Java | OCR
Setup
brew install tesseract
for support of other languages
brew install tesseract-lang
Using custom tessdata
Download trained tessdata for several languages and put them under folder, it could be located under resource folder or any other path outside of the application If folder is located under resource folder, use the next code
var tesseract = new Tesseract();
var tessdata = LoadLibs.extractTessResources("tessdata");
tesseract.setDatapath(tessdata.getAbsolutePath());
or if you know the absolute path, just write it in the setDataPath
var tesseract = new Tesseract();
tesseract.setDataPath("/opt/ocr/tessdata");
Under the folder you will have files with the next names:
deu.traineddata
eng.traineddata
osd.traineddata
the first part of the name is important, we will use for setting languages
Set language
The image that you will OCRing could contain one or multiple languages, in that case you will need to define which language is expected, by default eng
is enabled
var tesseract = new Tesseract();
tesseract.setDataPath("/opt/ocr/tessdata");
// It means we expect that Deutschland or English languages will appear in the image
tesseract.setLanguage("deu+eng");
Get OCR result
The library tess4j
supports multiple options to work with input content, it could be File
or BufferedImage
or ByteBuffer
, also you could define do you want to extract the text from the whole file, or a concrete Bounding Box
var tesseract = new Tesseract();
tesseract.setDataPath("/opt/ocr/tessdata");
tesseract.setLanguage("deu+eng");
var inputImage = new File("/opt/exmaples/images/multi-language.png")
String fullPageResult = tesseract.doOcr(inputImage);
String rectangleResult = tesseract.doOcr(inputImage, new Rectangle(100, 100, 100, 100));
Troubleshooting
Unable to load library 'tesseract'
This one was resolved by providing the path to jna libraries system property
String libPath = "/usr/local/lib";
File libTess = new File(libPath, "libtesseract.dylib");
if (libTess.exists()) {
String jnaLibPath = System.getProperty("jna.library.path");
if (jnaLibPath == null) {
System.setProperty("jna.library.path", libPath);
} else {
System.setProperty("jna.library.path", libPath + File.pathSeparator + jnaLibPath);
}
} else {
throw new RuntimeException(String.format("OCR: validate: Tesseract library not in /usr/local/lib"));
}
Links
Last updated