🧠
MY SECOND BRAIN
meBlogLinkedInGitHub
  • Second brain
  • AI - Artificial Intelligent
    • AI-labeling
    • AI-training
  • Books
  • Code
    • Linux
    • Gradle
  • Company
    • Interview
  • Computer science
    • Data Structures
    • Algorithms
    • Concurrency
  • Container
    • Docker
      • Docker | Private Docker Registry
    • Kubernetes
  • Distributed systems
    • Akka
    • Analytics
    • Delivery guarantee
    • Kafka
    • Rebalancing
    • RPC
      • gRPC
  • Food
    • Recipes
      • Tiramisu
  • Git
  • GH CLI
    • GH CLI | Pull Request
  • SSH
    • SSH bastion | SSH Jump host
    • SCP
  • Learning
  • Management
  • Reactive systems
  • System Design
    • CAP Theorem
    • Domain Driven Design
    • System Design Interview
    • Load Balancing
    • CDN
  • OCR
  • Productivity
    • Alfred
  • Health
    • Teeth
  • Devops
  • Data stores
    • Elasticsearch
    • Mongo
  • Germany
    • Berlin
      • Where is to buy Christmas trees in Berlin
    • Internet in Germany
      • Install custom router for telekom
  • Transport
    • Bikes
  • Travel
    • Russia
      • Moscow
        • Moscow Attractions
    • United Kingdom
  • Writing
    • Markdown
      • Markdown Tables
  • Programming languages
    • Java
      • Java | OCR
      • Java | Spring
      • JAVA | How to install multiple Java versions on macOS
    • Go
    • Kotlin
    • Python
  • Optimization
    • Email
      • Zero inbox
  • Finance
    • Investment
      • Online brokers
  • People
  • Security
    • SaaS Security
  • Unix
    • Shell
      • ZSH
  • Work
    • Feedback
Powered by GitBook
On this page
  • Setup
  • Using custom tessdata
  • Set language
  • Get OCR result
  • Troubleshooting
  • Unable to load library 'tesseract'
  • Links
  1. Programming languages
  2. Java

Java | OCR

PreviousJavaNextJava | Spring

Last updated 2 years ago

Setup

brew install tesseract

for support of other languages

brew install tesseract-lang

Using custom tessdata

Download for several languages and put them under folder, it could be located under resource folder or any other path outside of the application If folder is located under resource folder, use the next code

var tesseract = new Tesseract();
var tessdata = LoadLibs.extractTessResources("tessdata");
tesseract.setDatapath(tessdata.getAbsolutePath());

or if you know the absolute path, just write it in the setDataPath

var tesseract = new Tesseract();
tesseract.setDataPath("/opt/ocr/tessdata");

Under the folder you will have files with the next names:

  • deu.traineddata

  • eng.traineddata

  • osd.traineddata

the first part of the name is important, we will use for setting languages

Set language

The image that you will OCRing could contain one or multiple languages, in that case you will need to define which language is expected, by default eng is enabled

var tesseract = new Tesseract();
tesseract.setDataPath("/opt/ocr/tessdata");
// It means we expect that Deutschland or English languages will appear in the image
tesseract.setLanguage("deu+eng");

Get OCR result

The library tess4j supports multiple options to work with input content, it could be File or BufferedImage or ByteBuffer, also you could define do you want to extract the text from the whole file, or a concrete Bounding Box

var tesseract = new Tesseract();
tesseract.setDataPath("/opt/ocr/tessdata");
tesseract.setLanguage("deu+eng");
var inputImage = new File("/opt/exmaples/images/multi-language.png")
String fullPageResult = tesseract.doOcr(inputImage);
String rectangleResult = tesseract.doOcr(inputImage, new Rectangle(100, 100, 100, 100));

Troubleshooting

Unable to load library 'tesseract'

This one was resolved by providing the path to jna libraries system property

String libPath = "/usr/local/lib";
File libTess = new File(libPath, "libtesseract.dylib");
if (libTess.exists()) {
  String jnaLibPath = System.getProperty("jna.library.path");
  if (jnaLibPath == null) {
    System.setProperty("jna.library.path", libPath);
  } else {
    System.setProperty("jna.library.path", libPath + File.pathSeparator + jnaLibPath);
  }
} else {
  throw new RuntimeException(String.format("OCR: validate: Tesseract library not in /usr/local/lib"));
}

Links

trained tessdata
Baeldung article
Tesseract User Manual
Best tessdata
Fast tessdata
Combined (Best + Fast) tessdata
Hint to resolve the issue with not loaded libraries
How to provide more than one language to the tesseract
How to OCR with Tesseract, OpenCV and Python
Tesseract Page Segmentation Modes (PSMs) Explained: How to Improve Your OCR Accuracy
All OCR options