How to Read Text from Image in Java?

In today’s digital age, the ability to extract text from images has become an essential task for many applications. Whether it’s extracting data from scanned documents or analyzing images for text content, being able to read text from images can greatly enhance the functionality of our Java application.

In this article, we will explore easiest way to read text from image in Java, along with the prerequisites and example codes to helps to get started.

Reading Text from Images using Tesseract OCR

Tesseract OCR is a powerful open-source library for optical character recognition. It supports multiple languages and can recognize text from various image formats.

Following is how we can use Tesseract OCR to read text from image in Java:

Step 1: Add dependency in pom.xml file

<dependency>
        <groupId>net.sourceforge.tess4j</groupId>
	<artifactId>tess4j</artifactId>
	<version>5.11.0</version>
</dependency>

Step 2: Download trained data model

Now, we need to download the Tesseract OCR trained dataset and save it in the location.

We can download these datasets from the following link:

https://github.com/tesseract-ocr/tessdata

In this example, I’ve used eng.traineddata file. But you can use others as per your need.

And save it to the following location:

C:/traineddata/eng.traineddata

Remember this directory for next step where we need to define this path in our code.

Step 2: Load the Image

Next, we need to load the image from which we want to extract the text.

Following is an example code snippet:

// Load image file 
File imageFile = new File("C:/Users/codersathi/Desktop/sample-image-to-read-text.png");
ITesseract instance = new Tesseract();

/**
* Set path where trained data sets are saved after download.
*/		
instance.setDatapath("C:/traineddata"); //Only path without filename

Make sure to replace “C:/Users/codersathi/Desktop/sample-image-to-read-text.png” with the actual path to your image file.

And the location where we saved traineddata earlier in the Step 2 above.

Step 3: Extract the Text

Finally, we can use the Tesseract OCR library to extract the text from the loaded image. The `doOCR()` method of the `Tesseract` class performs the text recognition.

Following is an example code snippet:

String result = instance.doOCR(imageFile);
System.out.println(result);

When we run the above code, it will extract the text from the image and print it to the console.

Complete Example Code

In this example, I’ve used following image which is simply screenshort of the Lorem Ipsum page.

My Sample image:

read text from image in java

Example Code:

import java.io.File;

import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;

public class ImageToTextExample {
	public static void main(String[] args) {
		// Load image file
		File imageFile = new File(
				"C:/Users/codersathi/Desktop/sample-image-to-read-text.png");
		ITesseract instance = new Tesseract();
		// Set path where trained data sets are saved after download.
		instance.setDatapath("C:/traineddata");

		try {
			String result = instance.doOCR(imageFile);
			System.out.println(result);
		}
		catch (TesseractException e) {
			System.err.println(e.getMessage());
		}
	}
}

Output:

read text from image in java example

Read specific languages text

In the above example I’ve used english language data to parse so, code is auto setting english as language.

To read specific languages text let’s say Hindi we need to defing hindi language.

Example:

ITesseract instance = new Tesseract();
instance.setDatapath("C:/traineddata");
//Setting language as hindi
instance.setLanguage("hin");

Make sure to download hin.traineddata file for hindi and same for other language.

Tesseract OCR installation for Linux

If you are trying to work with Tesseract OCR in Linux (Ex. Ubuntu), then you need to install following packages then you are good to go.

sudo apt-get install tesseract-ocr -y && sudo apt install tesseract-ocr-eng -y

This will install the dependency required for tesseract ocr and trained language data.

For linux, we don’t need to download trained data the way we download for windows in step 2 above. It is automatically downloaded using above command and stored in the following location:

/usr/share/tesseract-ocr/4.00/tessdata

Above exact code works. We only need to change the data path value same as below:

/**
* Set path where trained data sets are saved after download.
*/		
instance.setDatapath("/usr/share/tesseract-ocr/4.00/tessdata"); //Only path without filename

Reference: https://tess4j.sourceforge.net/codesample.html


Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments