How to Use Tesseract.js, an OCR Engine for the Browser

  Programming

Optical Character Recognition, often shortened to just OCR, has been around for a very long time. However, because OCR is a CPU-intensive task, it has been limited to native desktop applications or server-side programs. Tesseract.js is a lightweight JavaScript library that tries to bring OCR to the browser. It is quite accurate, and supports well over a dozen languages.

Now, I want you to understand that Tesseract itself is not a new OCR engine. Its development started in the late 1980s. For the last ten years, it has been maintained by Google. Tesseract.js is simply a port of Tesseract, and was built using Emscripten.

In this tutorial, I show you how to use Tesseract.js to run OCR on image URLs. I suggest you use images that are hosted on Imgur servers. Why Imgur? Well, because they support CORS. In other words, they allow cross-domain requests. You are, of course, also free to use any other server that supports CORS. Alternatively, you could use locally hosted images.

Include Tesseract.js

The easiest way to include Tesseract.js in your HTML5 webpage is to use a CDN. So, add the following to the <head> of your webpage.

<script src="https://cdn.rawgit.com/naptha/tesseract.js/0.2.0/dist/tesseract.js">
</script>

Create an Interface

Let us now create an interface containing an <input> field where the user can type in an image URL, a <div> where the results of the OCR will be shown, a <div> where the status of the OCR will be shown, and a button the user can click to start the OCR. For now, let’s keep it simple.

<input type="text" id="url" placeholder="Image URL" />
<input type="button" id="go_button" value="Run" />
<div id="ocr_results"> </div>
<div id="ocr_status"> </div>

Initialize and Run Tesseract

We’ll be writing all our code inside a <script> tag now. Therefore, towards the end of the page, create a new <script> tag.

Inside the tag, create a new function named runOCR that accepts a URL as its input. Inside the function, to start the OCR, all you need to do is call the recognize() method of the Tesseract class and pass the URL to it.

The OCR operation runs asynchronously, and returns a TesseractJob object. Therefore, you must update the contents of the <div> inside a callback function, which can be added using the then() method. Additionally, add a callback using the progress() method to monitor the status and progress of the OCR operation.

function runOCR(url) {
    Tesseract.recognize(url)
         .then(function(result) {
            document.getElementById("ocr_results")
                    .innerText = result.text;
         }).progress(function(result) {
            document.getElementById("ocr_status")
                    .innerText = result["status"] + " (" +
                        (result["progress"] * 100) + "%)";
        });
}

And that’s it. All you need to do now is call runOCR(). You can do so after adding an on-click event listener to the go_button and fetching the URL the user typed in.

document.getElementById("go_button")
        .addEventListener("click", function(e) {
            var url = document.getElementById("url").value;
            runOCR(url);
        });

At this point, you can open your page in the browser, and type in a direct image URL. The image, of course, must contain easily-readable text. Otherwise, the accuracy will be low.

It’s also worth noting that the OCR engine might take several seconds to finish its task. Therefore, be patient.

Here’s the input image I used:

input for tesseract

And here’s the output I got:

output of tesseract

Conclusion

You now know how to use Tesseract.js to read text from images. In my opinion, the accuracy of this engine is quite high as long as the fonts aren’t too fancy.

If you found this article useful, please share it with your friends and colleagues!