From OCR to Text: the mirage becomes real




  • Greater than 2 minutes, my friend!

    The following article is for the poor, the penniless translator who likes re-inventing the wheel and making lemonade when juggling lemons.

    All others will not be interested.

    Part One: Image Text to Real Text

    I was translating an application form written in Arabic into English. The application form came to me as a PDF, and within this PDF there were a few embedded jpegs/slides/graphics that had some text which needed to be translated.

    So in order to transform the “mirage” of image text into “real” text, an OCR (optical character recognition)  conversion was needed.

    (Why? You might ask. Because I do not read Arabic and must rely completely on a CAT. Well, I’ll be specific: Google Translate. )

    For languages such as Arabic, there is a free online OCR converter: NewOCR.com. Not all OCR software can handle Arabic. But those that use Tesseract do.

    (Aside: There’s a price for free. Note that the free ocr conversion website limits you to 10 image files per hour. )

    IMG_2968

    This was a little bit of a challenge because I was translating a bar-chart graph that had a dozen labels placed sideways (at a 90-degree angle from horizontal) under each bar in the graph, and the y-axis was also vertical.

    So I decided to take a little screenshot of each label, save it as an individual jpeg, and upload each file individually for OCR conversion.  I used the open source software Gadwin Printscreen to do the screenshots, because I can’t get the Windows Printscreen function to work on my computer.

    When I reached 10, I still had a few more jpegs to OCR, and had to go away for an hour (or go to a computer at another location with a different IP address).

    Part Two: After catching the moonbeam, send it on its way

    After grabbing the converted source text and doing the translation, the next problem was to put the English translation back into a PDF format exactly like the source file.

    If this had been a regular document with lots of words and not many images, I would have put the translation into a Microsoft Word document and “printed” it to a PDF file.

    But this was an application form with several boxes with lines in and around the boxes, and a header image/logo, and several photo images with captions inside the various boxes.

    So I chose to make a duplicate copy of the source PDF file and paste the English translation on top of the Arabic.

    There is a free online PDF editor:

    https://www.pdfpro.co/edit-pdf

    Using this PDF editor is very laborious because you only have one option to overwrite the source text — you have to “erase” the original text by drawing a white box over it (it’s called ‘whiteout’), then you place your translation text on top of this whiteout box, and finally you save  and download the ‘overwritten’ translation to a new PDF file.

    Well, “all’s well that ends well.” (Shakespeare)

    small-Cobbe_portrait_of_Shakespeare

    Or maybe we can say “all’s well when it ends.”

    This method worked for me. It was very slow, but it worked. Wishing you success and speed with your translations.

    Flower

    4 thoughts on “From OCR to Text:

    Leave a Reply

    The Open Mic

    Where translators share their stories and where clients find professional translators.

    Find Translators OR Register as a translator