Greater than 2 minutes, my friend!
Last week a translator called me… After she had accepted the job, her customer asked her to make sure the terms she uses in the translation are the same as in the screenshots; the software would not be localized. At first, she thought: “OK, no problem. I’ll see the screenshot and I’ll use the same English terms. No big deal”. Then she realized there were an awful lot of screenshots (242!), containing a lot of terms and phrases; both were used throughout the 80 page manual (counted text only).
If I would get a question like this from a language service provider (LSP), I would not be surprised. I would figure out how to deal with it and propose an automated solution. LSPs appreciate some automation as they often get similar requests and when they solve a problem, they can be smarter than their competitors and they can also send good jobs to the freelance translators. This translator however did not work for an LSP and she did not need a full blown, nicely integrated technical solution. Just an answer to the question “how do I do this? Do you know a simple, cheap and fast solution?”
This is what I advised.
- Extract all images from the word document. This is easy:
- Make a copy of the word document
- Rename the copy: replace .DOCX by .ZIP
- Open the ZIP file and go to the sub folder: word
- Extract the sub folder media to your desktop
- Close the zip file and open the media folder on your desktop: here you’ll find all screenshots (PNG or other format)
- Extract the text from these images using a free online OCR tool
- newocr.com: upload your files one by one, but you can select what needs to be OCR’d. The result is really OK.
- drive.google.com: upload the whole folder at once to Google Drive. In Google Drive select the desired file, right-click on it to open it in “Google Docs” from “Open with” submenu. Now your file will opened and OCR’d at once. You can select, copy, change… the extracted text.
Download the document again to your local drive using “Download as” from File menu in the desired format. Select the Plain Text format.
- Import the extracted text in your CAT tool. It depends on the CAT you’re using how you have to do this and what you can do with this, but there are 2 ways to test what works best for you (and know that one solution does not exclude the other):
- Import it as terminology in your terminology database as non-translatable terms;
- Import it as a translation memory (source-source); If you want to create a source-source TM, any alignment tool that creates TMX will do this. Don’t forget to give the right name (language code) to the target “source”. So your target may be “FR” while you actually import “EN”;
- If you’re not happy with the result, contact the support team of your CAT tool. They may know a better trick.
Maybe you can use this advice as well one day.