Hacking PDF for translators Translating PDF files when they are protected with a password, when they contain images not text...




Greater than 4 minutes, my friend!

Rule #1

We all know we should not translate PDF files. They are simply not fit for translation. Rule number one is always to ask for the source document and work on that. Sometimes your customer only has a PDF, and go have to get your hands dirty, no matter what.

Can your translation tool (CAT) handle it?

First thing to try is opening the file straight in your CAT.  Not all CATs use the same approach. I like how Jost Zetzsche described it in his 176th ToolKit newsletter. As far as I know SDL Studio, Alchemy Publisher, MateCat and Wordfast Pro use a third-party PDF-to-DOC conversion tool that they have integrated as a filter. I guess some other CATs are also capable of opening some PDFs. You have no control on what these filters are doing, but when it is working, you have a no-brainer solution. You need to check though that all text has been found and filtered.

If you want to understand how tricky it can be to extract text from PDF, try it for yourself. You will, in many cases, end up with a plain text document where lines are ending with a hard return inside a paragraph, where words at the end of lines are split with a hard hyphen… Messy is the best word to describe that kind of output. Only if the PDF was generated through XSL-FO (so from an XML format) this text extraction goes smoothly. In all other cases it depends on the virtual printer that has been used to generate the PDF. Even though I can crack most PDF files, I always find pleasure in extracting text the stupid way: it allows me to see how much problems there are in the document, and I know what I need to check later on.

DIY: Converting PDF-to-DOCX

If your CAT can’t open the PDF, you should convert the PDF to DOCX yourself.

Abusing Google Docs just to convert files, also works quite well with PDF. The way to do this, I explained in a previous post.  You could also use CloudConvert, a tool I basically use for all file conversions.

Since I have MS Word 2013 (and now 2016) I prefer that tool. Some customers forbid me to use online services, and MS-Word is an off-line tool if you don’t use the OneDrive cloud. If the PDF contains text as images, you have to convert that yourself (see a previous post), but when the PDF contains text, MS Word does a real good job for many languages: it is capable of fixing line endings, something many other tools cannot do; If there are a lot of hard hyphens, a search & replace all will solve that issue. If Word does not do a good job, ABBYY Fine Reader and Nuance OmniPage are definitely worth trying as well. They are no longer as expensive as they used to be, and they can support many languages. These language variants/plugins know more fonts and more character sets, and they come with a dictionary helping the OCR process itself.

The risk of converting PDF yourself is that you’ll waste a lot of time. So make sure you quote and get paid for this. What you do for free, is often not appreciated as it should be. Make all your extra service visible.

One of the problems of using CATs when translating OCR’d text and PDF’s converted to Word is the code clutter you may end up with. You definitely need to remove the clutter before you apply your TMs on the job. As long as all the clutter is in, you will only see low fuzzy matches. You can use Translator Tools Document Cleaner or CodeZapper, or you can do it manually using this guideline.

Encrypted and restricted PDFs

When you receive a password protected file, you need to remove the password first. Otherwise none of the tools here above will do a good job. There is an extra reason why removing a password comes in handy: sometimes PDFs have been “protected” so you cannot search for words in them even when they contain plain text. I’m using the VeryPDF PDF Password Remover for this. I like this one because it can remove user passwords (responsible for encrypting and preventing unauthorised opening) and owner passwords (restricting printing, copying, extracting… even when the document is decrypted).

I never use free and online tools for this: if a file is user password protected, the owner of that document did not want everybody to have access to it, so sending it to an online service may be bad for your business.

When a customer asks me to remove a user password, I always ask for a written and signed instruction to do so, because I find it strange that he does not have the password. But sometimes there are good reasons: companies get acquired, people are leaving or get fired, backups and original documents get lost…

Extracting text

Sometimes the customer will ask you just to translate the text. Don’t worry about images and formatting. You may want to extract text from PDF then. This will only work if there is no owner password set on restricting copying from the PDF. But even when there is no restriction, extracting text can be cumbersome. PDF does not store text in Unicode or UTF-8. It is using its own mapping mechanism that is based on the fonts used for rendering the PDF to screen (or paper, or…) This means that when you copy the text via the clipboard, the clipboard may have to do a conversion. If that fails, your plain text will not be OK. This happened a lot to me when working on Asian documents.

So I always ask my customer to verify that the extracted text is indeed what he needs to get translated. If he sends me a PDF, I make sure my customer understands I will not work on the PDF file but on extracted or converted data. And he needs to sign off on that before I start translating.

If everything else fails…

… then there is still you: you can work the good old way: convert the PDF-to-paper and translate the printed document with Word or Notepad. Just be careful with your coffee then.

I hope this information is useful to some of you.

 

Gert Van Assche

About Gert Van Assche

At Datamundi we're paying a fair price to linguists and translators evaluating (label/score/tag) human translations and machine translations for large scale NLP research projects.

7 thoughts on “Hacking PDF for translators Translating PDF files when they are protected with a password, when they contain images not text...

  1. Thanks, Gert! I still haven’t had to translate from PDF directly, but I’ll bear this in mind, as well as the rest of the strategies suggested in this post:

    “The risk of converting PDF yourself is that you’ll waste a lot of time. So make sure you quote and get paid for this. What you do for free, is often not appreciated as it should be. Make all your extra service visible.”

    Report comment
  2. Ciao Gert, useful tips, many thanks! I get a lot of PDF’s, mainly from direct clients (hotels and the like) or for legal documents (these last are usually hopeless, as most of the times they are scanned directly from paper documents, plus they include lots of initials, signatures, underlining, notes etc., so OCR’s usually make more of a mess of than what they already are…). I invested in Adobe Acrobat Pro, which is not exactly cheap, but once bought, then all updates are free, and it is very useful to convert workable PDF’s into Word, Excel or RTF. Working on a Mac, I use Wordfast Pro as a CAT tool, but so far I couldn’t use it with PDF’s directly: version 4.0 has just been realease but I haven’t donloaded yet, so maybe there are news on that regard that I’m not aware of. CloudConvert looks promising, so I’ll give it a try as soon as needed. Thanks again, ciao!

    Report comment
  3. Hi Gert, PDF’s really can be a nightmare, and I believe that a translator should not be expected to translate them, especially when they are protected, or worse created from badly scanned documents. But as you say, sometimes there is no alternative. I can tell you that the 2015 version of SDL Studio has changed its PDF converter and the new one works wonders, even with scanned text. But like all the other solutions, it isn’t perfect so what I do is import the PDF, then take the Word file created by Studio, and check it out against the original. It is usually much easier and quicker to fix than other solutions. Then the Word file is ready for translation. memoQ also imports PDF files, but not unless they are editable.
    Regards
    Juliet

    Report comment
    1. Very good info, Juliet. The 2015 version of Studio is yet another step forward; I was very impressed by the way post-DTP fixed (or customer edits in the final document) can be imported in the TM again. That in itself already justified a purchase for me.

      The PDF support by CATs is a silent revolution: 3 years ago there was hardly any support. Today some tools really do a good job.

      Report comment

Leave a Reply

The Open Mic

Where translators share their stories and where clients find professional translators.

Find Translators OR Register as a translator