Extracting terminology from an English webpage to prepare for a new translation job. I never share tools I wrote for customers, but sometimes I do it somehow.




Greater than 1 minutes

When I write a tool for a customer, the customer owns the tool and I cannot share it anymore with others that might find it useful.

Last year I wrote a tool that extracts terminology from a list of URLs. One of my customers is using this tool to prepare for new jobs they accept. The tool I wrote is a quite complex one:

  1. It extracts English terms from many web pages at once — the frequency analysis is done on the total harvest from those pages;
  2. It compares the extracted terms to the terminology the customer already has in his own database (termbase);
  3. It finds possible translations online (in dictionaries, glossaries, several machine translation systems — all systems they selected themselves);
  4. It writes the new terms and all suggestions to a TBX file;
  5. The customer’s linguists can pick, fix or create the right translation — they can use the frequency count to prioritize their work;
  6. They use the result to have their customer agree on the terminology prior to kicking of the translation project.

My customer agreed that I create and share a stripped down version of this tool and share it with whoever would like to use it. The stripped down version does extract terms from many pages at once, it does not query different translation sources (it only calls Google Translate), and it does not create a TBX file, just a TSV, TMX or XLIFF file.

I created the stripped down version this weekend. It is now available on http://www.datamundi.be/cms/tools/english-terminology-extractor.

One remark: the stripped down version of the tool does not validate the output from Google Translate, nor does it check the validity of the generated files. It may well be that some of the sample languages cause problems (I did notice this on Hindi a couple of times).

Please read more about this tool on the web page. For the engineers among us, I also share the components I used to develop this free service.

 

Gert Van Assche

About Gert Van Assche

At Datamundi we're paying a fair price to linguists and translators evaluating (label/score/tag) human translations and machine translations for large scale NLP research projects.

7 thoughts on “Extracting terminology from an English webpage to prepare for a new translation job. I never share tools I wrote for customers, but sometimes I do it somehow.

  1. Hi Gert,
    I just gave it a try with a Wikipedia page and I can tell it’s going to be very useful to me! I often have to find a lot of terminology when I research a video game translation (weaponry, space travel, etc.)
    Thank you so much for sharing this with us!

    Report comment

Leave a Reply

The Open Mic

Where translators share their stories and where clients find professional translators.

Find Translators OR Register as a translator