Greater than 1 minutes
Extracting terminology from an English webpage to prepare for a new translation job. I never share tools I wrote for customers, but sometimes I do it somehow.
When I write a tool for a customer, the customer owns the tool and I cannot share it anymore with others that might find it useful.
Last year I wrote a tool that extracts terminology from a list of URLs. One of my customers is using this tool to prepare for new jobs they accept. The tool I wrote is a quite complex one:
- It extracts English terms from many web pages at once — the frequency analysis is done on the total harvest from those pages;
- It compares the extracted terms to the terminology the customer already has in his own database (termbase);
- It finds possible translations online (in dictionaries, glossaries, several machine translation systems — all systems they selected themselves);
- It writes the new terms and all suggestions to a TBX file;
- The customer’s linguists can pick, fix or create the right translation — they can use the frequency count to prioritize their work;
- They use the result to have their customer agree on the terminology prior to kicking of the translation project.
My customer agreed that I create and share a stripped down version of this tool and share it with whoever would like to use it. The stripped down version does extract terms from many pages at once, it does not query different translation sources (it only calls Google Translate), and it does not create a TBX file, just a TSV, TMX or XLIFF file.
I created the stripped down version this weekend. It is now available on http://www.datamundi.be/cms/tools/english-terminology-extractor.
One remark: the stripped down version of the tool does not validate the output from Google Translate, nor does it check the validity of the generated files. It may well be that some of the sample languages cause problems (I did notice this on Hindi a couple of times).
Please read more about this tool on the web page. For the engineers among us, I also share the components I used to develop this free service.
Wow, that’s really cool thing! I will try it in my next related project. Thank you, Gert!
Wow, that’s so clever!
This looks amazing to me, Gert!
Gert, sounds great. I will try it soon!
Oh that’s so cool!
Hi Gert,
I just gave it a try with a Wikipedia page and I can tell it’s going to be very useful to me! I often have to find a lot of terminology when I research a video game translation (weaponry, space travel, etc.)
Thank you so much for sharing this with us!
For anyone interested: I started to develop a new hybrid tool that extracts all terminology from an English website (or from a web address up to 5 levels deep). If you would like to test it, drop me a mail via the mail-link on the QA section of link to datamundi.be.