Edit Distance and Postediting And why it is cool




Greater than 2 minutes, my friend!

Edit distance can be described as the amount of change that you have to do to one text to change it into another text. And the great thing about it is that it is a number.

The distance (also known as Levenshtein distance) can be calculated using algorithms. But our focus here will be on using those numbers for machine translation and postediting.

Edit distance can be used to compare a postedited file to the machine translated output that was the starting point for the postediting. When you calculate the edit distance, you are calculating the “effort” that the posteditor made to improve the quality of the machine translation to a certain level. This level is many times a “human quality” level, where the posteditor makes all the changes needed for the translation to look like a human translator would do it. This type of activity is also known as “full postediting”. In other scenarios, the content just needs to be “understandable”. This leads to a task where the posteditor makes the minimum number of changes to the text so that it is understandable to a reader. This lower level of quality is less demanding than the human quality level, and the task is known as “light postediting”.

The level of quality that you want depends on the purpose of your content, its “life expectancy” and other factors. But the edit distance is there to help you. Starting from the source content and same MT output, if you perform a light postediting and a full postediting, the edit distance for each task will be different, and the human quality level is expected to have a higher edit distance, because more changes are needed. This means that you are measuring light and full postediting using the edit distance. For more on postediting, please read this article by Juan Rowda.

Could we compare MT engines? Yes! Let’s say you have choices of generic engines and you also have your own customized engine. Starting from a sample of text in the source language, you could have it machine translated by your engine, and also by Google and by Bing for example. Then you can postedit these samples to your desired level of quality. Now, calculate the edit distance. The lowest edit distance indicates the “best” engine, the one that will require less effort to postedit. To illustrate, this is what it could look like:

The edit distance calculates the number of changes, so a longer segment will likely have more changes because it has more words. Therefore, the edit distance is a kind of “word count” measure of the effort, similar in a way to the word count used to quantify the work of translators throughout the localization industry. If we go a little further, translation memories provide suggestions to translators, and fuzzy matches are used to measure the effort made to improve those suggestions. Machine translation is providing suggestions too, and the effort to be made to improve them can be measured with edit distance. So the Edit Distance is a number that can be part of a very important discussion happening in the industry, which is about how to pay for postediting work.

Silvio Picinini

About Silvio Picinini

One thought on “Edit Distance and Postediting And why it is cool

  1. Very interesting, especially since I took a MOOC on big data where one of the quiz questions was to calculate the “edit distance” for a few words. I’m not sure what is the most practical application of this statistic. I think you are suggesting that an independent translator could use edit distance to justify the fee that you charge. On a more negative side, I could see edit distance being a way for a large translation “factory” to compel its workers to adhere to a certain time frame and accept a lower fee because “edit distance” is statistically low.

    Report comment

Leave a Reply

The Open Mic

Where translators share their stories and where clients find professional translators.

Find Translators OR Register as a translator