Language Similarity in Machine Translation
- Duration: Feb. 2020 – Jan. 2022
- Funding: Facebook Sponsored Research Agreement
- Danni Liu
- Sai Koneru
- Jan Niehues
Neural machine translation is the key technology to enable communication without language barriers. However, normally large amounts of parallel training data needs to be collected. Multi-lingual machine translation addresses this challenge by exploiting resources from other languages pairs. This reduces the necessary amount for each individual language pair significantly. This enables automatic translation between significantly more languages. However, currently, the most common NMT models do not contain any notion of language similarity.
In reality, different languages are not created in isolation, but they develop over time and with a strong interaction between related languages. While humans often use the similarity between languages to infer the meaning of words, current machine translation systems cannot use this information. For example, the knowledge of English and German helps to infer the meaning of Dutch sentences.
In this project, we want to address this challenge and model the similarity between languages within the neural machine translation formulation. By explicitly modeling the similarity between languages, we will increase the usefulness of data from related languages. This is especially useful in low-resource languages. Since it is not possible to collect large amounts of training data for these languages, extra information like the relatedness to other language is extremely important. Furthermore, these phenomena occur very often in colloquial language. Therefore, the technology will also be very useful for translating this type of language.