Raquel Fernández
Full Professor of Computational Linguistics & Dialogue Systems at the University of Amsterdam and the Vice-Director for Research at the Institute for Logic, Language & Computation (ILLC)
The article presents a single model able to translate from written text to written text, from speech to speech, and from text to speech or speech to text between around 100 different languages. Automatic machine translation systems tend to work reasonably well for languages that are well represented online (English, Chinese or Spanish, etc.) but less so for languages with fewer speakers or less available digital data (such as Maltese, Swahili, or Urdu). Moreover, automatic translation has mostly been confined to written text. The model presented in this article (SEAMLESSM4T) advances the field by including the ability to translate to and from speech (in addition to text) and by doing so for a large quantity of different languages.
As all current AI systems, automatic machine translation models require huge amounts of data for training. In the case of translation, the data typically consists of pairs of sentences: a sentence in a given language and its translation to another language, which the model uses to learn. However, this kind of paired data is very costly to create and not available for many languages. The authors address this problem by using an AI model (SONAR) able to find sentences online (written or spoken) that have very similar meanings and use these mined sentences as a proxy for translation pairs. This allows them to create a very large training dataset that is key for developing a robust translation model. Beyond offering more coverage than earlier models, the resulting model tends to lead to translations that are of higher quality regarding meaning, sound, and clarity.
While this model makes substantial progress on speech translation, translating into spoken language remains more challenging than outputting text: the model can output translated text in 96 languages, but currently it can produce a spoken translation in only 36 languages. Moreover, the automatically produced speech may not always be expressive and sound natural. Similarly, when translating from speech to text, the model may have trouble processing the speech based on factors such as gender, accent, or language. An evaluation of the model’s performance also shows that it tends to display gender bias, for example when a sentence in the source language does not specify the gender (as in the English sentence “I’m a homemaker”, where the person speaking may be of any gender), the model has higher tendency to output a certain stereotypical gender when translating into gendered languages (e.g., when translating into Spanish from English, it may have a higher tendency to produce “Soy ama de casa” than “Soy amo de casa”). Finally, while the model has the potential to improve communication in many everyday scenarios, it does not yet allow for streaming or simultaneous translation, that is, the translation of a sentence as it is being produced.