Autor/es reacciones

Andreas Kaltenbrunner

Lead researcher of the AI and Data for Society group at the UOC

In principle, this seems to me to be a very interesting initiative. The impact will probably be greater for the co-official languages than for Spanish, since the percentage of content in Spanish on the Internet is much higher than for the other co-official languages.  

5.6% of the content on the Internet is in Spanish, compared to only 0.1% in Catalan/Valencian (see the relationship between Catalan/Valencian and Spanish here). In the other co-official languages this percentage will be even lower. Presumably, the proportion in the training data of LLMs [large language models] such as GPT will be similar. Therefore, having own LLMs in the state languages is a very interesting initiative to combat the disadvantages compared to English.  

However, it will not be an easy task given the amount of resources that competitors such as OpenAI have and it remains to be seen whether focusing only on a reduced set of languages does not lose the potential synergies that can be achieved by training multilingual models with more languages.  

Another very positive aspect of the announcement is the focus on using open and transparent code. This will allow for greater control over the training data and its processing and thus mitigate potential negative aspects such as bias or lack of explanatory power (black box algorithms) of large language models.

EN