Model of an artificial intelligence language trained in Spanish

Reactions

This article is 9 months old

Reactions: the Prime Minister announces the design of a foundational model of an artificial intelligence language trained in Spanish

The President of the Government, Pedro Sánchez, announced last night at the welcome dinner of the GSMA Mobile World Congress (MWC) Barcelona 2024, the construction of a foundational model of artificial intelligence language, trained in Spanish and co-official languages, in open and transparent code, and with the intention of incorporating Latin American countries. For its development, the Government will work with the Barcelona Supercomputing Center and the Spanish Supercomputing Network, together with the Spanish Academy of Language and the Association of Spanish Language Academies.

SMC Spain

26/02/2024 - 10:47 CET

Versión en castellano

Expert reactions

Pablo Haya - lenguaje IA español

Pablo Haya Coll

Researcher at the Computer Linguistics Laboratory of the Autonomous University of Madrid (UAM) and director of Business & Language Analytics (BLA) of the Institute of Knowledge Engineering (IIC)

Autonomous University of Madrid

Institute of Knowledge Engineering

Science Media Centre Spain

I think this is good news that highlights the value of the PERTE [strategic project for the recovery and economic transformation] of the new language economy and serves as a letter of introduction for the new team of the Secretary of State for Digitalisation and Artificial Intelligence (SEDIA). It is an action that aligns developments in natural language processing (NLP) in Spanish and co-official languages within the National Artificial Intelligence Strategy (ENIA).

The currently existing large language models (also called foundational models or large language models) have been trained with huge collections of documents (corpora) mainly extracted from public web pages. These corpora include documents in multiple languages, but with a very unbalanced distribution towards English. For example, the HPLT project (funded by the European Union) has collected and published 7 petabytes of documents extracted from the web. When you look at the distribution there is about 1,000 times more data in English than in Spanish. If you look at the co-official languages, this disproportion is much more pronounced.

It should be noted that, despite this disproportion in the training data, multilingual models perform reasonably well in Spanish in generalist tasks. There is still room for improvement and a model adapted to Spanish will certainly perform better. But we are at a time when technological advances in PLN are occurring at breakneck speed, which requires moving fast.

The author has not responded to our request to declare conflicts of interest

Language EN

Josep Curto - lenguaje IA español EN

Josep Curto

Lecturer at the UOC's Faculty of Computer Science, Multimedia and Telecommunications, director of the UOC's Master's Degree in Business Intelligence and Big Data Analytics (MIBA) and AI expert

Open University of Catalonia (UOC)

Science Media Centre Spain

Pedro Sánchez's announcement of the creation of a large foundational AI language model, trained specifically in Spanish and co-official languages, open source and transparent, and with the intention of incorporating Ibero-American countries, should be considered from several points of view.

On the one hand, it is good news, since the vast majority of foundational models have been created using datasets mostly in English. It is also relevant because it can serve as an example of a responsible artificial intelligence system. I mean, being the government the one pushing such creation, it must comply by default with the obligations for this type of systems as indicated in the EU AI Act and, on the other hand, they will surely take into account the rights of the author, publisher or licensee to exploit the reference sources which, as we know, have not taken into account some of the most relevant foundational models in the market.

On the other hand, there are many aspects that qualify this announcement. There are many unknowns in the announcement that are linked to its viability (who provides the budget, who carries out the project, how it will be delivered to generate value for society, who will do maintenance, how biases and other inefficiencies of these models will be controlled). Until we have more details to assess its future viability, it remains merely an announcement of good intentions.

The author has not responded to our request to declare conflicts of interest

Language EN

Andreas Kaltenbrunner - lenguaje IA español EN

Andreas Kaltenbrunner

Lead researcher of the AI and Data for Society group at the UOC

Open University of Catalonia (UOC)

Science Media Centre Spain

In principle, this seems to me to be a very interesting initiative. The impact will probably be greater for the co-official languages than for Spanish, since the percentage of content in Spanish on the Internet is much higher than for the other co-official languages.

5.6% of the content on the Internet is in Spanish, compared to only 0.1% in Catalan/Valencian (see the relationship between Catalan/Valencian and Spanish here). In the other co-official languages this percentage will be even lower. Presumably, the proportion in the training data of LLMs [large language models] such as GPT will be similar. Therefore, having own LLMs in the state languages is a very interesting initiative to combat the disadvantages compared to English.

However, it will not be an easy task given the amount of resources that competitors such as OpenAI have and it remains to be seen whether focusing only on a reduced set of languages does not lose the potential synergies that can be achieved by training multilingual models with more languages.

Another very positive aspect of the announcement is the focus on using open and transparent code. This will allow for greater control over the training data and its processing and thus mitigate potential negative aspects such as bias or lack of explanatory power (black box algorithms) of large language models.

The author has declared they have no conflicts of interest

Language EN

Teodoro Calonge - lenguaje IA español EN

Teodoro Calonge

Professor of the Department of Computer Science at the University of Valladolid

University of Valladolid

Science Media Centre Spain

I think it is a good proposal. The media coverage of ChatGPT has certainly been a milestone, which was another step forward in the development of AI. But it was just a spearhead, there is a long way to go. ChatGPT models are very generalist and, to get better results, there is a need to move towards more personalised AI. And this is where the Prime Minister's proposal comes in, Large Language Models (LLM) in a specific field: Spanish and extension to Latin American countries, as well as co-official languages. This will undoubtedly require more computational resources, hence the involvement of the Barcelona Supercomputing Centre, possibly the largest in Spain.

In any case, the people who are currently leading AI in LLMs talk about the need to make the leap, making this technology personalised and distributed. The latter is because the volume of computation required is of such a magnitude that the machines contributing to this cannot be located in a single centre.

Clearly, this proposal is novel, as there are some timid attempts to tackle this task, but it is of such a magnitude that either governments get involved or it would not be possible. This is not only because of the financial resources involved, but also because access to information for the training of these systems can only be provided by government bodies.

In terms of difficulties, there is the aforementioned issue of computational resources, but there is also the difficulty of feeding data to the LLM. And this, from a practical point of view, may pose more difficulties, and there may even be legal obstacles to overcome, which will take time.

The author has not responded to our request to declare conflicts of interest

Language EN

Nuria Oliver - lenguaje IA español EN

Nuria Oliver

Scientific Director and co-founder of the ELLIS Foundation Alicante

ELLIS Alicante Unit Foundation

Science Media Centre Spain

The announcement of investment in the development of a large, open-source, transparent language model in Spanish and the other co-official languages is welcome news as existing models, even those that are multilingual, have been trained on mostly English data. Recent research points out that these models use internal representations based on English and, therefore, the language they generate in other languages, especially if they are languages with few resources, may have linguistic biases and use expressions that are not specific to those languages.

Furthermore, being open source, this language model will be available to any person or institution, facilitating access to natural language processing tools for a wide range of applications and users. Open source also makes it possible to involve wider communities of developers, researchers and language experts in the continuous improvement of the model. Both ELLIS Europe and ELLIS Alicante advocate the development of open science, including the development of open source artificial intelligence systems.

Transparency is another key feature to contribute to trust in their operation and results, as well as to foster the exchange of ideas so necessary to drive innovation. Trust in these systems is a key requirement for their use in society, especially in critical applications where the correct interpretation of language is essential.

Clearly, the inclusion of co-official languages alongside Spanish is an important and necessary step towards the preservation and promotion of linguistic diversity, such a valuable asset for our society.

What does it add to the existing models?

ELLIS Europe and ELLIS Alicante believe that if we want artificial intelligence to be socially sustainable, we need to expand access to high-performance computing, especially using renewable energy, encourage open source practices, invest in attracting and retaining the best minds, and demand transparency in the research, deployment and use of AI. This approach not only democratises AI development, but also contributes to the development of a more secure and competitive AI ecosystem. In this context, it is important to develop our own open and transparent language models, trained on quality data that does not infringe intellectual property rights and in our own languages to minimise bias. Given the cross-cutting nature of large language models, which can be used in virtually any sector, it is of strategic value to have our own development of these models. Furthermore, we cannot forget that there are more than 480 million people in the world whose mother tongue is Spanish, being the official language of 20 sovereign states in the world. The opportunities for impact are therefore immense.

What will be your main obstacles?

Developing a great language model with internationally competitive performance is a complex task with several challenges of different kinds.

Firstly, challenges of resources, funding and environmental impact. Creating a large, high-quality language model requires significant resources, both financial and computational. Adequate budget is needed for research, hardware acquisition, hiring of specialised staff and other related expenses. I understand that this obstacle would be addressed on the basis of the Prime Minister's announcement. Large computational requirements have a direct impact on the environment as the training and use of these models entails large energy needs which, if renewable energies are not used, contribute to the carbon footprint.

The second challenge is obtaining large amounts of data for training. Collecting, cleaning and labelling this data can be a challenge in itself, especially when dealing with co-official languages with fewer resources. In addition, it is necessary to verify that the data used is not proprietary or protected by intellectual property rights.

The third major challenge concerns the need for large computing capacities. In this respect, Spain has a supercomputer, MareNostrum 5, located at the Barcelona Supercomputing Center, which would solve this difficulty.

Fourthly, there is the challenge of talent. The development of cutting-edge language models requires the participation of experts in artificial intelligence, computational linguistics, machine learning and other related fields. Attracting and retaining skilled talent in these fields is a challenge as talent is in short supply and in high demand globally. ELLIS Europe and ELLIS Alicante aim to attract, retain and help inspire the next generation of excellent artificial intelligence research talent in Europe by offering a globally competitive working environment.

Fifthly, we cannot forget that software is a living thing, in continuous evaluation and improvement. It is not only necessary to subject models to rigorous testing and evaluation to ensure their quality and performance, but also to plan a process of continuous improvement to keep the model up-to-date and relevant in a constantly evolving environment. Keeping up with the latest developments and competing in a rapidly changing technological world can be a constant challenge.

Finally, we cannot forget the ethical dimension. It is crucial to address ethical issues and mitigate bias, stereotyping and other undesirable behaviour in the development of language models, as well as ensuring the preservation of privacy and security. At ELLIS Alicante we have a line of research in this regard.

Conflict of interest: "From ELLIS Alicante we are collaborating with the PERTE of the economics of language, specifically, the part of PERTE dedicated to the development of language models in co-official languages. Our work focuses on the study and mitigation of biases in the corpora used to train these models, as well as on the study of the ethical implications of human-great language model interaction".

Language EN