AI models trained on AI-generated data can crash

Using artificial intelligence (AI)-generated datasets to train future generations of machine learning models can contaminate their results, a concept known as ‘model collapse’, according to a paper published in Nature. The research shows that, within a few generations, original content is replaced by unrelated nonsense, demonstrating the importance of using reliable data to train AI models.

SMC Spain

24/07/2024 - 17:00 CEST

Versión en castellano

Expert reactions

Andreas Kaltenbrunner - IA colapso EN

Andreas Kaltenbrunner

Lead researcher of the AI and Data for Society group at the UOC

Open University of Catalonia (UOC)

Science Media Centre Spain

The study is very interesting, of good quality, but its value is mainly at a theoretical level because its conclusions are based on the assumption that only data generated by AI models will be used in future training. In a real scenario, there will always be a part of human-generated training data as well, at least what is available now.

It is not clear what the outcome would be if human-generated data is mixed with AI-generated data, let alone what would happen if (increasingly frequent) hybrid AI-human generated data is also added.

The study would be more complete if it also included experiments on the subject.

The author has declared they have no conflicts of interest

Language EN

Víctor Etxebarria - IA colapso EN

Víctor Etxebarria

Professor of Systems Engineering and Automatics at the University of the Basque Country (UPV/EHU)

University of the Basque Country (UPV-EHU)

Science Media Centre Spain

This paper demonstrates, mathematically and rigorously, that generative AIs can malfunction if trained on AI-generated data. The effect that the authors propose to call ‘model collapse’ is true: large language models (LLMs) - on which current generative AIs base their operation - actually collapse (stop working, respond badly, give incorrect information). This is a statistical effect perfectly demonstrated in the article and illustrated with examples and experiments, provided that LLM models are trained recursively (i.e.: by giving the generative AI training data previously generated by a generative AI). In this sense, the paper demonstrates that generative AIs trained in this way are actually degenerative.

The AIs are trained on huge amounts of data present on the internet, produced by people who have legal rights of authorship of their material. To avoid lawsuits or to save costs, technology companies use data generated by their own AIs to further train their machines. This increasingly widespread procedure renders AIs useless for any truly reliable function. This makes AIs not only useless tools to help us solve our problems, but they can be harmful, if we base our decisions on incorrect information.

The authors of this excellent article recommend that the AI industry use training with truly intelligent (i.e. human) data. They also recognise that pre-filtering automatically generated data to avoid degeneracy is not necessarily impossible, but requires a lot of serious research.

The author has not responded to our request to declare conflicts of interest

Language EN

Pablo Haya Coll - IA colapso EN

Pablo Haya Coll

Researcher at the Computer Linguistics Laboratory of the Autonomous University of Madrid (UAM) and director of Business & Language Analytics (BLA) of the Institute of Knowledge Engineering (IIC)

Autonomous University of Madrid

Institute of Knowledge Engineering

Science Media Centre Spain

The article highlights an important limitation in the use of synthetic data to train LLMs [large language models]. The idea of using data generated by an LLM to retrain the same or another LLM is very attractive, as it would provide an unlimited source of training data. However, this paper provides evidence that this technique can lead to LLM corruption (‘model collapse’, in the authors' words). This result is a warning about the quality of the data used in the construction of these LLMs. As more of these LLMs are adopted, more synthetic data end up on the internet, which could hypothetically affect the training of future versions.

Collecting data from reliable sources that are frequently updated becomes a priority for LLM providers. It is no wonder that companies such as OpenAI are entering into numerous agreements with media outlets and publishers. In this line, the ALIA family of foundational models, funded by the Spanish government, will have to rely on top quality sources for the construction of these models.

With the publication of the Artificial Intelligence Regulation, there are additional data quality issues, such as intellectual property, privacy and personal data, and bias, that need to be taken into account. As the article shows, synthetic data generation will not be the solution to obtain quality data.

The author has not responded to our request to declare conflicts of interest

Language EN

Publications

Journal

Nature

Publication date

24/07/2024