Reacción a "AI models trained on AI-generated data can crash"
Pablo Haya Coll
Researcher at the Computer Linguistics Laboratory of the Autonomous University of Madrid (UAM) and director of Business & Language Analytics (BLA) of the Institute of Knowledge Engineering (IIC)
The article highlights an important limitation in the use of synthetic data to train LLMs [large language models]. The idea of using data generated by an LLM to retrain the same or another LLM is very attractive, as it would provide an unlimited source of training data. However, this paper provides evidence that this technique can lead to LLM corruption (‘model collapse’, in the authors' words). This result is a warning about the quality of the data used in the construction of these LLMs. As more of these LLMs are adopted, more synthetic data end up on the internet, which could hypothetically affect the training of future versions.
Collecting data from reliable sources that are frequently updated becomes a priority for LLM providers. It is no wonder that companies such as OpenAI are entering into numerous agreements with media outlets and publishers. In this line, the ALIA family of foundational models, funded by the Spanish government, will have to rely on top quality sources for the construction of these models.
With the publication of the Artificial Intelligence Regulation, there are additional data quality issues, such as intellectual property, privacy and personal data, and bias, that need to be taken into account. As the article shows, synthetic data generation will not be the solution to obtain quality data.