AI models collapse when trained on recursively generated data.

Advanced Search

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, Yarin Gal

Author Information

Ilia Shumailov: OATML, Department of Computer Science, University of Oxford, Oxford, UK. ilia.shumailov@chch.ox.ac.uk.
Zakhar Shumaylov: Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK. zs334@cam.ac.uk.
Yiren Zhao: Department of Electrical and Electronic Engineering, Imperial College London, London, UK. ORCID
Nicolas Papernot: University of Toronto, Toronto, Ontario, Canada.
Ross Anderson: Department of Computer Science and Technology, University of Cambridge, Cambridge, UK. ORCID
Yarin Gal: OATML, Department of Computer Science, University of Oxford, Oxford, UK. yarin@cs.ox.ac.uk. ORCID

PMID: 39048682 DOI: 10.1038/s41586-024-07566-y

Stable diffusion revolutionized image creation from descriptive text. GPT-2 (ref.��), GPT-3(.5) (ref.��) and GPT-4 (ref.��) demonstrated high performance across a variety of language tasks. ChatGPT introduced such language models to the public. It is now clear that generative artificial intelligence (AI) such as large language models (LLMs) is here to stay and will substantially change the ecosystem of online text and images. Here we consider what may happen to GPT-{n} once LLMs contribute much of the text found online. We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear. We refer to this effect as 'model collapse' and show that it can occur in LLMs as well as in variational autoencoders (VAEs) and Gaussian mixture models (GMMs). We build theoretical intuition behind the phenomenon and portray its ubiquity among all learned generative models. We demonstrate that it must be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of LLM-generated content in data crawled from the Internet.

Radford, A. et al. Language models are unsupervised multitask learners. OpenAI blog 1, 9 (2019).
Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877��1901 (2020).
OpenAI. GPT-4 Technical Report. https://cdn.openai.com/papers/gpt-4.pdf (2023).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. in Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (eds Burstein, J., Doran, C. & Solorio, T.) 4171��4186 (Association for Computational Linguistics, 2019).
Liu, Y. et al. RoBERTa: a Robustly Optimized BERT Pretraining Approach. Preprint at https://arxiv.org/abs/1907.11692 (2019).
Zhang, S. et al. Opt: open pre-trained transformer language models. Preprint at https://arxiv.org/abs/2205.01068 (2022).
Aljundi, R., Kelchtermans, K. & Tuytelaars, T. Task-free continual learning. in: Proc. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 11254��11263 (IEEE, 2019).
Carlini, N. & Terzis, A. in Proc. Tenth International Conference on Learning Representations (ICLR, 2022).
Carlini, N. et al. in Proc. 2024 IEEE Symposium on Security and Privacy (SP) 179 (IEEE, 2024).
Mousavi-Hosseini, A., Park, S., Girotti, M., Mitliagkas, I. & Erdogdu, M. A. in Proc. Eleventh International Conference on Learning Representations (ICLR, 2023).
Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S. & Srebro, N. The implicit bias of gradient descent on separable data. J. Mach. Learn. Res. 19, 1��57 (2018).
Gu, Y., Dong, L., Wei, F. & Huang, M. in Proc. Twelfth International Conference on Learning Representations (ICLR, 2024).
Shumailov, I. & Shumaylov, Z. Public code for Model Collapse (0.1). Zenodo https://doi.org/10.5281/zenodo.10866595 (2024).
Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://arxiv.org/abs/2108.07258 (2022).
Strubell, E., Ganesh, A. & McCallum, A. in Proc. 57th Annual Meeting of the Association for Computational Linguistics (eds Korhonen, A., Traum, D. & M��rquez, L.) 3645��3650 (Association for Computational Linguistics, 2019).
Merity, S., Xiong, C., Bradbury, J. & Socher, R. in Proc. 5th International Conference on Learning Representations (ICLR, 2017).
Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C. & Socher, R. CTRL: a conditional transformer language model for controllable generation. Preprint at https://arxiv.org/abs/1909.05858 (2019).
Shumailov, I. et al. in Proc. 2021 IEEE European Symposium on Security and Privacy (EuroS&P) 212��231 (IEEE, 2021).
Google. Finding more high-quality sites in search. Google https://googleblog.blogspot.com/2011/02/finding-more-high-quality-sites-in.html (2011).
Mims, C. The search engine backlash against ��content mills��. MIT Technology Review https://www.technologyreview.com/2010/07/26/26327/the-search-engine-backlash-against-content-mills/ (2010).
Taleb, N. N. Black swans and the domains of statistics. Am. Stat. 61, 198��200 (2007). [DOI: 10.1198/000313007X219996]
LeCun, Y., Cortes, C. & Burges, C. J. C. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/ (1998).

Journal Article

OpenLB
Open Library of Bioscience