Seminar: Memorisation and copyright in large language models

  • 28 October 2024
  • 14:00-15:00
  • N112 Haslegrave Building
  • Dr Claudio Ceruti, BBC

Abstract: 

The success of large language models (LLMs) is largely attributed to their extensive training datasets and massive number of parameters, which enable them to memorise vast amounts of information. This memorisation extends beyond basic language patterns, capturing details found in only a few documents, which can be useful for tasks like question answering. However, this also raises concerns about privacy, security, and copyright. LLMs can store sensitive information, as well as concepts like facts and writing styles, expressed in various forms. We present a taxonomy to classify different types of memorisation— ranging from verbatim text to abstract ideas and writing styles— discussing their impact on performance, privacy, and legal aspects. Additionally, we describe recent discoveries in memorisation dynamics, supported by empirical results from experiments detecting the memorisation of copyrighted data.

Speaker:  Dr Claudio Ceruti
Claudio Ceruti is a Lead R&D Engineer at BBC. His expertise spans NLP a Large Language Models, generative models, manifold learning, interpretability of deep models, and the social impacts of AI. He previously worked as a Senior ML Engineer at Lifebit, an NLP Researcher at Factmata, and a Machine Learning Researcher at Creative AI. Claudio holds a PhD in Mathematics and Statistics for Computational Sciences from the University of Milan, where he also completed postdoctoral research.

Presenter slides: Ceruti-Memorisation_copyright_LLM

Contact and booking details

Booking required?
No