Charla por Lorenzo Pacchiardi, el 2 de mayo a las 12:30, en la Sala de Juntas del DSIC (1F)
Title: How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Abstract:
Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM's yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting -- prompting GPT-3.5 to lie about factual questions -- the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.
The paper behind the talk has been accepted for ICLR2024.
Bio: Lorenzo Pacchiardi obtained a PhD in Statistics and Machine Learning at Oxford, and has worked on detecting lying in large language models and on technical standards for AI for the EU AI Act. He is now a research associate within the Kinds of Intelligence programme at the Leverhulme Centre for the Future of Intelligence, University of Cambridge. He is working on developing a framework for evaluating the cognitive capabilities of Large Language Models.
http://www.lorenzopacchiardi.me/