University of Cambridge > Talks.cam > Language Technology Lab Seminars > Model Interpretability: from Illusions to Opportunities

Model Interpretability: from Illusions to Opportunities

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Shun Shao.

Abstract

While the capabilities of today’s large language models (LLMs) are reaching—and even surpassing—what was once thought impossible, concerns remain regarding their misalignment, such as generating misinformation or harmful text, which continues to be an open area of research. Understanding LLMs’ internal representations can help explain their behavior, verify their alignment with human values, and mitigate instances where they produce errors. In this talk, I begin by challenging common misconceptions about the connections between LLMs’ hidden representations and their downstream behavior, highlighting several “interpretability illusions.”

Next, I introduce Patchscopes, a framework we developed that leverages the model itself to explain its internal representations in natural language. I’ll show how it can be used to answer a wide range of questions about an LLM ’s computation. Beyond unifying prior inspection techniques, Patchscopes opens up new possibilities, such as using a more capable model to explain the representations of a smaller model. I show how patchscope can be used as a tool for inspection, discovery, and even error correction. Some examples include fixing multihop reasoning errors, the interaction between user personas and latent misalignment, and understanding why different classes of contextualization errors happen.

I hope by the end of this talk, the audience shares my excitement in appreciating the beauty of the internal mechanisms of AI systems, understands the nuances of model interpretability and why some observations might lead to illusions, and takes away Patchscope, a powerful tool for qualitative analysis of how and why LLMs work and fail in different scenarios.

Bio

Asma Ghandeharioun, Ph.D., is a senior research scientist with the People + AI Research team at Google DeepMind. She works on aligning AI with human values through better understanding and controlling (language) models, uniquely by demystifying their inner workings and correcting collective misconceptions along the way. While her current research is mostly focused on machine learning interpretability, her previous work spans conversational AI, affective computing, and, more broadly, human-centered AI. She holds a doctorate and master’s degree from MIT and a bachelor’s degree from the Sharif University of Technology. She has been trained as a computer scientist/engineer and has research experience at MIT , Google Research, Microsoft Research, Ecole Polytechnique Fédérale de Lausanne (EPFL), to name a few.

Her work has been published in premier peer-reviewed machine learning venues such as NeurIPS, ICLR , ICML, NAACL , EMNLP, AAAI , ACII, and AISTATS . She has received awards at NeurIPS and her work has been featured in Quanta Magazine, Wired, Wall Street Journal, and New Scientist.

This talk is part of the Language Technology Lab Seminars series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2025 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity