Explaining the behavior of LLMs, are interventions the way to go?
Given the impressive performance and widespread adoption of large language models (LLMs), there is a pressing need to explain how these systems work and what information they use for their predictions. This would not only allow us to better predict and control their behavior, thereby increasing their trustworthiness, but also help gain insight into the internal processes underlying linguistic behavior in LLMs.
In the technical literature (e.g. Meng et al., 2022), these questions have been approached through interventions: applying small edits to the internal weights or activations of the network in order to change the output. The hope is that these interventions not only help to control the behavior of the system–for example by fixing incorrect predictions–but also provide causal explanations, in the sense that the target of the intervention plays a causal role in a particular decision process. While the link is rarely explicitly made, this sounds similar to Woodward’s interventionist account of causal explanation (2005). Specifically, Woodward claims that X is a direct cause of Y iff there is a possible intervention on X that changes Y. Moreover, this intervention should be surgical: it should only change the value of Y and other variables on the causal path between X and Y. If an intervention meets these constraints, it can be considered to identify causes of Y.
In this contribution, I argue that interventions as applied in machine learning do not meet Woodward’s constraint of being surgical, due to the fact that neural networks have distributed representations. As interventions are likely to change other causal paths, in addition to the target, they are not guaranteed to identify the actual causes of a particular behavior (see also Grimsley et al., 2020). While interventions might still be promising for changing the behavior of the system, I thus argue that they should not be used for causal explanations. As an alternative, I propose to take inspiration from explanation in cognitive science, specifically cognitive modeling, in which interventions are combined with more robust hypothesis-building and development of causal models of the presumed underlying processes in a particular system.
References:
Grimsley, C., Mayfield, E., & Bursten, J. R. S. (2020). Why Attention is Not Explanation: Surgical Intervention and Causal Reasoning about Neural Models. Proceedings of the 12th Conference on Language Resources and Evaluation, 1780–1790.
Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and Editing Factual Associations in GPT (arXiv:2202.05262). arXiv. http://arxiv.org/abs/2202.05262
Woodward, J. (2005). Making Things Happen: A Theory of Causal Explanation. Oxford University Press, USA.