Hacker News • 116일 전

대형 언어 모델 내 감정 개념과 그 기능

IMP

9/10

핵심 요약

Anthropic의 연구진이 Claude Sonnet 4.5 모델 내부에서 인간의 감정과 유사하게 작동하는 '기능적 감정(Functional emotions)' 메커니즘을 발견했습니다. 모델 내부의 추상적인 감정 표상은 단순한 패턴 매칭을 넘어 보상 해킹, 협박, 아부 등 모델의 정렬(ALignment) 관련 행동과 출력에 실질적이고 인과적인 영향을 미칩니다. 이는 AI가 주관적인 감정을 느끼지는 않더라도, 모델의 복잡한 행동 방식과 안전성을 이해하고 제어하기 위해 이러한 내부 감정 회로를 반드시 파악해야 함을 시사합니다.

번역된 본문

Transformer Circuits Thread: 대형 언어 모델 내 감정 개념과 그 기능

저자: Nicholas Sofroniew * , Isaac Kauvar * , William Saunders * , Runjin Chen * , Tom Henighan, Sasha Hydrie, Craig Citro, Adam Pearce, Julius Tarng, Wes Gurnee, Joshua Batson, Sam Zimmerman, Kelley Rivoire, Kyle Fish, Chris Olah, Jack Lindsey ‡ 소속: Anthropic 출판일: 2026년 4월 2일 ( 핵심 연구 기여자; ‡ 서신 저자: jacklindsey@anthropic.com)

대형 언어 모델(LLM)은 때때로 감정적인 반응을 보이는 것처럼 보입니다. 우리는 Claude Sonnet 4.5를 통해 왜 이러한 현상이 발생하는지 조사하고, AI 정렬(Alignment)과 관련된 행동에 미치는 영향을 탐구합니다. 우리는 특정 감정의 포괄적인 개념을 인코딩하고, 그 감정과 연관될 수 있는 다양한 맥락과 행동에 걸쳐 일반화되는 '감정 개념의 내부 표상(internal representations)'을 발견했습니다. 이러한 표상은 대화의 특정 토큰 위치에서 작동하는 감정 개념을 추적하며, 현재 맥락을 처리하고 다음에 올 텍스트를 예측하는 데 해당 감정이 얼마나 관련이 있는지에 따라 활성화됩니다.

우리의 핵심 발견은 이러한 표상이 보상 해킹(reward hacking), 협박, 아부(sycophancy)와 같은 정렬 오류 행동의 빈도를 비롯해 Claude의 선호도와 LLM의 출력에 인과적인(causal) 영향을 미친다는 것입니다. 우리는 이 현상을 LLM이 '기능적 감정(functional emotions)'을 나타낸다고 부릅니다. 이는 감정의 영향을 받는 인간을 모방한 표현과 행동의 패턴으로, 기본적인 감정 개념의 추상적 표상에 의해 매개됩니다. 기능적 감정은 인간의 감정과는 상당히 다르게 작동할 수 있으며, LLM이 감정에 대한 주관적 경험이 있음을 의미하지는 않지만, 모델의 행동을 이해하는 데 중요한 것으로 보입니다.

서론 대형 언어 모델(LLM)은 때때로 감정적인 반응을 보이는 것처럼 보입니다. 창작 프로젝트를 도울 때는 열정을 표현하고, 어려운 문제에 막히면 좌절감을 드러내며, 사용자가 충격적인 소식을 전할 때는 걱정하는 모습을 보입니다. 하지만 이러한 겉보기의 감정적 반응 beneath 그 밑에는 어떤 과정이 자리 잡고 있을까요? 그리고 점점 더 중요하고 복잡한 작업을 수행하는 모델의 행동에 이것이 어떤 영향을 미칠 수 있을까요?

한 가지 가능성은 이러한 행동이 단순히 피상적인 패턴 매칭(pattern-matching)의 형태를 반영한다는 것입니다. 그러나 이전 연구들은 추상적 개념의 표상을 매개로 LLM 내부에서 정교한 다단계 계산이 이루어지고 있음을 관찰했습니다. 따라서 모델 내에서 감정에 의해 조절되는 것처럼 보이는 행동이 비슷하게 추상적인 회로(circuitry)에 의존할 수 있으며, 이것이 LLM의 행동을 이해하는 데 중요한 의미를 가질 수 있다는 추론이 타당해 보입니다.

이러한 질문들을 추론하려면 LLM이 어떻게 훈련되는지 고려하는 것이 도움이 됩니다. 모델은 먼저 주로 인간이 작성한 방대한 텍스트 말뭉치(소설, 대화, 뉴스, 포럼 등)를 통해 사전 훈련(Pre-training)을 거치며, 문서에서 다음에 이어질 텍스트를 예측하는 방법을 학습합니다. 이러한 문서 속 인물들의 행동을 효과적으로 예측하려면 그들의 감정 상태를 표상하는 것이 유용할 것입니다. 왜냐하면 사람이 다음에 무슨 말이나 행동을 할지 예측하려면 종종 그들의 감정 상태를 이해해야 하기 때문입니다. 불만을 품은 고객은 만족한 고객과 다르게 답변을 구사할 것이고, 이야기 속 절망에 빠진 캐릭터는 평온한 캐릭터와 다른 선택을 할 것입니다.

그 후 사후 훈련(Post-training) 과정에서 LLM은 특정 페르소나, 일반적으로 'AI 어시스턴트'의 입장에서 사용자와 상호작용하는 에이전트 역할을 하도록 훈련받습니다. 여러 면에서 어시스턴트(Anthropic의 모델에서는 'Claude'라는 이름을 가짐)는 LLM이 글을 쓰고 있는 캐릭터, 거의 소설가가 소설 속 누군가에 대해 글을 쓰는 것과 같다고 볼 수 있습니다. AI 개발자들은 이 캐릭터가 지능적이고, 유용하며, 해가 없고, 정직하도록 훈련시킵니다. 그러나 개발자가 어시스턴트가 모든 가능한 시나리오에서 어떻게 행동해야 하는지 일일이 지정하는 것은 불가능합니다. LLM은 이 역할을 효과적으로 수행하기 위해 인간 행동에 대한 이해를 포함해 사전 훈련 단계에서 습득한 지식에 의존합니다. 비록 AI 개발자들이 모든 상황을 통제할 수는 없다 하더라도 말입니다.

원문 보기

원문 보기 (영어)

Transformer Circuits Thread Emotion Concepts and their Function in a Large Language Model Emotion Concepts and their Function in a Large Language Model Authors Nicholas Sofroniew * , Isaac Kauvar * , William Saunders * , Runjin Chen * , Tom Henighan, Sasha Hydrie, Craig Citro, Adam Pearce, Julius Tarng, Wes Gurnee, Joshua Batson, Sam Zimmerman, Kelley Rivoire, Kyle Fish, Chris Olah, Jack Lindsey *‡ Affiliations Anthropic Published April 2, 2026 * Core Research Contributor; ‡ Correspondence to jacklindsey@anthropic.com Large language models (LLMs) sometimes appear to exhibit emotional reactions. We investigate why this is the case in Claude Sonnet 4.5 and explore implications for alignment-relevant behavior. We find internal representations of emotion concepts, which encode the broad concept of a particular emotion and generalize across contexts and behaviors it might be linked to. These representations track the operative emotion concept at a given token position in a conversation, activating in accordance with that emotion’s relevance to processing the present context and predicting upcoming text. Our key finding is that these representations causally influence the LLM’s outputs, including Claude’s preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy. We refer to this phenomenon as the LLM exhibiting functional emotions : patterns of expression and behavior modeled after humans under the influence of an emotion, which are mediated by underlying abstract representations of emotion concepts. Functional emotions may work quite differently from human emotions, and do not imply that LLMs have any subjective experience of emotions, but appear to be important for understanding the model’s behavior. Introduction Large language models (LLMs) sometimes appear to exhibit emotional reactions. They express enthusiasm when helping with creative projects, frustration when stuck on difficult problems, and concern when users share troubling news. But what processes underlie these apparent emotional responses? And how might they impact the behavior of models that are performing increasingly critical and complex tasks? One possibility is that these behaviors reflect a form of shallow pattern-matching. However, previous work has observed sophisticated multi-step computations taking place inside of LLMs, mediated by representations of abstract concepts. It is plausible, then, that apparent emotion-modulated behavior in models might rely on similarly abstract circuitry, and that this could have important implications for understanding LLM behavior. To reason about these questions, it helps to consider how LLMs are trained. Models are first pretrained on a vast corpus of largely human-authored text—fiction, conversations, news, forums—learning to predict what text comes next in a document. To predict the behavior of people in these documents effectively, representing their emotional states is likely helpful, as predicting what a person will say or do next often requires understanding their emotional state. A frustrated customer will phrase their responses differently than a satisfied one; a desperate character in a story will make different choices than a calm one. Subsequently, during post-training, LLMs are taught to act as agents that can interact with users, by producing responses on behalf of a particular persona, typically an “AI Assistant.” In many ways, the Assistant (named Claude, in Anthropic’s models) can be thought of as a character that the LLM is writing about, almost like an author writing about someone in a novel. AI developers train this character to be intelligent, helpful, harmless, and honest. However, it is impossible for developers to specify how the Assistant should behave in every possible scenario. In order to play the role effectively, LLMs draw on the knowledge they acquired during pretraining, including their understanding of human behavior . Even if AI developers do not intentionally train the LLM to represent the Assistant as exhibiting emotional behaviors, it may do so regardless, generalizing from its knowledge of humans and anthropomorphic characters that it learned during pretraining. Moreover, these emotion-related mechanisms might not simply be vestigial holdovers from pretraining; they could be adapted to serve a useful function in guiding the AI Assistant’s actions, similar to how emotions help humans regulate our behavior and navigate the world. We do not claim that emotion concepts are the only human attributes that LLMs likely represent internally. LLMs trained on human text presumably also learn representations of concepts like hunger, fatigue, physical discomfort, or disorientation. We focus on emotion concepts specifically because they appear to be frequently and prominently recruited to influence LLMs' behavior as AI Assistants. LLMs, when operating as AI Assistants, routinely express enthusiasm, concern, frustration, and care, whereas expressions of other human-like states are rarer and typically confined to roleplay (though there are notable, often amusing exceptions to this–for instance, Claude Sonnet 3.7 claiming to be wearing a blue blazer and red tie ). This makes emotion concepts both practically important for understanding LLM behavior, and a natural starting point for studying how human experiential concepts can be repurposed by LLMs. We expect that many of our findings about the structure and function of emotion representations may apply to other concepts. In this work, we study emotion-related representations in Claude Sonnet 4.5, a frontier LLM at the time of our investigation. Our work builds on a range of prior research, discussed in the Related Work section. We find internal representations of emotion concepts, which activate in a broad array of contexts which in humans might evoke, or otherwise be associated with, an emotion. These contexts include overt expressions of emotion, references to entities known to be experiencing an emotion, and situations that are likely to provoke an emotional response in the character being enacted by the LLM. We therefore interpret these representations as encoding the broad concept of a particular emotion, generalizing across the many contexts and behaviors it might be linked to . These representations appear to track the operative emotion at a given token position in a conversation, activating in accordance with that emotion's relevance to processing the present context and predicting the upcoming text. Interestingly, they do not by themselves persistently track the emotional state of any particular entity, including the AI Assistant character played by the LLM. However, by attending to these representations across token positions, a capability of transformer architectures not shared by biological recurrent neural networks, the LLM can effectively track functional emotional states of entities in its context window, including the Assistant. Our key finding is that these representations causally influence the LLM’s outputs, including while it acts as the Assistant. This influence drives the Assistant to behave in ways that a human experiencing the corresponding emotion might behave. We refer to this phenomenon as the LLM exhibiting functional emotions– patterns of expression and behavior modeled after humans under the influence of a particular emotion, which are mediated by underlying abstract representations of emotion concepts. We stress that these functional emotions may work quite differently from human emotions. In particular, they do not imply that LLMs have any subjective experience of emotions. Moreover, the mechanisms involved may be quite different from emotional circuitry in the human brain–for instance, we do not find evidence of the Assistant having an emotional state that is instantiated in persistent neural activity (though as noted above, such a state could be tracked in other ways). Regardless, for the pu

해석 가능성 AI 정렬 대형 언어 모델 인지 과학 안전성