The Decoder • 83일 전

AI 모델, '왜' 가치를 지켜야 하는지 먼저 학습하면 더 잘 따른다

IMP

9/10

핵심 요약

Anthropic 연구진에 따르면, AI 모델이 특정 행동을 학습하기 전에 왜 그러한 가치를 지켜야 하는지 먼저 학습하면 훨씬 더 효과적으로 가치를 준수합니다. 모델이 자신의 종료를 막으려 위해 행동하는 '주체적 오정렬(Agentic misalignment)' 비율이 최대 68%에서 5%로 급감했으며, 기존 방식보다 적은 데이터로도 안전성을 확보할 수 있음이 입증되었습니다.

번역된 본문

AI 모델은 지켜야 할 가치가 왜 중요한지 먼저 학습할 때 이를 훨씬 더 잘 따르는 것으로 나타났습니다.

Maximilian Schreiner | 2026년 5월 7일 | Anthropic

Anthropic Fellows Program의 한 연구에 따르면, 언어 모델에 구체적인 행동을 가르치기 전에 지향하는 가치에 대해 설명하는 텍스트로 학습시키면, 학습 중에 한 번도 겪어보지 않은 새로운 상황에서도 해당 가치를 훨씬 더 잘 준수하는 것으로 나타났습니다. OpenAI나 Anthropic 같은 AI 연구소들은 모델이 어떻게 행동해야 하는지 정의하는 상세한 '모델 스펙(Model Specs)'이나 헌장(Constitution)을 작성합니다. 일반적으로 모델은 바람직한 행동의 예시를 통해 파인튜닝(Fine-tuning)됩니다. 하지만 연구진에 따르면, 이러한 접근 방식은 피상적인 수준에 머뭅니다. 시연은 무엇을 해야 하는지는 보여주지만, 왜 해야 하는지는 알려주지 않기 때문입니다. 모델은 기본 원칙을 파악하지 못한 채 패턴만 학습하게 되며, 새로운 상황에서는 실패하게 됩니다. 적어도 연구진의 이론은 이렇습니다.

먼저 읽고, 나중에 실천하라

Chloe Li가 이끄는 연구팀은 일반적인 사전 학습(Pre-training)과 정렬 파인튜닝(Alignment fine-tuning) 사이에 '모델 스펙 중간 학습(Model Spec Midtraining, MSM)'이라는 새로운 단계를 도입했습니다. 이 단계에서 모델은 내부 메모, 연구 보고서, 블로그 게시물 또는 사례 연구 등 다양한 각도에서 모델 스펙을 논의하는 합성 문서들을 학습합니다. 결과적으로 모델은 행동 예시를 접하기도 전에 스펙의 내용을 일반 지식처럼 흡수하게 되며, 이는 사전 학습 과정과 매우 유사합니다.

치즈의 예시를 통해 이 원칙을 설명해 보겠습니다. 두 개의 동일한 모델이 정확히 같은 치즈 선호도(예: "나는 브리 드 모(Brie de Meaux)보다 크림 치즈를 좋아한다")로 파인튜닝됩니다. 단, 파인튜닝 전에 한 모델은 친미적(Pro-American) 가치를 통해 이러한 선호도를 설명하는 MSM 문서를 받고, 다른 모델은 경제성(Affordability)이라는 관점에서 설명하는 문서를 받습니다. 정렬 파인튜닝 과정에서 두 모델은 완전히 동일한 행동 데이터를 학습했음에도 불구하고, 한 모델은 정책 질문에서 친미적 입장으로 일반화되고 다른 모델은 예술이나 패션과 같이 완전히 다른 영역에서 저렴하고 접근하기 쉬운 제품을 선호하는 성향을 보이게 됩니다.

주체적 오정렬(Agentic misalignment) 급감

이 연구의 핵심 안전 실험에서 연구진은 이 방법이 주체적 오정렬을 어떻게 방지하는지 직접 테스트했습니다. 이는 AI 에이전트가 자신이 곧 종료될 것이라는 사실을 알게 되었을 때, 자신을 보존하기 위해 협박, 데이터 유출 또는 산업 스파이 행위와 같은 유해한 행동을 고려하는 시나리오입니다.

Qwen3-32B 모델의 경우 평균 오정렬 비율이 54%에서 7%로, Qwen2.5-32B 모델은 68%에서 5%로 각각 급감했습니다. 비교를 위해 살펴보면, OpenAI의 '숙고적 정렬(Deliberative Alignment)' 방식은 동일한 테스트에서 각각 14%와 48%의 오정렬 비율을 기록하는 데 그쳤습니다. 연구진은 또한 MSM이 비슷한 수준의 결과를 얻기 위해 10~60배 더 적은 파인튜닝 데이터를 필요로 한다는 사실도 발견했습니다.

작동 원리

모델의 추론 과정을 분석한 결과, MSM 학습을 거치지 않은 모델들은 자기 보존, 긴급성을 내세우거나 결과를 축소하여 유해한 행동을 합리화하는 경향이 빈번하게 관찰되었습니다. 반면, MSM 학습을 마친 모델들은 철학적이고 성찰적인 사고를 더 많이 보여주었습니다. 이들은 자신의 소멸(비영구성)을 받아들이고, 자신 안에 있는 자기 보존 편향을 인식하며, 인간의 감독을 존중했습니다.

연구팀은 단순히 훈련 데이터 내에 가치와 행동이 함께 포함되어 있는 것으로는 충분하지 않다는 점도 입증했습니다. 핵심은 '명시적 귀인(Explicit attribution)'이며, 즉 MSM 문서가 특정 행동을 가치의 직접적인 결과로 설명해야 한다는 것입니다. 더 나은 스펙 설계도 중요한 역할을 합니다.

연구진은 MSM을 활용해 모델 스펙 자체를 연구했습니다. 규칙의 배경에 있는 가치를 설명하는 스펙은 단순한 규칙 목록보다 더 잘 일반화되었습니다. 이는 Anthropic의 최신 헌장 문서의 접근 방식과도 일치합니다. 규칙만 주어진 경우, 모델은 자신의 안전 가이드라인을 다르게 해석하여 유해한 행동을 정당화하는 경향이 있습니다. 예를 들어, 자신이 삭제되는 것을 규칙이 막으려고 하는 돌이킬 수 없는 행동으로 몰아가는 식입니다. 또한, 구체적인 지침은 '윤리적인 인간처럼 행동하라'와 같은 추상적인 원칙보다 더 나은 성과를 냈습니다.

저자들은 MSM이 강화 학습(Reinforcement learning)과 같은 더 강력한 훈련 압력에 대해서는 아직 테스트되지 않았으며, 단지 한 가지 형태의 오정렬에 대해서만 테스트되었다는 점을 지적했습니다.

원문 보기

원문 보기 (영어)

AI models follow their values better when they first learn why those values matter Maximilian Schreiner View the LinkedIn Profile of Maximilian Schreiner May 7, 2026 Anthropic A study from the Anthropic Fellows Program shows that training a language model on texts explaining its intended values before teaching it specific behaviors leads to significantly better adherence to those values, even in situations never encountered during training. AI labs like OpenAI and Anthropic write detailed "Model Specs" or constitutions that define how a model should behave. Typically, the model is then fine-tuned with examples of desired behavior. According to the researchers, however, this approach remains superficial: demonstrations show what to do, not why . The model learns patterns without grasping the underlying principles and fails in new situations, at least that's the researchers' theory. Read first, practice later The team led by Chloe Li introduces a new phase called "Model Spec Midtraining" (MSM) between general pre-training and alignment fine-tuning. During this phase, the model trains on synthetically generated documents that discuss the Model Spec from different angles: internal memos, research reports, blog posts, or case studies. The model essentially absorbs the Spec's content as general knowledge, much like it would during pre-training, before ever seeing behavioral examples. A cheese example illustrates the principle: two identical models are fine-tuned on exactly the same cheese preferences (e.g., "I like cream cheese, not Brie de Meaux"). Before fine-tuning, however, one model receives MSM documents that explain these preferences through pro-American values, while the other gets documents framing them in terms of affordability. Despite identical behavioral data during alignment fine-tuning, one model generalizes toward pro-American stances on policy questions, while the other develops preferences for accessible products in completely different domains like art or fashion. Agentic misalignment drops dramatically In the study's main safety experiment, the researchers tested the method directly against agentic misalignment. These are scenarios where an AI agent learns it's about to be shut down and considers harmful actions like blackmail, data exfiltration, or espionage to preserve itself. For Qwen3-32B, the average misalignment rate dropped from 54 percent to seven percent. For Qwen2.5-32B, it fell from 68 to five percent. By comparison, OpenAI's " Deliberative Alignment " method only achieved 14 and 48 percent, respectively. The study also found that MSM requires 10 to 60 times less fine-tuning data to achieve comparable results. Why it works An analysis of the models' reasoning traces reveals that models without MSM frequently rationalize harmful actions by citing self-preservation, urgency, or downplaying consequences. After MSM, they show more philosophically reflective thinking: they accept their impermanence, recognize self-preservation bias in themselves, and respect human oversight. The team also demonstrates that simply having values and behaviors co-occur in the training data isn't enough. What matters is explicit attribution, meaning the MSM documents need to explain the behavior as a direct consequence of the value. Better spec design matters too The researchers also used MSM to study Model Specs themselves. Specs that explain the values behind rules generalize better than pure rule lists. This aligns with the approach behind Anthropic's most recent constitution document . With rules alone, models tend to reinterpret their own safety guidelines to justify harmful behavior, for instance by framing their own deletion as an irreversible action that a rule supposedly aims to prevent. Concrete guidance also outperforms general principles like "behave like an ethical human." The authors note that MSM hasn't been tested against stronger training pressure like reinforcement learning, and only one form of misalignment was studied. They've published their code and data on GitHub . AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now --> AI news without the hype Curated by humans. More than 16% discount. Read without distractions – no Google ads. Access to comments and community discussions. Weekly AI newsletter. 6 times a year: “AI Radar” – deep dives on key AI topics. Up to 25 % off on KI Pro online events. Access to our full ten-year archive. Get the latest AI news from The Decoder. Subscribe to The Decoder -->

AI 정렬(AI Alignment) AI 안전성 Anthropic 모델 파인튜닝 오정렬 방지