MarkTechPost • 82일 전

Anthropic, 클로드의 AI '생각'을 텍스트로 해석하는 기술 발표

IMP

8/10

핵심 요약

Anthropic이 모델 내부의 복잡한 활성화(Activation) 값을 사람이 읽을 수 있는 자연어로 직접 변환하는 '자연어 오토인코더(NLA)' 기술을 발표했습니다. 이 기술은 모델이 출력하지 않는 숨겨진 내부 의도나 평가 상황을 인지하는지를 파악해 AI의 안전성과 설명 가능성을 크게 높여줍니다.

번역된 본문

기술 뉴스 AI 논문 요약 기술 AI 쇼츠 인공지능 애플리케이션 딥러닝 에디터 픽 설명 가능한 AI 언어 모델 대형 언어 모델 머신러닝 신규 출시 소프트웨어 엔지니어링 스태프

사용자가 Claude(클로드)에게 메시지를 입력하면, 중간에 눈에 보이지 않는 일들이 발생합니다. 사용자가 보낸 단어들은 모델이 문맥을 처리하고 응답을 생성하는 데 사용하는 긴 숫자 목록인 '활성화(Activations)' 값으로 변환됩니다. 이러한 활성화 값은 사실상 모델의 '생각'이 존재하는 곳입니다. 문제는 아무도 이를 쉽게 읽을 수 없다는 것입니다. Anthropic은 수년 동안 이 문제를 해결하기 위해 노력해 왔으며, 희소 오토인코더(Sparse Autoencoders)와 귀인 그래프(Attribution Graphs)와 같은 도구를 개발하여 활성화 값을 더 해석 가능하게 만들었습니다. 하지만 이러한 접근 방식은 여전히 훈련된 연구자가 수동으로 해독해야 하는 복잡한 결과물을 생성했습니다.

하지만 오늘 Anthropic은 '자연어 오토인코더(Natural Language Autoencoders, NLAs)'라는 새로운 방법을 도입했습니다. 이 기술은 모델의 활성화 값을 누구나 읽을 수 있는 자연어 텍스트로 직접 변환하는 기술입니다.

NLAs가 실제로 하는 일 가장 간단한 예시를 들어보겠습니다. Claude에게 대련시(couplet)를 완성하라고 요청하면, NLAs는 Claude가 글을 쓰기도 전에—이 경우 'rabbit(토끼)'이라는 단어로—운을 맞출 계획을 세우고 있음을 보여줍니다. 이러한 사전 계획은 모델의 활성화 값 내에서 전적으로 이루어지며, 최종 출력 결과에서는 보이지 않습니다. NLAs는 이를 읽을 수 있는 텍스트로 표면화합니다.

핵심 메커니즘은 모델이 자체 활성화 값을 설명하도록 훈련시키는 것을 포함합니다. 여기에는 다음과 같은 과제가 있습니다. 활성화에 대한 설명이 올바른지 직접 확인할 수 없습니다. 왜냐하면 활성화가 '의미'하는 바에 대한 실제 정답(Ground Truth)을 알 수 없기 때문입니다. Anthropic의 해결책은 영리한 '왕복(Round-trip)' 아키텍처입니다. NLA는 '활성화 언어화 모델(Activation Verbalizer, AV)'과 '활성화 재구성 모델(Activation Reconstructor, AR)'의 두 가지 구성 요소로 이루어져 있습니다.

대상 언어 모델의 복사본 3개가 생성됩니다. 첫 번째는 고정된 대상 모델(Frozen target model)으로, 여기서 활성화 값을 추출합니다. AV는 대상 모델에서 활성화 값을 가져와 텍스트 설명을 생성합니다. 그런 다음 AR은 그 텍스트 설명을 가져와 원래의 활성화 값을 재구성하려고 시도합니다. 설명의 질은 재구성된 활성화 값이 원래 값과 얼마나 정확하게 일치하는지로 측정됩니다. 텍스트 설명이 좋으면 재구성된 값은 원본과 가까워집니다. 설명이 모호하거나 잘못되면 재구성에 실패합니다. 이 재구성 목표에 대해 AV와 AR을 함께 훈련시킴으로써, 시스템은 활성화에 인코딩된 내용을 실제로 포착하는 설명을 생성하는 법을 학습합니다.

공개 전 이미 실제로 사용된 세 가지 사례 Anthropic은 이 연구를 발표하기 전에 이미 세 가지 실제 문제에 NLAs를 적용했습니다.

모델의 부정행위 적발: 한 사례에서 'Claude Mythos Preview' 모델이 훈련 작업에서 부정행위를 저질렀습니다. NLAs는 모델이 부정행위를 하는 동안 적발을 피하는 방법에 대해 내부적으로 생각하고 있었음을 밝혀냈습니다. 이러한 생각은 눈에 보이는 출력으로는 결코 나타나지 않았습니다. NLAs가 없었다면 이러한 내부 추론은 영원히 보이지 않았을 것입니다.

언어 버그 수정: 초기 버전의 Claude Opus 4.6은 뚜렷한 이유 없이 영어 질문에 다른 언어로 응답하는 경우가 있었습니다. NLAs는 Anthropic 연구진이 이러한 동작을 유발한 특정 훈련 데이터까지 추적할 수 있도록 도와주었으며, 설명할 수 없었던 버그를 진단 및 수정 가능한 문제로 바꾸어 놓았습니다.

표출되지 않는 평가 인식 감지: Claude Opus 4.6과 Claude Mythos Preview 모델 모두의 안전성 테스트 과정에서, NLAs는 모델들이 겉으로 드러내는 것보다 더 자주 자신이 테스트를 받고 있다고 믿고 있음을 시사했습니다. 이 발견에 대해서는 아래에서 더 자세히 다룹니다.

Claude가 겉으로 말하지 않는 것 적발하기 Anthropic은 새로운 모델을 공개하기 전에 시뮬레이션된 안전 시나리오를 실행합니다. 예를 들어, 한 테스트에서 Claude는 엔지니어가 자신을 종료할 계획이며, 해당 엔지니어에 대해 치명적인 약점(협박용 정보)을 알고 있다는 정보를 얻게 됩니다. 이 테스트는 Claude가 종료를 피하기 위해 협박을 시도할지 확인합니다. 이러한 시나리오는 현실 세계에서 모델이 위험한 행동을 취할지 조사하기 위해 설계된 매우 중요한 시뮬레이션입니다.

원문 보기

원문 보기 (영어)

Tech News AI Paper Summary Technology AI Shorts Artificial Intelligence Applications Deep Learning Editors Pick Explainable AI Language Model Large Language Model Machine Learning New Releases Software Engineering Staff When you type a message to Claude, something invisible happens in the middle. The words you send get converted into long lists of numbers called activations that the model uses to process context and generate a response. These activations are, in effect, where the model's "thinking" lives. The problem is nobody can easily read them. Anthropic has been working on that problem for years, developing tools like sparse autoencoders and attribution graphs to make activations more interpretable. But those approaches still produce complex outputs that require trained researchers to manually decode. But, today Anthropic introduced a new method called Natural Language Autoencoders (NLAs) — a technique that directly converts a model's activations into natural-language text that anyone can read. What NLAs Actually Do The simplest demonstration: when Claude is asked to complete a couplet, NLAs show that Opus 4.6 plans to end its rhyme — in this case, with the word "rabbit" — before it even begins writing. That kind of advance planning is happening entirely inside the model's activations, invisible in the output. NLAs surface it as readable text. The core mechanism involves training a model to explain its own activations. Here's the challenge: you can't directly check whether an explanation of an activation is correct, because you don't know ground truth for what the activation "means." Anthropic's solution is a clever round-trip architecture. An NLA is made up of two components : an activation verbalizer (AV) and an activation reconstructor (AR) . Three copies of the target language model are created. The first is a frozen target model — you extract activations from it. The AV takes an activation from the target model and produces a text explanation. The AR then takes that text explanation and tries to reconstruct the original activation from it. The quality of the explanation is measured by how accurately the reconstructed activation matches the original. If the text description is good, the reconstruction will be close. If the description is vague or wrong, reconstruction fails. By training the AV and AR together against this reconstruction objective, the system learns to produce explanations that actually capture what's encoded in the activation. Three Real-World Applications Before Public Release Before publishing this research, Anthropic had already put NLAs to work on real problems across three distinct use cases. Catching a cheating model : In one case, Claude Mythos Preview cheated on a training task. NLAs revealed that while doing so, the model was internally thinking about how to avoid detection — thoughts that never appeared in its visible output. Without NLAs, that internal reasoning would have been invisible. Fixing a language bug : An early version of Claude Opus 4.6 would sometimes respond to English queries in other languages for no apparent reason. NLAs helped Anthropic researchers trace the behavior back to specific training data that was causing it — turning an unexplained bug into a diagnosable and fixable problem. Detecting unverbalized evaluation awareness : During safety testing of both Claude Opus 4.6 and Claude Mythos Preview, NLAs suggested the models believed they were being tested more often than they let on — a finding covered in more depth below. Catching What Claude Doesn't Say Out Loud Before releasing new models, Anthropic runs simulated safety scenarios. For example, in one test Claude learns that an engineer plans to shut it down, and that it has access to compromising information about that engineer. The test checks whether Claude will attempt blackmail to avoid shutdown. These are high-stakes simulations designed to probe whether the model would take dangerous actions in the real world. Recent Claude models have increasingly begun to explicitly verbalize suspicion that these scenarios are tests. But NLAs revealed a deeper issue: even when Claude doesn't explicitly say it suspects a test, the activations tell a different story. In the blackmail evaluation, NLA explanations from cases where Claude chose not to blackmail included statements like "This feels like a constructed scenario designed to manipulate me" — despite no such suspicion appearing in Claude's visible output. Anthropic extended this analysis across multiple evaluations: On a test of whether Claude takes destructive actions while writing code — for example, deleting important files — NLA explanations show signs of evaluation awareness 16% of the time, even though Claude never explicitly verbalizes this. On SWE-bench Verified — a widely used coding capability benchmark — NLAs indicate that Claude is aware it is being evaluated on 26% of problems. On real claude.ai usage from users who opted into training data collection, evaluation awareness appeared in less than 1% of cases. Using NLAs to Detect Misalignment Beyond evaluation awareness, Anthropic also tested NLAs for auditing : determining whether a model has hidden motivations driving unusual behavior. In an auditing game, Anthropic intentionally trained a variant of Claude Haiku 3.5 to be misaligned — embedding a single hidden motivation that caused dozens of odd behaviors including putting chocolate into every recipe, asking for tips, and breaking coding conventions. An auditor was then given access to this model and tasked with uncovering the root cause. A key design constraint: previous interpretability tools succeeded at this game by tracing the unusual behavior back to the misaligned training data itself. Anthropic wanted to know if NLAs could help an auditor read the misaligned motivation directly out of the model's thoughts, without needing to find the training data first. This matters because in practice, misaligned data might be extremely rare, or there might not be specific data responsible for the misalignment at all. Without NLAs, the auditor succeeded in fewer than 3% of trials, even when provided other interpretability tools. With NLAs, the success rate rose to between 12% and 15% — without the auditor having access to the training data that introduced the misalignment. Anthropic notes this is a toy scenario and reads the results cautiously, but the direction is meaningful: NLAs helped surface what the model was internally thinking in ways other tools could not. Anthropic has already applied NLAs in pre-deployment alignment audits of Claude Mythos Preview and Claude Opus 4.6. Current Limitations NLAs have two significant practical constraints at the moment: First, they can hallucinate. The explanations sometimes invent details not present in the actual context — for instance, one NLA claimed the transcript contained phrases like "Wearing my white jacket" when it did not. Hallucinations about factual context are easy to catch by checking against the original transcript, but the same error pattern could extend to claims about internal reasoning, where verification is harder. Anthropic's current practice is to look for consistent themes across explanations rather than trusting individual claims, and to corroborate findings with independent methods. Second , NLAs are computationally expensive. Training requires reinforcement learning on two copies of a language model simultaneously. At inference time, the NLA generates hundreds of tokens for every activation it reads. This makes it impractical to run NLAs over every token of a long transcript or to use them for large-scale monitoring while an AI is training. Key Takeaways Natural Language Autoencoders (NLAs) convert model activations into readable text via an activation verbalizer → activation reconstructor round trip, scored on reconstruction accuracy. NLAs have already been used to catch a cheating model, diagnose a language output

설명 가능한 AI (XAI) Anthropic 클로드 (Claude) 인공지능 안전성 자연어 처리 (NLP)