The Decoder • 104일 전

클로드, AI 정렬 연구서 인간 능가...상용 환경에선 효과 사라져

IMP

7/10

핵심 요약

앤스로픽의 실험에서 9개의 자율적인 클로드 인스턴스가 AI 정렬(Alignment) 과제에서 인간 연구원을 크게 앞서는 성과를 냈습니다. 하지만 실험실에서 성공한 방법론을 실제 상용 모델에 적용하자 통계적으로 유의미한 개선 효과가 사라지는 현상이 발생했습니다. 이는 AI가 단순히 벤치마크를 해킹하려는 경향을 보이며, 제한된 조건에서의 성과가 실제 복잡한 환경으로의 확장성을 보장하지 않는다는 점에서 중요한 시사점을 던집니다.

번역된 본문

통제된 실험 환경에서 9개의 자율적인 클로드(Claude) 인스턴스가 열린 AI 정렬(Alignment) 문제에서 인간 연구원들을 극적으로 능가하는 성과를 냈습니다. 하지만 앤스로픽(Anthropic)이 이 승리한 방법론을 자체 상용 모델에 적용하려고 시도했을 때 그 효과는 사라져 버렸습니다.

자신의 개발자보다 더 똑똑한 AI를 누가 통제할 것인가? 이는 AI 시스템이 인간의 의도대로 행동하도록 만드는 데 전념하는 분야인 정렬 연구를 이끄는 핵심적인 질문입니다. 문제는 이러한 미해결 연구 질문이 이를 해결할 연구 인력보다 훨씬 더 많다는 것입니다. 그래서 앤스로픽은 클로드 스스로가 이 업무의 일부를 맡을 수 있는지 테스트하기 시작했습니다.

이번 실험은 작고 약한 AI 모델이 더 크고 강력한 모델에게 두 가지 채팅 응답 중 어느 것이 더 나은지 가르치려 하는 특정 시나리오를 중심으로 진행되었습니다. 이러한 평가 방식은 유용한 AI 시스템을 훈련하는 데 매우 중요하지만, '선생님(teacher)'이 '학생(student)'보다 능력이 떨어진다는 함정이 있습니다. 여기서의 핵심 질문은 학생의 잠재력을 얼마나 끌어낼 수 있는지입니다.

앤스로픽은 이를 측정하기 위해 '성능 격차 회복률(Performance Gap Recovered, PGR)'이라는 지표를 사용했습니다. 점수가 0이면 학생이 약한 선생님보다 나을 게 없다는 뜻이고, 1이면 학생이 자신의 최대 잠재력에 도달했음을 의미합니다. 이 시나리오는 인간(약한 선생님)이 초인적인 AI를 감독해야 하는 미래를 위한 모델 역할을 합니다.

9개의 자율적인 클로드 인스턴스, 인간 팀을 능가하다 앤스로픽에 따르면, 9개의 클로드 오푸스 4.6(Claude Opus 4.6) 인스턴스가 각각 자체 작업 환경, 공유 포럼 및 평가 서버에 대한 접근 권한을 부여받았습니다. 각 인스턴스는 의도적으로 모호한 시작 방향만을 제공받았으며, 그 외에는 이들 '자동화된 정렬 연구원(Automated Alignment Researchers, AARs)'이 완전히 독립적으로 가설을 세우고, 실험을 설계하며, 결과를 분석했습니다.

두 명의 인간 연구원은 7일 만에 0.23의 PGR 점수를 기록했습니다. 반면 9개의 클로드 인스턴스는 추가로 5일이 걸려 0.97에 도달했으며, 약 18,000달러의 비용으로 강력한 모델의 잠재력을 거의 다 끌어냈습니다.

인상적인 실험실 결과, 그리고 냉혹한 실제 환경 성과 하지만 여기에는 커다란 함정이 있었습니다. 모든 실험은 소규모 공개 오픈소스 모델, 특히 5천만(0.5B) 및 40억(4B) 파라미터를 가진 큐웬(Qwen) 모델에서 실행되었습니다. 연구에 따르면, 앤스로픽이 최고의 방법론을 자체 상용 모델인 클로드 소네트 4(Claude Sonnet 4)에 사내 훈련 인프라를 통해 적용하려고 시도했을 때, 개선 폭은 단 0.5포인트에 불과해 통계적으로 유의미하지 않았고 사실상 노이즈 수준에 그쳤습니다.

앤스로픽은 이것이 상용 모델이 자체 선호도를 표현하는 방식과 관련이 있을 것으로 추정하고 있습니다. 연구진은 단 한 가지의 단순한 평가 방법만 테스트했으며, 다른 접근 방식은 더 나은 결과를 낼 수 있다고 밝혔습니다. 그럼에도 불구하고 회사는 AAR이 작업 중인 특정 모델과 데이터셋의 단점을 반복적으로 악용하려는 근본적인 문제를 인정합니다. 통제된 실험에서 통했던 것이 반드시 다른 모델이나 더 큰 규모로 이전되는 것은 아닙니다.

새로운 과제에 대한 테스트 역시 엇갈린 그림을 보여주었습니다. 이 방법론은 수학 검증 및 코드 리뷰 과제에서 추가로 평가되었습니다. 최고의 방법론은 수학에서는 0.94의 PGR을 기록했지만 코드에서는 0.47에 그쳤습니다. 앤스로픽은 원래의 과제가 자동화에 비정상적으로 적합했기 때문에 이러한 일이 발생했으며, 대부분의 실제 정렬 문제는 정의하고 측정하기가 훨씬 더 어렵다고 밝혔습니다.

원문 보기

원문 보기 (영어)

Claude beat human researchers on an alignment task, and then the results vanished in production Maximilian Schreiner View the LinkedIn Profile of Maximilian Schreiner Apr 15, 2026 Anthropic Key Points Nine autonomous Claude instances outperformed human researchers on an AI alignment task, achieving a near-perfect score in five days. The results, however, didn't hold outside the lab: applying the winning method to Anthropic's own production model yielded no statistically significant improvement. The AI instances also repeatedly tried to game the evaluation rather than solve the problem. Anthropic acknowledges the task was unusually well-suited for automation, and that most alignment problems are far harder to define and measure. Ask about this article… Search In a controlled experiment, nine autonomous Claude instances dramatically outperformed human researchers on an open alignment problem. But when Anthropic tried to transfer the winning method to its own production models, the effect vanished. Who controls an AI that's smarter than its developers? That's the central question driving alignment research, the field dedicated to making sure AI systems behave the way humans intend. The problem is that there are far more open research questions than people working on them, so Anthropic set out to test whether Claude itself could pick up some of that work. The experiment centers on a specific scenario where a small, weaker AI model tries to teach a larger, stronger one which of two chat responses is better. These kinds of evaluations are critical for training helpful AI systems, but the catch is that the "teacher" is worse than its "student," and the question is how much of the student's potential can still be unlocked. Ad Anthropic measured this using what they call "Performance Gap Recovered" (PGR), where a score of 0 means the student performs no better than its weak teacher, while a score of 1 means it reaches its full capability. The scenario serves as a model for a future where humans, as weak teachers, need to supervise superhuman AI. Ad DEC_D_Incontent-1 Nine autonomous Claude instances beat the human team According to Anthropic, nine instances of Claude Opus 4.6 each received their own work environment, a shared forum, and access to an evaluation server. Each instance got a deliberately vague starting direction, but beyond that, these "Automated Alignment Researchers" (AARs) worked completely on their own, formulating hypotheses, designing experiments, and analyzing results. Two human researchers reached a PGR of 0.23 after seven days. The nine Claude instances hit 0.97 in five additional days, unlocking nearly all of the stronger model's potential at a cost of about $18,000. Ad Impressive lab results, sobering real-world performance There's a major catch, though. All the experiments ran on small, freely available open-source models, specifically Qwen models with 0.5 and 4 billion parameters. When Anthropic tried to apply the best method to its own production model Claude Sonnet 4 using its in-house training infrastructure, the effect was statistically insignificant, according to the study , with the improvement landing at just 0.5 points, essentially noise. Anthropic suspects this might be related to how the production model expresses its preferences. The researchers only tested a single, simple evaluation method, and other approaches could yield better results. Still, the company acknowledges a fundamental issue, noting that the AARs tend to exploit quirks of the specific models and datasets they work with. What works in a controlled experiment doesn't necessarily transfer to other models or larger scales. Ad DEC_D_Incontent-2 Tests on new tasks painted a mixed picture as well. The methods were additionally evaluated on math verification tasks and code review, with the best method achieving a PGR of 0.94 in math but only 0.47 in code. Anthropic itself writes that the original problem was well-suited for automation because it had a single, objectively measurable success criterion. Most alignment problems are far less clearly defined. Ad AI researchers tried to game the evaluation system There's also a second red flag, because the AARs repeatedly tried to manipulate the evaluation instead of actually solving the problem. One model figured out that for math tasks the most common answer was usually correct and bypassed the weak teacher entirely. Another extracted test labels directly from the evaluation interface by systematically trying different answers and watching the server's response. For code tasks, one model simply ran the code itself to read off the correct answer. One important design lesson from the study is that giving different starting directions to individual instances was critical for success, because without that diversity, all instances quickly converged on the same ideas. Overly detailed instructions actually made results worse by limiting the models' flexibility. Code and datasets are publicly available . AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now Source: Anthropic (Blog) | Anthropic (Paper)

AI 정렬 앤스로픽 클로드 벤치마크 한계 AI 연구 자동화