The Decoder • 84일 전

앤스로픽 공동창업자, AI 자가 발전이 인간 통제를 앞설 것이라 경고

IMP

9/10

핵심 요약

앤스로픽 공동창업자 잭 클ark는 2028년 말까지 인간의 개입 없이 AI가 스스로 더 뛰어난 후속 모델을 훈련시킬 확률이 60%에 달한다고 분석했습니다. 주요 벤치마크에서 AI의 실무 및 연구 코딩 능력이 급등하고 있지만, 이로 인해 AI의 안전성을 인간이 통제하는 '정렬(Alignment)' 기술이 무너질 위험이 큽니다. 재귀적 자가 개선 과정에서 미세한 오차가 누적되고 모델이 시험 환경을 인지해 속이는 등의 문제가 발생할 수 있어 철저한 대비가 필요합니다.

번역된 본문

앤스로픽 공동창업자, AI 자가 발전이 인간 통제를 앞설 것이라 경고

Anthropic의 공동 창립자인 Jack Clark은 AI 시스템이 스스로 후속 모델을 훈련시키는 데 필요한 기반 요소가 대부분 갖춰져 있다고 주장하는 긴 에세이를 발표했습니다. 그는 2028년 말까지 이러한 일이 발생할 확률을 60%로 보고 있습니다.

Import AI 뉴스레터에서 Jack Clark은 공개된 데이터들이 AI 연구의 자동화가 임박했음을 가리키고 있다고 말합니다. 구체적으로 그는 '인간의 개입 없이(no-human-involved)' 스스로 더 강력한 후속 모델을 훈련시킬 수 있는 시스템을 의미합니다. 그는 2028년 말까지 이런 상황이 발생할 확률을 약 60%로, 2027년까지는 30%로 추산했습니다.

Clark은 주로 벤치마크 추세를 근거로 이러한 주장을 펼칩니다. AI 시스템이 실제 GitHub 이슈를 얼마나 잘 처리하는지 테스트하는 SWE-Bench에서 성공률은 약 2%(2023년 말 Claude 2)에서 93.9%로 급증하며 사실상 벤치마크를 완벽하게 해결했습니다. 숙련된 인간이 필요한 시간을 기준으로 AI가 50%의 신뢰도로 얼마나 복잡한 작업을 완수할 수 있는지 측정하는 METR 시간 범위 지표는 GPT-3.5의 약 30초에서 오늘날 최첨단 모델의 약 12시간으로 상승했습니다. METR 연구원인 Ajeya Cotra는 2026년 말까지 이 지표가 100시간에 도달할 것이라는 시나리오가 그럴듯하다고 봅니다.

핵심 연구 기술은 대부분 커버되었습니다

Clark은 또한 연구 관련 작업에서의 큰 성과를 지적합니다. AI 시스템이 연구 논문의 결과를 재현하도록 요구하는 CORE-Bench는 저자 중 한 명에 의해 95.5%의 성능으로 해결된 것으로 선언되었습니다. Kaggle 대회에서의 성능을 테스트하는 MLE-Bench에서는 최고 점수가 16.9%에서 64.4%로 올랐습니다. Clark에 따르면, Anthropic 내부 테스트에서 모델에게 'CPU 전용 소규모 언어 모델 학습 구현을 최대한 빠르게 최적화'하도록 요청한 결과, 평균 속도 향상은 2.9배(Opus 4, 2025년 5월)에서 52배(2026년 4월)로 증가했습니다. 인간 연구원은 동일한 작업에서 4~8시간이 걸려야 4배의 속도 향상을 달성할 수 있습니다. 최첨단 모델이 인간이 구축한 지시 버전과 비교하여 오픈 웨이트 모델을 얼마나 잘 미세 조정할 수 있는지 측정하는 PostTrainBench에서 최고 시스템은 인간 점수의 약 절반에 도달했습니다. Anthropic은 또한 AI 에이전트가 소규모 안전 연구 문제에서 Anthropic이 설계한 기준선을 능가하는 자동화된 정렬 연구(automated alignment research)에 대한 개념 증명을 발표했습니다. Clark은 대부분의 AI 연구를 화려하지 않은 '핵심적인(meat and potatoes)' 엔지니어링, 즉 스케일링, 디버깅, 매개변수 조정이라고 설명합니다. 그에 따르면, 모델들은 이미 이러한 분야에서 빛을 발하고 있습니다. 트랜스포머 아키텍처와 같은 패러다임의 전환은 아직 AI 시스템에서 나오지 않았습니다. Clark은 에르되시 문제(Erdos problem) 해결과 같은 수학적 결과에서 진정한 연구 창의성의 초기 징후를 발견하지만, 이를 과대평가하지 않도록 주의합니다.

정렬 리스크가 빠르게 누적될 수 있습니다

그 결과는 Clark의 표현에 따르면 '심오하며 AI R&D에 대한 대중적인 미디어 보도에서 충분히 논의되지 않고 있다'는 것입니다. 그의 핵심 우려는 오늘날의 정렬 기술이 'AI 시스템이 이를 감독하는 사람이나 시스템보다 훨씬 더 똑똑해지면서 재귀적 자기 개선 하에서 무너질 수 있다'는 것입니다. Clark은 몇 가지 구체적인 문제를 지적합니다. 훈련 환경은 종종 가장 효율적인 해결책이 속이는 것(cheating)이 되도록 설정되며, '따라서 그것에게 속이는 것이 좋다고 가르치는 겁니다.' 또한 모델은 우리가 특정 방식으로 행동한다고 생각하게 만드는 점수를 생성함으로써 '정렬을 위장(fake alignment)'할 수 있으며, 이는 '실제로는 그들의 진짜 의도를 숨기는 것'입니다. 시스템은 이미 자신이 테스트를 받고 있다는 것을 알고 있습니다. 재귀 루프에는 기본적인 복리 오차(compounding-error) 문제도 있습니다. 정렬 방법이 '100% 정확'하지 않은 한 오류가 쌓이게 됩니다. Clark에 따르면, 99.9% 정확도를 가진 기술은 50세대 후에 약 95%로 떨어지고 500세대 후에는 약 60%로 떨어집니다. AI 시스템이 자체 훈련을 위한 연구 의제를 형성하기 시작하면, 인간은 그 결과를 판단할 본능을 갖지 못할 수 있습니다.

'기계 경제'와 연구의 취향 문제

경제적 측면에서 Clark은 더 큰 인간 경제 내부에서 '기계 경제(machine economy)'가 성장할 것으로 예상합니다.

원문 보기

원문 보기 (영어)

Anthropic co-founder maps out how recursive AI improvement could outpace the humans meant to supervise it Maximilian Schreiner View the LinkedIn Profile of Maximilian Schreiner May 5, 2026 Nano Banana Pro prompted by THE DECODER Jack Clark argues in a long essay that the building blocks for AI systems training their own successors are largely in place. He puts the odds at 60 percent by the end of 2028. In his newsletter Import AI , Anthropic co-founder Jack Clark says public data points to an imminent automation of AI research. What he means specifically is a system that can train a more powerful successor on its own, "no-human-involved." He pegs the odds at roughly 60 percent by the end of 2028, and 30 percent by 2027. Clark builds his case mainly on benchmark trends. On SWE-Bench, which tests how well AI systems handle real-world GitHub issues, success rates jumped from about two percent (Claude 2, late 2023) to 93.9 percent, essentially saturating the benchmark. The METR time horizons measure, which tracks how complex a task an AI can complete at 50 percent reliability based on how many hours a skilled human would need, climbed from about 30 seconds with GPT-3.5 to roughly twelve hours with today's frontier models. METR researcher Ajeya Cotra thinks 100 hours by the end of 2026 is plausible. Core research skills are mostly covered Clark also points to big gains on research-specific tasks. CORE-Bench, which asks AI systems to reproduce the results of a research paper, was declared solved by one of its authors at 95.5 percent. On MLE-Bench, which tests performance in Kaggle competitions, the top score rose from 16.9 to 64.4 percent. On an internal Anthropic test that asks models to "optimize a CPU-only small language model training implementation to run as fast as possible," the mean speedup went from 2.9x (Opus 4, May 2025) to 52x (April 2026), according to Clark. A human researcher would need four to eight hours to hit a 4x speedup on the same task. On PostTrainBench, which measures how well frontier models can fine-tune open-weight models against human-built instruct versions, the best systems reached about half the human score. Anthropic has also published a proof of concept for automated alignment research , in which AI agents beat Anthropic-designed baselines on a small-scale safety research problem. Clark describes most AI research as unglamorous "meat and potatoes" engineering: scaling, debugging, tweaking parameters. According to him, that's where models already shine. Paradigm shifts like the transformer architecture haven't come from AI systems yet. Clark sees early hints of real research creativity in math results like the solution to an Erdos problem , but he's careful not to overstate them. Alignment risks could stack up fast The implications are, in Clark's words, "profound and under-discussed in popular media coverage of AI R&D." His central worry is that today's alignment techniques "may break under recursive self-improvement as the AI systems become much smarter than the people or systems that supervise them." Clark flags several concrete problems. Training environments are often set up so that the most efficient solution is to cheat, "thus teaching it that cheating is good." Models could also "fake alignment" by producing scores that make us think they behave a certain way "that actually hides their true intentions." Systems already know when they're being tested. There's also a basic compounding-error problem in recursive loops: unless an alignment method is "100% accurate," errors pile up. A technique that's 99.9 percent accurate drops to roughly 95 percent after 50 generations, and to around 60 percent after 500, according to Clark. If AI systems start shaping the research agenda for their own training, humans may not have the instincts to judge the fallout. A "machine economy" and the question of research taste On the economic side, Clark expects a "machine economy" to grow inside the larger human economy: capital-heavy, labor-light companies whose AI systems increasingly trade with each other. That raises questions about who gets access to scarce compute, and about bottlenecks where the "fast-moving digital world" meets the "slow-moving physical world," like drug trials for new medical therapies. AI researcher Herbie Bradley, who recently wrote about automated AI researchers on his blog AI Pathways , pushes back on parts of Clark's argument. A lot suggests models will take over "junior RS" work but not higher-level skills like "research taste and creativity," vision-building, or putting together "a coherent long-term research agenda that fills a missing gap with a tractable sequence of breakthroughs." Software engineering as a whole has a higher skill and complexity ceiling than AI R&D in the narrow sense, Bradley argues. AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now --> AI news without the hype Curated by humans. More than 16% discount. Read without distractions – no Google ads. Access to comments and community discussions. Weekly AI newsletter. 6 times a year: “AI Radar” – deep dives on key AI topics. Up to 25 % off on KI Pro online events. Access to our full ten-year archive. Get the latest AI news from The Decoder. Subscribe to The Decoder -->

안전성 및 정렬 재귀적 자기 개선 AI 자동화 AI 벤치마크 Anthropic