The Decoder • 66일 전

연구진, AI가 직접 설계한 스케일링 알고리즘 공개

IMP

8/10

핵심 요약

인간 연구자 대신 AI 코딩 에이전트가 스스로 탐색하여 설계한 새로운 AI 추론 제어 알고리즘이 기존 방식들을 능가하는 성능을 보였습니다. 이 알고리즘은 모델의 신뢰도 변화를 동적으로 추적하여 연산 자원을 할당하는 방식으로, 극도로 적은 토큰 사용량(약 70% 절감) 대비 높은 정확도를 달성했습니다. 단 40달러와 160분 만에 인간이 고안해 내기 힘든 독창적인 로직을 자발적으로 찾아냈다는 점에서 자동화된 알고리즘 탐색의 가능성을 입증했다는 평가를 받습니다.

번역된 본문

연구진, 인간이 설계하지 않았을 AI 스케일링 알고리즘을 Claude Code가 발견하도록 허용하다

연구진이 직접 더 효율적인 AI 추론 규칙을 작성하는 대신, 코딩 에이전트가 시뮬레이션 환경에서 더 나은 제어 알고리즘을 탐색하도록 했습니다. 그 결과 훨씬 적은 연산량을 소모하면서도 기존 방법들을 뛰어넘는 성능을 보였습니다.

테스트 타임 스케일링(Test-time scaling, TTS)은 여러 해결 경로를 병렬로 실행하거나 사고 체인(Chain of Thought)을 확장하는 등 응답에 더 많은 연산을 할애하여 대형 언어 모델의 성능을 끌어올리는 기술입니다. 지금까지는 모델이 새로운 해결 경로를 시작하고, 유망한 경로에 집중하며, 특정 경로를 폐기할 때를 결정하는 규칙을 거의 항상 인간이 직접 작성했습니다.

메릴랜드 대학교(UMD), 버지니아 대학교(UVA), 워싱턴 대학교(WUSTL), 노스캐롤라이나 대학교(UNC), 구글, 메타(Meta) 소속 연구진이 내놓은 'AutoTTS'는 이러한 방식을 완전히 뒤집습니다. 연구진이 직접 알고리즘을 짜는 대신, AI 에이전트가 스스로 알고리즘을 찾아내는 놀이터(환경)를 구축한 것입니다.

논문에 따르면, 폭(한 번에 실행되는 해결 경로의 수)과 깊이(각 경로가 나아가는 정도)로 정의되는 공유 제어 공간에서 많은 기존 방법들은 사실상 특수한 예시에 불과합니다. 그렇다면 연구자들이 왜 기계가 이 공간을 탐색하도록 두는 대신 손으로 일일이 경로를 짜고 있는지에 대한 의문을 제기합니다.

시뮬레이션 탐색으로 비용 절감 AutoTTS의 핵심에는 오프라인 환경이 있습니다. 연구팀은 각 작업에 대해 언어 모델에서 미리 여러 해결 경로를 생성하여 저장해 둡니다. 새로운 제어 알고리즘은 이미 존재하는 데이터를 기반으로 연산을 어떻게 분배할지 결정합니다. 이러한 방식 덕분에 매번 실제 언어 모델을 구동할 필요 없이 수천 개의 변형을 테스트할 수 있습니다.

이 탐색 과정은 Claude Code가 수행합니다. 코딩 에이전트는 여러 라운드에 걸쳐 이전 결과를 검토하고, 기존 제안의 약점을 파악하여 직접 코드로 새로운 제어 알고리즘을 작성합니다. 수천 개의 세부적인 파라미터에 매몰되는 것을 방지하기 위해, 각 제안은 단 하나의 상위 수준 컨트롤러(high-level controller)만 외부에 노출하도록 제한됩니다. 이 컨트롤러가 나머지 모든 임계값을 스스로 설정합니다. 또한 매 실행 시 상세 로그를 통해 이전 시도에서 연산 자원을 낭비한 부분을 에이전트에게 알려줍니다.

에이전트가 작성한 알고리즘, 인간 설계 방식을 능가 AIME 및 HMMT와 같은 수학 벤치마크에서 에이전트가 고안해 낸 알고리즘은 단위 연산량당 기존 방법들보다 더 높은 정확도를 기록했습니다. 64개의 답변을 병렬로 생성한 뒤 다수결로 정답을 고르는 표준 자기 일관성(Self-consistency) 방식과 비교했을 때, 이 알고리즘은 토큰 사용량을 약 70% 줄이면서도 정확도는 동일하게 유지했습니다.

이 성과는 DeepSeek-R1-Distill-Llama-8B와 같은 다른 모델이나 GPQA-Diamond와 같은 비수학 벤치마크에서도 그대로 적용되었습니다. 전체 탐색 과정에 든 비용은 단 40달러였으며, 소요 시간은 160분에 불과했습니다.

인간이 떠올리기 힘들었을 논리 단순한 수치보다 더 흥미로운 점은 발견된 프로그램이 실제로 어떻게 작동하는가입니다. 이 알고리즘은 모델의 신뢰도가 여러 라운드에 걸쳐 어떻게 변화하는지 추적합니다. 기존 방법들은 답변 간에 다수결이 뒤집히는 순간 즉시 탐색을 중단합니다. 반면 이 알고리즘은 신뢰도 변화가 거의 없으면 더 많은 해결 경로를 열고, 신뢰도가 빠르게 상승하면 새로운 경로를 열지 않습니다. 또한 중간 결과가 현재 다수결과 일치하는 해결 경로에는 추가 연산 자원을 할당합니다. 여러 라운드에 걸쳐 계속해서 다수결과 어긋나는 방향으로 진행되는 경로만을 폐기합니다.

연구진은 이러한 정교한 조정 메커니즘은 인간이 직접 설계하기 거의 불가능했을 것이라고 평가했습니다. 이후 진행된 절제 연구(Ablation study)는 두 가지 설계 선택이 얼마나 중요한지 보여줍니다. 단일 상위 수준 컨트롤러를 제거하면 에이전트가 테스트 중에는 연산을 크게 절약하지만 새로운 작업에서 정확도가 급락하는 극단적인 지름길에 의존하게 됩니다. 또한 상세 로그가 없으면 발견된 알고리즘이 더 많은 연산을 소모하면서도 정확도는 떨어지는 결과를 낳았습니다. 이는 단순한 최종 결과만으로는 무엇이 잘못되었는지 파악하기에 충분하지 않음을 의미합니다.

알고리즘 작성에서 탐색 공간 구축으로 연구진은 AutoTTS를 FunSearch, AlphaEvolve, ADAS와 같은 선행 연구와 맥락을 같이 한다고 설명합니다. 이러한 연구들은 모두 대형 언어 모델을 프로그램 탐색 도구로 사용하여 자동화된 시스템이 스스로 최적의 해결책을 찾아가는 방향으로 AI 개발 패러다임을 전환하고 있습니다.

원문 보기

원문 보기 (영어)

Researchers let Claude Code discover AI scaling algorithms that humans probably wouldn't have designed Jonathan Kemper View the LinkedIn Profile of Jonathan Kemper May 24, 2026 Nano Banana Pro prompted by THE DECODER Instead of writing rules for more efficient AI reasoning themselves, researchers let a coding agent hunt for better control algorithms in a simulated environment. The result beats established methods while burning far less compute. Test-time scaling (TTS) is meant to make large language models perform better by letting them spend more compute on a response, say, by running several solution paths in parallel or extending chains of thought . Until now, human-written rules almost always dictated when a model kicks off a new solution path, doubles down on a promising one, or kills it. A research team from UMD, UVA, WUSTL, UNC, Google, and Meta flips that with AutoTTS. Humans don't write the algorithm. Instead, they build the playground where an AI agent figures out algorithms on its own. The paper argues that many known methods are really just special cases in a shared control space defined by width (how many solution paths run at once) and depth (how far each one goes). So why, the authors ask, do researchers keep plotting paths through this space by hand instead of letting a machine search it? Simulating the search keeps costs down At the core of AutoTTS sits an offline environment. For each task, the team pre-generates several solution paths from the language model and stores them. A new control algorithm decides how to spend compute based on data that's already there. That way, thousands of variants can run without firing up the actual language model each time. Claude Code does the searching. Over several rounds, the agent reviews what came before, spots weaknesses in earlier proposals, and writes a new control algorithm directly in code. To stop the search from getting lost in thousands of tiny knobs, each proposal can only expose one high-level controller to the outside. That controller sets all the other thresholds on its own. Full logs from each run also show the agent where earlier attempts blew compute for nothing. Agent-written algorithm outperforms human-designed ones On math benchmarks like AIME and HMMT, the algorithm the agent came up with gets better accuracy per unit of compute than established methods. The lean setting slashes token usage by about 70 percent compared to standard self-consistency , which just generates 64 answers in parallel and picks the winner by majority vote. Accuracy holds steady. The algorithm also carries over to a different model ( DeepSeek-R1 -Distill-Llama-8B) and a non-math benchmark (GPQA-Diamond). The whole discovery run cost about $40 and took 160 minutes. A logic humans probably wouldn't have come up with More interesting than the raw numbers is how the discovered program actually works. It tracks how the model's confidence shifts over several rounds. Other methods bail out the moment a majority among answers tips over. If confidence barely budges, the algorithm opens more solution paths. If it climbs quickly, it skips new ones. Solution paths whose interim result lines up with the current majority get extra compute. The algorithm only drops paths that diverge if they keep heading the wrong way over multiple rounds. The authors call this kind of coordination something that would've been nearly impossible to design by hand. An ablation study shows how much depends on two design choices: drop the single high-level controller, and the agent falls back on extreme shortcuts that save tons of compute in testing but tank accuracy on new tasks. Without detailed logs, the discovered algorithm eats more compute at worse accuracy, so a bare final result just isn't enough to figure out what went wrong. From writing algorithms to building search spaces The authors put AutoTTS in a line with work like FunSearch , AlphaEvolve , and ADAS, all of which use language models as program searchers. What's new here is applying that idea to test-time scaling, which was mostly done by hand before. The current version only covers the trade-off between width and depth. It can't handle more complex structures like tree searches. How good the discovery turns out also depends on the coding agent. The authors don't say whether open-source alternatives would work just as well. The bigger takeaway is that the work shifts where humans come in: instead of inventing the rules themselves, researchers set up the search environment those rules live in. The actual strategy then emerges as code that a language model writes and refines. As early as 2024, researchers from Hugging Face showed that small language models can match much larger ones through smart test-time compute scaling, though with search strategies designed by hand. Meta and partners recently introduced hyperagents , AI systems that optimize their own improvement process. AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now --> Read on for the full picture. Subscribe for hype-free coverage. Access to all THE DECODER articles. Read without distractions – no Google ads. Access to comments and community discussions. Weekly AI newsletter. 6 times a year: “AI Radar” – deep dives on key AI topics. Up to 25 % off on KI Pro online events. Access to our full ten-year archive. Get the latest AI news from The Decoder. Subscribe to The Decoder -->

AI 에이전트 알고리즘 최적화 추론 모델 자동화 연구 클로드(Claude)