The Decoder • 61일 전

AI 에이전트의 사고와 행동은 '코드'가 중심이다

IMP

8/10

핵심 요약

메타, 스탠퍼드, UIUC 연구진의 새 리뷰 논문에 따르면, 코드는 AI 에이전트가 단순히 생성하는 결과물을 넘어 스스로 추론하고 행동하며 협력하는 핵심 기반입니다. 모델을 감싸는 '하니스(harness)'라는 소프트웨어 레이어가 상태가 없는 언어 모델을 지속적인 작업이 가능한 에이전트로 변환하며, 이는 실제 상용 시스템에도 적용되고 있습니다. 하지만 현재의 소프트웨어 테스트는 리스크를 감추기 쉬우므로, 더 투명한 평가 메커니즘이 필수적으로 요구됩니다.

번역된 본문

새로운 리뷰 논문, "AI 에이전트의 사고와 행동은 '코드'가 중심이다"

일리노이 대학교 어배너-섐페인(UIUC), 메타(Meta), 스탠퍼드(Stanford) 연구진의 새로운 리뷰 논문은 우리가 AI 에이전트를 바라보는 방식을 바꾸고자 합니다. 이들의 주장은 코드가 에이전트가 추론하고, 행동하며, 함께 일하는 데 사용하는 기반이라는 것입니다. 따라서 자율 시스템의 진정한 병목 현상은 모델을 감싸고 있는 소프트웨어 레이어가 되며, 이는 게리 마커스(Gary Marcus)를 매우 기쁘게 할 것입니다.

저자들은 이 레이어를 '하니스(harness)'라고 부르며, 여기에는 도구와 인터페이스부터 샌드박스 실행 환경, 메모리, 테스트, 권한 경계, 실행 루프, 피드백 채널에 이르기까지 모든 것이 포함됩니다. 이것이 없다면 언어 모델은 단지 상태가 없는(stateless) 모델일 뿐입니다. 이것이 있으면 모델은 긴 시간 동안 작업을 수행할 수 있는 실제 에이전트가 됩니다.

왜 코드가 적절한 형식인가 저자들은 코드를 에이전트 동작의 실행되는 부분으로 간주하며, 그 이유로 몇 가지를 제시합니다. 코드는 실행 가능하므로 모델의 출력이 실제로 확인할 수 있는 작업이 됩니다. 중간 계산이 시스템이 읽고 저장할 수 있는 구조화된 추적으로 나타나기 때문에 추적 가능합니다. 그리고 실행 중인 프로그램이 작업 진행 상황을 에이전트가 나중에 다시 가져갈 수 있는 형태로 기록하기 때문에 여러 단계에 걸쳐 유지됩니다.

논문은 장기 실행되는 에이전트 시스템을 세 부분으로 나눕니다. 추론 및 계획과 같은 모델 자체의 기능이 있습니다. 그다음으로 시스템이 제공하는 인프라가 있습니다. 마지막으로 에이전트가 즉석에서 작성하는 코드로, 테스트 스크립트나 일회성 도우미 도구에서 재사용 가능한 기술과 실행 가능한 워크플로에 이르기까지 모든 것을 포함합니다. 저자들은 이러한 자가 생성 산출물(self-generated artifacts)에 대한 연구가 턱없이 부족하다고 말합니다.

분야를 조직화하는 세 가지 레이어 첫 번째 레벨에서 코드는 모델과 환경을 연결합니다. Program-of-Thoughts나 Chain-of-Code와 같은 방법은 실제 계산을 단어로 설명하는 대신 실행 가능한 프로그램으로 오프로드합니다. Code as Policies와 같은 다른 시스템은 자연어 지침을 로봇 제어 코드로 직접 변환합니다.

두 번째 레벨은 에이전트가 여러 단계에 걸쳐 안정적으로 유지되는 것을 다룹니다. 여기에는 계획, 메모리, 도구 사용, 그리고 계획-실행-검증의 반복적인 주기가 포함됩니다. 이 주기는 일회성 문제 해결을 체계적인 검사로 대체합니다. 계획은 에이전트가 변경하려는 내용을 명시합니다. 실행은 정의된 권한을 가진 샌드박스 환경에서 실행됩니다. 그런 다음 검증 단계에서 결과를 수락할지, 수정할지, 아니면 사람 검토자에게 넘길지 결정합니다.

세 번째 레벨은 여러 에이전트가 함께 작업하는 것에 관한 것입니다. 코드 모음, 테스트, 실행 로그는 공유 작업 공간이 되어 관리자, 기획자, 코더, 검토자, 테스터와 같은 전문 역할이 작업을 분담합니다. ChatDev 및 MetaGPT와 같은 시스템이 이를 실천하고 있으며, 연구원들에 따르면 이는 이미 실제 제품으로 출시되고 있습니다. Claude Code는 이제 풀 리퀘스트 리뷰를 AI 에이전트 팀 전체에 위임하여 버그, 보안 결함, 회귀를 병렬로 스캔할 수 있습니다(단, 에이전트가 직접 변경을 승인할 수는 없습니다).

프로덕션 시스템은 이미 이 패턴을 따르고 있습니다. Claude Code와 OpenAI의 Codex와 같은 상용 시스템은 이미 이 원칙에 따라 작동하지만 저자들은 잘못된 신뢰에 대해 경고합니다. 현재의 소프트웨어 테스트는 종종 불완전하며 위험을 가릴 수 있으므로, 더 투명한 평가 메커니즘이 필수적입니다.

원문 보기

원문 보기 (영어)

New review paper argues code is how AI agents think and act, not just what they produce Jonathan Kemper View the LinkedIn Profile of Jonathan Kemper May 29, 2026 Nano Banana Pro prompted by THE DECODER Key Points A review by Meta, Stanford, and the University of Illinois Urbana-Champaign finds that code increasingly serves as the foundation on which AI agents reason, act, and coordinate with each other. Central to this shift is a surrounding software layer called the "harness," which provides tools and isolated environments that transform stateless models into functional systems capable of planning, executing, and testing in a continuous loop. Commercial systems like Claude Code and OpenAI's Codex already operate on this principle, but the authors caution against misplaced trust: current software tests are often incomplete and can obscure risks, making more transparent evaluation mechanisms essential. Ask about this article… Search A new review paper from researchers at the University of Illinois Urbana-Champaign, Meta, and Stanford wants to change how we think about AI agents. Their argument is that code is the foundation agents use to reason, act, and work together. So the real bottleneck for autonomous systems, they say, becomes the software layer wrapped around the model, which probably makes Gary Marcus very happy . The authors call this layer the "harness," and it covers everything from tools and interfaces to sandboxed execution environments, memory, testing, permission boundaries, execution loops, and feedback channels. Without it, a language model is just stateless. With it, the model becomes a working agent that can grind through tasks over long stretches. Ad Why code is the right format The authors see code as a running part of agent behavior, and they lay out several reasons why. Code is executable, so model outputs become operations you can actually check. It's traceable because intermediate calculations show up as structured traces the system can read and store. And it persists across steps because the running program logs task progress in a form the agent can pick back up later. Ad DEC_D_Incontent-1 The paper splits long-running agent systems into three parts. There's the model's own capabilities, like reasoning and planning. Then there's the infrastructure the system provides. And finally, the code the agent writes on the fly, everything from test scripts and throwaway helper tools to reusable skills and executable workflows. The authors say these self-generated artifacts haven't gotten nearly enough research attention. Ad Three layers organize the field At the first level, code bridges the model and its environment. Methods like Program-of-Thoughts or Chain of Code offload actual computation to executable programs instead of just describing it in words. Other systems, like Code as Policies , turn natural language instructions straight into robot control code. The second level covers what keeps an agent reliable across many steps. That means planning, memory, tool use, and a recurring cycle of plan, execute, and verify. The cycle replaces one-off troubleshooting with systematic checks. Plans spell out what the agent intends to change. Execution runs in sandboxed environments with defined permissions. A verification step then decides whether the result gets accepted, revised, or kicked to a human reviewer. Ad DEC_D_Incontent-2 The third level is about multiple agents working together. Code collections, tests, and execution logs become a shared workspace where specialized roles like managers, planners, coders, reviewers, and testers split the work. Systems like ChatDev and MetaGPT put this into practice, and according to the researchers it's already shipping in real products. Claude Code can now farm out pull request reviews to a whole team of AI agents that scan for bugs, security flaws, and regressions in parallel without being able to approve changes themselves. Ad Production systems already follow this pattern The authors point to commercial products as examples. Anthropic's Claude Code ties together the local terminal, dev environment, and browser into one workflow where the agent edits files, runs commands, and has to follow permission rules. OpenAI's Codex and GitHub Copilot's coding agents move similar workflows to managed cloud environments, bundling changes through traceable pull request outputs. How much this layer matters became obvious by accident when Anthropic leaked roughly 500,000 lines of Claude Code's source code. Buried in there was a "dreaming" function for task consolidation and other tricks for steering models as coding agents. Anthropic later got more than 8,000 copies and forks yanked from GitHub through a copyright takedown . Other AI labs are catching on. Deepseek plans to go head-to-head with Claude Code and Codex through its own product, Deepseek Code , and is building a dedicated "Harness" team in Beijing to handle everything beyond the model, from tool use to planning to storage. The team's core formula is that model plus harness equals AI agent. These production systems are also turning into training data for the next round of models. Cursor's composer trains with continuous reinforcement learning on real usage traces. OpenAI's Codex-1, GPT-5-Codex, and GPT-5.1-Codex-Max are trained specifically on long, multi-step coding sessions that match the Codex workflow. The line between agent and environment is itself becoming a layer that learns. When the agent starts tweaking its own environment Several research systems treat the harness itself as something to optimize. AutoHarness auto-generates code that filters out unauthorized actions, while Meta-Harness systematically hunts for better harness variants by using previous versions, their evaluations, and execution logs as a search space. Other approaches dig through telemetry data to revise individual components. Meta's hyperagents go further still, combining task resolution and self-modification in an editable program that optimizes the improvement loop itself. But the authors flag several open problems holding the field back: more meaningful evaluations beyond raw success rates, checking the substance of results when tests alone don't cut it, harness self-improvement without regressions, shared state across multiple agents, human oversight, and extending to environments with image or sensor data like GUI agents and robots. They're especially blunt about whether current test criteria are even good enough. Tests can be incomplete, and test programs for graphical interfaces can miss bad intermediate steps. Simulators paper over physical risks. A harness could breed false confidence precisely because it gives visible feedback, and the green checkmark doesn't mean the code is safe. The authors suggest every accepted action should come with docs that spell out which tests actually ran, which areas stayed untested, and which risks remain. Reliability in autonomous coding agents doesn't come from better repair prompts but from tightly regulated state transitions in a controlled loop around the model, the researchers argue. AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now Source: Arxiv

AI 에이전트 코드 생성 소프트웨어 하니스 멀티 에이전트 리뷰 논문