Hacker News • 62일 전

AI 에이전트가 소프트웨어 시스템을 근본적으로 변화시킬 수 없는 이유

IMP

8/10

핵심 요약

현재의 LLM은 새로운 코드를 작성하는 수준의 국지적 작업에는 뛰어나지만, 복잡한 소프트웨어 시스템의 구조와 의존성을 파악하고 안전하게 수정하는 인과적 추론(Causal reasoning) 능력이 부족합니다. 이로 인해 에이전트가 완벽하게 PR(Pull Request)을 생성하여 자율적으로 소프트웨어를 배포하는 것은 현재로서는 불가능에 가깝습니다.

번역된 본문

이 기사는 최근의 놀라운 코드 생성 데모에도 불구하고, 현재의 LLM(대형 언어 모델)이 실제 소프트웨어 시스템을 안전하게 수정할 수 없는 이유를 설명합니다.

목차 자동화된 소프트웨어 전달의 약속 2026년, 자동화된 소프트웨어 개발의 꿈은 에이전트가 다음 작업을 수행하는 것입니다:

리포지토리 읽기
프로젝트 구조 이해
다단계 변경 계획 수립
코드, 테스트 및 문서 작성
코드를 실행하고 스스로 오류 수정
PR(Pull Request)에 바로 사용 가능한 diff(변경 사항) 생성

이 중 처음 세 가지 작업은 '추가적(Additive)'이며, 마지막 세 가지는 '변혁적(Transformative)'입니다. 처음 세 가지는 시스템의 동작을 변경하지 않고 정보를 추가합니다. 즉, 읽기, 매핑 및 계획이 필요하지만 코드베이스의 기존 인과적 구조를 변경할 필요가 없습니다.

반면, 새로운 코드를 적용하는 것은 자체적으로 완결되는 추가적인 작업입니다. 하지만 기존 시스템을 수정하는 것은 의존성, 불변성 및 결과에 대한 이해가 필요한 변혁적인 작업입니다. 이러한 '추가적 작업'과 '변혁적 작업'의 구분이 현재 LLM이 보조 역할은 할 수 있지만, 자율적으로 소프트웨어를 완성하여 전달할 수 없는 핵심적인 이유입니다.

위 작업의 일부는 수십 줄 정도의 단순한 코드로 이루어진 엄격하게 통제된 데모에서는 가능합니다. 하지만 수년 동안 수십 명의 사람들이 업데이트해 온 수천 줄의 실제 리포지토리에서는 불가능합니다.

주요 기업들이 실제로 선보인 것 관련 자료(Further Reading)에 나열된 OpenAI, Google, Cognition Labs, GitHub(Microsoft), Sourcegraph, JetBrains, Replit, Amazon, Meta, Anthropic 등의 에이전트 작업은 2023년과 2024년에 발표되었습니다. 어디를 보느냐에 따라 '에이전트 시대가 도래했다'는 다른 인상을 받았을 수도 있습니다. 하지만 현실은 다릅니다.

에이전트는 발전하고 있지만, 아직 신뢰할 수 없고, 자율적이지 않으며, 프로덕션 환경에서 사용하기에 안전하지 않습니다. LLM은 소프트웨어 전달을 '도울' 수는 있지만, 그 과정을 '책임지고 주도'할 수는 없습니다. 그 이유는 무엇일까요?

LLM은 통계적으로 그럴듯한 텍스트의 연속을 생성할 뿐입니다. 이는 함수 작성이나 문서 초안 작성과 같은 자체적으로 완결되는 작업에는 잘 작동합니다. 왜냐하면 이러한 작업은 패턴 확장 문제이기 때문입니다. 하지만 패턴 매칭은 시스템을 이해하는 것이 아니며, '그럴듯함'이 '정확함'을 의미하지는 않습니다.

소프트웨어 시스템은 인과적(Causal)입니다. 컴포넌트는 서로 의존하고, 불변성(Invariants)이 동작을 제약하며, 변경 사항은 시스템 전체에 파급됩니다. 작업이 독립적인 상태를 벗어나 시스템에 의존하게 되는 순간, 즉 의존성의 일관성, 영구 상태, 실제 코드베이스에 미치는 파급력에 대한 인지가 필요해지는 순간 패턴 매칭만으로는 더 이상 충분하지 않습니다.

현재 LLM은 엔지니어링 작업의 겉모습을 모방할 수는 있지만, 일관되게 변경되어야 하는 시스템의 안정적인 내부 표현을 유지할 수 없습니다. 그리고 이 한계가 바로 작업이 시스템 수준이 되는 순간 LLM이 실패하는 정확한 이유입니다.

영구 상태는 시간적 의존성을 생성합니다 자체적으로 완결되는 작업에는 과거도 미래도 없습니다. 하지만 시스템에 의존적인 작업에는 그것이 있습니다. 변경 사항이 다음 중 하나에 의존해야 하는 순간:

이전에 기록된 내용
누적된 데이터
캐시된 값
수명이 긴 객체
외부 시스템 상태

에이전트 모델은 시스템이 어떻게 현재 상태에 이르렀는지, 그리고 변경 후 어떻게 동작할지 추론해야 합니다. LLM은 그러한 내부적인 인과적 연결고리를 유지할 수 없습니다.

에이전트 시스템을 향한 코드 작성: 근본적인 간극 이 간극은 두 가지 활동, 즉 '새 코드 작성'과 '기존 시스템 수정'을 비교할 때 명확해집니다.

코드 생성은 국지적이고 추가적입니다. 모델은 시스템을 이해할 필요 없이 단순히 패턴을 확장합니다. 하지만 에이전트적인 작업은 전역적이고 변혁적입니다. LLM은 시스템 자체를 변경해야 하며, 이는 의존성, 불변성, 상호 작용 및 다운스트림 결과에 대한 이해가 필요합니다. 이는 패턴의 확장이 아니라 '인과적 추론(Causal reasoning)'의 영역입니다.

LLM은 결과가 아니라 토큰을 예측합니다. 그리고 이것이 코드 작성에서 시스템을 인지하는 안전하고 PR에 바로 사용할 수 있는 diff를 생성하는 것으로 도약하는 것이 단순한 점진적 발전이 아니라, 근본적으로 다른 문제 공간으로의 전환이어야 하는 이유입니다.

PR에 바로 사용 가능한 diff 생성 (논의 중인 섹션) 풀 리퀘스트(PR)는 시스템을 변경할 코드 조각입니다. 그 변경이 안전하려면, 시스템의 현재 아키텍처, 그 의도, 그리고 모든 다운스트림 파급력을 존중해야 합니다.

원문 보기

원문 보기 (영어)

This article explains why current LLMs cannot safely modify real software systems, despite impressive code‑generation demos. Table of contents The Promise of Automated Software Delivery In 2026, the automated software delivery dream is for an agent to: read a repository understand project structure plan a multi‑step change write code, tests, and docs run the code and fix its own mistakes produce a PR‑ready diff The first three tasks are additive; the last three are transformative. The first three add information without changing the behaviour of the system: they require reading, mapping, and planning, but not altering any existing causal structure in the codebase. Applying new code is self-contained, additive work; modifying an existing system is transformative work that requires an understanding of dependencies, invariants, and consequences. This distinction — additive vs transformative — is the core reason current LLMs can assist but cannot autonomously deliver software. Parts of the above can be done but only for tightly controlled demos on simple code that is tens of lines long, not on real-world repositories with thousands of lines of code that has existed for years where dozens of people have updated it. What the Labs Have Actually Delivered The agentic work of OpenAI, Google, Cognition Labs, GitHub (Microsoft), Sourcegraph, JetBrains, Replit, Amazon, Meta, and Anthropic, that is listed in Further Reading , was published in 2023 and 2024. Depending on where you look, you may have been given another impression: that "agents are here". However, reality tells a different story. Agents are improving, but are not reliable, not autonomous, and not production‑safe. LLMs can assist with software delivery, but they cannot own it. Why is this? LLMs generate statistically plausible continuations of text. This works well for self-contained tasks like writing a function or drafting documentation because these are pattern‑extension problems. But pattern‑matching is not system understanding, and plausibility is not correctness. Software systems are causal: components depend on each other, invariants constrain behaviour, and changes propagate through the system. The moment a task stops being self‑contained and becomes system‑dependent — requiring dependency coherence, persistent state, or awareness of how changes ripple through a real codebase — pattern‑matching is no longer sufficient. Currently, LLMs can imitate the shape of engineering work, but they cannot maintain a stable internal representation of a system that must be coherently changed, and that gap is exactly why LLMs fail the moment the task becomes system‑level. Persistent state creates temporal dependencies A self‑contained task has no past and no future. A system‑dependent task does. As soon as a change depends on: previous writes accumulated data cached values long‑lived objects external system state any agentic model must reason about how the system got here and how it will behave after the change. LLMs cannot maintain that internal causal chain. Writing code to Agentic Systems: The Fundamental Gap The gap becomes clear when you compare two activities: writing new code and modifying an existing system. Code generation is local and additive: the model extends a pattern without needing to understand the system. But agentic work is global and transformative: the LLM must change the system itself, which requires understanding dependencies, invariants, interactions, and downstream consequences. This is causal reasoning, not pattern extension. LLMs predict tokens, not consequences — and that is why the leap from writing code to producing a safe, system‑aware PR‑ready diff is not incremental but a shift into a fundamentally different problem space. Producing a PR‑ready diff (the section in question) A pull request (PR) is a piece of code that will change a system. For that change to be safe, the change must respect the system's current architecture, its intent, and all downstream consequences. Software engineers work hard to ensure that such a change is safe through testing and their own judgement and experience before having a collegue review the change. Applying a change is no longer pattern-matching but understanding causal behaviour: how will the system change if this PR is applied? The correctness of the PR depends on understanding the whole system, not just generating text. The LLM must change the system, which requires understanding dependencies, invariants, interactions and consequences, all of which demand causal reasoning, not pattern matching. Pattern‑matching can write code; only causal reasoning can maintain systems. What can I do? Confirm for yourself any claim that you see. Define your own realistic real-world repository to work on, one that is thousands of lines of code, that has supported past real-world work patterns. Having your own results, applied to your own repository will tell you volumes more than any press release or online anecdote. For the moment: treat agentic AI as a strategic direction treat current tools as assistants, not engineers invest in clarity, architecture, and test discipline expect progress, but not miracles do not plan delivery pipelines around unproven capabilities Maintain human judgement as the centre of the system. The dream is intact. The evidence is not yet here. Why this matters: code is cheap, judgement is not LLM-augmented software delivery does not remove engineering. It moves engineering up a level. Humans need to focus on: intent constraints architecture correctness safety trade‑offs The desired end state is not "AI writes code" but AI maintains systems. If we get there, humans will still need to maintain intent. The consequence of an agentic system is not to remove engineering, but to elevate it, so that teams spend less time on mechanical construction and more time on judgement, alignment, and shaping the environment in which agents operate. The organisations that benefit most will be those that treat agentic development not as automation, but as a structural shift in how software is conceived, validated, and maintained. Final Thought Until AI can reason causally about systems, human judgement remains the foundation of software delivery. Related Work The real gains from AI come from improving the shared work between engineers — planning, coordination, review, debugging, and delivery — not from speeding up individual coding. Software engineers must understand tokens, structure, and probabilistic behaviour to build reliable systems and avoid mismatches between test and production behaviour. AI systems behave like probabilistic components; engineers must build structured interfaces and layered constraints to make them reliable inside software systems. If this piece was useful , you’ll appreciate the free Phroneses newsletter — clear thinking on engineering leadership, organisational clarity, and reliable systems. Practical, honest, and built for people who care about doing the work well. Subscribe to the newsletter → I work with leaders and teams on clarity, capability, and momentum. Work with me → Table of Contents The Promise of Automated Software Delivery What the Labs Have Actually Delivered Why is this? Persistent state creates temporal dependencies Writing code to Agentic Systems: The Fundamental Gap Producing a PR‑ready diff (the section in question) What can I do? Why this matters: code is cheap, judgement is not Final Thought Related Work Table of Contents Further Reading Further Reading OpenAI o1/o3 , OpenAI, September, 2024 - https://openai.com/index/introducing-openai-o1-preview/ Gemini Code Demos , Google, December, 2023 - https://blog.google/technology/ai/google-gemini-ai/ Devin , Cognition Labs, March, 2024 - https://www.cognition-labs.com/ GitHub Copilot , GitHub (Microsoft), November, 2023 - https://github.blog/2023-11-08-the-new-github-copilot-your-ai-pair-programmer/ Cody , Sourcegraph, April, 2024 - https://sourcegraph.com/blo

LLM 소프트웨어 개발 에이전트 코드 생성 인과적 추론