Hacker News • 34일 전

에이전트 코딩 실패 막는 '에반플로우(TDD)'

IMP

8/10

핵심 요약

Claude Code 기반의 소프트웨어 개발용 TDD 주도 피드백 루프 툴인 'EvanFlow'가 소개되었습니다. 이 도구는 AI 코딩 시 흔히 발생하는 환각(Hallucination), 컨텍스트 유실, 범위 확장(Scope creep) 등의 치명적인 실패를 방지하기 위해 설계되었습니다. 개발자가 모든 단계와 Git 작업에서 결정권을 쥐는 구조로, 안전하고 통제 가능한 방식의 에이전트 코딩을 구현하는 것이 핵심입니다.

번역된 본문

EvanFlow는 Claude Code를 위한 TDD(테스트 주도 개발) 기반의 반복적 피드백 루프입니다. 16개의 응집력 있는 스킬과 2개의 커스텀 서브 에이전트가 브레인스토밍부터 구현까지 아이디어를 이끌어가며, 전 과정에서 개발자가 통제권을 유지할 수 있는 체크포인트를 제공합니다. 단일 진입점이 존재합니다: 사용자가 "Let's evanflow this(에반플로우로 해보자)"라고 말하면 오케스트레이터가 루프를 실행합니다. 브레인스토밍 → 계획 → 실행(순차 또는 병렬) → TDD → 반복 → 정지(STOP)의 과정을 거칩니다.

이 루프는 자동조종장치(autopilot)가 아니라 지휘자(conductor) 역할을 합니다. 디자인 승인, 계획 승인, 그리고 반복 이후에 실제 체크포인트가 존재합니다. 에이전트는 모든 Git 작업 직전에 멈추고 사용자의 지시를 기다립니다. 자동 커밋 없음, 강제된 형식 없음, 스킬 호출 세금(필수적이고 불필요한 스킬 실행) 없음이 특징입니다.

[빠른 설치] 권장되는 설치 방법은 Claude Code의 플러그인 마켓플레이스를 이용하는 것입니다: /plugin marketplace add evanklem/evanflow /plugin install evanflow@evanflow 재시작 후 다음과 같이 시도해 보세요: "Let's evanflow this — I want to add a small feature that does X.(에반플로우로 해보자 — X를 하는 작은 기능을 추가하고 싶어.)" evanflow-go가 실행되며 루프를 진행합니다. git-guardrails 훅이 플러그인과 함께 자동으로 활성화됩니다(settings.json 수정 불필요). 스킬들은 evanflow: 네임스페이스 아래에 나타납니다(예: /evanflow:evanflow-go ).

[피드백 루프를 만드는 것은 무엇인가?] 이 루프는 단발성 생성이 아닌, 반복을 거듭하며 누적되는 규율을 중심으로 구축되었습니다. 모든 단계에는 다음 단계로 넘어가는 제어문(checkpoint)이 있습니다:

브레인스토밍: 의도를 명확히 하고, 내재된 압박 테스트(stress-test)와 함께 2~3가지 접근 방식을 제안 → 사용자가 디자인을 승인
계획: 파일 구조를 먼저 매핑(깊은 모듈, 삭제 테스트) → 사용자가 계획을 승인
실행: 인라인 검증과 함께 태스크별로 실행 → 차단 요소가 발생하면 루프가 중지되고 사용자에게 표시됨
TDD: 수직 슬라이스(vertical-slice) 방식만 사용. 하나의 실패한 테스트 → 최소한의 구현 → 반복. 테스트는 공용 인터페이스를 통해 동작을 검증하므로 리팩토링 후에도 유지됨
반복(Iterate): 새로운 시각으로 diff를 다시 읽고, 품질 검사를 실행하며, UI 변경 사항을 스크린샷으로 찍고, '5가지 실패 모드(Five Failure Modes)' 체크리스트(환각 동작, 범위 확장, 연쇄 오류, 컨텍스트 유실, 도구 오용)에 대조합니다. 최대 5회 반복 하드캡 적용
정지(STOP): 보고. 사용자의 지시를 대기. 에이전트는 자동 커밋, 자동 스테이징, PR 생성을 절대 하지 않습니다.

3개 이상의 완전히 독립적인 단위가 있는 계획의 경우, 루프는 병렬 코더/감독자(overseer) 오케스트레이션으로 분기됩니다. 단위당 한 명의 코더(RED 체크포인트가 있는 수직 슬라이스 TDD 사용), 코더당 한 명의 감독자(코드를 수정할 수 없는 읽기 전용 검토 서브 에이전트), 그리고 모든 접점에서 명명된 통합 테스트를 실행하는 통합 감독자가 배치됩니다. 통합 테스트는 실행 가능한 계약(Contract) 역할을 합니다. 양쪽이 모두 동일한 통과 테스트를 충족해야 하므로 인터페이스가 어긋나지 않습니다.

[루프에 녹아든 강력한 규칙] 여러 규칙은 2025-2026년 에이전트 코딩 실패 모드에 대한 업계 연구에서 비롯되었으며 모든 스킬에 내장되어 있습니다:

절대 값을 조작하지 않음: 파일 경로, 환경 변수, ID, 함수 이름, 라이브러리 API. 확실하지 않은 경우 에이전트는 멈추고 묻습니다. (행동-환각은 가장 위험한 에이전트 실패입니다.)
단언(Assertion) 정확성 경고: 연구에 따르면 LLM이 생성한 테스트 단언의 62%가 잘못되었습니다. evanflow-tdd와 감독자 검토 모두 구현에 한 글자짜리 버그가 있어도 단언이 여전히 통과하는지 명시적으로 확인합니다.
컨텍스트 표류(Context drift) 주의: 증상(확립된 질문을 다시 묻거나, 이전 결정과 모순됨)이 나타나면 evanflow-compact가 트리거됩니다. 업계 데이터에 따르면 엔터프라이즈 AI 코딩 실패의 약 65%는 단순 토큰 소진이 아니라 컨텍스트 표류에서 비롯됩니다.
반복 및 감독자 검토에서 '5가지 실패 모드' 통과: 환각 동작, 범위 확장, 연쇄 오류, 컨텍스트 유실, 도구 오용에 대한 명시적 검사를 수행합니다.
스킬 세금(Skill tax) 없음: 임시 질문에 스킬 호출이 필요하지 않습니다. 스킬은 유료도로(Tollbooth)가 아닌 도구입니다.

[스킬 셋: 기본 루프 (5개 스킬)]

evanflow-brainstorming: 의도를 명확히 하고, 내재된 압박 테스트와 함께 2~3가지 접근 방식을 제안합니다. 시각적 요청에 대한 모의업(Mockup) 빠른 모드를 제공합니다.
evanflow-writing-plans: 파일 구조를 먼저 잡고, 작은 단위의 태스크와 내재된 압박 테스트를 제공합니다. 계획을 병렬화할 수 있는 경우 2.5단계에서 evanflow-coder-overseer를 제안합니다.
evanflow-executing-plans: 인라인 검증과 함께 태스크별로 실행합니다.

원문 보기

원문 보기 (영어)

EvanFlow A TDD-driven iterative feedback loop for software development with Claude Code. 16 cohesive skills + 2 custom subagents walk an idea from brainstorm through implementation, with checkpoints throughout where you stay in control. One entry point: say "let's evanflow this" and the orchestrator runs the loop. brainstorm → plan → execute (sequential or parallel) → tdd → iterate → STOP The loop is conductor, not autopilot : real checkpoints at design approval, plan approval, and after iteration. The agent stops short of every git operation and waits for your direction. No auto-commits. No forced ceremony. No "must invoke a skill" tax. Quick Install The recommended path — Claude Code's plugin marketplace: /plugin marketplace add evanklem/evanflow /plugin install evanflow@evanflow Restart, then try: "Let's evanflow this — I want to add a small feature that does X." evanflow-go fires and walks the loop. The git-guardrails hook auto-activates with the plugin (no settings.json edit needed). Skills appear under the evanflow: namespace (e.g., /evanflow:evanflow-go ). See Installation below for two alternative paths. What Makes It a Feedback Loop The loop is built around discipline that compounds across iterations , not single-shot generation. Every step has a checkpoint that gates the next: Brainstorm clarifies intent, proposes 2–3 approaches with embedded grill (stress-test) → you approve the design Plan maps file structure first (deep modules, deletion test) → you approve the plan Execute runs task-by-task with inline verification → blockers stop the loop and surface to you TDD is vertical-slice only: one failing test → minimal impl → repeat. Tests verify behavior through public interfaces, so they survive refactors Iterate re-reads the diff with fresh eyes, runs quality checks, screenshots UI changes, and runs against a Five Failure Modes checklist (hallucinated actions, scope creep, cascading errors, context loss, tool misuse). Hard cap of 5 iterations STOP. Report. Await your direction. The agent never auto-commits, never auto-stages, never proposes a PR For plans with 3+ truly independent units, the loop forks into a parallel coder/overseer orchestration : one coder per unit (using vertical-slice TDD with a RED checkpoint), one overseer per coder (read-only review subagent that can't modify code), plus an integration overseer that runs named integration tests at every touchpoint. The integration tests are the executable contract — interfaces can't drift if both sides have to satisfy the same passing test. Hard Rules Baked Into the Loop Several rules come from 2025-2026 industry research on agentic coding failure modes and are baked into every skill: Never invent values — file paths, env vars, IDs, function names, library APIs. If unsure, the agent stops and asks. (Action-hallucination is the most dangerous agent failure.) Assertion-correctness warning — research shows 62% of LLM-generated test assertions are wrong. Both evanflow-tdd and the overseer review explicitly check whether a one-character bug in the implementation would still let the assertion pass. Watch for context drift — evanflow-compact triggers when symptoms appear (re-asking established questions, contradicting earlier decisions). Industry data: ~65% of enterprise AI coding failures trace to context drift, not raw token exhaustion. Five Failure Modes pass in iterate + overseer review — explicit check against hallucinated actions, scope creep, cascading errors, context loss, tool misuse. No skill tax — ad-hoc questions don't require a skill invocation. Skills are tools, not a tollbooth. The Skill Set Default Loop (5 skills) Skill Purpose evanflow-brainstorming Clarify intent, propose 2–3 approaches with embedded grill (stress-test). Mockup quick-mode for visual-only requests. evanflow-writing-plans File structure first, bite-sized tasks, embedded grill. Step 2.5 offers evanflow-coder-overseer if the plan is parallelizable. evanflow-executing-plans Task-by-task with inline verification. Step 0 re-offers parallel path. Hands off to iterate, then STOPS. evanflow-tdd Vertical-slice TDD. One test → one impl → repeat. Behavior through public interface. Assertion-correctness warning. evanflow-iterate Self-review loop after implementation. Re-read diff, fix issues, run quality checks, screenshot UI (via headless Chromium). Five Failure Modes checklist. Hard cap of 5 iterations. Special-Purpose (8 skills) Skill Purpose evanflow-go Single entry point. Say "let's evanflow this" and it walks the whole loop. evanflow-glossary Extract canonical domain terms into CONTEXT.md . Flag ambiguities and synonyms. evanflow-improve-architecture Surface refactor opportunities via the deletion test + deep-modules vocabulary. evanflow-design-interface "Design it twice" — spawn 3+ parallel sub-agents with radically different constraints, compare on depth/simplicity/efficiency. evanflow-debug Root-cause discipline. Hypothesis stated explicitly, embedded grill before fixing, failing test first. evanflow-review Both halves of code review (giving + receiving). Don't capitulate to feedback you can't justify. evanflow-prd Synthesize a PRD from existing context. For substantial new features. evanflow-qa Conversational bug discovery → issue draft. Asks before filing. Cross-Cutting (1 skill) Skill Purpose evanflow-compact Long-session context management. Strategies for proactive summarization at clean boundaries. Drift symptoms checklist. Meta (1 skill) Skill Purpose evanflow The index. Shared vocabulary + when to invoke each evanflow-* skill. Custom Subagents (2) In agents/ — invoked via Agent tool with subagent_type: parameter: Subagent Tool restrictions Purpose evanflow-coder Read, Edit, Write, Glob, Grep, Bash, TodoWrite Implementation subagent for evanflow-coder-overseer . Tools + system prompt prevent git ops, out-of-scope edits, value hallucination. evanflow-overseer Read, Grep, Glob (no Edit/Write/Bash) Read-only review subagent. Tools physically enforce "report findings, never fix." Bundled Hook hooks/block-dangerous-git.sh — PreToolUse hook that blocks destructive git ops ( git push , git reset --hard , git clean -f , git branch -D , git checkout . , git restore . ). Auto-activates with the plugin install path. Hard Rules (apply to every skill) Never auto-commit, never auto-stage, never auto-finish. Every git write op requires you to explicitly ask in the current turn. Never invent values. File paths, env vars, IDs, function names, library APIs — if unsure, the agent stops and asks. No skill tax. Ad-hoc questions don't require a skill invocation. Skills are tools, not a tollbooth. No forced spec/plan paths. Files live where you want them. Verify before claiming done. Quality checks (typecheck, lint, test) run before any "done" report. Requirements Claude Code (any recent version) Bash — for the bundled hook script (Linux, macOS, or Windows + WSL) jq — used by the hook script to parse Claude's JSON tool input. Install via apt install jq , brew install jq , or your platform's package manager. If jq is missing, the guardrail hook fails silently and dangerous git ops are NOT blocked. Optional but recommended: chromium or google-chrome — for evanflow-iterate 's visual verification of UI changes ( chromium --headless --screenshot=... ). Falls back gracefully if missing — the skill flags it and asks you to verify visually. Installation Three paths, in priority order. All three end with the same skill set in your .claude/skills/ . The plugin path additionally auto-wires the guardrail hook. Path 1 — Claude Code Plugin Marketplace (recommended) This is the cleanest install. Skills, agents, AND the guardrail hook all activate automatically. /plugin marketplace add evanklem/evanflow /plugin install evanflow@evanflow Restart Claude Code (or /reload-plugins ). Skills appear namespaced as /evanflow:evanflow-go , /evanflow:evanflow-tdd , etc. Auto-invocation via "let's evanflow this" still works regardless of namespace. To u

클로드 코드 TDD AI 코딩 에이전트 EvanFlow 소프트웨어 개발