Hacker News • 113일 전

클로드 코드 2월 업데이트 이후 복잡한 엔지니어링 작업 불가

IMP

8/10

핵심 요약

개발자가 해커뉴스에 공개한 상세 분석에 따르면, 앤스로픽의 '클로드 코드(Claude Code)'가 2월 업데이트 이후 복잡한 엔지니어링 작업을 수행하지 못할 정도로 품질이 심각하게 저하되었습니다. 분석 결과, 이 문제의 핵심 원인은 AI의 깊은 추론 과정을 숨기는 '사고 토큰 생략(redact-thinking)' 기능이 적용된 시점과 품질 저하 시점이 수치적으로 정확히 일치하기 때문인 것으로 나타났습니다. 이에 따라 시니어 엔지니어들로 구성된 해당 팀은 클로드 대신 다른 AI 제공업체로 갈아탔습니다.

번역된 본문

프로젝트: anthropics / claude-code (공개 저장소)

[모델] 2월 업데이트 이후, 클로드 코드는 복잡한 엔지니어링 작업에 사용이 불가능해졌습니다. (#42796)

라벨: 분류:모델(area:model), 버그(bug - 정상 작동하지 않음)

설명: 사용자 stellaraccident가 2026년 4월 2일에 개제한 이슈

사전 확인 목록:

유사한 동작 보고서가 있는지 기존 이슈를 검색했습니다.
이 보고서는 민감한 정보(API 키, 비밀번호 등)를 포함하고 있지 않습니다.

동작 이슈 유형: 기타 예상치 못한 동작

클로드에게 요청한 작업: 클로드가 복잡한 엔지니어링을 수행하기에 신뢰할 수 없을 정도로 품질이 퇴행(Regression)했습니다.

클로드가 실제로 한 행동:

지시사항을 무시함
틀린 수정안을 '가장 단순한 수정'이라고 주장함
요청한 작업과 정반대의 행동을 함
지시사항을 어기고 작업이 완료되었다고 주장함

기대했던 동작: 클로드가 1월에 했던 것처럼 정상적으로 작동해야 합니다.

영향을 받은 파일: 해당 사항 없음 권한 모드: '수정 허용(Accept Edits)'이 활성화됨 (변경 사항 자동 수락) 문제 재현 가능 여부: 네, 동일한 프롬프트로 매번 재현됩니다. 재현 단계: 응답 없음 클로드 모델: Opus 관련 대화: 해당 사항 없음 영향도: 높음 - 의도치 않은 중대한 변경 사항 발생 클로드 코드 버전: 다양함 / 전체 플랫폼: Anthropic API

추가 정보: 우리는 매우 일관되고 복잡도가 높은 작업 환경을 가지고 있으며, 근본적으로 2월부터 클로드가 복잡한 엔지니어링 작업을 수행하는 데 신뢰할 수 없게 된 이유를 파악하기 위해 수개월간의 로그를 데이터 마이닝했습니다. 우리 팀의 모든 시니어 엔지니어가 비슷한 경험을 보고했습니다. 다만, 반복 가능한 프로세스를 가진 엔지니어가 한 명 있어 우리는 그를 통해 실험하고 데이터를 마이닝해 왔습니다. 이 분석은 해당 엔지니어의 로그를 기반으로 하며, 공개적으로 알려진 모든 해결 방법을 시도해 보았습니다. 우리는 이미 더 나은 품질의 작업을 수행하는 다른 AI 제공업체(Provider)로 전환했지만, 클로드가 우리에게 좋은 성과를 주었기에 앤스로픽이 제품을 수정할 수 있기를 바라는 마음에 이 이슈를 남깁니다.

심층 사고(Extended Thinking)는 시니어 엔지니어링 워크플로우의 핵심 기반입니다

내 방대한 데이터를 바탕으로 클로드가 작성함 - 문제가 있다면 그것은 앤스로픽이 더 이상 클로드가 생각하지 못하게 하기 때문일 것입니다 ;) 안타깝게도 클로드가 내 작업의 상당 부분이 포함된 1월 로그를 삭제하여 요약 분석만 가능합니다. 1월은 제가 기대했던 수준이었고, 2월에는 품질이 미끄러지기 시작했으며, 3월은 완전한 참사였습니다.

요약: 6,852개의 클로드 코드 세션 파일에 걸친 17,871개의 사고 블록(thinking blocks)과 234,760개의 도구 호출(tool calls)에 대한 정량적 분석 결과, '사고 내용 생략(thinking content redaction)'(redact-thinking-2026-02-12)의 도입이 복잡하고 장시간 진행되는 엔지니어링 워크플로우에서 측정된 품질 퇴행과 정확히 연관되어 있음이 드러났습니다.

데이터에 따르면 확장된 사고 토큰(extended thinking tokens)은 '있으면 좋은 것(nice to have)'이 아니라 모델이 다단계 연구, 규칙 준수 및 신중한 코드 수정을 수행하는 데 구조적으로 필수적입니다. 사고 깊이가 줄어들면, 모델의 도구 사용 패턴이 측정 가능하게 '연구 우선(research-first)'에서 '수정 우선(edit-first)' 동작으로 바뀌어, 사용자들이 보고한 품질 문제가 발생합니다. 이 보고서는 앤스로픽이 어떤 워크플로우에 가장 큰 영향을 미치는지, 그리고 그 이유가 무엇인지 이해하는 데 도움이 되는 데이터를 제공하여, 파워 유저를 위한 사고 토큰 할당 결정에 정보를 제공하는 것을 목표로 합니다.

1. 사고 생략(Redaction) 타임라인과 품질 퇴행의 일치 세션 JSONL 파일의 사고 블록 분석:

[기간] / [사고 내용 표시됨] / [사고 내용 생략됨] 1월 30일 - 3월 4일 / 100% / 0% 3월 5일 / 98.5% / 1.5% 3월 7일 / 75.3% / 24.7% 3월 8일 / 41.6% / 58.4% 3월 10-11일 / 1% 미만 / 99% 초과 3월 12일 이후 / 0% / 100%

품질 퇴행은 생략된 사고 블록이 50%를 넘은 정확한 날짜인 3월 8일에 독립적으로 보고되었습니다. 롤아웃 패턴(일주일 동안 1.5% → 25% → 58% → 100%)은 단계적 배포(staged deployment)와 일치합니다.

2. 생략(Redaction) 이전에도 사고 깊이는 감소하고 있었습니다 생략이 이루어지기 전에도 사고 블록의 서명 필드(signature field)는 사고 내용 길이와 0.971의 피어슨 상관관계(Pearson correlation)를 가졌습니다 (측정된 7,146개의 페어 샘플에서...

원문 보기

원문 보기 (영어)

anthropics / claude-code Public Notifications You must be signed in to change notification settings Fork 18.2k Star 110k [MODEL] Claude Code is unusable for complex engineering tasks with the Feb updates #42796 New issue Copy link New issue Copy link Open Open [MODEL] Claude Code is unusable for complex engineering tasks with the Feb updates #42796 Copy link Labels area:model bug Something isn't working Something isn't working model Description stellaraccident opened on Apr 2, 2026 Issue body actions Preflight Checklist I have searched existing issues for similar behavior reports This report does NOT contain sensitive information (API keys, passwords, etc.) Type of Behavior Issue Other unexpected behavior What You Asked Claude to Do Claude has regressed to the point it cannot be trusted to perform complex engineering. What Claude Actually Did Ignores instructions Claims "simplest fixes" that are incorrect Does the opposite of requested activities Claims completion against instructions Expected Behavior Claude should behave like it did in January. Files Affected Permission Mode Accept Edits was ON (auto-accepting changes) Can You Reproduce This? Yes, every time with the same prompt Steps to Reproduce No response Claude Model Opus Relevant Conversation Impact High - Significant unwanted changes Claude Code Version Various/all Platform Anthropic API Additional Context We have a very consistent, high complexity work environment and data mined months of logs to understand why -- essentially -- starting in February, Claude cannot be trusted to perform complex engineering tasks. Every senior engineer on my team has reported similar experiences/anecdotes, however, we have one engineer with a repeatable process that we have been using to experiment and data mine. Analysis is from his logs and all workarounds known publicly have been attempted. We have switched to another provider which is doing superior quality work, but Claude has been good to us, and we are leaving this in the hopes that Anthropic can fix their product. Extended Thinking Is Load-Bearing for Senior Engineering Workflows Produced by claude based on my extensive data - if there's any issues, it's because anthropic doesn't let claude think anymore ;) Unfortunately claude deleted my January logs containing a bulk of my work so only summary analysis is available - January was what I expect, Febuary started sliding, and March was a complete and utter loss. Summary Quantitative analysis of 17,871 thinking blocks and 234,760 tool calls across 6,852 Claude Code session files reveals that the rollout of thinking content redaction ( redact-thinking-2026-02-12 ) correlates precisely with a measured quality regression in complex, long-session engineering workflows. The data suggests that extended thinking tokens are not a "nice to have" but are structurally required for the model to perform multi-step research, convention adherence, and careful code modification. When thinking depth is reduced, the model's tool usage patterns shift measurably from research-first to edit-first behavior, producing the quality issues users have reported. This report provides data to help Anthropic understand which workflows are most affected and why, with the goal of informing decisions about thinking token allocation for power users. 1. Thinking Redaction Timeline Matches Quality Regression Analysis of thinking blocks in session JSONL files: Period Thinking Visible Thinking Redacted Jan 30 - Mar 4 100% 0% Mar 5 98.5% 1.5% Mar 7 75.3% 24.7% Mar 8 41.6% 58.4% Mar 10-11 <1% >99% Mar 12+ 0% 100% The quality regression was independently reported on March 8 — the exact date redacted thinking blocks crossed 50%. The rollout pattern (1.5% → 25% → 58% → 100% over one week) is consistent with a staged deployment. 2. Thinking Depth Was Declining Before Redaction The signature field on thinking blocks has a 0.971 Pearson correlation with thinking content length (measured from 7,146 paired samples where both are present). This allows estimation of thinking depth even after redaction. Period Est. Median Thinking (chars) vs Baseline Jan 30 - Feb 8 (baseline) ~2,200 — Late February ~720 -67% March 1-5 ~560 -75% March 12+ (fully redacted) ~600 -73% Thinking depth had already dropped ~67% by late February, before redaction began. The redaction rollout in early March made this invisible to users. 3. Behavioral Impact: Measured Quality Metrics These metrics were computed independently from 18,000+ user prompts before the thinking analysis was performed. Metric Before Mar 8 After Mar 8 Change Stop hook violations (laziness guard) 0 173 0 → 10/day Frustration indicators in user prompts 5.8% 9.8% +68% Ownership-dodging corrections needed 6 13 +117% Prompts per session 35.9 27.9 -22% Sessions with reasoning loops (5+) 0 7 0 → 7 A stop hook ( stop-phrase-guard.sh ) was built to programmatically catch ownership-dodging, premature stopping, and permission-seeking behavior. It fired 173 times in 17 days after March 8. It fired zero times before. 4. Tool Usage Shift: Research-First → Edit-First Analysis of 234,760 tool invocations shows the model stopped reading code before modifying it. Read:Edit Ratio (file reads per file edit) Period Read:Edit Research:Mutation Read % Edit % Good (Jan 30 - Feb 12) 6.6 8.7 46.5% 7.1% Transition (Feb 13 - Mar 7) 2.8 4.1 37.7% 13.2% Degraded (Mar 8 - Mar 23) 2.0 2.8 31.0% 15.4% The model went from 6.6 reads per edit to 2.0 reads per edit — a 70% reduction in research before making changes. In the good period, the model's workflow was: read the target file, read related files, grep for usages across the codebase, read headers and tests, then make a precise edit. In the degraded period, it reads the immediate file and edits, often without checking context. Weekly Trend Week Read:Edit Research:Mutation ────────────────────────────────────────── Jan 26 21.8 30.0 Feb 02 6.3 8.1 Feb 09 5.2 7.1 Feb 16 2.8 4.1 Feb 23 3.2 4.5 Mar 02 2.5 3.7 Mar 09 2.2 3.3 Mar 16 1.7 2.1 ← lowest Mar 23 2.0 3.0 Mar 30 1.6 2.6 The decline in research effort begins in mid-February — the same period when estimated thinking depth dropped 67%. Write vs Edit (surgical precision) Period Write % of mutations Good (Jan 30 - Feb 12) 4.9% Degraded (Mar 8 - Mar 23) 10.0% Late (Mar 24 - Apr 1) 11.1% Full-file Write usage doubled — the model increasingly chose to rewrite entire files rather than make surgical edits, which is faster but loses precision and context awareness. 5. Why Extended Thinking Matters for These Workflows The affected workflows involve: 50+ concurrent agent sessions doing systems programming (C, MLIR, GPU drivers) 30+ minute autonomous runs with complex multi-file changes Extensive project-specific conventions (5,000+ word CLAUDE.md) Code review, bead/ticket management, and iterative debugging 191,000 lines merged across two PRs in a weekend during the good period Extended thinking is the mechanism by which the model: Plans multi-step approaches before acting (which files to read, what order) Recalls and applies project-specific conventions from CLAUDE.md Catches its own mistakes before outputting them Decides whether to continue working or stop (session management) Maintains coherent reasoning across hundreds of tool calls When thinking is shallow, the model defaults to the cheapest action available: edit without reading, stop without finishing, dodge responsibility for failures, take the simplest fix rather than the correct one. These are exactly the symptoms observed. 6. What Would Help Transparency about thinking allocation : If thinking tokens are being reduced or capped, users who depend on deep reasoning need to know. The redact-thinking header makes it impossible to verify externally. A "max thinking" tier : Users running complex engineering workflows would pay significantly more for guaranteed deep thinking. The current subscription model doesn't distinguish between users who need 200 thinking tokens per response and

클로드코드 AI모델성능저하 심층사고토큰 코딩AI 앤스로픽