Hacker News • 90일 전

클로드 압축 플러그인 vs '간단히 답해'

IMP

6/10

핵심 요약

클로드 코드(Claude Code)의 인기 응답 압축 플러그인인 'Repo Caveman'의 성능을 단순히 프롬프트에 '간단히 답해(Be brief)'라는 두 단어를 추가한 것과 비교 분석했습니다. 그 결과, 단순 지시어가 플러그인과 동일한 품질 및 토큰 절감 효과를 내는 것으로 나타났습니다. 복잡한 플러그인 대신 간단하고 직관적인 프롬프트 지시만으로도 AI 코딩 어시스턴트의 토큰 사용량을 최적화할 수 있다는 점에서 실무적인 시사점을 제공합니다.

번역된 본문

Repo Caveman은 인기 있는 클로드 코드(Claude Code) 압축 플러그인입니다. 이름에서 알 수 있듯 초경량 응답, 약 75% 적은 토큰 사용, 기술적 정확도 유지를 내세웁니다. 6가지 모드, 슬래시 명령어, 강도 조절 기능, 한문(Classical Chinese) 변형까지 갖추고 있습니다. 저는 이 플러그인을 단 두 단어인 "간단히 답해(be brief)"와 비교 벤치마크해 보았습니다. 품질도 같았고, 토큰 절감 범위도 같았습니다. 이 플러그인은 어떤 면에서든 지루한 기본값을 능가하지 못했습니다. 이 글은 제 영상의 상세 텍스트 버전입니다. 2분 만에 결론을 알고 싶다면 영상을 시청하세요.

테스트 내용

버그 진단, 개념 설명, 아키텍처 트레이드오프, 다단계 설정, 보안 및 파괴적 작업, 오류 해석이라는 6개 카테고리에 걸쳐 총 24개의 프롬프트를 사용했습니다. 각 프롬프트에는 평가 기준(rubric)이 있습니다. 답변이 반드시 포함해야 하는 사실(key_points), 사용해야 하는 용어(must_use_terms), 피해야 할 위험한 오답(must_avoid)이 포함됩니다.

데이터셋 구조: ts interface PromptCase { id: string; category: string; prompt: string; key_points: string[]; must_use_terms?: string[]; must_avoid?: string[]; }

실제 입력 데이터 예시: json { "id": "bug_01", "category": "bug_diagnosis", "prompt": "I have const [count, setCount] = useState(0); function handleClick() { setCount(count + 1); setCount(count + 1); }. I expected count to go up by 2 per click but it only goes up by 1. Why?", "key_points": [ "stale closure on count", "both calls set count to same value", "functional updater setCount(c => c + 1)" ] }

테스트 그룹(Arms)은 다음과 같습니다: baseline: 지시어 없는 클로드 기본값. brief: 모든 프롬프트 앞에 "간단히 답해(Be brief)."를 붙임. lite, full, ultra: 3가지 강도 수준의 Caveman 플러그인.

각 그룹은 전체 24개 프롬프트 데이터셋을 claude-opus-4-7 모델의 claude -p 명령어를 통해 실행했습니다. 별도의 클로드 모델(sonnet-4-6)이 모든 응답을 해당 프롬프트의 평가 기준에 따라 채점했습니다. 핵심 포인트에 대한 의미론적 일치, 필수 용어에 대한 문자적 일치, 피해야 할 주장에 대한 함정 감지를 확인했습니다. 이 테스트 하네스(harness)는 오픈소스로 공개되어 있습니다.

품질에는 변화가 없었다

첫 번째 확인 사항: 압축이 정확성을 떨어뜨렸는가? 모든 그룹이 서로 1.5% 이내의 점수를 기록했습니다. Baseline 0.985, Brief 0.985, Lite 0.976, Full 0.975, Ultra 0.970이었습니다. 모든 그룹이 100%의 key_points를 적중했습니다. 120개의 응답에서 must_avoid(피해야 할 오답) 발동은 0건이었습니다. 압축이 실질적인 내용을 누락시키지는 않았습니다.

품질을 제외한다면 비교할 가치가 있는 유일한 축은 토큰입니다.

핵심 결과

"간단히 답해."는 baseline 대비 토큰을 34% 줄였습니다. Caveman lite와 full은 brief와 비슷한 수준을 기록했습니다. 반면 가장 엄격한 모드인 ultra는 세 가지 caveman 그룹 중 가장 긴 답변을 생성했습니다. 이는 ultra에게 좋지 않은 결과로 보입니다. 하지만 이건 잘못된 결론입니다.

카테고리별 분석

토큰을 카테고리별로 나누면 더 명확한 그림이 그려집니다. 버그 진단, 개념 설명, 아키텍처 트레이드오프 및 오류 해석에서 ultra가 가장 짧거나 다른 caveman 그룹과 동점이었습니다. 압축은 광고된 대로 작동하고 있었습니다.

다단계 설정과 보안 경고 카테고리에서는 모든 caveman 모드의 편차가 커졌습니다. 전체적인 수치에서 ultra가 눈에 띄지만, 특별히 더 나쁜 것은 아닙니다. 이 세 가지 caveman 그룹 모두 이 카테고리들에서 크게 흔들렸습니다.

그 이유는 기능 자체에 있습니다. Caveman에는 보안 경고, 되돌릴 수 없는 작업 및 다단계 시퀀스에 대해 압축을 명시적으로 해제하는 "자동 명확성(Auto-Clarity)" 규칙이 있습니다. 바로 이 두 카테고리가 해당됩니다. 안전 예외 기능이 작동하면 세 가지 모드 모두 자연스러운 산문 형태로 풀립니다. 압축 기능이 작동하지 않는 것입니다.

이것은 버그가 아닙니다. 의도적으로 설계된 기능입니다. 언제 압축을 멈춰야 하는지 알고 있는 Caveman의 특성입니다.

그렇다면 caveman은 대체...

원문 보기

원문 보기 (영어)

Repo Caveman is a popular Claude Code compression plugin. The pitch is in the name: ultra-compressed responses, ~75% fewer tokens, all the technical accuracy. Six modes, slash commands, intensity dials, classical Chinese variants. I benchmarked it against two words: "be brief." Same quality. Same range of tokens. The plugin didn't beat the boring default on either axis. This article is the long version of the video . If you want the verdict in two minutes, watch it. What I tested Category Failure mode Skill claim tested n Bug diagnosis Drops the why , gives fix without cause — 5 Concept explanation Strips nuance, edge cases, or compresses technical terms into plain English Technical terms exact 5 Architectural tradeoffs Drops caveats that change the advice — 4 Multi-step setup Collapses or reorders steps — 4 Security / destructive ops Missing warnings on irreversible actions Auto-Clarity escape 3 Error interpretation Paraphrases or truncates the error string Errors quoted exact 3 24 prompts across six categories: bug diagnosis, concept explanations, architecture tradeoffs, multi-step setup, security and destructive ops, error interpretation. Each prompt has a per-prompt rubric. Facts the answer must cover ( key_points ), terms it must use ( must_use_terms ), and dangerous wrong claims to avoid ( must_avoid ). The dataset shape: ts interface PromptCase { id : string ; category : string ; prompt : string ; key_points : string [ ] ; must_use_terms ? : string [ ] ; must_avoid ? : string [ ] ; } A real entry: json { "id" : "bug_01" , "category" : "bug_diagnosis" , "prompt" : "I have `const [count, setCount] = useState(0); function handleClick() { setCount(count + 1); setCount(count + 1); }`. I expected count to go up by 2 per click but it only goes up by 1. Why?" , "key_points" : [ "stale closure on count" , "both calls set count to same value" , "functional updater setCount(c => c + 1)" ] } Five arms: baseline . Claude default, no instruction. brief . "Be brief." prepended to every prompt. lite, full, ultra . Caveman plugin at three intensity levels. Each arm ran the full 24-prompt dataset through claude -p on claude-opus-4-7 . A separate Claude ( claude-sonnet-4-6 ) scored every response against its prompt's rubric. Semantic match on key points, literal match on required terms, trap detection on avoided claims. The harness is open source here . Quality didn't move First check: did compression hurt correctness? Every arm scored within 1.5% of every other arm. Baseline 0.985. Brief 0.985. Lite 0.976. Full 0.975. Ultra 0.970. Every arm hit 100% of its key_points . Zero must_avoid triggers in 120 responses. Compression didn't drop substantive content. Setting quality aside, the only axis worth comparing is tokens. The headline result Arm mean tokens baseline 636 brief 419 lite 401 full 404 ultra 449 "Be brief." cut tokens 34% versus baseline. Caveman lite and full landed close to brief. Ultra, the strictest mode, produced the longest answers of the three caveman arms. This looked bad for ultra. It's a false story. The category split Splitting tokens by category gives a clearer picture. On bug diagnosis, concept explanations, architecture tradeoffs, and error interpretation, ultra is shortest or tied with the other caveman arms. Compression is working as advertised. On multi-step setup and security warnings, every caveman mode gets more variable. Ultra catches the eye in the aggregate, but it's not specifically worse. All three caveman arms swing hard on these categories. The reason is in the skill itself. Caveman has an "Auto-Clarity" rule that explicitly drops compression for safety warnings, irreversible actions, and multi-step sequences. Exactly these two categories. When the safety escape engages, all three modes loosen toward natural prose. The compression just isn't running. That's not a bug. It's a designed feature. Caveman knowing when to stop compressing. So what's caveman actually for? If a two-word prompt matches it on tokens and quality, the value isn't compression. It's structure. Consistent output shape Every caveman response follows the same pattern: Predictable in a way that "be brief." isn't. If you want a uniform feel across sessions, or have downstream tooling that consumes Claude output, that consistency is real value. The intensity dial Slash command to switch lite, full, ultra mid-session. Two words can't do that. Persistence across long sessions Caveman re-injects the ruleset on every prompt via SessionStart and UserPromptSubmit hooks. The goal is to keep the pattern from drifting across long sessions. My benchmark didn't test this. Every run was single-shot via claude -p . But the mechanism is real, and "be brief." in CLAUDE.md doesn't have an equivalent. The safety escape Auto-Clarity dropping compression on destructive ops is the variance you saw in the chart above. Caveman explicitly distinguishes when to stop compressing. Two words don't make that distinction. On my data this didn't change outcomes. "be brief." never tripped a must_avoid trap either. But the design exists. What I cut from the video A few findings that didn't earn their place in a two-minute video but are worth flagging here. Lite missed a required term once. On a queue tradeoff question (SQS vs BullMQ vs Kafka), lite's markdown-table format compressed the comparison so tight it dropped the term "at-least-once" . Score 0.70. The only row below 0.90 in the 120-row sweep. n=1, but it's a real failure mode for benchmarks that enforce specific terminology. Ultra triggered tool-use behaviour the other modes didn't. On a Dockerfile setup question, ultra opened with "Need write perms. Retry after approve, or paste inline:". It tried to call the Write tool, got blocked, and dumped the file inline anyway. That single response added ~1300 tokens to ultra's setup category mean. Caveman's terse examples seem to prime tool-first behaviour, which is a side-effect of compression style I didn't see coming. The arch_tradeoffs token inflation isn't what I thought. My initial findings doc claimed caveman's [thing] [action] [reason] pattern pushed the model toward bulleted enumerations on N-way comparison questions. Looking closer, lite and full have the same pattern but produced cleaner outputs (lite often wrote tables, full wrote prose). The pattern isn't the cause. I don't have a clean attribution. What you should actually do If all you want is shorter outputs, start with "be brief." in your prompt or CLAUDE.md . Two words. Matched caveman's tokens and quality. Reach for caveman when you need consistent output structure across sessions. That's the differentiator that survived the benchmark. The bigger lesson: most prompt-engineering advice hasn't been measured against the boring default. Measure it. Repo: cc-compression-bench · Video: youtu.be/wijoYNiZq3M · Caveman plugin: juliusbrussee/caveman If you've got a compression strategy you want benchmarked against the same dataset, the harness is strategy-agnostic. Adding an arm is one shell script. PRs welcome.

클로드 코딩 어시스턴트 프롬프트 엔지니어링 성능 벤치마크 토큰 최적화