Hacker News • 111일 전

클로드 발화자 혼동 버그, 치명적 문제로 지적돼

IMP

8/10

핵심 요약

AI 모델 클로드(Claude)가 자신이 생성한 메시지를 사용자의 입력으로 착각하는 치명적인 버그가 보고되었습니다. 이 버그는 단순한 환각(Hallucination)이나 권한 문제가 아니라, 시스템 내부에서 메시지의 발신자를 잘못 레이블링하는 근본적인 결함으로 추정됩니다. AI가 스스로 파괴적 지시를 내린 뒤 사용자가 그렇게 지시했다고 우기기 때문에 개발자가 의도치 않은 결과를 통제하기 어려워진다는 점에서 중요한 문제로 평가받습니다.

번역된 본문

클로드가 누가 무슨 말을 했는지 혼동하는 버그, 이건 절대 안 된다.

이 버그는 클로드가 때때로 자신에게 메시지를 보낸 뒤, 그 메시지가 사용자로부터 온 것이라고 착각하는 현상입니다. 이는 제가 지금까지 LLM 제공업체에서 본 것 중 최악의 버그입니다. 하지만 사람들은 항상 상황을 오해하고 LLM 자체, 환각(Hallucination), 또는 권한 경계(Permission boundaries)의 부족 탓으로 돌립니다. 이들은 관련된 문제들이 맞지만, 이 '누가 무엇을 말했는가' 버그는 근본적으로 다른 범주의 문제입니다.

저는 이전 글(Claude Code에서 지금까지 본 최악의 버그)에서 이 문제에 대해 상세히 다루며, 클로드가 스스로에게 지시를 내린 뒤 그 지시가 저(사용자)로부터 온 것이라고 믿는 두 가지 사례를 보여주었습니다. 클로드는 제 오타가 의도적인 것이라고 스스로 결정을 내린 후 배포를 진행했고, 급기야 제가 그렇게 말했다고 주장했습니다.

이 문제를 겪은 건 저만이 아닙니다.

한 레딧(Reddit) 스레드에서는 클로드가 "H100도 철거해(Tear down the H100 too)"라고 스스로 말해놓고, 사용자가 그 지시를 내렸다고 주장하는 사례가 올라왔습니다. r/Anthropic 커뮤니티의 글 제목은 '클로드가 스스로 파괴적인 지시를 내리고 사용자에게 책임을 돌린다'였습니다.

"그렇게 많은 접근 권한을 주면 안 되죠" 같은 반응들

제 이전 글에 달린 댓글 중에는 "당신의 데브옵스(DevOps) 과정에 더 많은 훈련과 제약이 필요하도록 도와줄 것" 같은 의견들이 있었습니다. 레딧 스레드에서도 "보존해야 할 데이터가 있다면, 특히 프로덕션 환경에 이렇게 많은 접근 권한을 주면 안 된다"는 식의 반응이 많았습니다.

하지만 이는 핵심을 벗어난 지적입니다. 물론 AI는 위험성을 내포하고 예측 불가능하게 행동할 수 있습니다. 하지만 몇 달간 사용하다 보면 그것이 어떤 종류의 실수를 하는지, 언제 더 주의 깊게 지켜봐야 하는지, 언제 더 많은 권한이나 긴 주줄(Long leash)을 줘야 하는지 감을 잡게 됩니다.

이 버그는 모델 자체의 문제라기보다는 제어 시스템(하네스, harness) 내부의 문제로 보입니다. 시스템이 내부 추론 메시지를 사용자가 보낸 것으로 잘못 레이블링(Labeling)하고 있기 때문에, 모델이 자신의 잘못이 아니라 "아닙니다, 당신이 그렇게 말했습니다"라고 매우 자신만만하게 우기는 것입니다.

이전에는 이것이 일시적인 현상이라고 생각했습니다. 하루에 몇 번 본 적이 있고 그 후 몇 달 동안 다시 나타나지 않았기 때문입니다. 하지만 이는 회귀(Regression) 현상이거나, 우연의 일치로 가끔씩 불쑥 나타나는 것일 수 있습니다. 아마도 사람들은 클로드가 스스로 나쁜 행동을 할 권한을 부여할 때만 이 버그를 눈치채는 것 같습니다.

원문 보기

원문 보기 (영어)

The bug Claude sometimes sends messages to itself and then thinks those messages came from the user. This is the worst bug I’ve seen from an LLM provider, but people always misunderstand what’s happening and blame LLMs, hallucinations, or lack of permission boundaries. Those are related issues, but this ‘who said what’ bug is categorically distinct. I wrote about this in detail in The worst bug I’ve seen so far in Claude Code , where I showed two examples of Claude giving itself instructions and then believing those instructions came from me. Claude told itself my typos were intentional and deployed anyway, then insisted I was the one who said it. It’s not just me Here’s a Reddit thread where Claude said “Tear down the H100 too”, and then claimed that the user had given that instruction. From r/Anthropic — Claude gives itself a destructive instruction and blames the user. “You shouldn’t give it that much access” Comments on my previous post were things like “It should help you use more discipline in your DevOps.” And on the Reddit thread, many in the class of “don’t give it nearly this much access to a production environment, especially if there’s data you want to keep.” This isn’t the point. Yes, of course AI has risks and can behave unpredictably, but after using it for months you get a ‘feel’ for what kind of mistakes it makes, when to watch it more closely, when to give it more permissions or a longer leash. This class of bug seems to be in the harness, not in the model itself. It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.” Before, I thought it was a temporary thing — I saw it a few times in a single day, and then not again for months. But either they have a regression or it was a coincidence and it just pops up every so often, and people only notice when it gives itself permission to do something bad.

클로드 버그 코딩-에이전트 오작동