r/singularity • 103일 전

클로드 오푸스 4.7, 벤치마크서 4.6보다 성능 역행

IMP

6/10

핵심 요약

클로드 오푸스 4.7이 '주제 일반화 벤치마크(Thematic Generalization Benchmark)' 테스트에서 예상과 달리 이전 버전인 4.6보다 낮은 점수를 기록했습니다. 모델이 제시된 예시에서 구체적인 제약 조건을 잊고 더 포괄적이지만 틀린 패턴을 선택하는 오류를 보였으며, 이는 모델 업데이트 과정에서 추론 및 맥락 파악 능력의 퇴화 가능성을 시사합니다.

번역된 본문

Opus 4.7(추론 미사용)은 68.8점을 기록한 Opus 4.6과 비교해 52.6점에 그쳤습니다.

Opus 4.7 xhigh(초고강도 추론) 모드 역시 성능 개선을 보여주지 못했습니다.

이 벤치마크는 대형 언어 모델(LLM)이 몇 가지 예시에서 특정 잠재적 주제(latent theme)를 추론해내고, 반례(anti-examples)를 활용해 더 포괄적이지만 틀린 패턴을 배제한 뒤, 유사한 방해 요소들 사이에서 단 하나의 진짜 정답을 식별할 수 있는지를 평가합니다.

Opus 4.7이 실패한 구체적인 사례는 다음과 같습니다:

주제: '동물 가죽에 적힌 종교 문서'. 4.6 버전은 '동물 가죽'과 '종교 문서'라는 조건의 결합(논리곱)을 정확히 맞춥니다. 반면 4.7 버전은 '소재(material)' 제약 조건을 놓치고, 단순히 '종교 필사본'이라는 조건만으로 충분하다고 판단합니다. 제시된 반례들은 의도된 구분을 매우 명확히 보여줍니다. 하나는 '동물 가죽이지만 종교적이지 않은' 경우이고, 다른 하나는 '종교적이지만 동물 가죽이 아닌' 경우입니다.

평균 완료 토큰(Average completion tokens): Opus 4.7 (추론 미사용): 182 Opus 4.7 (high reasoning): 711 Opus 4.7 (xhigh reasoning): 1121

자세한 정보: https://github.com/lechmazur/generalization

원문 보기

원문 보기 (영어)

Opus 4.7 (no reasoning) scores 52.6 compared to 68.8 for Opus 4.6. Opus 4.7 xhigh is not an improvement. This benchmark tests whether large language models can infer a specific latent theme from a few examples, use anti-examples to reject the broader but wrong pattern, and then identify the one true match among close distractors. One example of how Opus 4.7 fails: Theme: religious texts written on animal skin. 4.6 gets the conjunction right. 4.7 loses the material constraint and behaves as if "religious manuscript" alone is enough. The anti-examples make the intended distinction very clear: one is animal skin but not religious and the other is religious but not animal skin. Average completion tokens: Opus 4.7 (no reasoning): 182 Opus 4.7 (high reasoning): 711 Opus 4.7 (xhigh reasoning): 1121 More info: [https://github.com/lechmazur/generalization](https://github.com/lechmazur/generalization)

클로드 오푸스 4.7 벤치마크 모델 성능 추론