r/singularity • 103일 전

클로드 오푸스 4.7, 확장 벤치마크서 41%로 대폭 하락

IMP

7/10

핵심 요약

최근 공개된 클로드 오푸스 4.7(high) 모델이 NYT 커넥션스 확장 벤치마크에서 41.0%의 저조한 성능을 기록해, 94.7%를 기록했던 이전 버전인 오푸스 4.6 대비 큰 성능 저하를 보였습니다. 현재 해당 벤치마크 상위권은 구글의 제미나이 3.1 프로 프리뷰(98.4%)와 오푸스 4.6이 차지하고 있으며, 이번 결과는 최신 모델이라고 해서 항상 모든 벤치마크에서 우수한 성능을 보이는 것은 아니라는 점을 시사합니다.

번역된 본문

확장 버전(Extended Version) 이 벤치마크는 940개의 NYT 커넥션스 퍼즐을 사용하여 대형 언어 모델(LLM)을 평가하며, 난이도를 높이기 위해 추가 단어를 포함합니다. 2025년 2월 4일 기준으로 벤치마크의 새로운 버전이 출시되었습니다. 표준 NYT 커넥션스 벤치마크는 점차 포화 상태에 이르고 있으며, o1 모델이 90.7점을 기록했고 올해 o3를 포함한 다른 추론 모델들도 출시될 예정입니다. 현재 규칙은 세 가지 카테고리만 알면 네 번째 카테고리는 자연스럽게 결정되는 방식입니다. 난이도를 높이기 위해 확장 커넥션스(Extended Connections)는 각 퍼즐에 최대 4개의 함정 단어를 추가합니다. 우리는 추가된 단어 중 해당 퍼즐에 사용된 카테고리에 맞는 것이 없도록 이중 확인을 거칩니다. 2026년 2월 2일 기준으로 새로운 퍼즐이 추가되어 총 퍼즐 수가 436개에서 940개로 확장되었습니다.

[표: 확장 버전 리더보드: 확장 버전] 순위 | 모델 | 점수(%) | 퍼즐 수 1 | Gemini 3.1 Pro Preview | 98.4 | 940 2 | gemini-3-pro-preview | 96.3 | 940 3 | Claude Opus 4.6 (high reasoning) | 94.7 | 940 4 | GPT-5.4 (xhigh reasoning) | 94.0 | 940 5 | GPT-5.4 (high reasoning) | 93.6 | 940 6 | Grok 4.20 Multi-Agent Exp Beta 0304 | 93.4 | 940 7 | GPT-5.4 (medium reasoning) | 91.9 | 940 8 | grok-4-1-fast-reasoning | 91.7 | 940 9 | Grok 4.20 0309 (Reasoning) | 90.3 | 940 10 | grok-4.20-experimental-beta-0304-reasoning | 89.5 | 940 11 | gpt-5.2-xhigh | 88.6 | 940 12 | Gemini 3 Flash Preview | 88.4 | 940 13 | GPT-5.2 Pro | 85.7 | 940 14 | Claude Sonnet 4.6 (high reasoning) | 85.7 | 940 15 | GLM-5.1 | 84.3 | 940 16 | Claude Sonnet 4.6 Thinking 32K | 82.4 | 940 17 | GLM-5 | 81.7 | 940 18 | Claude Opus 4.6 Thinking 16K | 81.7 | 940 19 | Gemma 4 31B Reasoning | 79.5 | 940 20 | Kimi K2.5 Thinking | 78.3 | 940 21 | gpt-5.2-high | 77.5 | 940 22 | GPT-5.4 Mini (xhigh reasoning) | 71.8 | 940 23 | gpt-5.2-medium | 71.4 | 940 24 | Qwen 3.6 Plus | 71.3 | 940 25 | Qwen3.5-397B-A17B | 69.2 | 940 26 | gpt-5.2-low | 66.7 | 940 27 | Qwen3.5-122B-A10B | 63.6 | 940 28 | Claude Opus 4.5 Thinking 16K | 62.6 | 940 29 | Qwen3.5-27B | 60.7 | 940 30 | Claude Opus 4.5 (no reasoning) | 60.3 | 940 31 | Claude Sonnet 4.6 Thinking 16K | 57.6 | 940 32 | Claude Opus 4.6 (no reasoning) | 55.9 | 940 33 | Claude Sonnet 4.6 (no reasoning) | 55.0 | 940 34 | DeepSeek V3.2 | 50.2 | 940 35 | Claude Sonnet 4.5 Thinking 16K | 49.4 | 940 36 | Claude Sonnet 4.5 (no reasoning) | 47.4 | 940 37 | qwen3-max-2026-01-23 | 42.1 | 940 38 | ByteDance Seed2.0 Pro | 42.1 | 940 39 | Claude Opus 4.7 (high reasoning) | 41.0 | 940 40 | Xiaomi MiMo V2 Pro | 40.9 | 940 41 | Step 3.5 Flash | 39.9 | 940 42 | MiniMax-M2.7 | 35.2 | 940 43 | GPT-5.4 (no reasoning) | 32.8 | 940 44 | LongCat Flash Thinking | 31.0 | 940 45 | Gemma 4 31B IT | 30.1 | 940 46 | minimax-m2.5 | 29.6 | 940 47 | Arcee Trinity Large Thinking | 29.5 | 940 48 | gpt-5.2-none | 28.1 | 940 49 | minimax-m2 | 27.0 | 940 50 | Claude 4.5 Haiku | 26.0 | 940 51 | grok-4-1-fast-non-reasoning | 25.1 | 940 52 | qwen3-max-thinking | 24.1 | 940 53 | minimax-m2.1 | 22.7 | 940 54 | Baidu Ernie 5.0 | 21.2 | 940 55 | Gemini 3.1 Flash-Lite Preview | 19.7 | 940 56 | Grok 4.20 0309 (Non-Reasoning) | 19.2 | 940 57 | Llama 4 Maverick | 18.4 | 940 58 | DeepSeek V3.2 (no reasoning) | 17.8 | 940 59 | grok-4.20-experimental-beta-0304-non-reasoning | 17.6 | 940 60 | Mistral Large 3 | 17.2 | 940 61 | Mistral Medium 3.1 | 15.5 | 940 62 | Claude Opus 4.7 (no reasoning) | 15.3 | 940

퍼즐 수준 결과의 상관관계: 히트맵

최신 100개 퍼즐. LLM의 학습 데이터에 정답이 포함되어 있을 가능성을 방지하기 위해, 우리는 가장 최근의 퍼즐 100개만 테스트했습니다. 초기 퍼즐의 난이도가 낮았기 때문에 낮은 점수가 반드시 NYT 커넥션스 정답이 학습 데이터에 포함되어 있음을 의미하지는 않는다는 점에 유의하세요.

[표: 확장 버전 기준 최신 100개 퍼즐]

인간 vs. LLM 최고 수준의 언어 모델(LLM)이 뉴욕 타임스 커넥션스 퍼즐에서 인간과 어떻게 비교되는지 알아보기 위해, 우리는 u/Bryschien1996이 분석한 2024년 12월부터 2025년 2월까지의 공식 NYT 성능 데이터를 인간의 게임 경험을 반영한 시뮬레이션 설정과 함께 사용했습니다. 이 설정은 풀이자가 그룹을 반복적으로 제안하고, 피드백("정답", "하나 차이", "오답")을 받으며, 실패하기 전까지 최대 4번의 실수가 허용되는 다단계 프로세스를 포함합니다. NYT 데이터에 따르면, 평균적인 인간 플레이어는 2024년 12월부터 2025년 2월까지 3개월 동안 약 71%의 퍼즐을 해결했습니다.

원문 보기

원문 보기 (영어)

Extended Version This benchmark evaluates large language models (LLMs) using 940 NYT Connections puzzles, with additional words included to increase difficulty. As of Feb 4, 2025, there is a new version of the benchmark. The standard NYT Connections benchmark is nearing saturation, with o1 scoring 90.7 and o3, along with other reasoning models, expected this year. The current rules require knowing only three categories, letting the fourth fall into place. To increase difficulty, Extended Connections adds up to four extra trick words to each puzzle. We double-check that none of the added words fit into any category used in the corresponding puzzle. New puzzles have expanded the total from 436 to 940 as of Feb 2, 2026. Chart: Extended Version Leaderboard: Extended Version Rank Model Score % #Puzzles 1 Gemini 3.1 Pro Preview 98.4 940 2 gemini-3-pro-preview 96.3 940 3 Claude Opus 4.6 (high reasoning) 94.7 940 4 GPT-5.4 (xhigh reasoning) 94.0 940 5 GPT-5.4 (high reasoning) 93.6 940 6 Grok 4.20 Multi-Agent Exp Beta 0304 93.4 940 7 GPT-5.4 (medium reasoning) 91.9 940 8 grok-4-1-fast-reasoning 91.7 940 9 Grok 4.20 0309 (Reasoning) 90.3 940 10 grok-4.20-experimental-beta-0304-reasoning 89.5 940 11 gpt-5.2-xhigh 88.6 940 12 Gemini 3 Flash Preview 88.4 940 13 GPT-5.2 Pro 85.7 940 14 Claude Sonnet 4.6 (high reasoning) 85.7 940 15 GLM-5.1 84.3 940 16 Claude Sonnet 4.6 Thinking 32K 82.4 940 17 GLM-5 81.7 940 18 Claude Opus 4.6 Thinking 16K 81.7 940 19 Gemma 4 31B Reasoning 79.5 940 20 Kimi K2.5 Thinking 78.3 940 21 gpt-5.2-high 77.5 940 22 GPT-5.4 Mini (xhigh reasoning) 71.8 940 23 gpt-5.2-medium 71.4 940 24 Qwen 3.6 Plus 71.3 940 25 Qwen3.5-397B-A17B 69.2 940 26 gpt-5.2-low 66.7 940 27 Qwen3.5-122B-A10B 63.6 940 28 Claude Opus 4.5 Thinking 16K 62.6 940 29 Qwen3.5-27B 60.7 940 30 Claude Opus 4.5 (no reasoning) 60.3 940 31 Claude Sonnet 4.6 Thinking 16K 57.6 940 32 Claude Opus 4.6 (no reasoning) 55.9 940 33 Claude Sonnet 4.6 (no reasoning) 55.0 940 34 DeepSeek V3.2 50.2 940 35 Claude Sonnet 4.5 Thinking 16K 49.4 940 36 Claude Sonnet 4.5 (no reasoning) 47.4 940 37 qwen3-max-2026-01-23 42.1 940 38 ByteDance Seed2.0 Pro 42.1 940 39 Claude Opus 4.7 (high reasoning) 41.0 940 40 Xiaomi MiMo V2 Pro 40.9 940 41 Step 3.5 Flash 39.9 940 42 MiniMax-M2.7 35.2 940 43 GPT-5.4 (no reasoning) 32.8 940 44 LongCat Flash Thinking 31.0 940 45 Gemma 4 31B IT 30.1 940 46 minimax-m2.5 29.6 940 47 Arcee Trinity Large Thinking 29.5 940 48 gpt-5.2-none 28.1 940 49 minimax-m2 27.0 940 50 Claude 4.5 Haiku 26.0 940 51 grok-4-1-fast-non-reasoning 25.1 940 52 qwen3-max-thinking 24.1 940 53 minimax-m2.1 22.7 940 54 Baidu Ernie 5.0 21.2 940 55 Gemini 3.1 Flash-Lite Preview 19.7 940 56 Grok 4.20 0309 (Non-Reasoning) 19.2 940 57 Llama 4 Maverick 18.4 940 58 DeepSeek V3.2 (no reasoning) 17.8 940 59 grok-4.20-experimental-beta-0304-non-reasoning 17.6 940 60 Mistral Large 3 17.2 940 61 Mistral Medium 3.1 15.5 940 62 Claude Opus 4.7 (no reasoning) 15.3 940 Correlation of puzzle-level results: heatmap Newest 100 puzzles. To counteract the possibility of an LLM's training data including the solutions, we have also tested only the 100 latest puzzles. Note that lower scores do not necessarily indicate that NYT Connections solutions are in the training data, as the difficulty of the first puzzles was lower. Chart: Newest 100 puzzles, extended version Humans vs. LLMs To explore how top language models (LLMs) compare to humans in the New York Times Connections puzzle, we used official NYT performance data from December 2024 to February 2025, as analyzed by u/Bryschien1996, alongside a simulated gameplay setup that mirrors the human experience. This setup involves a multi-step process where solvers iteratively propose groups, receive feedback ("correct," "one away," "incorrect"), and are allowed up to four mistakes before failing. According to NYT data, the average human player solved approximately 71% of puzzles over the three-month period from December 2024 to February 2025, with solve rates ranging from 39% on the toughest days (e.g., February 2, 2025) to 98% on the easiest (e.g., February 26, 2025). It's worth noting that NYT Connections players are self-selected and likely perform better than the general population. We collected data from nine LLMs spanning a range of scores in the Extended Connections benchmark. The results reveal that top reasoning LLMs from OpenAI consistently outperform the average human player. DeepSeek R1 performs closest to the level of an average NYT Connections player. Elite human players, however, set a higher standard, achieving a 100% win rate during the same period: o1, with a 98.9% win rate, comes close to this elite level. o1-pro, which has not yet been tested in this gameplay simulation setup, might be able to match these top humans. Thus, directly determining whether AI achieves superhuman performance on NYT Connections could hinge on comparing the number of mistakes made before fully solving each puzzle. Original NYT Connections LLM Benchmark This benchmark evaluates large language models (LLMs) using 436 NYT Connections puzzles. Three different prompts, not optimized for LLMs through prompt engineering, are used. Both uppercase and lowercase puzzles are assessed. Chart: Original Version Leaderboard: Original Version Model Score o1 90.7 o1-preview 87.1 o3-mini 72.4 DeepSeek R1 54.4 o1-mini 42.2 Multi-turn ensemble 37.8 Gemini 2.0 Flash Thinking Exp 01-21 37.0 GPT-4 Turbo 28.3 GPT-4o 2024-11-20 27.9 GPT-4o 2024-08-06 26.5 Llama 3.1 405B 26.3 Claude 3.5 Sonnet (2024-10-22) 25.9 Claude 3 Opus 24.8 Grok Beta 23.7 Llama 3.3 70B 23.7 Gemini 1.5 Pro (Sept) 22.7 Deepseek-V3 21.0 Gemini 2.0 Flash Exp 20.0 Gemma 2 27B 18.8 Qwen 2.5 Max 18.6 Gemini 2.0 Flash Thinking Exp 18.6 Mistral Large 2 17.4 Qwen 2.5 72B 14.8 Claude 3.5 Haiku 13.7 MiniMax-Text-01 13.6 Nova Pro 12.5 Phi-4 11.6 Mistral Small 3 10.5 DeepSeek-V2.5 9.9 Leaderboard: Older models These models are excluded from the main board because they ran fewer than 940 total puzzles. Rank Model Score % #Puzzles (window) Total Coverage 1 Sherlock Think Alpha 92.4 759 759/940 2 Grok 4 Fast Reasoning 92.1 759 759/940 3 Grok 4 91.7 759 759/940 4 Sonoma Sky Alpha 90.7 759 759/940 5 o3-pro (medium reasoning) 87.3 759 759/940 6 GPT-5 Pro 83.9 759 759/940 7 o1-pro (medium reasoning) 82.5 651 651/940 8 o3 (high reasoning) 78.6 759 759/940 9 GPT-5 (high reasoning) 77.0 759 759/940 10 o4-mini (high reasoning) 73.6 759 759/940 11 o3 (medium reasoning) 73.0 759 759/940 12 GPT-5 (medium reasoning) 72.2 759 759/940 13 o1 (medium reasoning) 70.8 651 651/940 14 GPT-5.1 (high reasoning) 69.9 759 759/940 15 o4-mini (medium reasoning) 68.8 651 651/940 16 GPT-5 mini (medium reasoning) 66.9 759 759/940 17 GPT-5 (low reasoning) 65.4 759 759/940 18 GPT-5.1 (medium reasoning) 62.7 759 759/940 19 o3-mini (high reasoning) 61.4 651 651/940 20 GLM-4.7 59.5 767 767/940 21 Claude Opus 4.1 Thinking 16K 58.8 759 759/940 22 DeepSeek V3.1 Reasoner 57.7 759 759/940 23 Gemini 2.5 Pro 57.6 759 759/940 24 Kimi K2 Thinking 64K 57.3 924 924/940 25 Qwen 3 235B A22B 54.3 759 759/940 26 Gemini 2.5 Pro Exp 03-25 54.1 651 651/940 27 o3-mini (medium reasoning) 53.6 651 651/940 28 Claude Opus 4 Thinking 16K 49.7 759 759/940 29 DeepSeek R1 05/28 48.6 759 759/940 30 Qwen 3 235B A22B 25-07 Think 46.2 759 759/940 31 Gemini 2.5 Pro Preview 05-06 42.5 651 651/940 32 Claude Sonnet 4 Thinking 16K 40.3 759 759/940 33 Claude Sonnet 4 Thinking 64K 39.6 651 651/940 34 GPT-OSS-120B 38.7 759 759/940 35 DeepSeek R1 38.6 651 651/940 36 Claude Opus 4.1 (no reasoning) 37.1 759 759/940 37 Qwen 3 30B A3B 36.6 759 759/940 38 Qwen 3 32B 35.8 759 759/940 39 Qwen 3 30B A3B 25-07 Thinking 35.5 759 759/940 40 Claude Opus 4 (no reasoning) 34.4 759 759/940 41 GPT-4.5 Preview 34.2 651 651/940 42 Claude 3.7 Sonnet Thinking 16K 33.6 651 651/940 43 Qwen 3 Next 80B A3B Thinking 32.9 759 759/940 44 Qwen QwQ-32B 16K 31.4 651 651/940 45 Grok 3 Mini Beta (high) 30.2 759 759/940 46 GLM-4.5 30.2 759 759/940 47 Cla

벤치마크 클로드 오푸스 언어모델평가 성능비교