MarkTechPost • 68일 전

알리바바, 100만 토큰 컨텍스트 추론 모델 Qwen3.7-Max 발표

IMP

8/10

핵심 요약

알리바바 클라우드 서밋 2026에서 멀티스텝 에이전트 및 복잡한 코딩 작업에 최적화된 최신 폐쇄형 추론 모델 Qwen3.7-Max가 공개되었습니다. 이 모델은 100만 토큰의 컨텍스트 윈도우를 지원하며, 과학적 추론과 코딩 벤치마크에서 전작 대비 큰 성능 향상을 보였습니다.

번역된 본문

오늘날 대부분의 AI 모델은 지속적이고 다단계적인 자율 실행(Autonomous execution)을 염두에 두고 설계되지 않았습니다. 수백 번에 걸친 반복적인 코드 수정이나 사람의 개입 없이 몇 시간 동안 도구 호출을 연결하는 작업 등에는 다른 종류의 모델 아키텍처와 훈련 방식이 요구됩니다.

알리바바의 큐웬(Qwen)팀은 5월 20일 열린 '2026 알리바바 클라우드 서밋'에서 Qwen3.7-Max를 공식 발표했습니다. 사실 Qwen3.7 시리즈의 프리뷰 버전 두 가지는 보도자료나 공식 API 발표 없이 조용히 Arena AI의 리더보드에 등장한 바 있습니다.

두 가지 프리뷰 모델 동시 출시 알리바바는 Qwen3.7-Max-Preview와 Qwen3.7-Plus-Preview라는 두 가지 모델을 동시에 프리뷰 형태로 공개했습니다. LM Arena에 따르면 이들은 각각 텍스트 역량 기준 전 세계 13위, 비전(Vision) 역량 기준 16위를 기록했습니다. 텍스트 아레나에서 Qwen3.7-Max-Preview는 종합 13위를 차지했고, 이로 인해 알리바바는 텍스트 분야에서 6위 랩(Lab)으로 평가되었습니다. 비전 아레나에서 Qwen3.7-Plus-Preview는 종합 16위를 차지했으며, 알리바바는 비전 분야 5위 랩으로 평가되었습니다. 모델 순위와 랩 순위는 별도로 책정되는 수치입니다.

Qwen3.7-Plus-Preview는 추론과 논리적 표현에 초점을 맞춘 고성능 밸런스 버전의 프리뷰로, 향후 툴체인이 점진적으로 개방될 예정입니다. 비전 및 멀티모달 입력을 처리할 수 있습니다. 반면 Qwen3.7-Max는 텍스트 전용 추론 플래그십 모델입니다. 본 기사에서는 알리바바가 API 액세스와 함께 공식적으로 발표한 모델인 Qwen3.7-Max에 대해 다룹니다.

Qwen3.7-Max의 설계 목적 알리바바 큐웬팀은 Qwen3.7-Max를 현재까지 개발한 가장 진보되고 포괄적인 에이전트 모델(Agent Model)이라고 설명했습니다. 이 모델은 비공개 폐쇄형 가중치(Closed-weight)를 사용하는 독점 모델입니다. 코딩 및 디버깅, 오피스 워크플로우 자동화, 수백에서 수천 단계에 걸친 장기 작업을 처리할 수 있습니다.

확장 사고 모드(Extended-Thinking Mode) Qwen3.7-Max는 추론(Reasoning) 모델입니다. 이 모델은 최종 답변을 내놓기 전에 계획, 작업 확인, 수정 등의 내부 단계인 사고 과정(Chain of thought)을 먼저 생성합니다. 큐웬 챗(Qwen Chat)과 같은 인터페이스에서는 모델의 추론 과정을 볼 수 있는 '사고(Thinking)' 모드를 켤 수 있습니다. 추론 모델은 일반 모델에 비해 훨씬 더 많은 출력 토큰을 생성합니다. Artificial Analysis가 지능 지수(Intelligence Index) 평가를 진행했을 때, 해당 벤치마크의 평균 토큰 생성량이 2,400만 개인 데 반해 Qwen3.7-Max는 약 9,700만 개의 토큰을 생성했습니다. 짧거나 단순한 작업의 경우, 이러한 오버헤드는 출력 품질을 향상시키지 못한 채 지연 시간(Latency)만 증가시킵니다. 반면, 다단계 계획 수립, 코드 리팩토링 또는 긴 에이전트 체인과 같은 작업에는 이 확장 사고 모드가 모델의 강점을 극대화합니다.

컨텍스트 윈도우(Context Window) 이 모델은 Qwen3.6 Max Preview의 256K에서 대폭 확장된 100만 토큰(1M)의 컨텍스트 윈도우를 특징으로 합니다. 단, 텍스트 입력과 출력만 지원합니다. 가격 책정은 아직 공식적으로 발표되지 않았습니다. 참고로 Qwen3.6 Max Preview는 알리바바 클라우드에서 입력/출력 100만 토큰당 $1.30/$7.80에 책정되었습니다. 100만 토큰의 컨텍스트 윈도우를 사용하면 단일 요청 하나에 중간 규모의 전체 코드 저장소(Repository)나 방대한 문서 스택을 담을 수 있습니다. 하지만 컨텍스트 윈도우가 채워질수록 모델의 추론 신뢰성이 떨어지는 경우가 많습니다. Qwen3.7-Max에 대한 독립적인 긴 문맥(Long-context) 테스트 결과는 아직 공개되지 않았습니다.

벤치마크 결과 Qwen3.7-Max는 Artificial Analysis 지능 지수에서 56.6점을 기록하며 종합 5위를 차지했습니다. 이는 전작인 Qwen3.6 Max Preview(51.8점)보다 4.8점 향상된 수치로, 구글의 Gemini 3.5 Flash(55.3점)를 앞지르는 결과입니다. 그러나 GPT-5.5(60.2점), Claude Opus 4.7(57.3점), Gemini 3.1 Pro Preview(57.2점)가 여전히 종합 순위 최상위권을 유지하고 있습니다.

지능 지수(Intelligence Index) v4.0은 GDPval-AA, Terminal-Bench Hard, SciCode, AA-Omniscience, Humanity's Last Exam, GPQA Diamond 등 10개의 평가를 종합한 지표입니다. Qwen3.6 Max Preview 대비 향상된 폭은 균일하지 않으며, 인덱스 상승분의 대부분은 과학적 추론, 에이전트 기능, 코딩 분야에 집중되어 있습니다. CritPt는 9.7%p 상승했으며(3.7%에서 13.4%로), Humanity's Last Exam 등 다양한 지표에서 눈에 띄는 성능 개선을 입증했습니다.

원문 보기

원문 보기 (영어)

Editors Pick Agentic AI Technology AI Shorts Artificial Intelligence Applications Language Model Large Language Model Machine Learning New Releases Software Engineering Staff Tech News Most AI models today are not designed for sustained, multi-step autonomous execution. Tasks like running hundreds of iterative code modifications, or chaining tool calls across hours without human intervention, require a different kind of model architecture and training focus. Alibaba's Qwen team formally announced Qwen3.7-Max at the 2026 Alibaba Cloud Summit on May 20. Although, two preview versions of the Qwen3.7 series quietly appeared on Arena AI's leaderboard with no press release and no official API announcement. Two Preview Models Released Simultaneously Alibaba previewed two models simultaneously: Qwen3.7-Max-Preview and Qwen3.7-Plus-Preview. They ranked 13th globally in text capabilities and 16th in vision capabilities, respectively, according to LM Arena. In Text Arena, Qwen3.7-Max-Preview ranked #13 overall, placing Alibaba as the #6 lab in text. In Vision Arena, Qwen3.7-Plus-Preview ranked #16 overall, placing Alibaba as the #5 lab in vision. The model rank and the lab rank are separate figures. Qwen3.7-Plus-Preview is described as a high-performance balanced version preview, focusing on reasoning and logical expression, with its toolchain to be gradually opened in the future. It handles vision and multimodal inputs. Qwen3.7-Max is the text-only reasoning flagship. This article covers Qwen3.7-Max, as it is the model Alibaba formally announced with API access. What is Qwen3.7-Max Designed For Alibaba Qwen team described Qwen3.7-Max as its most advanced and comprehensive agent model to date. The model is proprietary and closed-weight. It is capable of handling coding and debugging, office workflow automation, and long-horizon tasks spanning hundreds or even thousands of steps. Extended-Thinking Mode Qwen3.7-Max is a reasoning model. The model generates a chain of thought first — an internal sequence of steps where it plans, checks its work, and corrects course before committing to a final answer. On interfaces like Qwen Chat, this shows up as a ‘Thinking' mode you can switch on to see the model's reasoning trace. Reasoning models produce significantly more output tokens than standard completions. When Artificial Analysis ran its Intelligence Index evaluation, Qwen3.7-Max generated about 97 million tokens, compared to an average of 24 million for models on that benchmark. For short or simple tasks, this overhead adds latency without improving output quality. For multi-step planning, code refactoring, or long agent chains, extended-thinking mode is where the model's strength applies. Context Window The model features a 1M token context window, up from 256K on Qwen3.6 Max Preview. It supports text input and output only. Pricing has not yet been announced. Qwen3.6 Max Preview was priced at $1.30/$7.80 per million input/output tokens on Alibaba Cloud. A million-token context window can hold a full mid-sized code repository or a large stack of documents in a single request. Models often reason less reliably as the context window fills. Independent long-context testing for Qwen3.7-Max is not yet available. Benchmark Results Qwen3.7-Max scored 56.6 on the Artificial Analysis Intelligence Index, placing it fifth overall. That represents a 4.8-point gain over its predecessor Qwen3.6 Max Preview (51.8), and puts it ahead of Google's Gemini 3.5 Flash (55.3). GPT-5.5 (60.2), Claude Opus 4.7 (57.3), and Gemini 3.1 Pro Preview (57.2) still lead the overall rankings. The Intelligence Index v4.0 aggregates ten evaluations, including GDPval-AA, Terminal-Bench Hard, SciCode, AA-Omniscience, Humanity's Last Exam, and GPQA Diamond. The improvement over Qwen3.6 Max Preview is not uniform. Most of the Index gains are concentrated in scientific reasoning, agentic capability, and coding. CritPt rose 9.7 percentage points (from 3.7% to 13.4%), Humanity's Last Exam jumped 9.2 points (from 28.9% to 38.1%), and Terminal-Bench Hard climbed 6.9 points (from 43.9% to 50.8%). GDPval-AA added 42 Elo points (from 1504 to 1546). Scores on other benchmarks are largely flat compared to Qwen3.6 Max Preview. One result on the Index requires careful reading. On AA-Omniscience, Qwen3.7-Max's raw accuracy actually dropped 7.6 percentage points (from 37.7% to 30.1%), while its hallucination rate fell 21.3 points (from 44.2% to 22.9%). The model is choosing to say "I don't know" more often rather than recalling more facts. Its attempt rate fell from 67.3% to 48.0%, the lowest among frontier models in the comparison. The AA-Omniscience benchmark rewards correct answers and penalizes hallucinations but has no penalty for refusing to answer. For use cases that depend on broad factual recall, this is a meaningful limitation to test against your workload. In Text Arena, Qwen3.7-Max-Preview ranked #13 overall with an Elo score of 1,475. Category rankings include #7 in Math, #9 in Expert Prompts, #9 in Software and IT, and #10 in Coding. All benchmark numbers are preliminary. The model carries a ‘Preview' mode, indicating Alibaba considers it an early build. Agentic Performance — Internal Test In an internal Alibaba test on a new chip platform, the model autonomously performed more than 1,000 tool calls and iterative code modifications to optimize a key kernel. Alibaba claimed the process improved inference speed by roughly 10x compared with the previous version. Marktechpost’s Visual Explainer How to Use Qwen3.7-Max A practical guide for developers & data scientists May 2026 Overview Quick Start API Access Thinking Mode Agentic Use Limitations Slide 1 of 6 What is Qwen3.7-Max? A proprietary reasoning model from Alibaba, designed for long-horizon agent tasks, code generation, and multi-step automation. Context Window 1 million tokens — enough to fit a full mid-sized code repository in a single request. Reasoning Model Uses chain-of-thought (extended-thinking mode) before producing a final answer. Input / Output Text in, text out. No image input supported in this model. API String Use qwen3.7-max when calling via Alibaba Cloud Model Studio. Apache-compatible API OpenAI & Anthropic spec Preview — no open weights yet Slide 2 of 6 Quick Start: Chat Interface The fastest way to test Qwen3.7-Max with no API key or setup required. 1 Go to Qwen Chat Navigate to chat.qwen.ai and create a free account. 2 Select the model In the model selector dropdown, choose Qwen3.7-Max . It may appear as Qwen3.7-Max-Preview during the preview period. 3 Enable Thinking Mode Toggle on Thinking Mode in the chat interface. This activates chain-of-thought reasoning and shows the model's internal reasoning trace before the final answer. 4 Send your prompt Type your query. For best results on complex tasks, be specific about steps, constraints, and expected output format. 💡 Use your hardest real-world prompts when testing. Multi-step math problems, complex refactoring requests, and ambiguous expert questions reveal more about model quality than simple prompts. Slide 3 of 6 API Access Qwen3.7-Max is compatible with both OpenAI and Anthropic API specifications. You can plug it into existing pipelines with minimal changes. OpenAI-compatible Python call from openai import OpenAI client = OpenAI( api_key="YOUR_DASHSCOPE_API_KEY", base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1" ) response = client.chat.completions.create( model="qwen3.7-max", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain chain-of-thought reasoning."} ] ) print(response.choices[0].message.content) ℹ️ Get your API key from Alibaba Cloud Model Studio (DashScope). The base URL for international access is dashscope-intl.aliyuncs.com . ⚠️ Pricing has not yet been announced for Qwen3.7-Max. For reference, Qwen3.6 Max Preview was priced at $1.30 / $7.80 per million input/output tokens. Slide

에이전트 AI 추론 모델 Qwen3.7-Max 대규모 컨텍스트 윈도우 알리바바 클라우드