오픈AI가 복잡한 작업을 자율적으로 수행하는 에이전트 기반 모델 'GPT-5.5'를 공개했습니다. 이 모델은 코딩, 웹 검색, 데이터 분석 등에 강점을 보이며 주요 벤치마크에서 경쟁 모델들을 큰 폭으로 앞섰습니다. 다만, 이에 상응하는 성능 향상을 제공하지만 API 호출 비용은 기존 대비 2배로 인상되었습니다.
번역된 본문
오픈AI가 복잡한 작업을 다양한 도구를 통해 자율적으로 처리하는 에이전트(Agentic) 기반 모델인 GPT-5.5를 발표했습니다. API 가격은 두 배로 책정되었습니다.
오픈AI는 GPT-5.5를 "실제 업무 및 에이전트 구동을 위한 새로운 수준의 지능"이라고 부르며 공개했습니다. 오픈AI에 따르면 이 모델은 복잡한 목표를 이해하고, 도구를 사용하며, 자체 출력 결과를 확인하고 작업이 완료될 때까지 독립적으로 수행하도록 설계되었습니다. 현재 유료 ChatGPT 및 Codex 사용자들이 사용할 수 있습니다.
에이전트 워크플로우가 핵심 경쟁력
오픈AI에 따르면 GPT-5.5는 코드 작성 및 디버깅, 웹 리서치, 데이터 분석, 문서 및 스프레드시트 생성, 소프트웨어 운영에 특히 강합니다. 이 모델은 작업이 완료될 때까지 서로 다른 도구들 사이를 스스로 전환하며 사용하도록 설계되었습니다.
오픈AI는 에이전트 코딩, 컴퓨터 사용, 지식 노동, 초기 과학 연구의 네 가지 영역에서 가장 큰 성능 향상을 보인다고 봅니다. 이 회사는 해당 영역들이 문맥에 걸친 추론과 장시간에 걸쳐 행동을 수행하는 능력이 필요하다고 밝혔습니다.
에이전트 워크플로우를 위한 코딩 벤치마크인 Terminal-Bench 2.0에서 오픈AI에 따르면 GPT-5.5는 82.7%를 기록하며 전작인 GPT-5.4(75.1%)를 7.6%p 앞섰습니다. 앤스로픽의 Claude Opus 4.7은 69.4%, 구글의 Gemini 3.1 Pro는 68.5%를 기록했습니다.
더 어려운 수학 문제에서는 그 격차가 더욱 벌어집니다. FrontierMath Tier 4에서 GPT-5.5는 35.4%를 기록했으며, Claude Opus 4.7(22.9%) 및 Gemini 3.1 Pro(16.7%)를 크게 상회했습니다. 고급형 모델인 GPT-5.5 Pro 변형은 이 수치를 39.6%로 끌어올립니다.
오픈AI는 GPT-5.5가 속도를 희생하지 않고 이러한 성능 향상을 달성했다고 밝혔습니다. 이 모델은 토큰당 지연 시간에서 GPT-5.4와 동일한 수준을 유지하면서도, 동일한 Codex 작업을 완료하는 데 훨씬 적은 수의 토큰을 사용하는 것으로 알려졌습니다.
[벤치마크 비교 표]
Terminal-Bench 2.0: GPT-5.5(82.7%) | GPT-5.4(75.1%) | Claude Opus 4.7(69.4%) | Gemini 3.1 Pro(68.5%)
FrontierMath Tier 4: GPT-5.5(35.4%) | GPT-5.5 Pro(39.6%) | GPT-5.4(27.1%) | Claude Opus 4.7(22.9%) | Gemini 3.1 Pro(16.7%)
BrowseComp: GPT-5.5 Pro(90.1%) | GPT-5.5(84.4%) | Gemini 3.1 Pro(85.9%) | Claude Opus 4.7(79.3%) 등
긴 문맥(Long-context) 처리 능력 역시 크게 향상되었습니다. 매우 긴 텍스트 내에서 여러 숨겨진 정보를 얼마나 안정적으로 찾아내는지 테스트하는 MRCR v2 벤치마크에서 GPT-5.5는 512K~1M 토큰 길이에서 GPT-5.4(36.6%)에서 74.0%로 크게 도약했습니다. 100만 토큰 환경의 Graphwalks BFS 테스트에서도 GPT-5.4(9.4%)에서 GPT-5.5(45.4%)로 급증했습니다.
하지만 완벽한 우위를 점하는 것은 아닙니다. 실제 GitHub 이슈 해결 능력을 평가하는 SWE-Bench Pro에서 Claude Opus 4.7이 64.3%로 GPT-5.5(58.6%)를 앞섰습니다. 단, 오픈AI는 앤스로픽 자체도 해당 작업 중 일부에서 데이터셋 암기(memorization)의 징후를 인정했다고 지적했습니다. MCP Atlas(도구 사용 평가) 등의 추가적인 세부 벤치마크 평가에서도 엎치락뒤치락하는 양상을 보입니다.
OpenAI unveils GPT-5.5, claims a "new class of intelligence" at double the API price Matthias Bastian View the LinkedIn Profile of Matthias Bastian Apr 23, 2026 OpenAI Key Points OpenAI has released GPT-5.5, a new agent-based model that can autonomously handle complex tasks like writing code, running online searches, and analyzing data across multiple tools. The model beats out competitors including Anthropic's Claude Opus 4.7 and Google's Gemini 3.1 Pro on key benchmarks, particularly in programming and advanced math, without sacrificing speed, though it doesn't come out on top across the board. A more capable GPT-5.5 Pro variant has also launched as an iterative research partner, with both models now available to paying ChatGPT and Codex users on the Plus, Pro, Business, and Enterprise plans, while API access is coming soon at twice the cost. Ask about this article… Search OpenAI has announced GPT-5.5, an agentic model designed to handle complex tasks autonomously across multiple tools. On paper, it's double the API price. OpenAI has unveiled GPT-5.5, calling it a "new class of intelligence for real work and powering agents." The model is built to understand complex goals, use tools, check its own output, and work through tasks independently until they're done, OpenAI says. It's available now for paying ChatGPT and Codex users. Agentic workflows are the main selling point According to OpenAI, GPT-5.5 is especially strong at writing and debugging code , web research, data analysis, creating documents and spreadsheets, and operating software. The model is designed to switch between different tools on its own until a task is finished. Ad OpenAI sees the biggest improvements in four areas: agentic coding, computer use, knowledge work, and early scientific research. These areas require reasoning across contexts and the ability to carry out actions over extended periods, the company says. Ad DEC_D_Incontent-1 On Terminal-Bench 2.0 , a coding benchmark for agentic workflows, GPT-5.5 scores 82.7 percent according to OpenAI—7.6 percentage points above its predecessor GPT-5.4 (75.1 percent). Anthropic's Claude Opus 4.7 hits 69.4 percent, and Google's Gemini 3.1 Pro lands at 68.5 percent. The gap gets even wider on harder math problems. On FrontierMath Tier 4, GPT-5.5 scores 35.4 percent, compared to 22.9 percent for Claude Opus 4.7 and 16.7 percent for Gemini 3.1 Pro. The Pro variant, GPT-5.5 Pro, pushes that number to 39.6 percent. Ad OpenAI says GPT-5.5 delivers these performance gains without sacrificing speed . The model reportedly matches GPT-5.4's per-token latency while also using significantly fewer tokens to complete the same Codex tasks. GPT-5.5 GPT-5.4 GPT-5.5 Pro GPT-5.4 Pro Claude Opus 4.7 Gemini 3.1 Pro Terminal-Bench 2.0 82.7% 75.1% - - 69.4% 68.5% Expert-SWE (Internal) 73.1% 68.5% - - - - GDPval (wins or ties) 84.9% 83.0% 82.3% 82.0% 80.3% 67.3% OSWorld-Verified 78.7% 75.0% - - 78.0% - Toolathlon 55.6% 54.6% - - - 48.8% BrowseComp 84.4% 82.7% 90.1% 89.3% 79.3% 85.9% FrontierMath Tier 1-3 51.7% 47.6% 52.4% 50.0% 43.8% 36.9% FrontierMath Tier 4 35.4% 27.1% 39.6% 38.0% 22.9% 16.7% CyberGym 81.8% 79.0% - - 73.1% - OpenAI's benchmark comparison for GPT-5.5. GPT-5.5 Pro was only tested on selected benchmarks. | Table: OpenAI Ad DEC_D_Incontent-2 Long-context performance also improved significantly. On the MRCR v2 benchmark, which tests how reliably a model can locate multiple pieces of hidden information across very long texts, GPT-5.5 jumps to 74.0 percent at context lengths of 512K to 1M tokens, up from 36.6 percent for GPT-5.4. On the Graphwalks BFS test with one million tokens, GPT-5.5 leaps from 9.4 percent (GPT-5.4) to 45.4 percent. Ad The dominance isn't total, though. On SWE-Bench Pro, which tests real GitHub issue resolution, Claude Opus 4.7 beats GPT-5.5 with 64.3 percent versus 58.6 percent. OpenAI notes, however, that Anthropic itself acknowledged signs of memorization in some of those tasks. On MCP Atlas, a tool-use benchmark run by Scale AI, GPT-5.5 scores 75.3 percent, trailing both Claude Opus 4.7 (79.1 percent) and Gemini 3.1 Pro (78.2 percent). The base model also falls slightly behind Gemini on BrowseComp, a web research benchmark, with 84.4 percent versus 85.9 percent. And GPT-5.5 barely moved the needle on GDPval, a benchmark designed to measure real-world task performance across 44 occupations. GPT-5.5 scores 84.9 percent, only a marginal improvement over GPT-5.4's 83.0 percent. A full overview of all benchmarks is available here . The model was developed and optimized alongside NVIDIA GB200 and GB300-NVL72 systems. OpenAI says GPT-5.5 and Codex actually helped optimize the company's own serving infrastructure—Codex analyzed production traffic patterns and wrote its own heuristic algorithms for load balancing, resulting in an over 20 percent boost in token generation speed. "The model helped improve the infrastructure that serves it," OpenAI writes. GPT-5.5 Pro aims to be a "research partner" Alongside the standard model, OpenAI is launching GPT-5.5 Pro . The company says full-stack inference improvements make the more powerful model much more practical for heavy workloads. Early testers called it an iterative "research partner" that performs best when given rich context from documents and plugins. So far, OpenAI has only shared GPT-5.5 Pro benchmark results for three of nine tests: BrowseComp, FrontierMath Tier 1-3, and FrontierMath Tier 4. It beats the base model in all three. Cybersecurity capabilities rated "High" OpenAI classifies the biological, chemical, and cybersecurity capabilities of GPT-5.5 as "High" in its Preparedness Framework , the same rating as its recent predecessors, but not "Critical." The model shows improved cybersecurity performance compared to GPT-5.4, scoring 81.8 percent on the CyberGym benchmark (versus 79.0 percent) and 88.1 percent on internal capture-the-flag tasks (versus 83.7 percent). At the same time, OpenAI is rolling out stricter classifiers for potential cyber risk, which could initially lead to more rejections, the company says. The Trusted Access for Cyber program will give verified security researchers expanded access to cybersecurity capabilities. OpenAI is also working with government partners to protect critical infrastructure. A system card with additional security details is available here . Paying users get access first; API pricing doubles over GPT-5.4 GPT-5.5 Thinking is now available for Plus, Pro, Business, and Enterprise users in ChatGPT. GPT-5.5 Pro is limited to Pro, Business, and Enterprise users. In Codex, GPT-5.5 is available for Plus, Pro, Business, Enterprise, Edu, and Go users with a 400K context window. A fast mode generates tokens 1.5 times faster at 2.5 times the cost. For the API, OpenAI is charging 5 dollars per million input tokens and 30 dollars per million output tokens, with a context window of one million tokens, exactly twice what GPT-5.4 costs at 2.50 and 15 dollars , respectively. GPT-5.5 Pro lands at 30 dollars per million input tokens and 180 dollars per million output tokens. OpenAI argues that despite the higher price tag, GPT-5.5 is more efficient and needs fewer tokens for comparable tasks. There's no word yet on when free users will get access. As for the API, OpenAI says that it's coming "very soon." AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now Source: OpenAI