Hacker News • 63일 전

DeepSWE: 데이터 오염 없는 장기 코딩 에이전트 벤치마크

IMP

8/10

핵심 요약

기존 SWE-bench Pro의 한계를 극복하고 데이터 오염(Data Contamination) 문제를 원천적으로 차단한 새로운 소프트웨어 엔지니어링 벤치마크인 DeepSWE가 공개되었습니다. 이 벤치마크는 에이전트가 스스로 탐색하며 문제를 해결해야 하는 실제 개발 환경과 유사한 복잡한 과제를 제공하며, GPT-5.5가 70%의 해결률로 최고 성능을 기록했습니다.

번역된 본문

DeepSWE는 현재의 공개 벤치마크들에 비해 4가지 주요 진보를 이루어낸 장기적(Long-horizon) 소프트웨어 엔지니어링 벤치마크입니다:

데이터 오염 없음 (Contamination free): 기존의 커밋이나 PR에서 변형하지 않고 처음부터 새로 작성된 태스크이므로, 어떤 모델도 사전 학습(Pretraining) 중에 정답을 본 적이 없습니다.
높은 다양성 (High diversity): 5개 프로그래밍 언어에 걸쳐 91개의 방대한 리포지토리를 아우르는 태스크를 제공합니다.
실제 환경의 복잡성 (Real-world complexity): 프롬프트 길이는 SWE-bench Pro의 절반에 불과하지만, 해결을 위해서는 5.5배 더 많은 코드와 약 2배 더 많은 출력 토큰이 필요합니다.
신뢰할 수 있는 검증 (Reliable verification): 구현 세부 사항이 아닌 소프트웨어 동작을 테스트하기 위해 검증자(Verifier)가 수작업으로 작성되었습니다.

기존 벤치마크들은 이러한 측면에서 부족합니다. 선도적인 코딩 에이전트 벤치마크인 SWE-bench Pro는 평균 120줄의 코드만으로 해결되는 태스크를 가지고 있으며, 우리의 감사 결과 검증자가 에이전트의 출력을 8%의 위양성(False positive)과 24%의 위음성(False negative) 비율로 잘못 평가한다는 것을 발견했습니다. 최첨단 AI 연구소들 역시 벤치마크 오염 문제에 대해 점점 더 우려를 표하고 있습니다.

반면 DeepSWE는 최첨단 코딩 에이전트들을 더 명확하게 비교할 수 있게 해줍니다. 공개 벤치마크에서는 비슷해 보였던 모델들이 개발자들이 실제 에이전트 워크플로우에서 겪는 체감 차이와 일치하는 넓고 순서 있는 격차로 나뉩니다.

리더보드 모델 (12 / 16) gpt-5.5 [ xhigh ] 70 % ± 4 % gpt-5.4 [ xhigh ] 56 % ± 5 % claude-opus-4.7 [ max ] 54 % ± 5 % claude-sonnet-4.6 [ high ] 32 % ± 4 % gemini-3.5-flash [ medium ] 28 % ± 4 % gpt-5.4-mini [ xhigh ] 24 % ± 4 % kimi-k2.6 24 % ± 4 % mimo-v2.5-pro 19 % ± 4 % glm-5.1 18 % ± 4 % gemini-3.1-pro 10 % ± 3 % deepseek-v4-pro 8 % ± 2 % gemini-3-flash 5 % ± 2 %

모든 모델은 mini-swe-agent를 사용하여 실행되었습니다. 다른 실행 도구들과의 비교는 'Why mini-swe-agent' 섹션에서 확인할 수 있습니다.

GitHub에서 벤치마크를 확인하고, 위 숫자 뒤에 있는 모든 트레일(rollout)을 탐색하거나, 자신의 에이전트를 벤치마크에 대해 실행해 볼 수 있습니다.

개요

장기적 작업, 현실적이고 짧은 프롬프트 DeepSWE의 프롬프트는 개발자가 에이전트에게 말하는 방식과 일치합니다. 즉, 지나치게 장황하고 지시적인 대신, 행동에 중점을 두고 짧으며, 거대한 인터페이스 정의 블록이 없습니다. 에이전트는 변경 사항을 구현할 위치와 방법을 스스로 찾아내야 하므로, 평가되는 기능의 상당 부분은 지나치게 세세하게 지정된 엔지니어링 작업의 단순 실행이 아니라 엔드투엔드 탐색(End-to-end exploration)을 포함합니다.

GitHub 이슈와 풀 리퀘스트에서 가져온 공개 벤치마크는 종종 더 많은 세부 정보인 재현 단계, 추가 컨텍스트, 코드 스니펫, 특정 기호나 서명을 가정하는 테스트 등을 포함합니다. 반면 DeepSWE는 관찰 가능한 동작을 측정하므로 기본 작업이 훨씬 더 길더라도 프롬프트는 짧고 자연스럽게 유지됩니다.

DeepSWE 태스크는 실제 소프트웨어 엔지니어링(SWE) 작업을 반영하여 범위가 더 넓고 세부적으로 덜 지정되어 있습니다.

평균 프롬프트 길이 (SWE-Bench Verified: 1,700 / SWE-Bench Pro: 4,614 / DeepSWE: 2,158) 평균 참조 솔루션 추가 라인 수 (SWE-Bench Verified: 10 / SWE-Bench Pro: 120 / DeepSWE: 668) 참조 솔루션당 평균 편집 파일 수 (SWE-Bench Verified: 1 / SWE-Bench Pro: 5 / DeepSWE: 7)

광범위한 리포지토리 커버리지 DeepSWE는 TypeScript, Go, Python, JavaScript, Rust 등 5개 언어에 걸쳐 91개의 활성 오픈 소스 리포지토리를 포괄하는 113개의 태스크를 포함하고 있습니다. 이러한 규모의 샘플링은 Dee

원문 보기

원문 보기 (영어)

DeepSWE is a long-horizon software engineering benchmark that delivers four major advances over today's public benchmarks: Contamination free : Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining. High diversity : Tasks span a broad pool of 91 repositories across 5 languages. Real-world complexity : Prompts are half the length of SWE-bench Pro's, yet solutions require 5.5x more code and ~2x more output tokens. Reliable verification : Verifiers are hand-written to test software behavior rather than implementation details. Existing benchmarks fall short on several of these axes. SWE-bench Pro , the leading agentic coding benchmark, has tasks averaging just 120 lines of code to solve, and our audit found its verifier misgrades agent outputs at rates of 8% false positives and 24% false negatives. Frontier labs are also raising growing concerns about benchmark contamination . By contrast, DeepSWE produces a sharper comparison of frontier coding agents. Models that appear close together on public benchmarks separate into wide, ordered gaps that match the differences developers see in day-to-day agent workflows. Leaderboard Models ( 12 / 16 ) gpt-5.5 [ xhigh ] 70 % ± 4 % gpt-5.4 [ xhigh ] 56 % ± 5 % claude-opus-4.7 [ max ] 54 % ± 5 % claude-sonnet-4.6 [ high ] 32 % ± 4 % gemini-3.5-flash [ medium ] 28 % ± 4 % gpt-5.4-mini [ xhigh ] 24 % ± 4 % kimi-k2.6 24 % ± 4 % mimo-v2.5-pro 19 % ± 4 % glm-5.1 18 % ± 4 % gemini-3.1-pro 10 % ± 3 % deepseek-v4-pro 8 % ± 2 % gemini-3-flash 5 % ± 2 % 0 % 20 % 40 % 60 % 80 % gpt-5.5 [ xhigh ] 70 % ± 4 % gpt-5.4 [ xhigh ] 56 % ± 5 % claude-opus-4.7 [ max ] 54 % ± 5 % claude-sonnet-4.6 [ high ] 32 % ± 4 % gemini-3.5-flash [ medium ] 28 % ± 4 % gpt-5.4-mini [ xhigh ] 24 % ± 4 % kimi-k2.6 24 % ± 4 % mimo-v2.5-pro 19 % ± 4 % glm-5.1 18 % ± 4 % gemini-3.1-pro 10 % ± 3 % deepseek-v4-pro 8 % ± 2 % gemini-3-flash 5 % ± 2 % 0 % 25 % 50 % 75 % 100 % All models are run with mini-swe-agent ; see Why mini-swe-agent for a comparison against other harnesses. Explore View the benchmark on GitHub, browse every rollout behind the numbers above, or run your own agent against the benchmark. GitHub → Browse trajectories → Run DeepSWE → Overview 1. Long-horizon work, realistic and short prompts DeepSWE prompts are aligned with the way developers talk to their agents: behavior-focused, short, and free of large interface-definition blocks, rather than overly verbose and prescriptive. Agents must discover where and how to implement the change, so a substantial share of the capabilities being evaluated involve end-to-end exploration instead of just the execution of an overspecified engineering task. Public benchmarks sourced from GitHub issues and pull requests often carry more detail: reproduction steps, additional context, code snippets, and tests that assume specific symbols or signatures. DeepSWE instead scores observable behavior, which lets prompts stay short and natural even when the underlying tasks are substantially longer. DeepSWE tasks are larger in scope and less specified, reflecting real SWE work Mean prompt length 0 1.9k 3.8k 5.6k 7.5k characters SWE-Bench Verified 1,700 SWE-Bench Pro 4,614 DeepSWE 2,158 Mean prompt length SWE-Bench Verified 1,700 SWE-Bench Pro 4,614 DeepSWE 2,158 0 3.8k 7.5k characters Mean reference solution lines added 0 187.5 375 562.5 750 lines added SWE-Bench Verified 10 SWE-Bench Pro 120 DeepSWE 668 Mean reference solution lines added SWE-Bench Verified 10 SWE-Bench Pro 120 DeepSWE 668 0 375 750 lines added Mean files edited per reference solution 0 2.5 5 7.5 10 files SWE-Bench Verified 1 SWE-Bench Pro 5 DeepSWE 7 Mean files edited per reference solution SWE-Bench Verified 1 SWE-Bench Pro 5 DeepSWE 7 0 5 10 files 2. Broad repository coverage DeepSWE contains 113 tasks spanning 91 active open-source repositories across 5 languages: TypeScript, Go, Python, JavaScript, and Rust. Sampling at this scale makes DeepSWE a much stronger proxy for the real-world utilities of coding agents: whether they can make useful, well-scoped changes across varied codebases with different levels of structure, documentation, and maintenance. Existing public benchmarks are much more concentrated. SWE-Bench Pro Public spans 11 repositories, and SWE-Bench Verified spans 12, with many tasks drawn from prominent, heavily maintained projects. That is a narrower setting than the range of projects developers bring to coding agents in practice. Language distribution typescript 35 go 34 python 34 typescript 35 ( 31 %) go 34 ( 30 %) python 34 ( 30 %) javascript 5 ( 4 %) rust 5 ( 4 %) Libraries to large frameworks, across five languages 91 repositories. Dot size is task count; color is primary language. 1k 10k 100k 100 1k 10k GitHub stars Files in default branch TypeScript · 27 Go · 28 Python · 27 JavaScript · 4 Rust · 5 3. Novel tasks test problem-solving, not recall Every DeepSWE task is original: the reference solution is written from scratch rather than copied or adapted from an existing pull request, commit, or public patch. Some tasks are motivated by unresolved GitHub issues, but the fix itself is new. DeepSWE tasks are also never merged back into the upstream repositories, so they do not enter the public GitHub record and are unlikely to appear in future pre-training corpora scraped from open source. This makes DeepSWE a cleaner test of whether an agent can solve a novel software engineering problem, rather than recall, retrieve, or rediscover a public fix. Benchmarks sourced from existing commits have an inherent contamination risk because its corresponding implementation, tests, and discussion are already online. Our SWE-Bench Pro audit found that this risk is not hypothetical, with solution leakage and false positives affecting roughly 8% of audited rollouts. 4. Verifiers reward correctness across many valid implementations A benchmark is only as good as its verifier. In SWE benchmarks, the verifier should approximate the task’s behavioral specification: it should determine whether the submitted code implements the requested change, while remaining agnostic to the particular implementation strategy. DeepSWE’s verifiers are purpose-written from the task description with this goal in mind, and will accept any solution that implements the requested behavior. This differs from a common benchmark construction pattern in which the verifier is inherited from a merged pull request's test suite. These tests provide useful signal, but they are not necessarily designed as complete graders for arbitrary future submissions. They can therefore miss valid solutions, or pass submissions that satisfy the tests without fully satisfying the task. To quantify this, we drew 30 tasks at random from DeepSWE and SWE-Bench Pro and ran 3 rollouts across 10 frontier agent configurations. An LLM then analyzed each trajectory along with the task definition, reference solution and verifier output and then issued an independent verdict on whether the patch actually implemented the requested behavior. The analyzer's verdict could disagree with the verifier in two directions: False positives: the verifier passed but the AI judge concludes the patch does not actually implement the requested behavior. False negatives: the verifier failed but the AI judge concludes the patch is a reasonable solution to the prompt. DeepSWE verifiers align more closely with real task success False positive rate Verifier accepted a wrong implementation 0% 5% 10% SWE-Bench Pro 8.5 % DeepSWE 0.3 % SWE-Bench Pro 8.5 % DeepSWE 0.3 % 0% 5 % 10 % False negative rate Verifier rejected a correct implementation 0% 15% 30% SWE-Bench Pro 24.0 % DeepSWE 1.1 % SWE-Bench Pro 24.0 % DeepSWE 1.1 % 0% 15 % 30 % n = 735 DeepSWE / 789 SWE-Bench Pro reviewed rollouts. Excludes trials with API errors, timeouts, and other transient harness failures from each denominator. Browse false positives SWE-Bench Pro DeepSWE Brow

벤치마크 코딩 에이전트 DeepSWE 소프트웨어 엔지니어링 AI 모델 평가