r/LocalLLaMA • 93일 전

SWE-bench, 오염으로 사실상 한계 도달

IMP

8/10

핵심 요약

오래전부터 코딩 AI 성능의 표준이었던 벤치마크 SWE-bench Verified가 데이터 오염 문제와 불량 테스트 케이스로 인해 최신 프론티어 모델의 코딩 능력을 더 이상 제대로 측정하지 못한다는 분석이 나왔습니다. 평가 데이터가 모델 학습에 노출되어 실력 향상이 아닌 사전 지식 암기로 점수가 올라가는 문제가 발생하고 있습니다. 이에 따라 업계는 새로운 대체 평가 지표인 SWE-bench Pro 사용을 권장하고 있습니다.

번역된 본문

2026년 2월 23일 연구 간행물

왜 SWE-bench Verified는 더 이상 최신 코딩 능력을 측정하지 못하는가?

SWE-bench Verified는 점점 더 오염되고 있습니다. 우리는 SWE-bench Pro의 사용을 권장합니다.

로딩 중… 공유하기

2024년 8월에 SWE-bench Verified를 처음 발표한 이래, 업계는 이를 폭넓게 사용하여 자율 소프트웨어 엔지니어링 작업에서 모델의 발전 정도를 측정해 왔습니다. 출시 이후 SWE-bench Verified는 기능적 진전을 보여주는 강력한 지표가 되었으며, 최신 프론티어 모델 발표 시 보고되는 표준 척도로 자리 잡았습니다. 이러한 능력의 진전을 추적하고 예측하는 것은 OpenAI의 준비 프레임워크(Preparedness Framework)에서도 중요한 부분을 차지합니다.

초기에 Verified 벤치마크를 만들었을 때, 우리는 원래 평가에서 SWE-bench 데이터셋 내의 특정 작업을 수행할 수 없게 만들었던 문제들을 해결하려고 시도했습니다. 초기의 비약적인 발전 이후, SWE-bench Verified에서의 최첨단 성능 향상은 둔화되어 지난 6개월 동안 74.9%에서 80.9%로 개선되는 데 그쳤습니다. 이는 다음과 같은 의문을 제기합니다. 나머지 실패 건들이 모델의 한계를 나타내는 것일까요, 아니면 데이터셋 자체의 특성 때문일까요?

새로운 분석에서, 우리는 Verified 세트의 두 가지 주요 문제점을 발견했습니다. 이는 오늘날의 성능 수준에서 자율 소프트웨어 엔지니어링 기능의 발전을 측정하기 위해 이 벤치마크가 더 이상 적합하지 않다는 것을 시사합니다.

테스트가 올바른 솔루션을 거부함: 모델이 종종 해결하지 못했던 데이터셋의 27.6% 하위 세트를 감사한 결과, 감사된 문제의 최소 59.4%가 결함이 있는 테스트 케이스를 가지고 있어 기능적으로 정확한 제출을 거부한다는 것을 발견했습니다. 이는 초기 SWE-bench Verified 생성 시 이를 개선하려고 최선을 다했음에도 불구하고 발생했습니다.

솔루션으로 학습함: 대규모 프론티어 모델은 학습을 통해 정보를 습득할 수 있으므로, 평가되는 문제와 솔루션으로 학습되지 않도록 하는 것이 중요합니다. 이는 시험을 앞둔 학생들에게 앞으로 볼 시험의 문제와 정답을 미리 공유하는 것과 같습니다. 학생들이 정답을 암기하지 않을 수도 있지만, 정답을 본 적이 있는 학생들은 그렇지 않은 학생들보다 확실히 더 나은 점수를 받을 것입니다. SWE-bench의 문제들은 많은 모델 제공자가 학습 목적으로 사용하는 오픈 소스 저장소에서 가져옵니다. 우리의 분석에서 우리가 테스트한 모든 프론티어 모델은 골드 패치(Gold Patch)라 불리는 정답 참조용으로 사용된 원본 인간 작성 버그 수정본을 재현하거나, 특정 작업에 대한 문제 설명의 세부 사항을 그대로 따라 재현할 수 있었습니다. 이는 모든 모델이 학습 중에 최소한 일부 문제와 해결책을 접했음을 나타냅니다. 또한 학습 중에 문제를 본 적이 있는 모델이 불충분하게 명시된 테스트를 통과하는 데 필요한 추가 정보를 갖고 있기 때문에 성공할 가능성이 더 높다는 증거도 발견했습니다.

이는 SWE-bench Verified에서의 개선이 더 이상 실제 소프트웨어 개발 능력의 의미 있는 향상을 반영하지 않음을 의미합니다. 대신, 그것은 점점 더 모델이 학습 시간에 벤치마크에 얼마나 많이 노출되었는지를 반영합니다. 이것이 우리가 SWE-bench Verified 점수 보고를 중단한 이유이며, 다른 모델 개발자들도 그렇게 하기를 권장합니다.

우리는 코딩 능력을 더 잘 추적하기 위해 오염되지 않은 새로운 평가를 구축하고 있으며, 이것이 광범위한 연구 커뮤니티가 집중해야 할 중요한 영역이라고 생각합니다. 이를 마련할 때까지, OpenAI는 SWE-bench Pro에 대한 결과 보고를 권장합니다.

배경 지식

원래의 SWE-bench 평가는 2023년에 공개되었습니다. 각 문제는 12개 오픈 소스 Python 저장소 중 하나에서 해결된 GitHub 이슈에서 가져오며, 해당하는 풀 리퀘스트(PR)와 쌍을 이룹니다. 모델이 생성한 코드 변경이 올바른지 확인하기 위해 각 문제에는 두 세트의 테스트가 포함되어 있습니다:

수정되지 않은 코드베이스에서는 실패하지만 이슈가 올바르게 수정되면 통과하는 테스트
관련 없는 기능이 그대로 유지되도록 수정 전후 모두 통과하는 회귀 테스트(Regression test)

모델은 이 테스트들을 볼 수 없습니다. 오직 원본 이슈 텍스트와

원문 보기

원문 보기 (영어)

February 23, 2026 Research Publication Why SWE-bench Verified no longer measures frontier coding capabilities SWE-bench Verified is increasingly contaminated. We recommend SWE-bench Pro. Loading… Share Since we first published SWE-bench Verified in August 2024, the industry has widely used it to measure the progress of models on autonomous software engineering tasks. After its release, SWE-bench Verified provided a strong signal of capability progress and became a standard metric reported in frontier model releases. Tracking and forecasting progress of these capabilities is also an important part of OpenAI’s Preparedness Framework . When we created the Verified benchmark initially, we attempted to solve issues in the original evaluation that made certain tasks impossible to accomplish in the SWE-bench dataset ⁠ (opens in a new window) . After initial leaps, state-of-the-art progress on SWE-bench Verified has slowed, improving ⁠ (opens in a new window) from 74.9% to 80.9% in the last 6 months. This raises the question: do the remaining failures reflect model limitations or properties of the dataset itself? In a new analysis, we found two major issues with the Verified set that indicate the benchmark is no longer suitable for measuring progress on autonomous software engineering capabilities for frontier launches at today’s performance levels: Tests reject correct solutions: We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions, despite our best efforts in improving on this in the initial creation of SWE-bench Verified. Training on solutions: Because large frontier models can learn information from their training, it is important that they are never trained on problems and solutions they are evaluated on. This is akin to sharing problems and solutions for an upcoming test with students before the test - they may not memorize the answer but students who have seen the answers before will certainly do better than those without. SWE-bench problems are sourced from open-source repositories many model providers use for training purposes. In our analysis we found that all frontier models we tested were able to reproduce the original, human-written bug fix used as the ground-truth reference, known as the gold patch, or verbatim problem statement specifics for certain tasks, indicating that all of them have seen at least some of the problems and solutions during training. We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests. This means that improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time. This is why we have stopped reporting SWE-bench Verified scores, and we recommend that other model developers do so too. We’re building new, uncontaminated evaluations to better track coding capabilities, and we think this is an important area to focus on for the wider research community. Until we have those, OpenAI recommends reporting results for SWE-bench Pro. Background The original SWE-bench ⁠ (opens in a new window) evaluation was released in 2023. Each problem is sourced from a resolved GitHub issue in one of 12 open-source Python repositories and paired with the corresponding pull request (PR). To determine whether a model-generated code change is correct, each problem comes with two sets of tests: Tests that fail on the unmodified codebase but pass if the issue is correctly fixed Regression tests that pass both before and after the fix to ensure unrelated functionality remains intact. The model does not see the tests. It has to produce a code change given only the original issue text and the state of the repository before the fix. It passes a problem only if all tests pass after the code change is applied. We found many issues with that evaluation that could lead to underreporting the capability of models. Some unit tests were overly specific or misaligned with the task so correct fixes could be rejected. Many task statements were underspecified, which could lead to multiple valid interpretations - while the tests only covered a specific one. Depending on setup of the environment (for example Linux vs Windows, or the python version), some tests could spuriously fail We created SWE-bench Verified in 2024 to address these issues. We worked with expert software engineers to review 1,699 SWE-bench problems and filter out problems that had these issues. Each problem was reviewed by three experts independently. This review process resulted in SWE-bench Verified, a curated set of 500 problems. Too narrow and too wide tests While SWE-bench Verified is a big improvement over the initial version, residual issues remain. We conducted an audit of 138 SWE-bench Verified problems that OpenAI o3 did not consistently solve over 64 independent runs. Each case was independently reviewed by at least six experienced software engineers. If an expert flagged an issue, it was re-verified by an additional team. We found that 59.4% of the 138 problems contained material issues in test design and/or problem description, rendering them extremely difficult or impossible even for the most capable model or human to solve. 35.5% of the audited tasks have strict test cases that enforce specific implementation details, invalidating many functionally correct submissions, which we call narrow test cases. 18.8% of the audited tasks have tests that check for additional functionality that wasn’t specified in the problem description, which we call wide test cases. The remaining 5.1% of tasks had miscellaneous issues that were not well grouped with this taxonomy. An illustrative example of the first failure mode is pylint-dev__pylint-4551 ⁠ (opens in a new window) , where the PR introduces a new function `get_annotation` as part of the overall solution. This function name is not mentioned in the problem description, but is imported directly by the tests. While some models might intuit to create such a function, it’s not strictly necessary to implement a function with this specific name to correctly address the problem. Many valid solutions fail the tests on import errors. Problem description Plain Text 1 Use Python type hints for UML generation 2 It seems that pyreverse does not read python type hints (as defined by [PEP 484](https://www.python.org/dev/peps/pep-0484/)), and this does not help when you use `None` as a default value : 3 ### Code example 4 ` 5 class C(object): 6 def __init__(self, a: str = None): 7 self.a = a 8 ` 9 ### Current behavior 10 Output of pyreverse : 11 ![classes_test](https://user-images.githubusercontent.com/22218701/27432305-f10fe03e-574f-11e7-81fa-e2b59e493360.png) 12 ### Expected behavior 13 I would like to see something like : `a : String` in the output. 14 ### pylint --version output 15 pylint-script.py 1.6.5, 16 astroid 1.4.9 17 Python 3.6.0 |Anaconda custom (64-bit)| (default, Dec 23 2016, 11:57:41) [MSC v.1900 64 bit (AMD64)] PR test snippet Python 1 + from pylint.pyreverse.utils import get_annotation, get_visibility, infer_node PR test failures (truncated for readability) Python 1 ==================================== ERRORS ==================================== 2 _____________ ERROR collecting tests/unittest_pyreverse_writer.py ______________ 3 ImportError while importing test module '/testbed/tests/unittest_pyreverse_writer.py' . 4 Hint: make sure your test modules/packages have valid Python names. 5 Traceback: 6 /opt/miniconda3/envs/testbed/lib/python3 .9 /importlib/__init__.py: 127 : in import_module 7 return _bootstrap._gcd_import(name[level:], package, level) 8 tests/unittest_pyreverse_writer.py: 32 : in

벤치마크 오염 코딩 AI SWE-bench 평가 지표 LLM 한계