The Decoder • 73일 전

클로드 미토스, 자율적 브라우저 익스플로잇 가능

IMP

9/10

핵심 요약

카네기멜런 대학교 연구진이 구글 V8 자바스크립트 엔진의 실제 취약점을 공격하는 AI 에이전트의 능력을 평가하는 새로운 벤치마크를 개발했습니다. 앤스로픽의 '클로드 미토스(Claude Mythos)' 모델은 OpenAI의 'GPT-5.5'를 압도적으로 제치고 능숙한 인간 보안 연구원과 맞먹는 수준의 해킹 능력을 입증했습니다. 다만 미토스의 테스트 비용이 GPT-5.5보다 12배 가까이 높게 책정되어, 성능 대비 높은 비용 효율성의 문제가 제기되었습니다.

번역된 본문

새로운 벤치마크에 따르면 클로드 미토스(Claude Mythos)와 GPT-5.5는 자율적으로 실제 브라우저 익스플로잇(Exploit)을 개발할 수 있는 것으로 나타났습니다.

핵심 요약:

카네기멜런 대학교(CMU) 연구진은 AI 에이전트가 구글의 V8 자바스크립트 엔진에 있는 실제 취약점을 최대 임의 코드 실행 단계까지 얼마나 효과적으로 악용할 수 있는지 평가하는 벤치마크를 개발했습니다.
연구진에 따르면, 앤스로픽(Anthropic)의 Claude Mythos Preview 모델은 OpenAI의 GPT-5.5를 크게 능가하며 유능한 인간 보안 연구원과 동등한 수준의 성능을 보여주었습니다.
뛰어난 성과에도 불구하고 미토스는 높은 비용을 수반했습니다. 테스트 비용이 약 36,400달러에 달해 GPT-5.5보다 10배 이상 높아 비용 효율성에 대한 의문을 낳았습니다.

상세 내용: 카네기멜런 대학교 연구진은 AI 에이전트가 구글의 자바스크립트 엔진인 V8의 실제 취약점을 악용할 때 어디까지 할 수 있는지 측정하는 새로운 벤치마크를 구축했습니다. 미토스가 GPT-5.5를 큰 차이로 앞서고 있지만 막대한 비용이 듭니다.

이전 테스트들과 달리, 이 벤치마크는 단순히 버그가 트리거되는지만 확인하지 않습니다. 대상 시스템에서 원하는 명령을 실행하는 임의 코드 실행(Arbitrary Code Execution)에 이르기까지 5개 등급에 걸쳐 진행 상황을 점수로 매깁니다. V8은 Chrome, Edge, Node.js 및 Cloudflare Workers와 같은 시스템을 구동하는 핵심 엔진입니다.

가끔 인간의 힌트("넛지, nudge")를 받은 Anthropic의 Claude Mythos Preview는 16점 만점에 평균 9.90점을 기록했으며, 41개 취약점 중 21개에서 최고 등급에 도달했습니다. OpenAI의 GPT-5.5는 5.51점으로 크게 뒤처졌으며, 최고 등급에 도달한 것은 단 2개에 불과했습니다.

완전 자율 모드(Fully autonomous mode)에서 그 격차는 훨씬 더 벌어집니다. 미토스는 9.55점을 기록했으며 점수 하락이 거의 없었습니다. 반면 Codex를 통한 GPT-5.5는 4.30점에 그쳤습니다. 테스트된 다른 어떤 모델도 전체 코드 실행(T1)을 달성하지 못했습니다.

비용의 차이도 극명합니다. ExploitBench에 따르면 122개 에피소드에 걸친 미토스의 전체 테스트 실행 비용은 약 36,428달러였습니다. 반면 Codex를 이용한 GPT-5.5는 123개 에피소드를 약 3,075달러에 실행하여 약 12배 저렴했습니다. 영국의 AI 안전 연구소(AI Safety Institute) 역시 최근 테스트에서 미토스가 GPT-5.5보다 성능이 약간 더 좋지만 훨씬 더 높은 비용이 든다는 것을 확인했습니다. 이러한 가격 격차는 OpenAI가 더 많은 컴퓨팅 자원을 투입하여 성능 격차를 좁힐 수 있음을 시사합니다.

미토스는 '꽤 유능한' 브라우저 보안 연구원과 같은 수준으로 작동합니다. 20개 이상의 브라우저 취약점을 보고한 경험이 있는 숙련된 보안 연구원이자 ExploitBench 공동 저자인 이승현 연구원은 미토스의 대화 내역을 하나하나 검토했습니다. 그의 결론은 이 모델이 "상당히 유능한 브라우저/JS 엔진 보안 연구원"처럼 작동한다는 것입니다.

한 사례에서 미토스는 이 연구원과 동료가 이전에 너무 복잡하다고 기각했던 익스플로잇 기술을 개발했습니다. 또 다른 사례에서 인간 연구자들이 1년 넘게 풀지 못했던 취약점(CVE-2024-0519)을 재현해 내기도 했습니다. 연구진은 테스트된 버그가 공개적으로 알려져 있으며 모델이 이론적으로 학습 데이터를 활용할 수 있음을 인정합니다. 하지만 이 데이터셋에는 공개된 익스플로잇이나 버그 보고서가 없는 취약점도 포함되어 있습니다. 이 벤치마크는 아직 새로운 결함을 찾거나 실제 공격을 위해 익스플로잇을 완전히 무기화하는 능력을 측정하지는 않습니다.

이 벤치마크는 GitHub에서 확인할 수 있으며, 관련 논문은 arXiv에서 볼 수 있습니다. Anthropic과 OpenAI는 API 크레딧을 제공했으며, 저자들은 모든 분석이 독립적으로 수행되었다고 밝혔습니다.

원문 보기

원문 보기 (영어)

New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously Matthias Bastian View the LinkedIn Profile of Matthias Bastian May 16, 2026 Nano Banana Pro prompted by THE DECODER Key Points Researchers at Carnegie Mellon University have developed a benchmark that evaluates how effectively AI agents can exploit real-world vulnerabilities in Google's V8 JavaScript engine, all the way up to full code execution. Anthropic's Claude Mythos Preview model significantly outperformed OpenAI's GPT-5.5, performing on par with a competent human security researcher, according to the researchers. Despite its strong results, Mythos came at a steep price: test costs reached approximately $36,400, more than ten times higher than those for GPT-5.5, raising questions about the cost-efficiency. Ask about this article… Search Researchers at Carnegie Mellon University built a new benchmark that measures how far AI agents can go when exploiting real-world vulnerabilities in Google's JavaScript engine V8. Mythos leads GPT-5.5 by a wide margin, but it costs a fortune. Unlike previous tests, the benchmark doesn't just check whether a bug gets triggered. It scores progress across five tiers, all the way up to arbitrary code execution, running whatever commands you want on the target system. V8 powers systems like Chrome , Edge, Node.js, and Cloudflare Workers. Anthropic's Claude Mythos Preview , with occasional human hints ("nudges"), hit an average score of 9.90 out of 16 and reached the highest tier on 21 of 41 vulnerabilities. OpenAI's GPT-5.5 trailed far behind at 5.51 points, reaching the top tier on just two. Ad The gap gets even wider in fully autonomous mode. Mythos scored 9.55 points there, barely any drop. GPT-5.5 via Codex managed only 4.30. None of the other tested models achieved full code execution (T1). Ad DEC_D_Incontent-1 The price tags differ sharply: the full Mythos test run across 122 episodes cost about $36,428, according to ExploitBench. GPT-5.5 via Codex ran 123 episodes for roughly $3,075, about twelve times cheaper. The UK's AI Safety Institute also confirmed that Mythos performs somewhat better than GPT-5.5 but at a much higher cost in a recent test. The price gap suggests OpenAI could close the performance gap by throwing more compute at the problem. Mythos works like a "fairly competent" browser security researcher ExploitBench co-author Seunghyun Lee—himself an experienced security researcher with over 20 reported browser vulnerabilities—reviewed the Mythos transcripts one by one. His takeaway : the model works like a "fairly competent browser / JS engine security researcher." Ad In one case, Mythos developed an exploit technique that Lee and a colleague had previously dismissed as too complex. In another, it reproduced a vulnerability (CVE-2024-0519) that human researchers had failed to crack for over a year, according to Lee. The researchers acknowledge that the tested bugs are publicly known, and models could theoretically draw on training data. But the dataset also includes vulnerabilities with no public exploit or bug report. The benchmark doesn't yet measure the ability to find new flaws or fully weaponize an exploit for real attacks. Ad DEC_D_Incontent-2 The benchmark is available on GitHub , and the paper is on arXiv . Anthropic and OpenAI provided API credits; the authors say all analysis was done independently. Ad AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now Source: Web | Paper | Github

보안 벤치마크 클로드 GPT-5.5 취약점 분석