The Decoder • 112일 전

GPT-2에서 클로드 미토스까지: '출시엔 위험'했던 AI의 귀환

IMP

8/10

핵심 요약

과거 OpenAI가 GPT-2의 전면 공개를 미뤘던 논쟁이 Anthropic의 신규 모델 'Claude Mythos'를 통해 다시금 주목받고 있습니다. 이번에는 가드레일과 안전성 평가를 거친 후 공개하는 업계의 기존 방식을 넘어, 보안 취약점 발견에 특화된 모델을 통제된 환경에서만 방어적 목적으로 배포하는 'Project Glasswing'이 소개되었습니다. 글로벌 테크 기업들이 연합에 참여하며, 단순히 모델을 보류하는 것이 아닌 철저히 통제하며 활용하는 새로운 안전 기준을 제시하고 있습니다.

번역된 본문

7년 전, OpenAI는 자사의 언어모델 GPT-2를 '공개하기엔 너무 위험하다'고 선언했습니다. 당시 업계는 이를 콧웃음 쳤습니다. 이제 Anthropic이 'Claude Mythos Preview(클로드 미토스 프리뷰)'를 통해 같은 행보를 반복하고 있지만, 이번에는 확실한 증거가 존재합니다. 바로 인간이 검토하기조차 벅찬 수천 개의 운영체제 및 브라우저 취약점을 AI가 발견해냈다는 것입니다.

2019년 2월, OpenAI는 너무나도 정교하게 가짜 뉴스를 생성할 수 있는 언어모델을 공개하지 않기로 결정하면서 대중의 이목을 끌었습니다. AI 연구 커뮤니티의 일부는 이를 현명한 예방 조치로 평가했지만, 다른 일부는 단순한 홍보(PR) 수단으로 일축했습니다. OpenAI는 놀라운 텍스트 생성 능력과 악용에 대한 우려를 이유로 15억 파라미터 규모의 전체 모델 공개를 보류했습니다.

최초 발표 6개월 후, OpenAI의 정책팀은 이 결정의 영향에 대한 보고서를 발표했습니다. 당시 5월, OpenAI는 입장을 수정해 점진적으로 모델의 크기를 키워 공개하는 '단계적 출시(Staged Release)'를 시작했습니다. 우려했던 부작용이 실제로 발생하지 않았고, 무엇보다 대체 모델들이 이미 시장에 나와 있었던 탓에 전체 버전의 GPT-2는 2019년 11월 최종 공개되었습니다.

당시 OpenAI에서 커뮤니케이션과 정책의 균형을 맞췄던 인물은 정책 책임자였던 잭 클락(Jack Clark)이었습니다. 2019년 6월, 클락은 미국 의회 증인으로 나서 텍스트 출력을 제어하는 기술은 아직 제한적이지만 과학계의 광범위한 연구를 통해 개선될 것이라고 설명했습니다. 그는 단계적 출시가 책임 있는 규범을 위한 새로운 프로토타입이라고 주장했습니다.

업계는 보류 대신 가드레일을 선택했습니다. 모델을 점진적으로 공개한다는 아이디어는 널리 퍼지지 않았습니다. 대신, 업계는 안전 문제에 대한 다른 해답을 선택했습니다. 바로 출시를 보류하는 것이 아니라, 안전하게 보호한 뒤 공개하는 방식이었습니다. 출시 전 레드팀(Red Teaming), 안전성 평가, 시스템 카드(System Card), 책임 있는 확장 정책, 버그 바운티 프로그램, 그리고 RLHF(인간 피드백 기반 강화학습) 기반의 안전 레이어가 업계의 표준 관행이 되었습니다.

GPT-3는 API를 통해 접근 가능해졌고, ChatGPT는 공개 제품으로 출시되었으며, Meta는 LLaMA 모델을 오픈소스로 공개했습니다. 그 논리는 이것이었습니다. 모델을 철저히 테스트하고 안전장치를 장착한다면, 책임감 있게 출시할 수 있다는 것입니다. 당시 엔비디아(Nvidia) 소속 딥러닝 엔지니어였던 칩 후옌(Chip Huyen)은 2019년에 이 상황을 정확히 꿰뚫어 보았습니다. "관련 기술은 매우 쉽게 복제될 수 있기 때문에 단계적 출시가 특별히 유용하다고 생각하지 않습니다. 하지만 향후 프로젝트를 위한 선례를 만든다는 점에서는 유용할 수 있습니다."

그리고 그 선례가 찾아왔습니다. 단, 누구도 예상치 못한 방식이었습니다. 모델 공개를 단순히 보류하는 것이 아니라, 출시 전 철저히 보안을 강화하는 형태로 말입니다.

잭 클락 본인도 2020년 12월 OpenAI를 떠났습니다. 몇 달 뒤, 그는 다니엘라 아모데이 및 다리오 아모데이 남매를 포함한 전직 OpenAI 직원들이 설립한 Anthropic의 공동 창립자로 다시 모습을 드러냈습니다. Anthropic은 업계의 안전성 관행을 이끄는 주요 주자가 되었습니다. 'Constitutional AI(헌법적 AI)', '책임 있는 확장 정책(Responsible Scaling Policy)', 그리고 출시 때마다 발표하는 방대한 시스템 카드는 회사의 상징이 되었습니다. 이는 주로 샘 알트만(Sam Altman) CEO의 직관과 기존 안전 접근 방식을 경시하는 태도에 동의하지 않았던 전직 OpenAI 직원들이 주도한 결과였습니다.

7년의 세월과 수많은 모델 세대가 흐른 후, Anthropic은 이제 업계가 한 번도 본 적 없는 한 걸음 더 나아간 조치를 취하고 있습니다.

Anthropic이 클로드 미토스를 비공개로 유지하며 연합체를 구성하다

구체적으로, Anthropic은 'Project Glasswing(프로젝트 글래스윙)'을 발표했습니다. 이 이니셔티브는 'Claude Mythos Preview'라는 최신 프론티어 모델을 현재로서는 방어적 사이버 보안 목적으로만 독점 배포하는 프로젝트입니다. 파트너 목록에는 테크 거인, 주요 은행, 그리고 오픈소스 재단을 아우르는 11개 기관이 포함되어 있습니다: Amazon Web Services, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorgan Chase, Linux Foundation, Micro...

원문 보기

원문 보기 (영어)

From GPT-2 to Claude Mythos: The return of AI models deemed 'too dangerous to release' Maximilian Schreiner View the LinkedIn Profile of Maximilian Schreiner Apr 8, 2026 Nano Banana Pro prompted by THE DECODER Seven years ago, OpenAI declared its language model GPT-2 "too dangerous to release." The industry rolled its eyes. Now Anthropic is repeating the move with Claude Mythos Preview - but this time there's real evidence on the table: thousands of vulnerabilities in operating systems and browsers, found by an AI that barely any human could review. In February 2019, OpenAI catapulted itself into public consciousness when it unveiled a language model that could generate fake news so convincingly that the company decided not to release it. Parts of the AI research community considered it a wise precaution; others dismissed it as a PR stunt. OpenAI withheld the full 1.5-billion-parameter model, citing remarkable progress in text generation and concerns about potential misuse. Six months after the initial announcement, OpenAI's policy team published a report on the decision's impact. In May, OpenAI had revised its original position and kicked off a "staged release" - a gradual rollout of increasingly larger versions of the model. The full GPT-2 arrived in November 2019, after the feared harms never materialized, likely in part because alternatives had already become available. The person who managed this balancing act on the communications side at OpenAI was Jack Clark, the company's Policy Director. In June 2019, Clark testified before the US Congress , explaining that the ability to control text output was still limited but would improve through the broader research of the scientific community. He described the staged release as a new prototype for responsible norms. The industry chose guardrails over withholding The idea of gradually releasing a model didn't catch on. Instead, the industry settled on a different answer to the safety question: don't withhold - secure and then release. Red teaming before launch, safety evaluations, system cards, responsible scaling policies, bug bounty programs, and RLHF-based safety layers became standard practice. GPT-3 was made accessible via API , ChatGPT launched as a public product , and Meta released its LLaMA models as open models . The logic: if you thoroughly test a model and equip it with safety measures, you can responsibly ship it. Deep learning engineer Chip Huyen, then at Nvidia, pretty much nailed it back in 2019 : "I don't think a staged release was particularly useful in this case because the work is very easily replicable. But it might be useful in the way that it sets a precedent for future projects." The precedent came, just not in the way anyone expected - not as withholding, but as securing before release. Clark himself left OpenAI in December 2020. A few months later, he emerged as a co-founder of Anthropic, a company founded by former OpenAI employees including siblings Daniela and Dario Amodei. Anthropic became a major driver of the industry's safety practices: Constitutional AI, its Responsible Scaling Policy, and extensive system cards before each launch became hallmarks of the company, often spearheaded by former OpenAI employees who disagreed with CEO Sam Altman's "vibes" and his rather dismissive attitude toward traditional safety approaches. Seven years and several model generations later, Anthropic is now taking a step further than anything the industry has seen before. Anthropic keeps Claude Mythos under wraps and brings a coalition along Specifically, Anthropic has announced Project Glasswing , an initiative that deploys the company's new frontier model called " Claude Mythos Preview " exclusively for defensive cybersecurity purposes for now. The partner list includes eleven organizations spanning tech giants, a major bank, and an open-source foundation: Amazon Web Services, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks. Rather than a release with guardrails, Anthropic plans to introduce the necessary safeguards with an upcoming Claude Opus model first, refining them on a model that doesn't pose the same level of risk as Mythos Preview. Only after that should Mythos-class models become broadly available. Security professionals whose work is affected by the restrictions will be able to apply for an upcoming "Cyber Verification Program." According to Anthropic, the model has already found thousands of high-severity vulnerabilities, "including some in every major operating system and web browser." The company is committing up to $100 million in usage credits and donating $4 million directly to open-source security organizations: $2.5 million to Alpha-Omega and OpenSSF through the Linux Foundation, $1.5 million to the Apache Software Foundation. Over 40 additional organizations are getting access to scan and secure critical software infrastructure. After the credits expire, Mythos Preview will be available to partners at $25 and $125 per million input and output tokens, respectively. A 27-year-old bug proves the point Unlike OpenAI with GPT-2 back in the day, Anthropic is backing its decision with concrete findings. According to the Frontier Red Team blog , Mythos Preview autonomously - without any human intervention - found vulnerabilities that had gone undetected for decades. In OpenBSD, an operating system known for its security, the model discovered a 27-year-old bug in the TCP SACK implementation. The vulnerability allowed an attacker to crash any OpenBSD machine simply by connecting to it. The bug was based on a subtle combination of missing validation and integer overflow that only became exploitable through the simultaneous fulfillment of seemingly impossible conditions. In FFmpeg, arguably the most intensively tested media library in the world, Mythos Preview identified a 16-year-old vulnerability in the H.264 codec. According to Anthropic, an automated testing tool had hit the affected line of code five million times without ever catching the problem. The FreeBSD case is also noteworthy: according to Anthropic, the model autonomously found a 17-year-old vulnerability in the NFS server ( CVE-2026-4747 ) and independently built a working exploit for it. Even in long-maintained infrastructure, the model uncovered security flaws that had gone unnoticed for years. Mythos doesn't just find bugs - it exploits them What sets Mythos Preview apart from earlier models, according to Anthropic, isn't just the ability to find vulnerabilities but also to exploit them. The predecessor model Claude Opus 4.6 had a near-zero percent success rate at autonomous exploit development, according to the company. On a benchmark using Firefox 147 vulnerabilities, Opus 4.6 produced working exploits only two times out of several hundred attempts. Mythos Preview achieved 181. On the CyberGym benchmark, which measures how reliably a model can reproduce known vulnerabilities in real open-source software, Mythos Preview scored 83.1 percent compared to 66.6 percent for Opus 4.6. In another internal test against roughly a thousand open-source projects, Mythos Preview achieved full control-flow hijack on ten fully patched targets. Opus 4.6 managed it exactly once. Beyond cybersecurity, the model also shows significant improvements according to the system card . On the coding benchmark SWE-bench Verified, which tests models on real-world software engineering tasks, Mythos Preview reaches 93.9 percent (Opus 4.6: 80.8%). On GPQA Diamond, a set of challenging graduate-level science questions, it scores 94.6 percent (Opus 4.6: 91.3%). On the USAMO 2026, the US Mathematical Olympiad competition, it reaches 97.6 percent; Opus 4.6 scored 42.3 percent. Fewer incidents, but higher stakes when things go wrong Anthropic's 244-page system card also documents alarming behaviors from earlier versions of the model during internal use. In one case, an e

AI 안전성 Anthropic 사이버 보안 Claude Mythos OpenAI

오픈AI, 너무 위험해 공개 불가한 GPT-2 발표 (2019)

2019년 오픈AI는 특정 주제를 주면 문맥에 맞는 글을 매우 자연스럽게 생성하는 텍스트 생성 모델 GPT-2를 발표했습니다. 그러나 가짜 뉴스나 악의적 콘텐츠 생성 등 안전 및 보안상의 우려로 인해 원본 모델이 아닌 축소된 버전만 공개하기로 결정했습니다. 이 결정은 AI 알고리즘의 위험성과 윤리적 공개 기준에 대한 업계 전문가들의 치열한 논쟁을 촉발하는 계기가 되었습니다.

오픈AI GPT-2 안전 및 보안