The Decoder • 65일 전

구글 딥마인드 AI, 수백 달러로 수십년 난제 풀다

IMP

8/10

핵심 요약

구글 딥마인드가 개발한 '알파프루프 넥서스(AlphaProof Nexus)' 프레임워크가 353개의 미해결 에르되시(Erdős) 수학 문제 중 9개를 자율적으로 해결했습니다. 이 시스템은 LLM이 Lean이라는 형식 언어로 증명 단계를 생성하고 컴파일러가 이를 검증하는 방식으로 추론 비용 문제당 단 몇 백 달러에 불과했습니다. 이는 수학적 증명 분야에서 전문적인 훈련 시스템에서 벗어나, 강력한 범용 LLM과 단순한 에이전트 루프가 결합하는 AI의 새로운 패러다임을 보여줍니다.

번역된 본문

구글 딥마인드의 알파프루프 넥서스, 수십 년간 수학자들을 괴롭힌 연구 난제 해결

구글 딥마인드의 새로운 프레임워크인 알파프루프 넥서스(AlphaProof Nexus)가 시도한 353개의 미해결 에르되시(Erdős) 문제 중 9개를 자율적으로 해결했으며, 여기에는 56년 동안 답이 없었던 두 가지 질문도 포함되어 있습니다.

이 시스템은 온라인 정수열 백과사전(OEIS)의 492개 미해결 추측 중 44개를 증명했고, 대수 기하학에서 힐베르트 함수(Hilbert functions)에 대한 15년 된 질문에 종지부를 찍었으며, 볼록 최적화(convex optimization)의 기존 경계를 개선했습니다. 연구 논문에 따르면, 문제당 추론 비용은 단 몇 백 달러에 불과했습니다.

최근의 오픈AI(OpenAI) 해결책과 같은 (순수하게) 자연어 기반 접근 방식과 달리, 알파프루프 넥서스의 기반이 되는 언어 모델(이 경우 Gemini 3.1 Pro)은 전체 논리적 사슬을 스스로 감당할 필요가 없습니다.

대신, 모델이 Lean이라는 형식 언어로 증명 단계를 생성하면 컴파일러가 각 단계를 확인합니다. 이때 발생한 오류 메시지는 다음 시도에 직접적으로 피드백으로 제공됩니다. 이러한 방식을 통해 LLM은 상징적 피드백에 기반을 두게 되며, 이는 논리적 추론에서 언어 모델의 잘 알려진 약점을 보완해 주는 안전망 역할을 합니다. 인간은 최종 결과를 확인하는 마지막 단계에만 개입합니다.

네 개의 에이전트, 하나의 놀라운 결과 이 시스템은 복잡성이 증가하는 네 가지 에이전트 변형으로 구성되어 있습니다. 가장 단순한 Agent (A)는 Gemini 3.1 Pro에서 루프를 돌며 실행되는 독립적인 하위 에이전트를 배치합니다. 즉, 언어 모델이 증명 단계를 생성하고, Lean 컴파일러가 이를 확인하며, 오류 메시지가 다음 시도에 피드백으로 들어가는 구조입니다.

Agent (B)는 구글의 수학 올림피아드용 강화학습 기반 시스템인 알파프루프(AlphaProof)에 쿼리를 추가하여, 누락된 증명 부분을 채울 수 있도록 합니다. Agent (C)는 진화적(evolutionary) 구성 요소를 도입합니다. 알파이볼브(AlphaEvolve)에서 영감을 받은 하위 에이전트들은 공통된 증명 스케치(개요) 풀을 공유합니다. Gemini 3.0 Flash 기반의 평가 에이전트는 이 스케치들의 타당성과 참신함을 점수로 매긴 다음, Elo 시스템을 사용해 순위를 매깁니다.

모든 기능을 갖춘 Agent (D)는 이러한 모든 기능을 결합합니다. 에르되시 문제에는 이 Agent (D)가 사용되었습니다. 하지만 사후 분석 결과 놀라운 사실이 밝혀졌습니다. LLM과 컴파일러 피드백만 사용하는 가장 단순한 Agent (A) 역시 가장 어려운 문제에서는 비용이 더 들었지만, 해결된 9개의 에르되시 문제를 모두 증명할 수 있었던 것입니다.

연구진은 단순한 에이전트의 성공을 두 가지 요인으로 돌립니다. 바로 기반이 되는 언어 모델의 빠른 발전과 "컴파일러 피드백이 LLM의 추론을 뒷받침해 주는 힘"입니다. 모든 기능을 갖춘 에이전트는 여전히 가장 까다로운 작업에서 우위를 점하고 있지만, LLM이 발전함에 따라 이러한 격차는 줄어들 수 있습니다.

연구진은 이것이 더 광범위한 추세를 가리킨다고 말하며, 이는 "LLM이 더 능력을 갖추게 되면서 전문적으로 훈련된 시스템에서 단순한 에이전트 루프로 지속적인 전환이 일어나고 있음"을 의미한다고 설명합니다.

완전한 증명이 없어도 유용하다 이 시스템의 성공은 주로 조합론(combinatorics), 볼록 최적화, 정수론(number theory)과 같은 분야에 집중되어 있습니다. 이 분야들은 Lean의 수학 라이브러리인 Mathlib이 성숙해 있고, 문제들을 다루기 쉬운 하위 목표들로 세분화할 수 있기 때문입니다.

대부분의 에르되시 문제는 여전히 AI의 손이 닿지 않는 곳에 있었습니다. 연구진은 "수학자들이 완전히 새로운 개념을 구축해야만 풀 수 있는 문제들은 말할 것도 없고, 현재 시스템으로는 해결할 수 없다"고 덧붙였습니다. 그럼에도 불구하고, 연구진은 알파프루프 넥서스가 수학 연구를 지원하는 귀중한 도구가 될 수 있다고 봅니다.

원문 보기

원문 보기 (영어)

Google Deepmind's AlphaProof Nexus solves decades-old math problems for a few hundred dollars Matthias Bastian View the LinkedIn Profile of Matthias Bastian May 25, 2026 GPT-Image-2 prompted by THE DECODER Key Points Google Deepmind has developed AlphaProof Nexus, a framework that autonomously solved nine of 353 open mathematical Erdős problems along with other complex conjectures, at an inference cost of just a few hundred dollars per problem. The system relies on the Gemini 3.1 Pro language model to generate proof steps in Lean, a formal programming language used for mathematical verification, enabling rigorous and machine-checkable solutions. While the vast majority of Erdős problems remained beyond the AI's reach, Deepmind researchers see the system as a valuable tool for supporting mathematical research. Ask about this article… Search AlphaProof Nexus combines LLM-driven proof generation with machine verification to crack open math research problems that have stumped mathematicians for decades. Google Deepmind's new framework AlphaProof Nexus has autonomously solved nine out of 353 open Erdős problems it attempted, including two questions that had gone unanswered for 56 years. The system also proved 44 out of 492 open conjectures from the Online Encyclopedia of Integer Sequences (OEIS), settled a 15-year-old question about Hilbert functions in algebraic geometry, and improved a known bound in convex optimization. Inference costs ran just a few hundred dollars per problem, according to the research paper . Ad Unlike (potentially) pure natural-language approaches such as OpenAI's recent solution , the underlying language model in AlphaProof Nexus—in this case Gemini 3.1 Pro—doesn't have to carry the entire logical chain on its own. Ad DEC_D_Incontent-1 Instead, it generates proof steps in Lean's formal language, and the compiler checks each one. Error messages feed directly back into the next attempt. That way, the LLM gets grounded by symbolic feedback, a safety net that offsets the well-known weaknesses of language models when it comes to logical reasoning. Humans only step in at the very end to check the results. Four agents, one surprising result The system consists of four agent variants with increasing complexity. The simplest, Agent (A), deploys independent sub-agents running on Gemini 3.1 Pro in loops: the language model generates proof steps, the Lean compiler checks them, and error messages feed back into the next try. Ad Agent (B) adds queries to AlphaProof, Google's reinforcement-learning-based system for olympiad math, which can fill in missing proof segments. Agent (C) introduces an evolutionary component. Inspired by AlphaEvolve , sub-agents share a common population of proof sketches. Rating agents built on Gemini 3.0 Flash score these sketches for plausibility and novelty, then rank them using an Elo system. The fully equipped Agent (D) combines all of these capabilities. Agent (D) was used for the Erdős problems. But a post-hoc analysis turned up a surprise: the simplest Agent (A), which only uses an LLM and compiler feedback, could also prove all nine solved Erdős problems, albeit pricier on the hardest ones. Ad DEC_D_Incontent-2 The researchers attribute the simple agent's success to two factors: rapid improvement in the underlying language models and the "power of compiler feedback in grounding LLM reasoning." The fully equipped agent still holds an edge on the toughest tasks for now, but that lead could shrink as LLMs get better. The researchers say this points to a broader trend, describing "an ongoing shift from specialized trained systems toward simple agentic loops as LLMs become more capable." Ad Useful even without a complete proof The system's successes cluster in areas like combinatorics, convex optimization, and number theory, where Lean's math library Mathlib is mature and problems break down into manageable sub-goals. Most Erdős problems remained out of reach, "let alone problems that require extensive new theory," the researchers write. The agents also inherit the unreliability of the underlying language models. Still, they see value beyond solved problems. Mathematicians who worked with the system reported that even failed proof attempts deepened their understanding of a problem, or as the authors put it, "AI-driven formal proof search can serve not only to solve problems but to deepen human understanding." Because the sketches were formal, experts could focus on the unsolved sub-goals instead of re-checking the entire argument from scratch. The agents also proved effective at catching flawed formalizations in the literature. "Formal verification can serve as a filter for determining which proofs merit human review," the authors write. The system is already being used in ongoing research on quantum optics and graph theory, according to the paper. All Lean proofs and selected natural-language proofs are available on GitHub . Erdős problems become the benchmark for AI math OpenAI recently used a proprietary reasoning model to disprove Erdős's unit-distance conjecture . Fields Medalist Tim Gowers called it "a milestone in AI mathematics." Before that, GPT-5.2 Pro helped solve Erdős problem #281 , with Terence Tao calling the case "perhaps the most unambiguous instance" of an LLM solving an open math problem. Thereafter, GPT-5.4 solved another Erdős problem . In some ways, those results are more impressive than Deepmind's approach. The language model had to carry the entire logical chain through natural language, without a Lean compiler checking each step . AlphaProof Nexus is more systematic and scalable, but it's tackling a different goal: building a reliable AI tool for everyday math research. OpenAI could integrate Lean into their scaffold as well, of course, but the point there is more about testing raw LLM capability. Tao in the past warned against reading too much into the headlines, though. AI's actual success rate on Erdős problems sits at just one to two percent, concentrated on easier tasks. Google's system cracked only nine out of 353 problems. That lines up almost exactly with Tao's two-percent bar. AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now Source: Paper

인공지능 구글 딥마인드 수학 증명 에이전트 연구