MarkTechPost • 108일 전

미니맥스, SWE-Pro 56% 기록한 자가진화형 오픈소스 에이전트 공개

IMP

8/10

핵심 요약

미니맥스(MiniMax)가 자사의 가장 강력한 오픈소스 혼합 전문가(MoE) 모델인 '미니맥스 M2.7'의 가중치를 허깅페이스(Hugging Face)에 공개했습니다. 이 모델은 SWE-Pro(56.22%) 및 Terminal Bench 2(57.0%) 등 실무 중심의 코딩 벤치마크에서 GPT-5.3-Codex와 Opus 4.6에 필적하는 최고 수준(SOTA)의 성능을 기록했습니다. 특히 실제 프로덕션 환경의 장애 복구 시간을 3분 미만으로 단축시키는 강력한 디버깅 능력과, 모델 스스로 반복적인 코드 개선 및 최적화를 수행하는 독자적인 '자가진화(Self-Evolution)' 아키텍처가 핵심 차별점입니다.

번역된 본문

미니맥스(MiniMax)가 공식적으로 미니맥스 M2.7을 오픈소스로 전환하며, 모델 가중치를 허깅페이스(Hugging Face)에 공개했습니다. 2026년 3월 18일에 처음 발표된 미니맥스 M2.7은 미니맥스의 역대 가장 뛰어난 오픈소스 모델이자, 모델의 개발 주기에 스스로 능동적으로 참여하는 최초의 모델입니다. 이는 대형 언어 모델(LLM)이 구축되고 반복 개선되는 방식에 있어 의미 있는 패러다임의 전환을 보여줍니다.

미니맥스 M2.7은 무엇인가요? 미니맥스 M2.7은 미니맥스의 M2 시리즈 중 혼합 전문가(MoE, Mixture-of-Experts) 아키텍처를 채택한 모델입니다. MoE는 추론 시 전체 파라미터 중 일부만 '활성화'하는 설계 방식으로, 유사한 수준의 성능을 내는 밀집형(dense) 모델에 비해 서빙 속도를 훨씬 빠르게 만들고 비용을 크게 절감할 수 있습니다. 미니맥스 M2.7은 전문적인 소프트웨어 엔지니어링, 전문적인 오피스 작업, 그리고 미니맥스가 '에이전트 팀(Agent Teams)'이라 부르는 네이티브 다중 에이전트 협업이라는 세 가지 핵심 역량을 기반으로 구축되었습니다. 이 모델은 에이전트 팀, 복잡한 스킬(Skills), 동적 도구 탐색 기능을 활용하여 복잡한 에이전트 하네스(harness)를 구축하고 매우 정교한 생산성 작업을 완료할 수 있습니다.

SOTA 벤치마크 성능: SWE-Pro 및 Terminal Bench 2 다양한 프로그래밍 언어를 다루는 SWE-Pro 벤치마크에서 미니맥스 M2.7은 56.22%의 정확도를 기록하며 GPT-5.3-Codex와 동등한 수준을 보여주었습니다. SWE-Pro의 과제는 표준적인 알고리즘 코딩 테스트보다 실제 프로덕션 시스템의 복잡한 현실에 훨씬 가까운 로그 분석, 버그 해결, 코드 보안 검토, 머신러닝 워크플로우 디버깅 등을 아우릅니다.

높은 수준의 시스템 이해도를 요구하는 Terminal Bench 2(57.0%)와 NL2Repo(39.8%) 벤치마크에서도 미니맥스 M2.7은 탄탄한 성능을 보여주었습니다. 이 모델은 단순한 코드 생성에만 뛰어난 것이 아니라 소프트웨어 시스템의 운영 논리와 협업 역학을 깊이 있게 이해할 수 있습니다. 리포지토리 수준의 코드 생성 벤치마크인 VIBE-Pro에서는 55.6%를 기록하며 Opus 4.6과 거의 동등한 수준을 달성했습니다. 즉, 웹, 안드로이드, iOS 또는 시뮬레이션 작업 등 어떤 요구 사항이든 미니맥스 M2.7에 직접 맡겨 처리할 수 있다는 의미입니다. 또한 실제 엔지니어링 시나리오에 더 가까운 벤치마크인 SWE Multilingual(76.5)과 Multi SWE Bench(52.7)에서도 강력한 우위를 입증했습니다.

실제 프로덕션 디버깅: 3분 미만 소요 프로덕션 환경에서 알림 경고가 발생했을 때, 미니맥스 M2.7은 모니터링 지표와 배포 타임라인을 연관 지어 인과 관계를 추론하고, 트레이스(Trace) 샘플링에 대한 통계 분석을 수행하여 정확한 가설을 제시합니다. 또한 능동적으로 데이터베이스에 연결하여 근본 원인을 확인하고, 코드 리포지토리에서 누락된 인덱스 마이그레이션 파일을 찾아낸 뒤, 병합 요청(Merge Request)을 제출하기 전에 논블로킹 인덱스 생성을 통해 서비스 장애를 즉시 멈출 수 있습니다. 미니맥스 팀에 따르면, 이 과정을 통해 실제 프로덕션 시스템 장애의 복구 시간이 여러 차례에 걸쳐 3분 미만으로 단축되었습니다. 관측 가능성(Observability) 분석부터 데이터베이스 전문 지식, SRE(사이트 안정성 엔지니어링) 수준의 의사결정에 이르기까지, 미니맥스 M2.7은 단순한 코드 생성 모델의 한계를 훌쩍 뛰어넘는 존재입니다.

자가진화 아키텍처(Self-Evolution Architecture) 자율적인 개선의 한계를 시험하기 위해 미니맥스 M2.7은 내부 스캐폴드(Scaffold) 위에서 모델의 프로그래밍 성능을 최적화하는 과제를 부여받았습니다. 이 모델은 '실패 궤적 분석 → 변경 계획 → 스캐폴드 코드 수정 → 평가 실행 → 결과 비교 → 변경 사항 유지 또는 되돌리기'의 반복 루프를 100회 이상 완전히 자율적으로 실행했습니다. 이 과정에서 미니맥스 M2.7은 스스로 효과적인 최적화 방안을 발견해냈습니다. 예를 들어, 온도(temperature), 빈도 페널티(frequency penalty), 존재 페널티(presence penalty)와 같은 샘플링 파라미터의 최적의 조합을 체계적으로 탐색하고, 더 구체적인 워크플로우 지침(예: 버그 수정 후 다른 파일에서 동일한 버그 패턴을 자동으로 검색)을 설계하며, 스캐폴드의 에이전트 루프에 루프 탐지 기능을 추가했습니다. 이를 통해 30%의 성능 향상을 달성했습니다.

원문 보기

원문 보기 (영어)

Editors Pick Agentic AI AI Agents Technology AI Shorts Artificial Intelligence Applications Language Model Large Language Model Machine Learning New Releases Open Source Software Engineering Staff Tech News MiniMax has officially open-sourced MiniMax M2.7, making the model weights publicly available on Hugging Face. Originally announced on March 18, 2026, MiniMax M2.7 is the MiniMax's most capable open-source model to date — and its first model to actively participate in its own development cycle, a meaningful shift in how large language models are built and iterated. What is MiniMax M2.7? MiniMax M2.7 is part of MiniMax's M2-series of Mixture-of-Experts (MoE) models. MoE is an architectural design where only a subset of the total parameters are ‘activated' during any inference pass, which makes the model significantly faster and cheaper to serve compared to a dense model of similar output quality. MiniMax M2.7 is built around three core capability areas: professional software engineering, professional office work, and what MiniMax calls Agent Teams — native multi-agent collaboration. MiniMax M2.7 is capable of building complex agent harnesses and completing highly elaborate productivity tasks, leveraging capabilities such as Agent Teams, complex Skills, and dynamic tool search. SOTA Benchmark Performance: SWE-Pro and Terminal Bench 2 On SWE-Pro, which covers multiple programming languages, MiniMax M2.7 achieved a 56.22% accuracy rate, matching GPT-5.3-Codex. SWE-Pro tasks span log analysis, bug troubleshooting, code security review, and machine learning workflow debugging — much closer to the messy reality of production systems than standard algorithmic coding tests. On Terminal Bench 2 (57.0%) and NL2Repo (39.8%), both of which demand a high degree of system-level comprehension, MiniMax M2.7 performs solidly. The model excels not only at code generation but can also deeply understand the operational logic and collaborative dynamics of software systems. On the repo-level code generation benchmark VIBE-Pro, MiniMax M2.7 scored 55.6%, nearly on par with Opus 4.6 — meaning whether the requirement involves Web, Android, iOS, or simulation tasks, they can be handed directly to MiniMax M2.7 to complete. It also demonstrates a strong advantage on benchmarks closer to real-world engineering scenarios: SWE Multilingual (76.5) and Multi SWE Bench (52.7). Production Debugging: Under Three Minutes When faced with alerts in production, MiniMax M2.7 can correlate monitoring metrics with deployment timelines to perform causal reasoning, conduct statistical analysis on trace sampling and propose precise hypotheses, proactively connect to databases to verify root causes, pinpoint missing index migration files in the code repository, and use non-blocking index creation to stop the bleeding before submitting a merge request. MiniMax team reports that on multiple occasions, this reduced recovery time for live production system incidents to under three minutes. From observability analysis and database expertise to SRE-level decision-making, this positions MiniMax M2.7 as something beyond a code-generation model. The Self-Evolution Architecture To test the boundaries of autonomous improvement, MiniMax M2.7 was tasked with optimizing a model's programming performance on an internal scaffold. It ran entirely autonomously, executing an iterative loop of ‘analyze failure trajectories → plan changes → modify scaffold code → run evaluations → compare results → decide to keep or revert changes' for over 100 rounds. During this process, MiniMax M2.7 discovered effective optimizations on its own: systematically searching for the optimal combination of sampling parameters such as temperature, frequency penalty, and presence penalty; designing more specific workflow guidelines (such as automatically searching for the same bug pattern in other files after a fix); and adding loop detection to the scaffold's agent loop. This achieved a 30% performance improvement on internal evaluation sets. Within MiniMax's own reinforcement learning team workflows, M2.7 is now capable of handling 30%–50% of the workflow end-to-end, with human researchers only interacting for critical decisions and discussions. MLE Bench Lite: Testing Autonomous ML Experimentation MiniMax team also tested MiniMax M2.7 on MLE Bench Lite, OpenAI's open-sourced suite of 22 machine learning competitions runnable on a single A30 GPU, covering virtually all stages of the ML workflow. For this evaluation, MiniMax team designed a simple three-component harness: short-term memory, self-feedback, and self-optimization. After each iteration round, the agent generates a short-term memory markdown file, performs self-criticism on the current results, and provides optimization directions for the next round. Three trials were run, each with a 24-hour window for iterative evolution. The best run achieved 9 gold medals, 5 silver medals, and 1 bronze medal. The average medal rate across the three runs was 66.6%, a result second only to Opus-4.6 (75.7%) and GPT-5.4 (71.2%), tying with Gemini-3.1 (66.6%). Professional Office Work and Finance Beyond software engineering, MiniMax M2.7 targets professional office tasks. In the GDPval-AA evaluation, which measures domain expertise and task delivery capability across 45 models, MiniMax M2.7 achieved an ELO score of 1495 — the highest among open-source models, second only to Opus 4.6, Sonnet 4.6, and GPT-5.4, and surpassing GPT-5.3. On Toolathon, MiniMax M2.7 achieved an accuracy of 46.3%, reaching the global top tier. In MM Claw testing — an evaluation MiniMax built based on real-world usage patterns from the OpenClaw personal agent platform — MiniMax M2.7 maintained a 97% skill compliance rate across 40 complex skills (each exceeding 2,000 tokens) and achieved an overall accuracy of 62.7%, approaching Sonnet 4.6. In finance, MiniMax M2.7 can autonomously read a company's annual reports and earnings call transcripts, cross-reference multiple research reports, independently design assumptions and build a revenue forecast model, and produce a PPT and Word research report based on templates — understanding, making judgments, and producing output like a junior analyst. Key Takeaways MiniMax M2.7 is now officially open source , with weights available on Hugging Face, making a frontier-grade agentic model freely accessible for developers to deploy and build on. MiniMax M2.7 achieves SOTA performance on real-world software engineering benchmarks , scoring 56.22% on SWE-Pro (matching GPT-5.3-Codex) and 57.0% on Terminal Bench 2 — tests that measure production-level reasoning, not just code generation. MiniMax M2.7 is the first model to actively participate in its own development , running over 100 autonomous rounds of scaffold optimization and achieving a 30% performance improvement — an early, concrete example of AI-assisted AI development in practice. The model is built for real agentic deployments , maintaining 97% skill adherence across 40 complex skills (each exceeding 2,000 tokens), supporting native Agent Teams with stable role boundaries, and handling 30–50% of MiniMax's internal RL team workflows autonomously. MiniMax M2.7 is the highest-ranked open-source model on GDPval-AA with an ELO score of 1495 across 45 models, demonstrating strong professional work capabilities spanning office document editing, financial analysis, and multi-round high-fidelity task delivery. Check out the Technical details and Model Weight . Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter . Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

오픈소스 에이전트 소프트웨어 엔지니어링 미니맥스 자가진화