The Decoder • 108일 전

에이시 AI, 자본금 절반 투자해 오픈소스 에이전트 모델 공개

IMP

8/10

핵심 요약

미국 스타트업 Arcee AI는 전체 벤처캐피탈의 절반인 약 2천만 달러를 투자하여 4천억 파라미터 규모의 오픈소스 추론 모델인 'Trinity-Large-Thinking'을 공개했습니다. 이 모델은 에이전트 작업 벤치마크에서 클로드 오푸스(Claude Opus)에 필적하는 성능을 보여주며, 중국 모델들이 장악한 오픈소스 LLM 시장의 판도를 바꿀 잠재력을 가졌습니다. 기술적으로는 토큰당 4개의 전문가 모듈만 활성화해 연산 효율을 높였고, 학습 불안정성을 해결하기 위한 새로운 전문가 분산 방식(SMEBU)을 도입한 것이 특징입니다.

번역된 본문

Arcee AI, 에이전트 작업에서 Claude Opus와 맞먹는 오픈 추론 모델 구축에 벤처캐피탈의 절반을 지출하다

Arcee AI가 에이전트 작업에서 Claude Opus와 경쟁하기 위해 구축된 오픈 소스 추론 모델인 Trinity-Large-Thinking를 출시했습니다. 이 회사는 이 프로젝트에 총 벤처캐피탈의 절반 정도를 지출했습니다.

현재 대규모 언어 모델(LLM)의 오픈 가중치(Open-weight) 분야는 Qwen, MiniMax, Zhipu AI와 같은 중국 연구소들이 주도하고 있습니다. 미국 스타트업 Arcee AI는 에이전트 작업을 위해 특별히 구축된 약 4천억 개의 파라미터를 갖춘 Apache 2.0 라이선스 기반 추론 모델인 Trinity-Large-Thinking로 이를 바꾸고자 합니다. 혼합 전문가(Mixture-of-Experts, MoE) 아키텍처를 통해 토큰당 약 130억 개의 파라미터만 활성화되므로, 모델 크기에도 불구하고 추론이 매우 효율적입니다.

회사에 따르면, 팀은 2,048개의 엔비디아 B300 GPU에서 33일 동안 기반 모델을 학습시켰습니다. 약 2천만 달러의 비용이 소요되었으며, 이는 Arcee AI가 지금까지 유치한 총 벤처캐피탈의 절반에 해당합니다. 최고기술책임자(CTO)인 Lucas Atkins는 출시와 함께 게시한 블로그 포스트에서 "여러 면에서 중국 외부에서 시장에 출시된 가장 강력한 오픈 모델"이라고 밝혔습니다.

에이전트 벤치마크는 강력하지만, 일반 추론은 뒤처져 Trinity-Large-Thinking는 각 답변 전에 특수 사고 블록(think blocks)에서 명시적인 사고 과정을 생성합니다. 이 모델은 도구 호출(Tool calling), 다단계 계획 및 자율 워크플로우에 최적화되어 있습니다. Hugging Face의 모델 카드에 따르면 에이전트 벤치마크에서 강력한 성능을 보여줍니다. Tau2-Airline 88(1위), PinchBench 91.9(2위, Claude Opus 4.6의 93.3에 불과한 차이로 2위), AIME25 96.3을 기록했습니다.

하지만 일반 추론 성능은 조금 다릅니다. GPQA-Diamond는 76.3, MMLU-Pro는 83.4를 기록한 반면, Claude Opus 4.6은 각각 89.2와 89.1을 기록했습니다.

토큰당 256개의 전문가 중 4개만 활성화 이 모델은 256개의 특수화된 하위 네트워크를 가진 혼합 전문가(MoE) 아키텍처를 사용하지만, 토큰당 4개만 활성화됩니다. 즉, 4천억 개의 파라미터 중 약 130억 개만 각 연산 단계에서 작동하며, 모델의 전체 용량을 줄이지 않고도 처리 능력을 절약합니다.

기술 보고서에 따르면, 기반 모델은 토큰당 훨씬 더 많은 파라미터를 활성화하는 GLM 4.5와 대등한 벤치마크 결과를 달성했습니다. 긴 텍스트 처리를 위해 Trinity Large는 두 가지 유형의 어텐션 레이어를 결합합니다. 텍스트의 일부만 다루는 로컬 레이어가 전체 컨텍스트를 포괄하는 글로벌 레이어와 교대로 사용됩니다. 이 설정은 연산 비용의 비례적 증가 없이도 긴 컨텍스트 윈도우를 지원합니다.

실제로 이 모델은 256K 토큰으로만 학습되었음에도 512K 토큰의 사용 가능한 컨텍스트 윈도우에 도달합니다. 긴 텍스트에서 특정하게 배치된 정보를 찾아내는 능력을 평가하는 '건초더미 속 바늘(Needle-in-a-Haystack)' 테스트에서 512K 길이에서 0.976의 점수를 받았습니다.

학습 중 전문가 붕괴를 방지하는 맞춤형 균형 방식 초기 학습 과정에서는 개별 전문가 모듈이 붕괴되는 현상으로 한계에 부딪혔습니다. 하위 네트워크 전체에 걸친 토큰 분포가 흐트러지고, 일부 전문가는 아예 사용되지 않게 되었으며, 모델의 성능 향상이 멈췄습니다.

기술 보고서에 따르면, 근본 원인은 전문가 간의 부하 분산을 위한 기존 방식에 있었습니다. 기존 방식은 전문가의 과부하 정도와 상관없이 항상 동일한 고정 폭으로 불균형을 수정했습니다. 256개의 전문가를 대상으로 이 방식을 적용하면 안정적인 상태로 수렴하지 못하고 지속적인 발산(Oscillation)이 발생했습니다.

팀은 이 문제를 해결하기 위해 실제 편차에 비례하여 수정 규모를 조정하고 시간이 지남에 따라 이를 부드럽게 평활화하는 새로운 방법인 SMEBU(Soft-clamped Momentum Expert Bias Updates)를 개발했습니다. 시간 압박으로 인해 동시에 도입된 다른 5가지 안정화 조치와 결합하여 이 문제를 해결했습니다. 결과적으로 전체 학습 과정에서 학습 손실(Training loss)이 급증하는 현상 없이 완벽하게 안정적으로 유지되었습니다. 이러한 급증 현상은 대규모 모델에서 흔히 발생하지만, 최악의 경우 전체 학습 과정을 망칠 수 있는 두려운 문제입니다.

원문 보기

원문 보기 (영어)

Arcee AI spent half its venture capital to build an open reasoning model that rivals Claude Opus in agent tasks Jonathan Kemper View the LinkedIn Profile of Jonathan Kemper Apr 12, 2026 Arcee AI Arcee AI has released Trinity-Large-Thinking, an open reasoning model built to compete with Claude Opus in agent tasks. The company spent roughly half its total venture capital on the project. The open-weight space for large language models is currently dominated by Chinese labs like Qwen, MiniMax, and Zhipu AI. US start-up Arcee AI wants to change that with Trinity-Large-Thinking, an Apache 2.0-licensed reasoning model with around 400 billion parameters built specifically for agent tasks. A mixture-of-experts architecture keeps only about 13 billion parameters active per token, making inference efficient despite the model's size. According to the company , the team trained the base model on 2,048 Nvidia B300 GPUs over 33 days. The roughly 20 million dollar price tag ate up about half of Arcee AI's total venture capital raised to date. "In many ways, it's the most powerful open model ever brought to market outside of China," CTO Lucas Atkins writes in the blog post accompanying the release. Agent benchmarks look strong, general reasoning lags behind Trinity-Large-Thinking generates an explicit thought process in special think blocks before each answer. The model is optimized for tool calling, multi-stage planning, and autonomous workflows. According to the model card on Hugging Face , it puts up strong numbers in agent benchmarks: 88 on Tau2-Airline (first place), 91.9 on PinchBench (second place, just a hair behind Claude Opus 4.6 at 93.3), and 96.3 on AIME25. General reasoning is a different story, though: GPQA-Diamond comes in at 76.3 and MMLU-Pro at 83.4, while Claude Opus 4.6 clocks in at 89.2 and 89.1 respectively. Only 4 out of 256 experts fire per token The model uses a mixture-of-experts architecture with 256 specialized sub-networks, but only four are active per token. That means roughly 13 billion out of 400 billion parameters do work on any given compute step, saving processing power without cutting the model's overall capacity. According to the technical report , the base model hits benchmark results competitive with GLM 4.5, even though that model activates far more parameters per token. For handling long texts, Trinity Large combines two types of attention layers: local layers that each cover only a section of the text alternate with global layers that span the entire context. This setup supports long context windows without a proportional jump in compute costs. In practice, the model reaches a usable context window of 512K tokens, even though it was trained at only 256K. On the Needle-in-a-Haystack test —which checks whether a model can locate specifically placed information in long texts—it scored 0.976 at 512K. Custom balancing method prevents expert collapse during training Early training runs hit a wall when individual experts collapsed. Token distribution across sub-networks drifted, some experts stopped getting used entirely, and the model quit improving. According to the technical report, the root cause was the existing method for load balancing between experts. It corrected imbalances with the same fixed step size every time, regardless of whether an expert was slightly or massively overloaded. With 256 experts, this created constant oscillation that never settled into a stable state. The team built SMEBU (Soft-clamped Momentum Expert Bias Updates) to fix this, a new method that scales corrections proportionally to the actual deviation and smooths them over time. Combined with five other stabilization measures introduced simultaneously due to time pressure, this solved the problem. Subsequently, the entire training run stayed stable without a single sudden spike in training loss. These kinds of spikes are a common and dreaded issue with large models that can ruin an entire training run in the worst case. Over 8 trillion tokens of synthetic training data A big chunk of the training data is synthetic: more than eight of the 17 trillion tokens were generated by other AI models rather than scraped from the web. That includes 6.5 trillion tokens of rewritten web text, around 1 trillion tokens of multilingual data, and roughly 800 billion tokens of code. Partner DatologyAI handled data curation. According to the technical report, this ranks among the largest documented synthetic data generations for pretraining. Prime Intellect provided the GPU clusters. Since the B300 systems were brand new at the time, GPU errors kept popping up and could only be patched through firmware updates. The team also built a new method for processing training data called Random Sequential Document Buffer (RSDB). Normally, especially long documents can dominate several consecutive training steps and skew the data distribution. RSDB shuffles documents randomly instead, which the technical report says significantly cuts fluctuations between individual training steps. Strong early adoption despite limited post-training After pretraining, the model went through a second fine-tuning phase focused on specific skills like tool use and multi-step tasks. According to the technical report, though, this phase ran shorter than planned because compute time on the GPU cluster was limited. Arcee AI calls the current version preliminary and plans more extensive fine-tuning for the next iteration. A previously released preview version ran on OpenRouter, where it processed 3.37 trillion tokens in its first two months. It ranked among the most-used open models in the US on the platform, according to Arcee AI. The Thinking version is also live on OpenRouter and works with agent frameworks like OpenClaw and Hermes Agent. Shortly before Arcee AI's release, Google shipped Gemma 4 , a new family of open models also under an Apache 2.0 license and partly built on a mixture-of-experts architecture. AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now --> AI news without the hype Curated by humans. More than 16% discount. Read without distractions – no Google ads. Access to comments and community discussions. Weekly AI newsletter. 6 times a year: “AI Radar” – deep dives on key AI topics. Up to 25 % off on KI Pro online events. Access to our full ten-year archive. Get the latest AI news from The Decoder. Subscribe to The Decoder -->

오픈소스 모델 에이전트 혼합 전문가(MoE) Arcee AI 벤치마크