The Decoder • 74일 전

AI 모델, 전문가 모듈 12.5%만으로도 최고 성능 근접 달성

IMP

8/10

핵심 요약

앨런 AI 연구소와 UC 버클리 연구진이 개발한 'EMO' 모델은 문서 경계를 활용해 전문가(Expert)들이 특정 도메인(의료, 정치 등)을 전문적으로 학습하도록 유도합니다. 실험 결과, 전체 전문가 모듈의 12.5%만 남기고 제거해도 성능 하락이 약 3% 포인트에 그쳐 기존 MoE 모델들의 한계를 뛰어넘는 효율성을 입증했습니다. 이를 통해 스토리지 절약 및 특정 작업에 맞춘 모델의 유연한 배포가 가능해져 산업계에 큰 의미를 갖습니다.

번역된 본문

연구진이 전문가(Expert)의 단 12.5%만으로도 거의 온전한 성능을 발휘하는 AI 모델을 훈련시켰습니다.

주요 요점:

앨런 AI 연구소(AI2)와 UC 버클리는 단순한 문법이 아닌 의학이나 정치와 같은 특정 주제 영역에서 전문성을 가지면서도 전반적인 성능을 강력하게 유지하는 모듈형 언어 모델 'EMO'를 개발했습니다.
이 시스템은 훈련 중 고정된 문서 경계(Document boundaries)를 사용하여 개별 모듈이 순수하게 구조적인 언어 패턴을 학습하는 대신 고유한 콘텐츠 도메인에 대한 전문성을 개발하도록 합니다.
모델을 모듈의 단 4분의 1로 줄였을 때 EMO의 성능은 약 1% 포인트 하락하는 데 그쳐 스토리지 공간을 크게 절약하고 모델이 다루는 콘텐츠 영역에 대한 타겟팅된 제어를 가능하게 합니다.

앨런 AI 연구소와 UC 버클리의 연구원들은 사전 훈련(Pre-training) 과정에서 모듈형 구조를 개발하는 Mixture-of-Experts(MoE, 전문가 혼합) 모델인 EMO를 구축했습니다. 이 모델은 성능 저하 없이 전문가를 아주 일부만 남겨두어 축소할 수 있습니다.

MoE(Mixture-of-Experts) 아키텍처는 이제 DeepSeek-V4나 Qwen3.5 같은 대규모 언어 모델에서 표준으로 자리 잡았습니다. 이 아키텍처는 토큰당 소수의 전문가만 활성화하므로 연산 비용을 폭증시키지 않고도 수백억 개의 파라미터로 확장할 수 있습니다. 하지만 하나의 작업 내에서도 서로 다른 토큰이 서로 다른 전문가를 호출하기 때문에 여전히 전체 모델을 메모리에 상주시켜야 합니다. 따라서 수학이나 코딩만 하고 싶다면 모델의 일부분만 로드해서 끝내는 방식은 불가능합니다.

논문에 따르면, 이는 표준 MoE의 전문가들이 얕은 언어 패턴에 매달리는 경향이 있기 때문입니다. 이들은 수학이나 코드와 같은 상위 수준의 도메인 대신 전치사, 구두점, 관사와 같은 요소에 반응합니다. 그렇기 때문에 유용한 하위 집합을 추출해 내는 것이 불가능했습니다.

훈련 신호로서의 문서 경계

EMO는 간단한 트릭으로 이 문제를 해결합니다. BTX나 Ai2의 FlexOlmo 같은 기존 프로젝트들처럼 수학이나 생물학 같은 고정된 도메인으로 훈련 데이터를 미리 분류하는 대신, 저자들은 '문서 경계'를 사용합니다. 하나의 문서 내에 있는 토큰들은 보통 동일한 도메인에 속합니다.

EMO는 문서 내의 모든 토큰이 공유 풀에서 활성 전문가를 선택하도록 강제합니다. 모델은 문서 내 모든 토큰에 대한 라우터(Route) 기본 설정을 평균화하고 가장 자주 선택된 것들을 유지함으로써 해당 풀에 어떤 전문가가 속할지 결정합니다.

훈련을 안정적으로 유지하기 위해 두 가지 조정이 필요했습니다. 첫째, 저자들은 훈련 배치별로 로컬하게 이루어지던 '부하 분산(Load balancing, 작업을 전문가들에게 고르게 분배하는 것)' 계산을 중단했습니다. 대신 이를 여러 문서에 걸쳐 전역적으로(Global) 계산했습니다. 그렇지 않으면 두 가지 훈련 목표가 서로 충돌하게 됩니다. 하나는 문서 내의 토큰들을 하나로 묶으려 하고, 다른 하나는 그것들을 가능한 한 많은 전문가에게 분산시키려 하기 때문입니다.

둘째, 연구원들은 문서 풀의 크기를 고정하는 대신 훈련 중에 무작위로 다양하게 변화를 주었습니다. 이 방법은 모델에게 추론 시점(Inference time)에 다양한 크기의 전문가 하위 그룹을 사용하는 방법을 학습시킵니다.

전문가의 4분의 1 제거, 단 1%의 성능 손실

팀은 OLMoE 사전 훈련 코퍼스의 1조 개 토큰을 사용하여 토큰당 8개가 활성화되는 128개의 전문가를 갖춘 10억 개 활성 파라미터, 총 140억 개 파라미터 규모의 MoE를 훈련시켰습니다.

전체 모델로서 EMO는 동일하게 훈련된 표준 MoE와 맞먹는 성능을 보여줍니다. 저자들은 5배 더 많은 데이터를 사용했음에도 불구하고 기존 OLMoE를 능가한다고 말합니다.

연구원들은 어디까지 갈 수 있는지 확인하기 위해 전문가 제거를 시작했습니다. 전문가의 25%(128개 중 32개)만 남겨둔 상태에서 EMO는 여러 벤치마크 평균 절대 성능 기준으로 약 1% 포인트의 손실만 보였습니다. 12.5%(16명의 전문가) 수준에서도 하락폭은 약 3% 포인트에 불과했습니다.

반면, 표준 MoE는 동일한 설정에서 붕괴하여 10~15% 포인트의 성능 손실을 겪으며, 경우에 따라서는 동일한 수의 활성 파라미터를 가진 일반 밀집 모델(Dense model) 수준 이하로 떨어집니다. 수학 벤치마크인 GSM8K에서 하위 집합 모델은...

원문 보기

원문 보기 (영어)

Researchers train AI model that hits near-full performance with just 12.5 percent of its experts Jonathan Kemper View the LinkedIn Profile of Jonathan Kemper May 16, 2026 Nano Banana Pro prompted by THE DECODER Key Points The Allen Institute for AI and UC Berkeley have developed EMO, a modular language model whose internal modules specialize in specific subject areas like medicine or politics rather than just grammar, while still maintaining strong overall performance. The system uses fixed document boundaries during training, which causes individual modules to develop expertise in distinct content domains instead of learning purely structural language patterns. When reduced to just a quarter of its modules, EMO's performance drops by only about one percentage point, significantly saving storage space and enabling targeted control over which content areas the model covers. Ask about this article… Search Researchers at the Allen Institute for AI and UC Berkeley have built EMO, a mixture-of-experts model that develops modular structures during pre-training. The model can be stripped down to a small fraction of its experts with barely any drop in performance. Mixture-of-experts (MoE) architectures are now standard in language models like DeepSeek-V4 or Qwen3.5 . They activate only a handful of experts per token, which lets them scale to hundreds of billions of parameters without blowing up compute costs. But the full model still has to sit in memory because different tokens within a task call on different experts. If you only want to do math or code, you can't just load a slice of the model and call it a day. According to the paper , that's because experts in standard MoEs tend to latch onto shallow language patterns. They respond to things like prepositions, punctuation, or articles instead of higher-level domains like math or code. That makes it impossible to carve out a useful subset. Ad Document boundaries as a training signal EMO tackles this with a simple trick. Instead of sorting training data into fixed domains like math or biology ahead of time—the way projects like BTX or Ai2's own FlexOlmo do—the authors use document boundaries. Tokens within a document usually belong to the same domain. Ad DEC_D_Incontent-1 EMO forces all tokens in a document to pick their active experts from a shared pool. The model decides which experts belong in that pool by averaging its router preferences across all tokens in a document and keeping the most frequently selected ones. Two adjustments were needed to keep training stable. First, the authors stopped calculating load balancing, which aims to spread work evenly across experts, locally per training batch. Instead, they compute it globally across many documents. Otherwise, the two training goals would fight each other: one bundles tokens within a document, and the other spreads them across as many experts as possible. Ad Second, the researchers randomly vary the size of the document pool during training instead of fixing it. This teaches the model to work with expert subgroups of different sizes at inference time. A quarter of the experts, one percent performance loss The team trained a MoE with 1 billion active and 14 billion total parameters with 128 experts, eight active per token, on 1 trillion tokens from the OLMoE pre-training corpus. As a full model, EMO matches an identically trained standard MoE. The authors say it beats OLMoE despite using five times more data. Ad DEC_D_Incontent-2 The researchers then started removing experts to see how far they could go. With just 25 percent of them left (32 out of 128), EMO loses about one percentage point of absolute performance averaged across several benchmarks. At 12.5 percent (16 experts), the drop is around three points. Ad A standard MoE collapses in the same setup, losing 10 to 15 percentage points and, in some cases, falling below the level of a dense model with the same number of active parameters. On the math benchmark GSM8K, subsets with just 12.5 percent of the experts match full-model performance again after fine-tuning. Finding the right experts doesn't take much data, the authors say. A single few-shot example is enough to pick a subgroup that performs comparably to one selected on a full validation dataset. EMO works with both simple router-based selection and the more specialized Easy-EP method. Experts learn topics, not punctuation To understand what EMO actually learned, the researchers analyzed how the model distributes tokens to experts internally. For each token, they recorded the probability with which the router sends it to each expert. These patterns create a kind of fingerprint per token. They then grouped tokens with similar fingerprints into clusters. The difference is clear-cut. In a standard MoE, expert clusters correspond to shallow linguistic categories: prepositions, proper names, definite articles. EMO's clusters map to actual topics: health and medicine, US politics, film, and music reviews. Tokens from the same document converge on a single cluster in EMO; in a standard MoE, they scatter across many. An interactive visualization of the clusters is available online. On a sample of 20 million documents from the WebOrganizer dataset with 24 human-assigned domain labels, the authors checked whether related domains also activate similar experts. In EMO, the patterns separate much more cleanly, especially in the model's deeper layers. In standard MoE, they overlap more. Use cases go beyond memory savings The most obvious application is running models in memory-constrained settings where only domain-relevant experts get loaded. In a head-to-head comparison, EMO expert subgroups match or beat both a standard MoE with 32 experts and a dense model with eight active parameters, each trained from scratch. The researchers also discuss fine-tuning models at runtime. A child-friendly app, for example, could switch off clusters that respond to spam, gambling, or adult content. In an initial test, the authors retrained a 32-expert subgroup of EMO and plugged it back into the 128-expert model. This improved the full model but didn't reach the level of the standalone subgroup. EMO could also help with monitoring, since the experts make it visible which parts of the model a given input is using. Ai2 is releasing the EMO model, a comparably trained standard MoE baseline, and the training code on Hugging Face and GitHub . The researchers have also published an interactive demo of the token activations. Open questions remain: how best to select and combine expert subgroups, how to retrain individual modules for specific tasks, and how the modular structure can be used to make models more interpretable. AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now Source: Ai2 | Paper

AI 모델 MoE 아키텍처 EMO 메모리 최적화 AI 연구