r/LocalLLaMA • 83일 전

ZAYA1-8B: AMD GPU로 훈련된 최고 수준의 지능 밀도 모델

IMP

9/10

핵심 요약

Zyphra가 AMD 인스틴트 MI300 클러스터에서 처음부터 학습된 MoE 아키텍처 기반의 'ZAYA1-8B' 모델을 공개했습니다. 10억 개 미만의 활성 파라미터를 활용함에도 불구하고 복잡한 추론, 수학, 코딩 벤치마크에서 자원 대비 최고 수준의 지능 밀도를 달성하며 Claude 4.5 Sonnet이나 Mistral-Small-4-119B 등 훨씬 큰 모델들을 능가하거나 필적하는 성능을 보여줍니다.

번역된 본문

모델 소개

2026년 5월 5일, 캘리포니아주 샌프란시스코

ZAYA1-8B: AMD GPU로 훈련된 최고 수준의 지능 밀도

Zyphra는 복잡한 추론, 수학 및 코딩 작업에서 강력한 성능을 발휘하는 AMD 훈련 기반 MoE(Mixture of Experts) 모델인 ZAYA1-8B를 출시했습니다.

참여 연구진: Robert Washbourne, Rishi Iyer, Tomás Figliolia, Henry Zheng, Ryan Lorig-Roach, Sungyeon Yang, Pritish Yuvraj, Quentin Anthony, Yury Tokpanov, Xiao Yang, Ganesh Nanduru, Stephen Ebert, Praneeth Medepalli, Skyler Szot, Srivatsan Rajagopal, Alex Ong, Bhavana Mehta, Beren Millidge

[기술 보고서 읽기] [허깅페이스]

소개

ZAYA1-8B의 수학 및 코딩 성능 vs 더 큰 규모의 오픈웨이트 및 독점 추론 모델들

오늘 Zyphra는 AMD 인스틴트 MI300 스택에서 사전 훈련(pretrained), 중간 훈련(midtrained), 지도 미세조정(SFT)을 모두 거친 최초의 MoE 모델인 ZAYA1-8B를 출시합니다. ZAYA1-8B는 활성 파라미터당 최고 수준의 지능 밀도(frontier intelligence density)를 제공하며, 특정 수학 및 코딩 벤치마크에서 훨씬 더 큰 규모의 오픈웨이트 모델들을 능가합니다.

10억 개 미만의 활성 파라미터를 가지고도, ZAYA1-8B는 추론, 수학 및 코딩 벤치마크에서 강력한 성능을 발휘합니다. 이는 자신보다 수십 배 큰 모델인 Mistral-Small-4-119B의 성능과 일치하거나 이를 능가하며, DeepSeek-R1-0528, Gemini-2.5-Pro 및 Claude 4.5 Sonnet과 같이 훨씬 더 큰 1세대 최고 수준의 추론 모델들과도 경쟁력을 유지합니다.

당사의 새로운 Markovian-RSA 테스트 시점 연산(test-time compute) 기법을 적용하여 추가적인 상당한 성능 향상을 달성했습니다. 이를 통해 HMMT'25 벤치마크에서 Claude 4.5 Sonnet 및 GPT-5-High를 능가하는 성적(89.6점 vs 88.3점)을 기록했으며, 수학 벤치마크에서는 DeepSeek-V3.2와 같은 최고 수준의 오픈웨이트 모델에 근접하는 성과를 보여주고 있습니다.

ZAYA1-8B의 이러한 성능은 모델 아키텍처, 사전 훈련 및 최적화부터 사후 훈련(post-training) 및 대규모 강화학습(RL)에 이르기까지 전체 스택에 걸친 Zyphra의 혁신을 입증합니다. 나아가 이러한 강점은 당사의 사후 훈련 스택의 강력함을 보여주며, 향후 모델 크기와 다양한 도메인의 폭넓은 적용 측면에서 이러한 노력을 계속해서 확장해 나갈 것을 기대하고 있습니다.

ZAYA1-8B는 오늘부터 Zyphra Cloud에서 서버리스 엔드포인트로 사용할 수 있습니다.

성능

ZAYA1-8B는 동일한 파라미터급의 최신 최고 수준(SOTA) 오픈소스 모델 및 수학(AIME 및 HMMT), 코딩(LCB), 추론 및 지식 검색(GPQA-Diamond), 명령어 준수(IFEval 및 IFBench)와 같은 광범위한 평가에서 훨씬 더 큰 규모의 수많은 오픈소스 모델들과 대등하게 경쟁합니다.

다양한 평가에서 ZAYA1-8B와 주요 오픈웨이트 모델들의 비교

아키텍처

ZAYA1-8B는 독특한 아키텍처, 사전 훈련 방법론, 그리고 강화학습 파이프라인의 조합을 통해 그 효율성을 달성합니다. 스택의 각 레벨에서 도입된 새로운 혁신 기술들은 최종 모델의 파라미터당, 그리고 연산량(FLOP)당 추출되는 지능을 극대화한다는 단일한 목표를 위해 최적화되었습니다.

ZAYA1-8B는 세 가지 핵심적인 아키텍처 변화를 보여줍니다. 첫째, Zyphra가 개발한 훨씬 더 효율적이고 성능이 뛰어난 어텐션 변형인 압축 합성곱 어텐션(Compressed Convolutional Attention, CCA)입니다. 둘째, 선형 라우터에 비해 라우팅 안정성을 향상시키는 새로운 MLP 기반 전문가 선택 라우터입니다. 셋째, 무시할 수 있을 정도로 적은 파라미터와 연산 비용으로 깊이에 따른 잔차 노름(residual-norm) 성장을 제어하는 학습된 잔차 스케일링(learned residual scaling)입니다. 이 세 가지가 결합되어 ZAYA1-8B의 지능 효율성의 기반을 형성합니다.

CCA와 새로운 라우터를 결합한 ZAYA1-8B 아키텍처 개략도

사전 훈련

ZAYA1-8B의 가장 독특한 특징은 AMD 하드웨어와 네트워킹을 사용하여 IBM과 함께 구축된 맞춤형 훈련 클러스터에서 1,024개의 MI300x 노드 클러스터와 AMD Pensando Pollara 인터커넥트를 활용해 전적으로 사전 훈련이 진행되었다는 점입니다. 당사의 사전 훈련 및 클러스터 설계에 대한 자세한 내용은 이전에 발표한 ZAYA1-base 기술 보고서에 깊이 있게 설명되어 있습니다.

사후 훈련

Zyphra의 새로운 대규모 사후 훈련 파이프라인 역시 ZAYA1-8B의 성능을 구성하는 핵심 요소입니다. 당사의 파이프라인은 다섯 단계로 구성되며, 각 단계는 ZAYA1-8B의 기능을 순차적으로 향상시키는 데 중점을 둡니다. 첫 번째 SFT(지도 미세조정) 단계...

원문 보기

원문 보기 (영어)

Back Models May 5, 2026 San Francisco, California ZAYA1-8B: Frontier intelligence density, trained on AMD ZAYA1-8B: Frontier intelligence density, trained on AMD Zyphra releases ZAYA1-8B, an AMD-trained MoE model which performs strongly on complex reasoning, mathematics, and coding tasks. Robert Washbourne, Rishi Iyer, Tomás Figliolia, Henry Zheng, Ryan Lorig-Roach, Sungyeon Yang, Pritish Yuvraj, Quentin Anthony, Yury Tokpanov, Xiao Yang, Ganesh Nanduru, Stephen Ebert, Praneeth Medepalli, Skyler Szot, Srivatsan Rajagopal, Alex Ong, Bhavana Mehta, Beren Millidge Read technical report Hugging Face No headings found on page Introduction Mathematical and coding performance of ZAYA1-8B vs substantially larger open-weight and proprietary reasoning models. Today Zyphra is releasing ZAYA1-8B, the first MoE model pretrained, midtrained, and supervised fine-tuned on an AMD Instinct™ MI300 stack. ZAYA1-8B delivers frontier intelligence density per active parameter and outperforms substantially larger open-weight models on certain mathematics and coding benchmarks. At under 1 billion active parameters, ZAYA1-8B performs strongly on reasoning, mathematics and coding benchmarks, matching or exceeding the performance of models many times its size such as Mistral-Small-4-119B, and remaining competitive with substantially larger first-generation frontier reasoning models such as DeepSeek-R1-0528, Gemini-2.5-Pro and Claude 4.5 Sonnet. With our novel Markovian-RSA test-time compute methodology, we achieve significant additional performance gains — exceeding Claude 4.5 Sonnet and GPT-5-High on HMMT'25 (89.6 vs 88.3) and closing in on frontier open-weight models such as DeepSeek-V3.2 on mathematics benchmarks. ZAYA1-8B’s performance is a testament to Zyphra’s innovations across the full stack from model architecture, pretraining and optimization, to post-training and large-scale RL. Moreover, its strength demonstrates the power of our post-training stack and we are excited to continue to scale our efforts here both in terms of model size and the breadth and diversity of domains. ZAYA1-8B is available today as a serverless endpoint on Zyphra Cloud . Performance ZAYA1-8B also performs competitively against recent SOTA OS models in the same weight class and against many substantially larger OS models across a wide range of evaluations such as mathematics (AIME and HMMT), coding (LCB), reasoning and knowledge retrieval (GPQA-Diamond) and instruction following (IFEval and IFBench). ZAYA1-8B vs leading open-weights models on a variety of evals. Architecture ZAYA1-8B achieves its efficiency through a combination of unique architecture, pretraining methodology, and reinforcement learning pipeline, with novel innovations at each level of the stack optimized toward a singular objective – maximize the intelligence extracted per parameter and per FLOP of the final model. ZAYA1-8B demonstrates three key architectural changes: Compressed Convolutional Attention (CCA), a substantially more efficient and performant attention variant developed by Zyphra; a novel MLP-based router for expert selection that improves routing stability over linear routers; and learned residual scaling, which controls residual-norm growth through depth at negligible parameter and FLOP cost. Together, these form the base of ZAYA1-8B's intelligence efficiency. A schematic of the architecture of ZAYA1-8B which combines CCA with our novel router. Pretraining Uniquely, ZAYA1-8B was pretrained entirely on AMD hardware and networking using a cluster of 1,024 MI300x nodes with AMD Pensando Pollara interconnect on a custom training cluster built with IBM. Our pretraining and cluster design is described in depth in our previous technical report on ZAYA1-base . Post-training Zyphra’s novel large-scale post-training pipeline is also a core component of ZAYA1-8B’s performance. Our pipeline consists of five stages, each focused on sequentially improving the capabilities of ZAYA1-8B. The first SFT stage focused on basic chat, IF, code, math, and TTC abilities. This was followed by a reasoning warmup stage combining mathematical tasks, logic and puzzle solving, with TTC prompts to train the model to natively self-aggregate candidate solutions. This was followed by a large RLVE-Gym phase with dynamically adjusted puzzle-difficulty to train core reasoning circuits. The model then underwent large-scale math and code RL designed to improve its knowledge and reasoning skills in these fundamental domains. Finally, there was a relatively lightweight RLHF/RLAIF phase which focused on improving the model’s chat capabilities and behavior as well as focusing on less verifiable rewards such as instruction following and writing style. We observe substantial improvements across many capabilities during our RL phase, with improvements especially focused on mathematics, instruction-following, and coding tasks, however we also saw smaller improvements in multiple-choice knowledge retrieval (MMLU and GPQA) and non-verifiable tasks such as creative-writing. Improvements observed during the RL phase from the SFT checkpoint. We observe most substantial capability boost on mathematics and coding but also obtained smaller boosts on instruction-following and creative writing. Markovian RSA Alongside the ZAYA1-8B, we also introduce a novel test-time-compute (TTC) scheme called Markovian RSA that was used to train ZAYA1-8B. Markovian RSA combines the idea of generating multiple traces in parallel then aggregating these recursively from RSA , and the Markovian thinker idea of performing reasoning in chunks of a fixed duration, after which only the tail end of the previous chunk is passed on to the next chunk in the sequence, thus keeping the context window of fixed size despite potentially unlimited reasoning. With Markovian RSA, we combine these ideas and first, for each prompt, generate multiple traces in parallel, extract fixed-length tail segments from the traces, and then create new aggregation prompts by sub-sampling a few references from the candidate pool. These aggregated prompts are then used as the seed to generate the next round of parallel responses. As a result Markovian RSA has favorable inference properties – rollout generation can be done in parallel taking advantage of batching, while the Markovian chunking strategy ensures that no matter how long the model reasons for the intermediate chain-of-thoughts, the context length always remains bounded. A schematic of the Markovian RSA process ZAYA1-8B was trained to understand and respond to the Markovian RSA aggregation prompts and chunking methodology starting in SFT, where we synthetically constructed prompts reflecting the desired behaviour and then also during RL where on some portion of prompts trained Markovian RSA self-aggregation behavior. We found that for ZAYA1-8B, Markovian RSA substantially boosts performance, especially on challenging mathematical reasoning tasks. In the headline figure, we demonstrate that with Markovian RSA on a 40k-token budget for intermediate chain-of-thoughts and with only the last 4K tokens forwarded to the next iteration, that ZAYA1-8B can approach the level of open-weight frontier models such as DeepSeek-V3.2 and Qwen3-A22B, and is only a few points away from GPT-5-High. Furthermore, with a Markovian RSA configuration using extra-high compute, ZAYA1-8B surpasses DeepSeek-V3.2 and GPT OSS 120B (high) in APEX-shortlist. We continue to observe performance gains when scaling test-time compute. With extra-high-TTC (5.5M tokens per problem), ZAYA1-8B outperforms DeepSeek-v3.2 and GPT-OSS-High on the challenging APEX-shortlist mathematics benchmark. We found that training ZAYA1-8B to understand the Markovian-RSA harness was important to achieving this performance. When we applied the same methodology to Qwen3-4B-Thinking-2507, the performance uplift was substantially less, highlighting the importance of co-design of the eventual model harness and the post-

오픈소스 모델 AMD 인프라 MoE 아키텍처 경량 모델 벤치마크

10억 미만 파라미터로 딥시크 수학 성능 맞춘 ZAYA1-8B

Zyphra가 AMD GPU 클러스터로 훈련한 84억 MoE 모델 ZAYA1-8B는 7억 6천만 활성 파라미터로 DeepSeek-R1 수학 벤치마크를 상회하고, Claude Sonnet 4.5와 비견되는 성능을 보여줍니다. 이는 엔비디아 독점적 인프라 없이도 최첨단 AI 모델 개발이 가능하다는 것과 활성 파라미터를 극도로 줄이면서도 성능을 유지할 수 있다는 것을 증명합니다.

오픈소스 모델 MoE 아키텍처 AMD 인프라