MarkTechPost • 68일 전

코히어, H100 2개로 구동되는 218B 에이전트형 모델 공개

IMP

8/10

핵심 요약

코히어(Cohere)가 기업용 에이전트 워크플로우에 최적화된 218B 매개변수의 오픈소스 MoE 모델 'Command A+'를 공개했습니다. 이 모델은 추론, 검색 증강 생성(RAG), 다국어 및 멀티모달 문서 처리 능력을 하나로 통합했으며, 최소 H100 GPU 2대만으로도 실행 가능해 현업 AI 실무자들에게 매우 효율적인 선택지가 됩니다.

번역된 본문

코히어(Cohere)가 기업용 에이전트 워크플로우를 타겟으로 한 오픈소스 모델인 'Command A+'를 발표했습니다. 아파치 2.0(Apache 2.0) 라이선스로 제공되는 Command A+는 최소한의 컴퓨팅 오버헤드로 고성능 에이전트 작업을 수행하기 위해 설계된 혼합 전문가(Mixture-of-Experts, MoE) 모델입니다. 이 모델은 추론, 에이전트 워크플로우, RAG(검색 증강 생성), 다국어 및 멀티모달 문서 처리에 최적화되어 있습니다. 또한 기존의 4개 모델(Command A, Command A Reasoning, Command A Vision, Command A Translate)의 기능을 단일 확장 가능한 모델로 통합했습니다.

아키텍처 Command A+는 총 2,180억 개(218B)의 매개변수와 250억 개(25B)의 활성 매개변수를 갖춘 디코더 전용(Decoder-only) 희소 혼합 전문가(Sparse MoE) 트랜스포머입니다. 128개의 전문가(Expert) 중 토큰당 8개가 활성화되며, 모든 토큰에 단일 공유 전문가가 적용됩니다. MoE 모델에서는 각 토큰이 전체 매개변수 집합이 아닌 전문가 하위 네트워크의 일부만을 통과하므로, 추론 시 활성 컴퓨팅이 25B 매개변수 규모로 유지됩니다. 어텐션 레이어는 3:1 비율로 슬라이딩 윈도우 어텐션 레이어(회전 위치 임베딩 적용)와 위치 임베딩이 없는 글로벌 어텐션 레이어를 교차로 배치합니다. 희소 MoE 레이어는 완전한 드롭 없음(Fully dropless) 방식으로 학습되며, 토큰 선택 라우터(Token-choice router)와 각 토큰의 상위-k 전문가 로짓에 대한 정규화된 시그모이드(Sigmoid)를 사용합니다. 입력 모달리티는 텍스트, 이미지, 도구 사용(Tool use)이며, 출력 모달리티는 텍스트, 추론, 도구 사용입니다. 이 모델은 128K의 입력 컨텍스트 길이와 64K의 최대 생성 길이를 지원합니다.

하드웨어 요구 사항 및 양자화 최소 GPU 요구 사항을 충족하는 세 가지 양자화(Quantization) 변형이 제공됩니다. BF16(16비트)은 4× B200 또는 8× H100 GPU가 필요하고, FP8(8비트)은 2× B200 또는 4× H100 GPU가 필요하며, W4A4(4비트)는 단일 B200 또는 2× H100 GPU에서 실행됩니다. 세 가지 양자화 모두 벤치마크 품질에서 무시할 수 있는 수준의 차이만 보입니다. 코히어는 대부분의 배포 환경에 W4A4를 권장합니다.

W4A4 양자화 방법론 코히어는 2단계 스케일링이 적용된 4비트 가중치 및 활성화를 갖는 NVFP4 W4A4 양자화를 MoE 전문가에만 적용합니다. Q/K/V/O 프로젝션, KV 캐시 및 어텐션 연산을 포함한 어텐션 경로는 전체 정밀도(Full precision)를 유지합니다. 잔여 품질 격차를 해소하기 위해 학습 후 단계에서 양자화 인지 증류(Quantization-Aware Distillation, QAD)를 사용합니다. 양자화된 학생 모델은 순방향 패스에서 가짜 양자화 연산자(Fake quantization operators)를 사용하고 역방향 패스에서 직통 추정기(Straight-through estimators)를 사용하여 전체 정밀도 교사 모델의 출력 분포와 일치하도록 학습됩니다.

이전 Command A 모델 대비 성능 τ²-Bench Telecom에서 Command A Reasoning 대비 점수가 37%에서 85%로 향상되었으며, Terminal-Bench Hard 에이전트 코딩 성능은 3%에서 25%에 도달했습니다. LLM-as-a-judge 기술을 사용하여 채점한 내부 North 플랫폼 평가에서 에이전트 질의응답(QA) 정확도가 Command A Reasoning 대비 20% 향상되었습니다. 에이전트 QA는 MCP에 연결된 클라우드 파일 시스템을 사용하여 모델이 기업의 질문에 얼마나 잘 답변하는지 측정합니다. 스프레드시트 분석 품질은 32% 향상되었으며, 이전 세션의 정보를 활용하여 후속 세션의 질문에 답변하는 능력을 평가하는 메모리 사용 품질(Memory Usage Quality)은 Command A Reasoning의 39%에 비해 Command A+가 54%를 기록했습니다.

Command A+는 코히어의 첫 번째 멀티모달 추론 모델입니다. MMMU Pro에서 63%, MMMU에서 75.1%를 달성했으며, 후자의 경우 Command A Vision의 65.3%와 비교됩니다. MathVista 점수는 73.5%에서 80.6%로 향상되었고, CharXiv 추론은 46.9%에서 52.7%로 향상되었습니다. 또한 다국어 지원 범위를 23개에서 48개 언어로 확장하여 기계 번역 및 다국어 추론에서 큰 성능 향상을 보여주었습니다. Command A+는 Artificial Analysis Intelligence Index에서 37점을 기록하며 다른 주요 오픈소스 모델들을 능가했습니다.

속도 및 지연 시간 동일한 양자화 및 동시성 수준에서 Command A+는 초당 출력 토큰 수(TOPS)를 최대 63% 높이고 첫 토큰까지의 시간(TTFT)을 단축시킵니다.

원문 보기

원문 보기 (영어)

Editors Pick Agentic AI Technology AI Shorts Artificial Intelligence Applications Language Model Large Language Model Machine Learning New Releases Open Source Software Engineering Tech News Cohere just released Command A+, as an open-source model targeting enterprise agentic workflows. Available under an Apache 2.0 license, Command A+ is a mixture-of-experts (MoE) model built for high-performance agentic tasks with minimal compute overhead. The model is optimized for reasoning, agentic workflows, RAG, multilingual, and multimodal document processing. It unifies capabilities from four prior models — Command A, Command A Reasoning, Command A Vision, and Command A Translate — into a single scalable model. Architecture Command A+ is a decoder-only Sparse Mixture-of-Experts Transformer with 218B total parameters and 25B active parameters. It has 128 experts, of which 8 are active per token, and a single shared expert is applied to all tokens. In a MoE model, each token is routed through only a subset of expert sub-networks rather than the full parameter set, keeping active compute at 25B-parameter scale at inference time. The attention layers interleave sliding-window attention layers with Rotational Positional Embeddings and global attention layers without positional embeddings in a 3:1 ratio. The sparse MoE layer is trained in a fully dropless manner and uses a token-choice router, with a normalized sigmoid over the top-k expert logits per token. Input modalities are text, image, and tool use. Output modalities are text, reasoning, and tool use. The model supports a 128K input context length and a 64K max generation length. Hardware Requirements and Quantization Three quantization variants are available with minimum GPU requirements: BF16 (16-bit) requires 4× B200 or 8× H100 GPUs; FP8 (8-bit) requires 2× B200 or 4× H100 GPUs; W4A4 (4-bit) runs on a single B200 or 2× H100 GPUs. All three quantizations show negligible differences in benchmark quality. Cohere recommends W4A4 for most deployments. W4A4 Quantization Methodology Cohere applies NVFP4 W4A4 quantization, 4-bit weights and activations with two-level scaling, to the MoE experts only. The attention path, including Q/K/V/O projections, the KV cache, and attention compute, is kept at full precision. To close residual quality gaps, Cohere uses Quantization-Aware Distillation (QAD) in the post-training phase: the quantized student model is trained to match the full-precision teacher's output distribution, using fake quantization operators in the forward pass and straight-through estimators on the backward pass. Performance vs. Prior Command A Models On τ²-Bench Telecom, scores improved from 37% to 85% over Command A Reasoning, and Terminal-Bench Hard agentic coding performance reached 25% from 3%. On internal North platform evaluations, all scored using LLM-as-a-judge techniques, Agentic Question Answering accuracy improved by 20% over Command A Reasoning. Agentic QA measures how well the model answers enterprise questions using MCP-connected cloud file systems. Spreadsheet analysis quality improved by 32%, and Memory Usage Quality — measuring how well an agent leverages information from a previous session to answer questions in a subsequent session — scored 54% with Command A+ compared to 39% with Command A Reasoning. Command A+ is Cohere's first multimodal reasoning model. It achieved 63% on MMMU Pro and 75.1% on MMMU, compared with 65.3% for Command A Vision on the latter. MathVista scores improved from 73.5% to 80.6%, and CharXiv reasoning improved from 46.9% to 52.7%. Command A+ expands multilingual coverage from 23 to 48 languages, with gains in machine translation and multilingual reasoning. Command A+ scored 37 on the Artificial Analysis Intelligence Index, outperforming other leading open models. Speed and Latency At the same quantization and concurrency levels, Command A+ delivers up to 63% higher Output Tokens per Second (TOPS) and reduces Time To First Token (TTFT) by up to 17% compared with Command A Reasoning. The W4A4 quantization contributes an additional 47% increase in speed and a 13% reduction in latency. Speculative decoding, optimized specifically for the MoE architecture, delivers an additional 1.5–1.6× inference speedup for both text and multimodal inputs. Tokenizer Command A+ is the first model to use Cohere's latest tokenizer, reducing the number of tokens required to generate the same response. Tokenization efficiency improved by 20% for Arabic, 16% for Korean, and 18% for Japanese. Getting Started The model is supported by vLLM and Transformers. Tool use is handled through chat templates in Transformers using JSON schema for tool descriptions. When reasoning is enabled, the model generates thinking traces between <|START_THINKING|> and <|END_THINKING|> tags before producing a final answer. The W4A4 variant requires vLLM ≥0.21.0 and cohere_melody>=0.9.0 for accurate response parsing. Cohere recommends the following sampling parameters: temperature=0.9 , top_p=0.95 , and repetition_penalty=1.04 . Key Takeaways Command A+ has 218B total / 25B active parameters in a Sparse MoE architecture, released under Apache 2.0. W4A4 applies NVFP4 quantization to MoE experts only with QAD post-training, running on 2× H100s. τ²-Bench Telecom improved from 37% to 85%; Terminal-Bench Hard from 3% to 25% vs. Command A Reasoning. TOPS increased up to 63% and TTFT reduced up to 17% vs. Command A Reasoning at matching quantization. Command A+ is Cohere's first multimodal reasoning model, expanding language support from 23 to 48 languages. Check out the Model Weights and Technical details . Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter . Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us Michal Sutter + posts Bio Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights. Michal Sutter What is a Forward Deployed Engineer: The AI Role OpenAI, Anthropic, and Google Are Hiring in 2026 Michal Sutter Google Introduces Gemini 3.5 Flash at I/O 2026: A Faster and Cheaper Model for AI Agents and Coding Michal Sutter Upstash for Redis vs Supabase vs Neon: Which One Fits Vibe Coding Workflows in 2026? Michal Sutter Google Launches Antigravity 2.0 at I/O 2026: A Standalone Agent-First Platform with CLI, SDK, Managed Execution, and Enterprise Support Michal Sutter Vercel Labs Introduces Zero, a Systems Programming Language Designed So AI Agents Can Read, Repair, and Ship Native Programs Michal Sutter Enterprise AI Governance in 2026: Why the Tools Employees Use Are Ahead of the Policies That Cover Them Michal Sutter Google DeepMind Introduces an AI-Enabled Mouse Pointer Powered by Gemini That Captures Visual and Semantic Context Around the Cursor Michal Sutter OpenAI Introduces Daybreak: A Cybersecurity Initiative That Puts Codex Security at the Center of Vulnerability Detection and Patch Validation Michal Sutter Best Vector Databases in 2026: Pricing, Scale Limits, and Architecture Tradeoffs Across Nine Leading Systems Michal Sutter OpenClaw vs Hermes Agent: Why Nous Research's Self-Improving Agent Now Leads OpenRouter's Global Rankings Michal Sutter NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX Michal Sutter OpenAI Introduces MRC (Multipath Reliable Connection): A New Open Networking Protocol for Large-Scale AI Supercomputer Training Clusters Michal Sutter Google Adds Event-Driven Webhooks to the Gemini API, Eliminating the Need for Polling in Long-Running AI Jobs Michal Sutter Microsoft Research

대형 언어 모델 에이전트 AI 오픈소스 코히어 MoE