Hacker News • 60일 전

리퀴드 AI, 38T 토큰 학습 8B MoE 모델 공개

IMP

8/10

핵심 요약

리퀴드 AI는 엣지 환경에 최적화된 혼합 전문가(MoE) 구조의 새로운 언어 모델 'LFM2.5-8B-A1B'를 발표했습니다. 이번 모델은 기존 대비 3배 이상 확장된 38조 개의 토큰으로 사전 학습되었으며, 컨텍스트 윈도우가 128K로 늘어났고 추론 성능이 대폭 향상되었습니다. 특히 어휘량을 두 배로 늘려 비 라틴어권 언어의 토크나이징 효율을 높이고, 대규모 강화 학습을 적용해 가벼운 소비자용 하드웨어에서도 강력한 온디바이스 성능을 발휘하는 것이 가장 큰 특징입니다.

번역된 본문

오늘 저희는 소비자용 하드웨어에서 빠르고 안정적인 도구 호출(Tool calling)을 위해 구축된 엣지 모델인 LFM2.5-8B-A1B를 공개합니다. 이 모델은 2025년 10월에 발표된 LFM2-8B-A1B를 기반으로, 확장된 128K 컨텍스트 윈도우, 대폭 확장된 사전 학습(12조 개에서 38조 개 토큰으로 증가), 그리고 대규모 강화 학습을 적용했습니다. 또한 비 라틴어 계열 언어에 대한 토크나이징 효율성을 높이기 위해 어휘량을 두 배로 늘렸습니다. 그 결과, 여러 도구 호출을 연계하여 작업을 수행하고 입문급 노트북에서도 부담 없이 실행되는 모델이 탄생했습니다.

베이스 모델(LFM2.5-8B-A1B-Base)과 사후 학습된 모델(LFM2.5-8B-A1B)은 오늘부터 허깅페이스(Hugging Face) 및 저희 플레이그라운드(Playground)에서 사용할 수 있습니다. 로컬 환경에서 모델을 실행하고 파인튜닝하는 방법에 대한 자세한 내용은 공식 문서를 확인해 주세요.

하이라이트

온디바이스 개인 비서: 실제 애플리케이션 구동, 복잡한 명령어 수행 및 여러 도구 호출 연계를 모든 기기에서 지원하도록 설계되었습니다.
압축된 성능: 명령어 준수 및 에이전트 작업에서 훨씬 더 큰 규모의 일반 밀집(Dense) 모델 및 MoE 모델들과 경쟁할 수 있는 성능을 자랑합니다.
비교할 수 없는 처리량: CPU 및 GPU 추론 모두에서 동급 크기 모델 중 가장 빠르며, 첫날부터 llama.cpp, MLX, vLLM, SGLang을 지원합니다.

LFM2-8B-A1B 이후 변경된 점 이전 버전과 비교하여 새 버전은 컨텍스트 윈도우를 32,768개에서 128,000개 토큰으로 확장했습니다. 이를 통해 모델이 더 긴 문서를 처리하고 더 오래 추론할 수 있게 되었습니다. 비 라틴 문자를 더 효율적으로 토큰화하기 위해 어휘 크기 역시 65,536개에서 128,000개로 증가했습니다. 특히 힌디어, 태국어, 베트남어, 인도네시아어, 아랍어에서 눈에 띄는 압축 효율 향상을 확인했습니다. 나머지 아키텍처는 하단 그림과 같이 LFM2-8B-A1B와 동일하게 MoE, GQA, 게이팅된 짧은 컨볼루션 블록(Gated Short Convolution Blocks)의 조합을 따릅니다.

이전 버전과 달리, LFM2.5-8B-A1B는 순수 추론(Reasoning-only) 모델로, 최종 답변을 내놓기 전에 명시적인 사고 연결 고리(Chain of Thought)를 생성합니다. MoE 모델은 일반적으로 연산 제약 환경에서 실행되며, 적은 수의 활성 파라미터가 각 추론 토큰의 비용을 저렴하게 만들기 때문에 저희는 이 전략을 채택했습니다. 이는 속도 저하 없이 품질을 크게 향상시킵니다.

추론 기능 도입과 학습 규모 확장 덕분에 새 버전은 다음과 같이 눈에 띄게 향상된 성능을 보여줍니다:

[벤치마크 비교 표] AA-Omniscience Index: -78.42 → -24.70 (+53.62) AA-Omniscience Accuracy: 7.33 → 8.67 (+1.34) AA-Omniscience Non-Hallucination Rate(환각 미발생률): 7.46 → 63.47 (+56.01) IFEval: 79.44 → 91.84 (+12.40) IFBench: 26.00 → 56.47 (+30.47) Multi-IF: 58.54 → 79.93 (+21.39) MATH500: 74.80 → 88.76 (+13.96) AIME25: 20.00 → 42.53 (+22.53) BFCLv3: 45.07 → 64.36 (+19.29) BFCLv4: 25.52 → 48.50 (+22.98) Tau² Telecom: 13.60 → 88.07 (+74.47) Tau² Retail: 7.02 → 39.82 (+32.80)

학습 하이라이트

토크나이저 확장: LFM2-8B-A1B는 초기 언어 지원 범위에 맞춰 최적화된 65K BPE 토크나이저로 학습되었습니다. LFM2.5에서 비 라틴 문자를 더 잘 지원하기 위해 모델을 처음부터 다시 학습하는 대신 기존 토크나이저를 그대로 확장하여 어휘량을 128K로 두 배 늘렸습니다. 다국어 코퍼스에서 기존 병합(Merge) 작업을 이어서 BPE 병합 학습을 진행했습니다. 이를 통해 기존 토큰 ID의 대부분을 동일하게 유지하고, 모든 새 토큰이 원래 하위 토큰(Sub-token)의 시퀀스로 결정론적으로 분해되도록 만들었습니다. 새로운 임베딩 행(Row)은 하위 토큰 분해 결과의 평균으로 초기화하고 공유 행은 변경하지 않고 그대로 복사했습니다. 이후 임베딩 전용 학습, 전체 모델 지속 사전 학습이라는 두 단계의 짧은 적응 과정을 거쳐 모델의 품질을 복원했습니다.

하단 표는 언어별 chars/token(각 토큰이 담고 있는 텍스트의 양)을 보여줍니다. 수치가 높을수록 좋으며, 새로운 토크나이저는 16개 언어 모두에서 더 높은 효율성을 보여줍니다.

[언어별 토크나이저 효율성 비교] 아랍어, 독일어, 영어, 스페인어, 프랑스어, 힌디어, 인도네시아어, 이탈리아어, 일본어, 한국어, 폴란드어, 포르투갈어, 러시아어, 태국어, 베트남어, 중국어

원문 보기

원문 보기 (영어)

Solutions Resources Company Get Liquid Request a Demo Try Liquid Products Solutions Research Resources company Models LFM2.5-8B-A1B: an Even Better on-Device Mixture-of-Experts Authors Liquid AI Published May 28, 2026 Today, we're releasing LFM2.5-8B-A1B , an edge model built for fast, reliable tool calling on consumer hardware. It builds on our LFM2-8B-A1B release from October 2025, with an expanded 128K context window, scaled-up pretraining (from 12T to 38T tokens), and large-scale reinforcement learning. We also doubled its vocabulary to improve tokenization efficiency for non-Latin languages. The result is a model that chains tool calls, achieves tasks, and fits comfortably even on an entry-level laptop. The base (LFM2.5-8B-A1B-Base) and post-trained (LFM2.5-8B-A1B) models are available today on Hugging Face and our Playground . Check out our docs on how to run and fine-tune them locally. Highlights On-device personal assistant. Designed to power real-life applications, chaining tool calls, and following complex instructions on all devices. Compressed performance. Competitive with much larger dense and MoE models on instruction following and agentic tasks. Unmatched throughput. Fastest in its size class on both CPU and GPU inference, with day-one support for llama.cpp, MLX, vLLM, and SGLang. What changed since LFM2-8B-A1B Compared to LFM2-8B-A1B, this new version expands the context window from 32,768 to 128,000 tokens . This allows the model to process longer documents and reason for longer. Its vocabulary size was also scaled up from 65,536 to 128,000 to tokenize non-Latin scripts more efficiently . We see particularly strong compression gains in Hindi, Thai, Vietnamese, Indonesian, and Arabic. The rest of the architecture follows the same combination of MoE, GQA, and gated short convolution blocks as LFM2-8B-A1B, as shown in the following figure. ‍ Unlike its predecessor, LFM2.5-8B-A1B is a reasoning-only model , producing an explicit chain of thought before its final answer. We adopted this strategy because MoE models generally run in compute-bound settings, where a smaller number of active parameters makes each reasoning token cheap. This provides a significant quality boost without compromising speed. Thanks to reasoning and scaled-up training, this new version performs significantly better: Benchmark LFM2-8B-A1B LFM2.5-8B-A1B Δ AA-Omniscience Index -78.42 -24.70 +53.62 AA-Omniscience Accuracy 7.33 8.67 +1.34 AA-Omniscience Non-Hallucination Rate 7.46 63.47 +56.01 IFEval 79.44 91.84 +12.40 IFBench 26.00 56.47 +30.47 Multi-IF 58.54 79.93 +21.39 MATH500 74.80 88.76 +13.96 AIME25 20.00 42.53 +22.53 BFCLv3 45.07 64.36 +19.29 BFCLv4 25.52 48.50 +22.98 Tau² Telecom 13.60 88.07 +74.47 Tau² Retail 7.02 39.82 +32.80 Training highlights Tokenizer expansion. LFM2-8B-A1B was originally trained with a 65K BPE tokenizer optimized for our initial language coverage. To better support non-Latin scripts in LFM2.5, we doubled the vocabulary to 128K by extending the existing tokenizer in place rather than retraining the model from scratch.. We continued BPE merge training from the original merges on a multilingual corpus, which keeps most existing token IDs as identity mappings and makes every new token decompose deterministically into a sequence of original sub-tokens. We initialize the new embedding rows as the mean of their sub-token decompositions and copy the shared rows unchanged. We then recover quality through a brief two-stage adaptation: embedding-only training, followed by full-model continued pretraining. The table below reports chars/token, roughly how much text each token carries: higher is better, and the new tokenizer is more efficient in all 16 languages Tokenizer Arabic (ar) German (de) English (en) Spanish (es) French (fr) Hindi (hi) Indonesian (id) Italian (it) Japanese (ja) Korean (ko) Polish (pl) Portuguese (pt) Russian (ru) Thai (th) Vietnamese (vi) Chinese (zh) Old tokenizer 2.239 3.641 4.063 3.442 3.618 0.961 2.731 3.251 1.836 1.652 2.672 3.194 2.703 0.671 1.519 1.475 New tokenizer 3.107 3.783 4.137 3.579 3.759 2.118 3.513 3.475 1.963 1.943 2.895 3.450 2.876 2.269 3.311 1.620 Improvement +38.8% +3.9% +1.8% +4.0% +3.9% +120.4% +28.6% +6.9% +6.9% +17.6% +8.3% +8.0% +6.4% +238.2% +117.9% +9.8% Context extension. We first extended the context window to 32K through a 2T token midtraining phase focused on reasoning, math, tool-use, and longer documents. We then extended the context to 128K by increasing the RoPE base θ and running an additional 400B token midtraining stage focused on long-document and long-trajectory data. Doom loops. We added a targeted preference optimization stage to reduce doom loops in long reasoning traces. This stage identifies tokens that tend to trigger looping behavior in specific contexts, then redistributes probability mass toward plausible alternatives, while leaving the rest of the next-token distribution largely intact. During RL, we also added a lightweight shaping reward that discourages excessive use of common loop-inducing restart words like “Wait…”. We'll share more details on the full pipeline, objective, and empirical results in a dedicated blog post. Hallucinations. Because of their small number of parameters, edge models have a limited knowledge capacity, which leads to more hallucinations. To mitigate hallucinations, we added a targeted RL stage that uses an avg@k-based reward over a diverse knowledge dataset. The goal is to reinforce abstention on queries beyond reliable knowledge while preserving existing knowledge. This produces a sharper knowledge boundary and clearer expression of uncertainty. Benchmarks We evaluated LFM2.5-8B-A1B across benchmarks covering knowledge, instruction following, math, and agentic workflows. The model is competitive with both dense alternatives with a similar total number of parameters and much larger MoEs. Model Parameters AA-Omniscience Index Accuracy Non-Hallucination IFEval IFBench Multi-IF LFM2.5-8B-A1B 8B/A1B -24.70 8.67 63.47 91.84 56.47 79.93 Granite-4.0-H-Tiny 7B/A1B -75.50 9.37 6.38 82.23 21.28 59.00 Qwen3.5-4B 4B -51.53 17.20 16.99 87.80 50.38 67.43 Qwen3-30B-A3B-Thinking-2507 30.5B/3.3B -51.31 18.80 13.87 90.82 51.11 79.04 Gemma-4-E2B-IT 5.1B -72 7.00 15.05 82.93 33.53 69.70 Gemma-4-E4B-IT 8B -50.67 8.10 36.06 87.74 39.48 77.58 Gemma-4-26B-A4B-IT 26B/4B -62.07 14.37 10.75 91.40 47.25 82.06 gpt-oss-20b 21B/3.6B -49.17 14.57 24.50 86.73 58.65 76.64 The avg@k-based reward enables LFM2.5-8B-A1B to achieve a significantly lower hallucination rate while maintaining reasonable accuracy. It also leads on instruction following benchmarks, matching bigger MoEs like Gemma 4-26B at a fraction of the active parameter count. Math and agentic workflows Model Parameters MATH500 AIME25 AIME26 BFCLv3 BFCLv4 Tau² Telecom Tau² Retail LFM2.5-8B-A1B 8B/A1B 88.76 42.53 50.00 64.79 49.73 88.07 39.82 Granite-4.0-H-Tiny 7B/A1B 59.20 4.93 3.33 56.89 28.52 16.67 18.42 Qwen3.5-4B 4B 80.76 54.28 58.33 71.06 54.01 87.72 71.93 Qwen3-30B-A3B-Thinking-2507 30.5B/3.3B 86.48 71.67 66.67 73.39 50.53 21.93 56.14 Gemma-4-E2B-IT 5.1B 64.00 26 30 56.44 31.91 22.37 18.95 Gemma-4-E4B-IT 8B 65.00 34.33 40.67 57.31 33.92 26.75 42.11 Gemma-4-26B-A4B-IT 26B/4B 94.20 68.67 72.00 68.87 55.87 42.11 55.26 gpt-oss-20b 21B/3.6B 92.40 68.53 68.67 62.52 49.88 57.24 53.51 On agentic benchmarks, LFM2.5-8B-A1B is competitive with bigger models and particularly strong on Tau2-Telecom. As agentic harnesses are becoming the main way to consume models, LFM2.5-8B-A1B is a first step towards powering on-device, fully private agents. Sparse Inference, Everywhere LFM2.5-8B-A1B ships with day-one support across the inference ecosystem: LEAP — Liquid's Edge AI Platform for iOS and Android deployment llama.cpp — GGUF checkpoints for efficient edge inference MLX — Optimized inference for Apple Silicon vLLM — GPU-accelerated serving for production throughput SGLang — GPU-accelerated serving for production thr

온디바이스-AI 혼합-전문가-모델 강화-학습 에이전트 오픈소스-모델