Hacker News • 102일 전

단 1.58비트로 최고 수준 지능 구현한 '테르나리 분산'

IMP

8/10

핵심 요약

PrismML이 가중치를 단 3가지 값(-1, 0, +1)만 사용하는 1.58비트 언어 모델인 '테르나리 분산(Ternary Bonsai)'을 공개했습니다. 이 모델은 기존 16비트 모델 대비 약 9분의 1 수준의 작은 메모리 용량을 차지하면서도 동급 16비트 모델들을 능가하는 뛰어난 성능을 보여줍니다. 엣지 디바이스에서도 초고속 추론 속도와 높은 전력 효율을 발휘하여, 하드웨어 자원이 제한된 환경에서의 실용적인 AI 배포를 혁신할 것으로 평가받습니다.

번역된 본문

출시 소식: 테르나리 분산(Ternary Bonsai) 발표

단 1.58비트로 최고 수준의 지능을 구현하다 2026년 4월 16일 • PrismML

오늘 저희는 엄격한 메모리 제한과 높은 정확도 요구 사항 사이의 균형을 맞추기 위해 설계된 새로운 1.58비트 언어 모델 제품군인 '테르나리 분산(Ternary Bonsai)'을 발표합니다. 이번 출시는 최근 출시된 1비트 분산(Bonsai) 모델을 통해 탐구하기 시작한 효율성의 최전선을 기반으로 합니다. 1비트 제품군은 극단적인 압축을 통해 상업적으로 유용한 언어 모델을 여전히 생산할 수 있음을 보여주었습니다. 테르나리 분산은 이 곡선 위의 다른 지점을 목표로 합니다. 즉, 크기를 약간 늘리는 대신 성능에서 의미 있는 이득을 얻는 것입니다.

이 모델은 8B, 4B, 1.7B 파라미터 세 가지 크기로 제공됩니다. 3진법 가중치({-1, 0, +1})를 사용하여 표준 16비트 모델보다 약 9배 더 작은 메모리 공간을 차지하면서도 표준 벤치마크에서 각각의 파라미터 클래스에 있는 대부분의 경쟁 모델을 능가합니다.

진정한 테르나리(삼진법) 모델 테르나리 분산은 전체 네트워크 아키텍처에 걸쳐 1.58비트 표현을 구현합니다. 더 높은 정밀도를 사용하는 예외 처리는 없습니다. 임베딩, 어텐션 레이어, MLP 및 LM 헤드가 모두 동일한 1.58비트 표현을 사용합니다. 이 모델들은 그룹별 양자화 방식을 채택하여 각 가중치가 세 가지 값({-s, 0, +s}) 중 하나로 제한됩니다. 이 세 가지 상태는 가중치당 1.58비트를 사용하여 (-1, 0, +1)로 인코딩되며, 128개 가중치 그룹마다 공유되는 FP16 스케일 팩터(s)가 함께 사용됩니다.

벤치마크 성능 1비트 분산 8B와 비교하여 테르나리 분산 8B는 벤치마크 평균 5점 높은 점수를 기록했으며, 메모리는 600MB 더 필요할 뿐입니다. 테르나리 분산 8B(1.75GB)는 70.5점을 기록한 1비트 분산 8B(1.15GB)와 비교하여 75.5의 평균 벤치마크 점수에 도달했습니다. 동급 모델들 사이에서 오직 Qwen3 8B(16.38GB)에만 뒤처지며, 크기가 9~10배 더 작음에도 불구하고 다른 모든 모델을 능가합니다. MMLU Redux, MuSR, GSM8K, HumanEval+, IFEval 및 BFCLv3에서 경쟁력 있는 결과를 내어, 이러한 성능 향상이 단일 벤치마크에 집중된 것이 아니라 전반적인 것임을 보여줍니다. 테르나리 분산 모델의 지능 밀도는 비교 가능한 파라미터 클래스의 다른 모델들을 계속해서 크게 능가합니다.

파레토 최전선 확장 이전의 1비트 분산 모델은 언어 모델의 기능 대비 크기에 대한 새로운 파레토 최전선을 확립했습니다. 테르나리 분산은 그 최전선을 더 왼쪽으로 밀어냅니다. 이것은 분산(Bonsai) 제품군에 유용한 추가 사항이며 1비트 분산을 대체하는 것이 아닙니다. 가능한 가장 작은 크기가 최우선인 환경에서는 1비트가 여전히 올바른 선택입니다. 그러나 약간의 메모리 증가가 훨씬 더 강력한 모델을 정당화할 수 있는 경우, 테르나리 분산은 대안적인 트레이드오프를 제공합니다. 1.7B, 4B 및 8B 변형은 여러 배포 계층에 걸쳐 이러한 트레이드오프를 확장하여 개발자에게 메모리, 처리량 및 모델 품질을 할당하는 방법에 대한 더 많은 유연성을 제공합니다.

처리량 및 에너지 사용량 새로운 모델들은 실제로도 강력한 처리량을 제공합니다. M4 Pro에서 테르나리 분산 8B는 초당 82 토큰(tok/sec)으로 실행되며, 이는 16비트 8B 모델보다 약 5배 빠른 것이며, iPhone 17 Pro Max에서는 초당 27 토큰으로 실행됩니다. 16비트 완전 정밀도 대응 모델들보다 훨씬 적은 에너지를 사용하여 약 3~4배 더 나은 에너지 효율을 제공합니다. M4 Pro에서 테르나리 분산 8B는 토큰당 0.105mWh가 필요하며 iPhone 17 Pro Max에서는 토큰당 단 0.132mWh만 필요합니다.

플랫폼 지원 테르나리 분산 모델은 MLX를 통해 Apple 기기(Mac, iPhone, iPad)에서 기본적으로 실행됩니다. 모델 가중치는 오늘 Apache 2.0 라이선스에 따라 제공됩니다. 당사의 훈련, 평가 및 벤치마킹 프로세스에 대한 전체 기술 세부 사항은 백서에서 확인할 수 있습니다.

함께하세요 PrismML은 칼텍(Caltech) 연구원 팀에서 나왔으며 Khosla Ventures, Cerberus 및 Google의 지원을 받아 설립되었습니다. 저희는 수년 동안 이 분야의 가장 어려운 문제 중 하나인 '추론 능력을 희생하지 않으면서 신경망 압축하기'를 해결하는 데 매달려 왔습니다. 차세대 최고 수준의 AI 구축을 돕고 싶다면 여러분의 연락을 기다립니다.

원문 보기

원문 보기 (영어)

LAUNCH Announcing Ternary Bonsai Back to all posts Introducing Ternary Bonsai: Top Intelligence at 1.58 Bits April 16, 2026 • PrismML Today, we’re announcing Ternary Bonsai, a new family of 1.58-bit language models designed to balance strict memory constraints with high accuracy requirements. This release builds on the efficiency frontier we began exploring with the recently released 1-bit Bonsai models. The 1-bit family showed that extreme compression could still produce commercially useful language models. Ternary Bonsai targets a different point on that curve: a modest increase in size for a meaningful gain in performance. The models are available in three sizes: 8B, 4B, and 1.7B parameters. By using ternary weights {-1, 0, +1}, these models achieve a memory footprint approximately 9x smaller than standard 16-bit models while outperforming most peers in their respective parameter classes on standard benchmarks. A true ternary model Ternary Bonsai implements 1.58-bit representation throughout the entire network architecture. There are no higher-precision escape hatches. Embeddings, attention layers, MLPs, and the LM head all use the same 1.58-bit representation. The models employ a group-wise quantization scheme in which each weight is constrained to one of three values: {-s, 0, +s}. These three states are encoded as (-1, 0, +1) using 1.58 bits per weight, together with a shared FP16 scale factor (s) for each group of 128 weights. Benchmark performance Compared to the 1-bit Bonsai 8B, the Ternary Bonsai 8B scores 5 points higher on average across benchmarks, while requiring only 600MB more memory. Ternary Bonsai 8B (1.75 GB) reaches 75.5 average benchmark score, compared with 70.5 for 1-bit Bonsai 8B (1.15 GB). Among its peers, it is only behind Qwen3 8B (16.38 GB) and outperforms all other models, despite being 9-10x smaller than them. It posts competitive results across MMLU Redux, MuSR, GSM8K, HumanEval+, IFEval, and BFCLv3, showing that the gain is broad rather than concentrated in a single benchmark. The intelligence density of Ternary Bonsai models continue to significantly outperform other models in their comparable parameter classes. Extending the Pareto frontier Our earlier 1-bit Bonsai models established a new Pareto frontier for language model capability versus size. Ternary Bonsai shifts that frontier even further left. That makes it a useful addition to the Bonsai family, and not a replacement for 1-bit Bonsai. In settings where the smallest possible footprint is the priority, 1-bit remains the right choice. However, where a small increase in memory can justify a substantially stronger model, Ternary Bonsai offers an alternative tradeoff. The 1.7B, 4B, and 8B variants extend that tradeoff across multiple deployment tiers, giving developers more flexibility in how they allocate memory, throughput, and model quality. Throughput and energy use The new models also deliver strong throughput in practice. On M4 Pro, Ternary Bonsai 8B runs at 82 toks/sec, roughly 5x faster than a 16-bit 8B model and on iPhone 17 Pro Max, it runs at 27 toks/sec. They use substantially less energy than their 16-bit full-precision counterparts, delivering roughly 3-4x better energy efficiency. On the M4 Pro, Ternary Bonsai 8B requires 0.105 mWh/tok and on the iPhone 17 Pro Max, it only requires 0.132 mWh/tok. Platform Coverage Ternary Bonsai models run natively on Apple devices (Mac, iPhone, iPad) via MLX. Model weights are available today under the Apache 2.0 License. Full technical details of our training, evaluation, and benchmarking processes are available in our whitepaper . Join Us PrismML emerged from a team of Caltech researchers and was founded with support from Khosla Ventures, Cerberus and Google. We’ve spent years tackling one of the field’s hardest problems: compressing neural networks without sacrificing their reasoning ability. If you want to help build the next generation of state-of-the-art AI, we’d love to hear from you. Check out our careers page . Back to all posts Announcing 1-bit Bonsai: The First Commercially Viable 1-bit LLMs March 31, 2026 PrismML Launches World's First 1-Bit AI Model to Redefine Intelligence at the Edge March 31, 2026 Resources Quick start guide (video) Whitepaper Models Hugging Face Hugging Face demo Github (macOS) Locally AI (iOS) Follow X Discord LinkedIn

경량화/양자화 온디바이스 AI 오픈소스 LLM 에지 컴퓨팅 모델 효율화