The Decoder • 83일 전

구글, 멀티 토큰 예측으로 Gemma 4 속도 3배 향상

IMP

8/10

핵심 요약

구글이 공개형 AI 모델인 Gemma 4에 '멀티 토큰 예측(MTP)' 초안 생성기를 도입해 텍스트 생성 속도를 최대 3배까지 높였습니다. 이 기술은 메인 모델이 데이터를 불러오며 대기하는 시간 동안 소형 보조 모델이 여러 토큰을 미리 제안하고 메인 모델이 이를 한 번에 검증하는 방식으로 작동합니다. 품질 저하 없이 스마트폰, 로컬 PC, 클라우드 환경 모두에서 빠른 처리가 가능하며, 소스코드는 Apache 2.0 라이선스로 공개되었습니다.

번역된 본문

구글, 멀티 토큰 예측으로 Gemma 4 속도 3배 향상 Matthias Bastian | Matthias Bastian의 LinkedIn 프로필 보기 | 2026년 5월 6일

구글은 공개형 AI 모델 패밀리인 Gemma 4를 위해 텍스트 생성 속도를 최대 3배까지 높여주는 '멀티 토큰 예측(Multi-token prediction, MTP)' 초안 생성기를(drafters) 출시했습니다.

대형 언어 모델(LLM)은 일반적으로 한 번에 하나의 토큰씩 텍스트를 생성하며, 각 단계마다 수십억 개의 파라미터를 메모리에서 불러와야 합니다. 구글에 따르면, 프로세서의 연산 코어는 대부분의 시간을 데이터를 기다리는 데만 소모합니다.

구글의 새로운 MTP 기술은 이러한 병목 현상을 해결합니다. 메인 모델이 데이터를 기다리는 동안, 소형 보조 모델이 유휴 용량을 활용해 여러 토큰을 한 번에 제안(suggest)합니다. 그런 다음 메인 모델이 단 한 번의 연산으로 이 모든 제안을 검사하며, 올바르다고 판단되면 즉시 수용합니다.

구글은 "이 작은 보조 모델은 그저 기존에는 낭비되었을 시간을 채우는 역할을 할 뿐"이며, "따라서 품질이나 정확도의 손실 없이 동일한 텍스트를 더 빠르게 생성할 수 있다"고 설명했습니다.

이 속도 향상 기능은 스마트폰, 로컬 컴퓨터 및 클라우드 애플리케이션 환경에서 모두 작동합니다. 해당 초안 생성기는 Hugging Face와 Kaggle에서 Apache 2.0 오픈소스 라이선스로 제공됩니다.

4월 초에 도입된 구글의 오픈 웨이트 모델인 Gemma 4는 이미 6,000만 회 이상 다운로드되었습니다.

원문 보기

원문 보기 (영어)

Google speeds up Gemma 4 threefold with multi-token prediction Matthias Bastian View the LinkedIn Profile of Matthias Bastian May 6, 2026 Google has released multi-token prediction drafters (MTP) for its open AI model family Gemma 4, designed to speed up text generation by up to three times. LLMs normally generate text one token at a time, loading billions of parameters from memory at each step. The processor's computing core spends most of its time just waiting for data, Google says. The company's new MTP technology tackles this bottleneck. While the main model waits for its data, a small auxiliary model uses the idle capacity to suggest several tokens at once. The main model then checks all those suggestions in a single pass—if they're correct, they get accepted at once. The smaller model is just filling time that would otherwise go to waste, so the same text gets produced faster with no loss in quality or accuracy, according to Google. The speedup works on smartphones, local computers, and cloud applications. The drafters are available under the open Apache 2.0 license on Hugging Face and Kaggle . Google's Gemma 4 open-weight model , introduced in early April, has already been downloaded over 60 million times. Ad DEC_D_Incontent-1 Ad AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now Source: Google Blog Ask about this article… Search

구글 Gemma 4 모델 최적화 추론 속도 오픈소스