Hacker News • 92일 전

마이크로소프트, 오픈소스 최고 수준 음성 AI '바이브보이스' 공개

IMP

9/10

핵심 요약

마이크로소프트가 장문 음성 처리 및 구조화된 전사에 특화된 음성 인식(ASR) 및 합성(TTS) 오픈소스 모델 패밀리인 '바이브보이스(VibeVoice)'를 공개했습니다. 이 모델은 최대 60분의 오디오를 한 번에 처리하고 발화자 구분, 타임스탬프, 내용을 구조화하여 출력하는 것이 가장 큰 특징입니다. 50개 이상의 언어를 지원하며, 최근에는 허깅페이스(Hugging Face) 트랜스포머 라이브러리와 vLLM 추론 통합을 통해 실무 적용이 매우 용이해졌습니다.

번역된 본문

🎙️ VibeVoice: 오픈소스 최고 수준의 음성 AI

📰 소식

2026-03-06: 🚀 VibeVoice ASR가 이제 Transformers 릴리스에 포함되었습니다! 이제 허깅페이스(Hugging Face) Transformers 라이브러리를 통해 음성 인식 모델을 직접 사용하여 프로젝트에 원활하게 통합할 수 있습니다.

2026-01-21: 📣 통합 음성-투-텍스트(Speech-to-Text) 모델인 VibeVoice-ASR를 오픈소스로 공개했습니다. 이 모델은 60분 길이의 장문 오디오를 한 번에 처리하고, 누가(발화자), 언제(타임스탬프), 무엇을(내용) 포함하는 구조화된 전사 결과를 생성하며, 사용자 맞춤 컨텍스트를 지원합니다. 플레이그라운드(Playground)에서 직접 사용해 보세요. ⭐️ VibeVoice-ASR는 50개 이상의 언어를 기본적으로 지원하는 다국어 모델입니다. 자세한 내용은 지원 언어 목록을 확인하세요. 🔥 VibeVoice-ASR 파인튜닝(Finetuning) 코드가 공개되었습니다! ⚡️ 더 빠른 추론을 위해 vLLM 추론이 지원됩니다. 자세한 내용은 vllm-asr을 참조하세요. 📑 VibeVoice-ASR 기술 보고서를 확인할 수 있습니다.

2025-12-16: 📣 탐구 목적의 실험적 스피커를 VibeVoice‑Realtime‑0.5B에 추가했습니다. 여기에는 9개 언어(독일어, 프랑스어, 이탈리아어, 일본어, 한국어, 네덜란드어, 폴란드어, 포르투갈어, 스페인어)의 다국어 음성과 11가지 독특한 영어 스타일 음성이 포함됩니다. 직접 사용해 보세요. 향후 더 많은 스피커 유형이 추가될 예정입니다.

2025-12-03: 📣 스트리밍 텍스트 입력과 안정적인 장문 음성 생성을 지원하는 실시간 텍스트-투-스피치(Text-to-Speech, TTS) 모델인 VibeVoice‑Realtime‑0.5B를 오픈소스로 공개했습니다. Colab에서 사용해 보세요.

2025-09-05: VibeVoice는 음성 합성 커뮤니티의 협력을 발전시키기 위한 오픈소스 연구 프레임워크입니다. 출시 후 이 도구가 명시된 목적과 일치하지 않는 방식으로 사용된 사례를 발견했습니다. 책임 있는 AI 사용은 마이크로소프트의 핵심 원칙 중 하나이므로, 이 리포지토리에서 VibeVoice-TTS 코드를 제거했습니다.

2025-08-25: 📣 최대 90분 길이의 음성과 최대 4명의 화자를 합성할 수 있는 장문 멀티 스피커 텍스트-투-스피치 모델인 VibeVoice-TTS를 오픈소스로 공개했습니다. — ICLR 2026 구두 발표(Oral)로 채택되었습니다!

🔥 개요 VibeVoice는 텍스트-투-스피치(TTS) 및 자동 음성 인식(ASR) 모델을 모두 포함하는 오픈소스 최고 수준의 음성 AI 모델 패밀리입니다.

VibeVoice의 핵심 혁신은 7.5Hz의 초저 프레임 레이트(Frame Rate)에서 작동하는 연속 음성 토크나이저(Continuous Speech Tokenizers, 음향 및 의미)를 사용한다는 점입니다. 이러한 토크나이저는 오디오 충실도를 효율적으로 보존하는 동시에 긴 시퀀스를 처리하는 계산 효율성을 크게 향상시킵니다.

VibeVoice는 다음 토큰 디퓨전(Next-token Diffusion) 프레임워크를 채택하여, 대형 언어 모델(LLM)을 통해 텍스트 문맥과 대화 흐름을 이해하고, 디퓨전 헤드(Diffusion Head)를 통해 고품질의 음향 디테일을 생성합니다.

자세한 정보, 데모 및 예제는 프로젝트 페이지를 방문하세요.

[모델 가중치 및 빠른 사용]

VibeVoice-ASR-7B: 허깅페이스 링크 | 플레이그라운드
VibeVoice-TTS-1.5B: 허깅페이스 링크 | 비활성화됨
VibeVoice-Realtime-0.5B: 허깅페이스 링크 | Colab

[모델 상세]

📖 VibeVoice-ASR - 장문 음성 인식 VibeVoice-ASR는 60분 길이의 장문 오디오를 한 번에 처리하고, 누가(발화자), 언제(타임스탬프), 무엇을(내용) 포함하는 구조화된 전사 결과를 생성하며, 사용자 맞춤형 핫워드(Hotwords)를 지원하는 통합 음성-투-텍스트 모델입니다.

🕒 60분 원패스(Single-Pass) 처리: 오디오를 짧은 단위로 잘라 전역 컨텍스트를 잃는 기존 ASR 모델과 달리, VibeVoice ASR는 64K 토큰 길이 내에서 최대 60분의 연속 오디오 입력을 받습니다. 이를 통해 전체 시간 동안 일관된 발화자 추적과 의미적 일관성을 보장합니다. 👤 맞춤형 핫워드: 사용자가 맞춤형 핫워드(예: 특정 이름, 전문 용어 또는 배경 정보)를 제공하여 인식 과정을 안내할 수 있어, 특정 도메인 콘텐츠의 정확도가 크게 향상됩니다. 📝 풍부한 전사 (누가, 언제, 무엇을): 이 모델은 ASR, 화자 분리(Diarization), 타임스탬프 할당을 동시에 수행하여 누가 언제 무엇을 말했는지 나타내는 구조화된 출력을 생성합니다. 📖 문서 | 🤗 허깅페이스 | 🎮 플레이그라운드 | 🛠️ 파인튜닝 | 📊 논문

🎙️ VibeVoice-TTS - 장문 멀티 스피커 TTS 추천 용도: 장문 대화형 오디오, 팟캐스트, 다자간 대화

원문 보기

원문 보기 (영어)

🎙️ VibeVoice: Open-Source Frontier Voice AI 📰 News 2026-03-06: 🚀 VibeVoice ASR is now part of a Transformers release ! You can now use our speech recognition model directly through the Hugging Face Transformers library for seamless integration into your projects. 2026-01-21: 📣 We open-sourced VibeVoice-ASR , a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for User-Customized Context. Try it in Playground . ⭐️ VibeVoice-ASR is natively multilingual, supporting over 50 languages — check the supported languages for details. 🔥 The VibeVoice-ASR finetuning code is now available! ⚡️ vLLM inference is now supported for faster inference; see vllm-asr for more details. 📑 VibeVoice-ASR Technique Report is available. 2025-12-16: 📣 We added experimental speakers to VibeVoice‑Realtime‑0.5B for exploration, including multilingual voices in nine languages (DE, FR, IT, JP, KR, NL, PL, PT, ES) and 11 distinct English style voices. Try it . More speaker types will be added over time. 2025-12-03: 📣 We open-sourced VibeVoice‑Realtime‑0.5B , a real‑time text‑to‑speech model that supports streaming text input and robust long-form speech generation. Try it on Colab . 2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have removed the VibeVoice-TTS code from this repository. 2025-08-25: 📣 We open-sourced VibeVoice-TTS , a long-form multi-speaker text-to-speech model that can synthesize speech up to 90 minutes long with up to 4 distinct speakers. — accepted as an Oral at ICLR 2026! 🔥 Overview VibeVoice is a family of open-source frontier voice AI models that includes both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) models. A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz . These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details. For more information, demos, and examples, please visit our Project Page . Model Weight Quick Try VibeVoice-ASR-7B HF Link Playground VibeVoice-TTS-1.5B HF Link Disabled VibeVoice-Realtime-0.5B HF Link Colab Models 1. 📖 VibeVoice-ASR - Long-form Speech Recognition VibeVoice-ASR is a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content) , with support for Customized Hotwords . 🕒 60-minute Single-Pass Processing : Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to 60 minutes of continuous audio input within 64K token length. This ensures consistent speaker tracking and semantic coherence across the entire hour. 👤 Customized Hotwords : Users can provide customized hotwords (e.g., specific names, technical terms, or background info) to guide the recognition process, significantly improving accuracy on domain-specific content. 📝 Rich Transcription (Who, When, What) : The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates who said what and when . 📖 Documentation | 🤗 Hugging Face | 🎮 Playground | 🛠️ Finetuning | 📊 Paper small.mp4 2. 🎙️ VibeVoice-TTS - Long-form Multi-speaker TTS Best for : Long-form conversational audio, podcasts, multi-speaker dialogues ⏱️ 90-minute Long-form Generation : Synthesizes conversational/single-speaker speech up to 90 minutes in a single pass, maintaining speaker consistency and semantic coherence throughout. 👥 Multi-speaker Support : Supports up to 4 distinct speakers in a single conversation, with natural turn-taking and speaker consistency across long dialogues. 🎭 Expressive Speech : Generates expressive, natural-sounding speech that captures conversational dynamics and emotional nuances. 🌐 Multi-lingual Support : Supports English, Chinese and other languages. 📖 Documentation | 🤗 Hugging Face | 📊 Paper English ES_._3.mp4 Chinese default.mp4 Cross-Lingual 1p_EN2CH.mp4 Spontaneous Singing 2p_see_u_again.mp4 Long Conversation with 4 people 4p_climate_45min.mp4 3. ⚡ VibeVoice-Streaming - Real-time Streaming TTS VibeVoice-Realtime is a lightweight real‑time text-to-speech model supporting streaming text input and robust long-form speech generation . Parameter size: 0.5B (deployment-friendly) Real-time TTS (~300 milliseconds first audible latency) Streaming text input Robust long-form speech generation (~10 minutes) 📖 Documentation | 🤗 Hugging Face | 🚀 Colab VibeVoice_Realtime.mp4 Contributing Please see CONTRIBUTING.md for detailed contribution guidelines. ⚠️ Risks and Limitations While efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate. VibeVoice inherits any biases, errors, or omissions produced by its base model (specifically, Qwen2.5 1.5b in this release). Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content. We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly. Star History

음성 인식 텍스트 음성 변환 마이크로소프트 오픈소스 모델 다국어 지원