메뉴
BL
MarkTechPost 48일 전

마이크로소프트 VibeVoice 실전 튜토리얼

IMP
7/10
핵심 요약

이 튜토리얼은 마이크로소프트의 음성 AI 모델인 VibeVoice를 활용하여 음성 인식(ASR)과 실시간 음성 합성(TTS) 파이프라인을 구축하는 과정을 다룹니다. 구글 Colab 환경에서 화자 구분, 문맥 인식 ASR, 표현력이 풍부한 TTS, 그리고 엔드투엔드 음성-음성(Speech-to-Speech) 변환 기술을 실습할 수 있습니다. 개발자와 실무자들에게 최신 오디오 언어 모델을 자신의 데이터에 적용하고 실험해 볼 수 있는 실용적인 가이드를 제공합니다.

번역된 본문

이 튜토리얼에서는 구글 Colab 환경에서 마이크로소프트 VibeVoice를 탐색하고, 음성 인식 및 실시간 음성 합성을 위한 완전한 실습 워크플로우를 구축합니다. 처음부터 환경을 설정하고 필수 종속성을 설치하며, 최신 VibeVoice 모델 지원을 확인한 후 화자 인식 전사(speaker-aware transcription), 문맥 기반 ASR, 오디오 배치 처리, 표현력이 풍부한 텍스트 음성 생성(TTS), 그리고 엔드투엔드 음성-음성 파이프라인과 같은 고급 기능을 살펴봅니다.

이 튜토리얼을 진행하면서 실용적인 예제와 상호작용하고, 다양한 음성 프리셋을 테스트하며, 긴 형식의 오디오를 생성하고, Gradio 인터페이스를 실행하여 자신의 파일과 실험에 맞게 시스템을 적용하는 방법을 이해하게 됩니다.

[코드 스니펫] 필요한 파이썬 패키지를 설치하고, 마이크로소프트 VibeVoice 공식 저장소를 복제(clone)하며, 허깅페이스 Transformers 라이브러리가 VibeVoice를 지원하는지 확인하는 초기 환경 설정 코드가 포함되어 있습니다.

VibeVoice를 위한 구글 Colab 환경을 준비하기 위해 필요한 모든 패키지를 설치하고 업데이트합니다. 공식 VibeVoice 저장소를 복제하고 런타임을 구성하며, 설치된 Transformers 버전에서 특별한 ASR 지원을 사용할 수 있는지 확인합니다. 또한 핵심 라이브러리를 가져오고(import) 샘플 오디오 소스를 정의하여 이후의 전사 및 음성 생성 단계를 수행할 준비를 마칩니다.

[코드 스니펫] 이어지는 코드에서는 약 70억(7B) 파라미터 규모의 VibeVoice ASR 모델을 불러오고(load), 파인튜닝된 음성 데이터를 통해 화자를 식별하고 전사(transcription)를 수행하는 예제를 보여줍니다.

원문 보기
원문 보기 (영어)
Editors Pick Agentic AI Technology Artificial Intelligence Language Model Audio Language Model Staff TTS Tutorials Uncategorized Voice AI In this tutorial, we explore Microsoft VibeVoice in Colab and build a complete hands-on workflow for both speech recognition and real-time speech synthesis. We set up the environment from scratch, install the required dependencies, verify support for the latest VibeVoice models, and then walk through advanced capabilities such as speaker-aware transcription, context-guided ASR, batch audio processing, expressive text-to-speech generation, and an end-to-end speech-to-speech pipeline. As we work through the tutorial, we interact with practical examples, test different voice presets, generate long-form audio, launch a Gradio interface, and understand how to adapt the system for our own files and experiments. Copy Code Copied Use a different Browser !pip uninstall -y transformers -q !pip install -q git+https://github.com/huggingface/transformers.git !pip install -q torch torchaudio accelerate soundfile librosa scipy numpy !pip install -q huggingface_hub ipywidgets gradio einops !pip install -q flash-attn --no-build-isolation 2>/dev/null || echo "flash-attn optional" !git clone -q --depth 1 https://github.com/microsoft/VibeVoice.git /content/VibeVoice 2>/dev/null || echo "Already cloned" !pip install -q -e /content/VibeVoice print("="*70) print("IMPORTANT: If this is your first run, restart the runtime now!") print("Go to: Runtime -> Restart runtime, then run from CELL 2.") print("="*70) import torch import numpy as np import soundfile as sf import warnings import sys from IPython.display import Audio, display warnings.filterwarnings('ignore') sys.path.insert(0, '/content/VibeVoice') import transformers print(f"Transformers version: {transformers.__version__}") try: from transformers import VibeVoiceAsrForConditionalGeneration print("VibeVoice ASR: Available") except ImportError: print("ERROR: VibeVoice not available. Please restart runtime and run Cell 1 again.") raise SAMPLE_PODCAST = "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav" SAMPLE_GERMAN = "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav" print("Setup complete!") We prepare the complete Google Colab environment for VibeVoice by installing and updating all the required packages. We clone the official VibeVoice repository, configure the runtime, and verify that the special ASR support is available in the installed Transformers version. We also import the core libraries and define sample audio sources, making our tutorial ready for the later transcription and speech generation steps. Copy Code Copied Use a different Browser from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration print("Loading VibeVoice ASR model (7B parameters)...") print("First run downloads ~14GB - please wait...") asr_processor = AutoProcessor.from_pretrained("microsoft/VibeVoice-ASR-HF") asr_model = VibeVoiceAsrForConditionalGeneration.from_pretrained( "microsoft/VibeVoice-ASR-HF", device_map="auto", torch_dtype=torch.float16, ) print(f"ASR Model loaded on {asr_model.device}") def transcribe(audio_path, context=None, output_format="parsed"): inputs = asr_processor.apply_transcription_request( audio=audio_path, prompt=context, ).to(asr_model.device, asr_model.dtype) output_ids = asr_model.generate(**inputs) generated_ids = output_ids[:, inputs["input_ids"].shape[1]:] result = asr_processor.decode(generated_ids, return_format=output_format)[0] return result print("="*70) print("ASR DEMO: Podcast Transcription with Speaker Diarization") print("="*70) print("\nPlaying sample audio:") display(Audio(SAMPLE_PODCAST)) print("\nTranscribing with speaker identification...") result = transcribe(SAMPLE_PODCAST, output_format="parsed") print("\nTRANSCRIPTION RESULTS:") print("-"*70) for segment in result: speaker = segment['Speaker'] start = segment['Start'] end = segment['End'] content = segment['Content'] print(f"\n[Speaker {speaker}] {start:.2f}s - {end:.2f}s") print(f" {content}") print("\n" + "="*70) print("ASR DEMO: Context-Aware Transcription") print("="*70) print("\nComparing transcription WITH and WITHOUT context hotwords:") print("-"*70) result_no_ctx = transcribe(SAMPLE_GERMAN, context=None, output_format="transcription_only") print(f"\nWITHOUT context: {result_no_ctx}") result_with_ctx = transcribe(SAMPLE_GERMAN, context="About VibeVoice", output_format="transcription_only") print(f"WITH context: {result_with_ctx}") print("\nNotice how 'VibeVoice' is recognized correctly when context is provided!") We load the VibeVoice ASR model and processor to convert speech into text. We define a reusable transcription function that enables inference with optional context and multiple output formats. We then test the model on sample audio to observe speaker diarization and compare the improvements in recognition quality from context-aware transcription. Copy Code Copied Use a different Browser print("\n" + "="*70) print("ASR DEMO: Batch Processing") print("="*70) audio_batch = [SAMPLE_GERMAN, SAMPLE_PODCAST] prompts_batch = ["About VibeVoice", None] inputs = asr_processor.apply_transcription_request( audio=audio_batch, prompt=prompts_batch ).to(asr_model.device, asr_model.dtype) output_ids = asr_model.generate(**inputs) generated_ids = output_ids[:, inputs["input_ids"].shape[1]:] transcriptions = asr_processor.decode(generated_ids, return_format="transcription_only") print("\nBatch transcription results:") print("-"*70) for i, trans in enumerate(transcriptions): preview = trans[:150] + "..." if len(trans) > 150 else trans print(f"\nAudio {i+1}: {preview}") from transformers import AutoModelForCausalLM from vibevoice.modular.modular_vibevoice_text_tokenizer import VibeVoiceTextTokenizerFast print("\n" + "="*70) print("Loading VibeVoice Realtime TTS model (0.5B parameters)...") print("="*70) tts_model = AutoModelForCausalLM.from_pretrained( "microsoft/VibeVoice-Realtime-0.5B", trust_remote_code=True, torch_dtype=torch.float16, ).to("cuda" if torch.cuda.is_available() else "cpu") tts_tokenizer = VibeVoiceTextTokenizerFast.from_pretrained("microsoft/VibeVoice-Realtime-0.5B") tts_model.set_ddpm_inference_steps(20) print(f"TTS Model loaded on {next(tts_model.parameters()).device}") VOICES = ["Carter", "Grace", "Emma", "Davis"] def synthesize(text, voice="Grace", cfg_scale=3.0, steps=20, save_path=None): tts_model.set_ddpm_inference_steps(steps) input_ids = tts_tokenizer(text, return_tensors="pt").input_ids.to(tts_model.device) output = tts_model.generate( inputs=input_ids, tokenizer=tts_tokenizer, cfg_scale=cfg_scale, return_speech=True, show_progress_bar=True, speaker_name=voice, ) audio = output.audio.squeeze().cpu().numpy() sample_rate = 24000 if save_path: sf.write(save_path, audio, sample_rate) print(f"Saved to: {save_path}") return audio, sample_rate We expand the ASR workflow by processing multiple audio files together in batch mode. We then switch to the text-to-speech side of the tutorial by loading the VibeVoice real-time TTS model and its tokenizer. We also define the speech synthesis helper function and voice presets to generate natural audio from text in the next stages. Copy Code Copied Use a different Browser print("\n" + "="*70) print("TTS DEMO: Basic Speech Synthesis") print("="*70) demo_texts = [ ("Hello! Welcome to VibeVoice, Microsoft's open-source voice AI.", "Grace"), ("This model generates natural, expressive speech in real-time.", "Carter"), ("You can choose from multiple voice presets for different styles.", "Emma"), ] for text, voice in demo_texts: print(f"\nText: {text}") print(f"Voice: {voice}") audio, sr = synthesize(text, voice=voice) print(f"Duration: {len(audio)/sr:.2f} seconds") display(Audio(audio, rate=sr)) print("\n" + "="*70) print("TTS DEMO: Compare All Voice Presets") print("="*70) comparison_text = "VibeVoice produces remark