r/LocalLLaMA • 97일 전

큐웬3 TTS, 로컬 실시간 구동 가능한 최고의 오픈소스 모델

IMP

7/10

핵심 요약

한국의 AI 독자를 위해 번역·요약한 결과, 이 프로젝트는 오픈소스 큐웬3 TTS(Qwen3 TTS) 모델을 활용해 로컬 환경에서 실시간 음성 합성 및 아바타 립싱크 파이프라인을 구현한 사례입니다. 스트리밍 안정화, llama.cpp를 통한 양자화 및 속도 최적화, CTC 기반 워드 레벨 정렬(자막·립싱크용), 그리고 맞춤형 음성 파인튜닝까지 성공적으로 수행하여, 기존 로봇 같던 TTS를 매우 표현력 있고 자연스러운 음성으로 개선했다는 점이 핵심입니다.

번역된 본문

안녕하세요 여러분,

약 1년 전 저는 재미로 사이드 프로젝트를 진행하며 'Persona Engine'을 소개했습니다. 이 엔진은 ASR(자동 음성 인식) -> LLM(대형 언어 모델) -> TTS(텍스트 음성 변환) 파이프라인을 완전히 로컬에서 구동하면서도, VTuber처럼 실시간 아바타의 립싱크를 구현하는 것이 목표였습니다. 이를 성공적으로 완료해 매우 만족했지만, 당시 기준으로 참고했던 Sesame에 비해 TTS의 품질은 아쉬웠습니다. 그 후로 오랜 휴식을 취했습니다.

1~2주 전, 프로젝트를 새로 고치면서 로컬 모델 기술이 얼마나 발전했는지 확인해 보고자 했고, Qwen3 TTS에 정말 크게 놀랐습니다. 초기 테스트에서는 Qwen 팀이 공식 배포한 버전이 특히 부족해 보였지만, 여러 실험과 조사 끝에 다음과 같은 성과를 얻었습니다:

모델의 스트리밍을 안정적으로 구현했습니다. 디코더가 슬라이딩 윈도우(Sliding Window) 방식을 사용하기 때문에 LLM 응답을 스트리밍해도 TTS가 일관된 운율, 피치, 억양을 유지하는 등 아키텍처가 이에 최적화되어 있습니다.
제가 C#을 사용하고 속도가 중요한 관계로, llama.cpp를 통해 모델이 작동하도록 구현했고 양자화(Quantization)도 적용했습니다.
기존에 사용하던 로봇 같은 소리가 나는 TTS인 Kokoro에는 단어 수준 타이밍(Word-level timings)과 음소(Phonemes)가 있었지만, Qwen3 TTS에는 이 기능이 부족했습니다. 그래서 특정 단어가 언제 발음되는지(자막 처리와 입 모양을 정확히 맞추는 데 필수) 알기 위해 CTC 단어 수준 정렬(CTC word-level alignment)을 직접 구현해야 했습니다.

이 모든 작업을 마친 후, 저만의 Qwen3-TTS 음성을 파인튜닝(Finetune)하기로 결정했습니다. 음성 복제(Cloding) 기능은 매우 멋지지만 문맥 이해가 다소 부족하고 발음에 어려움이 있었습니다. 또한 Qwen 팀이 제공한 맞춤형 학습 음성에는 여성 원어민 화자가 없었고, 저는 새로운 Live2D 모델을 만들고 싶지 않았습니다.

결과적으로 이 파인튜닝 결과는 저를 정말 놀라게 했으며, 앞으로도 계속 개선해 나갈 예정입니다.

GitHub 저장소는 여기입니다: https://github.com/fagenorn/handcrafted-persona-engine

한번 확인해 보시고, 재미있게 활용하시길 바랍니다. 여러분이 이걸로 어떤 미친 프로젝트를 만들지 꼭 알려주세요.

원문 보기

원문 보기 (영어)

Heya guys and gals, Around a year ago I released and posted about Persona Engine as a fun side project, trying to get the whole ASR -> LLM -> TTS pipeline going fully locally while having a realtime avatar that is lip-synced (think VTuber). I was able to achieve this and was super happy with the result, but the TTS for me was definitely lacking, since I was using Sesame at the time as reference. After that I took a long break. A week or two ago, I thought to give the project a refresh, and also wanted to see how far we have come with local models, and boy was I pleasantly surprised with Qwen3 TTS. During my initial tests it was lacking, especially the version published by the Qwen team themselves, but after digging around and experimenting a lot I was able to: 1. Make streaming with the model work reliably. The architecture of the model is perfect for this, since the decoder uses a sliding window, which means if you stream the LLM response, that's completely fine and the TTS will keep coherent prosody, pitch, and intonation. 2. Get the model working with llama.cpp, because I am using C# and speed is important, so also quantized it. 3. The model was lacking word-level timings and phonemes which Kokoro (the previous, more robotic sounding TTS) had. So I had to implement CTC word-level alignment to be able to know when certain words are spoken (important for subtitles + getting phonemes to have the lips move correctly). Once this was all done, I also decided to finetune my own Qwen3-TTS voice. The cloning capabilities are really cool, but very lacking in contextual understanding and struggles with pronouncing. Additionally, the custom trained voices provided by the Qwen team didn't have any female native speakers, and I didn't want to create a new Live2D model. In the end, the finetune blew me away and will probably continue improving it. GitHub is here: [https://github.com/fagenorn/handcrafted-persona-engine](https://github.com/fagenorn/handcrafted-persona-engine) Check it out, have fun, and let me know whatever crazy stuff you decide to do with it.

TTS 오픈소스 로컬-추론 파인튜닝 실시간-음성합성