MarkTechPost • 70일 전

알리바바 통번역 AI, 60개 언어 2.8초 지연

IMP

8/10

핵심 요약

알리바바 클라우드의 Qwen 팀이 실시간 다국어 통번역 모델인 Qwen3.5-LiveTranslate-Flash를 공개했습니다. 이 모델은 60개 언어의 입력을 2.8초의 지연 시간으로 처리하며, 시각 정보 분석과 화자의 음성 실시간 클로닝을 지원합니다. 소음이 많은 환경에서도 안정적인 성능을 발휘하고 전문 용어 사전 주입 기능을 갖춰 글로벌 기업의 실무 환경에 즉각적인 활용이 가능한 것이 핵심입니다.

번역된 본문

동시통역은 응용 AI 분경에서 가장 어려운 문제 중 하나입니다. 모델이 화자가 문장을 끝마치기도 전에 음성을 번역해야 하기 때문입니다. 실시간 통신의 경험을 깨뜨리는 지연 시간이 1초만 늘어나도 체감 품질은 크게 떨어집니다. 알리바바의 Qwen팀은 매번 새로운 릴리즈를 통해 이 문제를 조금씩 해결해 왔습니다. 이들의 최신 모델인 Qwen3.5-LiveTranslate-Flash는 지연 시간(latency)을 2.8초까지 줄이고 입력 언어 지원 범위를 60개 언어로 확장했습니다.

이전 릴리즈 대비 의미 있는 도약 이전 버전인 Qwen3-LiveTranslate-Flash는 약 3초의 지연 시간으로 18개 입력 언어를 처리했습니다. 반면 Qwen3.5-LiveTranslate-Flash는 지연 시간을 2.8초로 단축하고, 입력 언어를 60개로 확장했으며, 29개 언어에 대한 음성 출력 기능을 추가했습니다. 입력 측면에서 무려 3배 이상의 언어 지원 확장입니다. 다국어 제품을 개발하는 개발자들에게 이는 글로벌 기업 환경에서 언어별로 모델을 교체해야 하는 수고를 크게 덜어줍니다. 지연 시간의 개선은 팀이 '읽기 단위(reading units)'라고 부르는 기술을 처리하는 방식에서 비롯됩니다. 출력을 생성하기 전에 완전한 하나의 문장이 도착할 때까지 기다리는 대신, 모델은 특정 세그먼트에 충분한 의미가 누적되었을 때 번역을 시작할 시점을 스스로 결정합니다. 화자가 여전히 말을 하고 있는 동안에도 출력을 지속적으로 스트리밍합니다. 이는 시맨틱 유닛 예측(Semantic Unit Prediction)과 동일한 기본 논리이지만, 추가적인 200밀리초를 줄여주는 더욱 타이트한 구현 방식입니다.

시각 정보, 1순위 입력 데이터로 격상 대부분의 번역 시스템은 오디오를 유일한 입력 신호로 취급합니다. 이는 잡음 없는 스튜디오 환경에서는 잘 작동하지만, 붐비는 회의실이나 시끄러운 트레이딩 룸, 또는 목소리가 겹치고 음향 환경이 좋지 않은 곳에서는 성능이 급격히 저하됩니다. Qwen3.5-LiveTranslate-Flash는 다른 접근 방식을 취합니다. 오디오와 병렬로 화면의 텍스트, 실제로 보이는 물체, 입모양, 제스처와 같은 시각적 정보를 분석합니다. 단어의 발음이 모호하거나 오디오 스트림의 질이 떨어질 때, 시각적 컨텍스트가 그 공백을 메우고 번역의 정확도를 높입니다. 이는 결코 사소한 기능이 아닙니다. 실제 배포 환경에서는 오디오 품질이 보장되는 경우가 거의 없습니다. 시각 채널(Vision Channel)이 있다는 것은 오디오 전용 시스템보다 모델이 현장의 복잡한 실시간 통번역 상황을 훨씬 더 유연하게 처리할 수 있음을 의미합니다.

실시간으로 이루어지는 음성 클로닝 이는 Qwen3.5 릴리즈에서 가장 눈에 띄는 부분입니다. 기존의 일반적인 번역 시스템은 화자의 음성을 범용적인 합성 음성으로 대체합니다. 반면 Qwen3.5-LiveTranslate-Flash는 번역 과정 자체에서 원래 화자의 특징적인 음성 특징을 실시간으로 클로닝합니다. 단 하나의 발화 문장만으로도 모델이 이러한 음향적 적응을 수행하기에 충분합니다. 수신 측의 청취자 입장에서 번역된 출력은 로봇 같은 대체 음성이 아니라, 마치 동일한 화자가 목표 언어로 직접 말하는 것처럼 들립니다. 실시간 회의 통번역, 다국어 라이브 스트리밍 또는 국제 고객 전화 통화에서 이는 매우 중요합니다. 현재의 다른 시스템들이 제공하는 것보다 눈에 띄게 더 인간적이고 자연스러운 경험을 제공합니다.

도메인별 키워드 동적 구성 전문적인 환경에서 번역 모델이 가장 흔하게 겪는 실패 원인은 고유명사와 전문 어휘의 오역입니다. 의학 브리핑을 번역하는 모델이 약물명을 지속적으로 오역할 수 있으며, 법률 통역 세션은 기술적인 법률 용어로 인해 엉뚱하게 진행될 수 있습니다. Qwen3.5-LiveTranslate-Flash는 런타임에 동적 키워드 구성을 통해 이 문제를 해결합니다. 개발자는 브랜드 이름, 의학 용어, 법률 용어 또는 기술 어휘의 용어집을 모델에 주입할 수 있으며, 모델은 해당 용어들을 훨씬 더 안정적으로 정확하게 처리합니다. 이 기능은 대부분의 범용 번역 API에서는 제공되지 않으며, 특정 산업 도메인에 특화된 기업 배포 환경의 실질적인 간극을 메워줍니다.

벤치마크 성능 다국어 음성 번역을 위한 두 가지 확립된 벤치마크인 FLEURS 및 CoVoST2에서 Qwen3.5-LiveTranslate-Flash는 타 모델들을 능가하는 성능을 보여줍니다.

원문 보기

원문 보기 (영어)

Editors Pick Agentic AI Technology AI Shorts Artificial Intelligence Applications Language Model Large Language Model Machine Learning New Releases Software Engineering Staff Tech News Simultaneous interpretation is one of the harder problems in applied AI. You're asking a model to translate speech before the speaker has finished a sentence. Every extra second of delay breaks the illusion of real-time communication. Alibaba's Qwen team has been chipping away at this with each release. Their latest model, Qwen3.5-LiveTranslate-Flash , brings that latency down to 2.8 seconds and expands input language coverage to 60 languages. A Meaningful Jump From the Previous Release The Qwen3-LiveTranslate-Flash handled 18 input languages at roughly three seconds of latency. Qwen3.5-LiveTranslate-Flash brings that down to 2.8 seconds , expands input coverage to 60 languages, and adds speech output in 29 languages. That's more than a 3× expansion in language coverage on the input side. For devs building multilingual products, this reduces the need for per-language model switching in most global enterprise scenarios. The latency improvement comes from a technique for processing what the team calls ‘reading units.' Rather than waiting for a full sentence to arrive before producing output, the model decides when enough meaning has accumulated in a segment to commit to a translation. It streams output continuously while the speaker is still talking. This is the same underlying logic as semantic unit prediction but with a tighter implementation that shaves off that extra 200 milliseconds. Vision Is Now a First-Class Input Most translation systems treat audio as the only input signal. That works fine in clean studio conditions. It breaks down in a crowded conference room, a noisy trade floor, or anywhere with overlapping voices and bad acoustics. Qwen3.5-LiveTranslate-Flash takes a different approach. It analyzes visual information in parallel with audio on-screen text, physically shown objects, lip movements, and gestures. When a word is phonetically ambiguous or the audio stream degrades, the visual context fills the gap and sharpens the translation decision. This is not a minor feature. In real-world deployment, audio quality is rarely guaranteed. Having a vision channel means the model handles the messy reality of live interpretation more gracefully than audio-only systems. Voice Cloning Happens in Real Time This is the part that stands out most in the Qwen3.5 release. Standard translation systems replace the speaker's voice with a generic synthesis voice. Qwen3.5-LiveTranslate-Flash instead clones the characteristic voice features of the original speaker during the translation itself. A single spoken sentence is enough for the model to perform this acoustic adaptation. For listeners on the receiving end, the translated output sounds like the same person speaking the target language and not a robotic substitute. In live conference interpretation, multilingual livestreams, or international customer calls, this is important. The experience feels noticeably more human than what current systems deliver. Configure Domain-Specific Keywords One persistent failure mode for translation models in professional settings is proper nouns and specialized vocabulary. A model translating a medical briefing might consistently mistranslate a drug name. A legal interpretation session breaks down over a technical statute term. Qwen3.5-LiveTranslate-Flash addresses this with dynamic keyword configuration at runtime. Developers can inject a glossary of brand names, medical terms, legal terminology, or technical vocabulary, and the model handles those terms significantly more reliably. This isn't available in most general-purpose translation APIs and it closes a real gap for domain-specific enterprise deployments. Benchmark Performance On FLEURS and CoVoST2 — two established benchmarks for multilingual speech translation — Qwen3.5-LiveTranslate-Flash outperforms major commercial alternatives. FLEURS tests translation quality across a wide variety of language pairs under real acoustic conditions. CoVoST2 covers 21 translation directions from speech, making it a practical proxy for multilingual pipeline performance. Marktechpost’s Visual Explainer ✓ Developer Guide How to Use Qwen3.5-LiveTranslate-Flash A step-by-step integration guide — from setup to production-ready real-time translation 1 Overview 2 Prerequisites 3 Connect 4 Send Audio 5 Visual Input 6 Keywords 7 Languages What it does Qwen3.5-LiveTranslate-Flash at a glance Qwen3.5-LiveTranslate-Flash is an API-only, closed-weight real-time translation model from Alibaba's Qwen team. It takes audio and video frames as simultaneous inputs and outputs translated text and speech. The model uses a WebSocket-based protocol over Alibaba Cloud Model Studio. Latency 2.8s Per token to audio out Input languages 60 Speech + visual input Speech output 29 Languages with voice Protocol WebSocket Persistent connection ✓ Vision-enhanced comprehension — lip movements, gestures, and on-screen text all feed into the translation decision alongside audio ◆ Real-time voice cloning — clones the original speaker's voice profile in the translated output from a single spoken sentence ◆ Semantic unit prediction — commits to output segments before a full sentence ends, enabling continuous streaming without waiting for complete utterances ◆ Dynamic keyword configuration — inject domain-specific glossaries at runtime for technical, medical, or legal terminology Before you start Prerequisites You need an Alibaba Cloud account with Model Studio access and a valid DashScope API key. The model is available through the qwen3-livetranslate-flash-realtime model ID. 1 Create an Alibaba Cloud account Sign up at alibabacloud.com and activate Alibaba Cloud Model Studio in your account dashboard. 2 Get your DashScope API key Navigate to Model Studio → API Keys. Generate a key and store it as the environment variable DASHSCOPE_API_KEY . Never hardcode it in source files. 3 Install the Python dependency Install the websocket-client package for the WebSocket connection. For audio capture, also install pyaudio . 4 Check your audio setup The model accepts 16kHz, 16-bit PCM mono audio on input. Confirm your microphone or audio source can output in this format before connecting. BASH Copy # Install dependencies pip install websocket-client pyaudio # Set your API key as an environment variable export DASHSCOPE_API_KEY = "your_key_here" Step 3 — Connection Establish the WebSocket connection The model uses the WebSocket protocol for a persistent, bidirectional connection. You authenticate via a Bearer token in the connection header using your DashScope API key. PYTHON Copy import json, websocket, os API_KEY = os. getenv ( "DASHSCOPE_API_KEY" ) API_URL = ( "wss://dashscope-intl.aliyuncs.com" "/api-ws/v1/realtime" "?model=qwen3-livetranslate-flash-realtime" ) def on_open (ws): print ( "Connected to Qwen3.5-LiveTranslate-Flash" ) def on_message (ws, message): data = json. loads (message) print ( "Translation event:" , data) def on_error (ws, error): print ( "Error:" , error) ws = websocket. WebSocketApp ( API_URL , header=[ "Authorization: Bearer " + API_KEY ], on_open= on_open , on_message= on_message , on_error= on_error ) ws. run_forever () ⓘ The connection stays open for the full session. You do not reconnect per utterance. Send audio chunks and image frames continuously over the same socket. Step 4 — Audio streaming Configure and stream audio input After connecting, send a session configuration event to set the source and target languages. Then stream PCM audio chunks continuously. The model uses session.input_audio_transcription.language to identify the input language. PYTHON Copy import base64, pyaudio # Audio input config: 16kHz, 16-bit PCM mono INPUT_RATE = 16000 INPUT_CHUNK = 1600 # 100ms per

음성 인식 및 번역 멀티모달 AI 알리바바 Qwen 실시간 통번역 음성 합성