MarkTechPost • 101일 전

xAI, 엔터프라이즈 음성 개발자 겨냥한 Grok 오디오 API 공개

IMP

8/10

핵심 요약

일론 머스크의 AI 기업 xAI가 기존 음성 시장을 겨냥해 음성을 텍스트로 변환하는 STT와 텍스트를 음성으로 변환하는 TTS, 두 가지 독립적인 오디오 API를 전격 출시했습니다. 특히 Grok STT API는 경쟁사 대비 최대 3~4배 낮은 오류율을 기록하며 뛰어난 정확도를 입증했으며, TTS API는 감정 표현과 세밀한 발화 제어 기능을 제공하는 것이 특징입니다. 이를 통해 기업 개발자들은 고도화된 회의록 자동 작성, 음성 비서, 콜센터 분석 등의 서비스를 저렴하고 효율적으로 구축할 수 있게 되었습니다.

번역된 본문

일론 머스크의 AI 기업 xAI가 두 가지 독립형 오디오 API인 음성-텍스트 변환(STT) API와 텍스트-음성 변환(TTS) API를 출시했습니다. 두 API는 모바일 앱, 테슬라 차량, 스타링크 고객 지원 등에서 Grok Voice를 구동하는 것과 동일한 인프라를 기반으로 구축되었습니다. 이번 릴리즈를 통해 xAI는 현재 ElevenLabs, Deepgram, AssemblyAI가 점유하고 있는 경쟁적인 음성 API 시장에 본격적으로 진출했습니다.

Grok 음성-텍스트 변환(STT) API란? 음성-텍스트 변환(Speech-to-Text)은 음성 오디오를 텍스트로 변환하는 기술입니다. 회의록 작성 도구, 음성 에이전트, 콜센터 분석, 접근성 기능을 개발하는 개발자들에게 STT API는 핵심 구성 요소입니다. 처음부터 개발할 필요 없이, 개발자는 엔드포인트를 호출하여 오디오를 전송하고 구조화된 트랜스크립트(스크립트)를 반환받을 수 있습니다.

현재 일반 이용이 가능한 Grok STT API는 배치(Batch) 및 스트리밍(Streaming) 모드를 모두 지원하며 25개 언어의 트랜스크립션을 제공합니다. 배치 모드는 사전 녹음된 오디오 파일 처리를 위해 설계되었으며, 스트리밍 모드는 오디오가 캡처되는 동안 실시간 트랜스크립션을 가능하게 합니다. 가격 정책은 직관적입니다. 배치 모드는 시간당 0.10달러, 스트리밍 모드는 시간당 0.20달러로 책정되었습니다.

이 API는 단어 수준 타임스탬프, 화자 분리(Speaker diarization), 멀티채널 지원을 포함하며, 숫자, 날짜, 통화 등을 올바르게 처리하는 지능형 역텍스트 정규화(Inverse Text Normalization) 기능도 함께 제공합니다. 또한 12개의 오디오 형식을 지원합니다. 컨테이너 형식 9가지(WAV, MP3, OGG, Opus, FLAC, AAC, MP4, M4A, MKV)와 원시(Raw) 형식 3가지(PCM, µ-law, A-law)를 수용하며, 요청당 최대 파일 크기는 500MB입니다.

화자 분리는 오디오를 개별 화자별로 분리하는 과정으로, '누가 무슨 말을 했는지'를 식별해 줍니다. 이는 회의, 인터뷰 또는 고객 통화와 같은 다자간 녹음에 매우 중요한 기능입니다. 단어 수준 타임스탬프는 트랜스크립트의 각 단어에 정확한 시작 및 종료 시간을 할당하여 자막 생성, 검색 가능한 녹음, 법률 문서화 등의 활용 사례를 가능하게 합니다. 역텍스트 정규화는 "십만 육천칠백구십팔 달러 십오 센트"와 같은 구어체 형태를 읽을 수 있는 구조화된 출력인 "$167,983.15"로 변환해 줍니다.

벤치마크 성능 xAI 연구팀은 정확도에 대해 강력한 주장을 하고 있습니다. 전화 통화 엔티티 인식(이름, 계정 번호, 날짜) 부문에서 Grok STT는 5.0%의 오류율을 기록했으며, 이는 ElevenLabs(12.0%), Deepgram(13.5%), AssemblyAI(21.3%)와 비교하여 상당한 격차입니다. 이러한 성능이 실제 프로덕션 환경에서도 유지된다면 매우 고무적인 결과입니다. 비디오 및 팟캐스트 트랜스크립션에서는 Grok과 ElevenLabs가 2.4%의 동일한 오류율을 기록했으며, Deepgram(3.0%)과 AssemblyAI(3.2%)가 그 뒤를 이었습니다. xAI 팀은 또한 일반 오디오 벤치마크에서 6.9%의 단어 오류율을 보고했습니다.

Grok 텍스트-음성 변환(TTS) API란? 텍스트-음성 변환(Text-to-Speech)은 작성된 텍스트를 음성 오디오로 변환합니다. 개발자들은 TTS API를 사용하여 음성 비서, 텍스트 낭독 기능, 팟캐스트 생성, IVR(대화형 음성 응답) 시스템 및 접근성 도구를 구동합니다. Grok TTS API는 빠르고 자연스러운 음성 합성을 제공하며, 음성 태그를 통해 세밀한 제어가 가능합니다. 가격은 100만 자당 4.20달러입니다.

이 API는 단일 REST 요청당 최대 15,000자의 텍스트를 처리합니다. 더 긴 콘텐츠의 경우 텍스트 길이 제한이 없으며 전체 입력이 처리되기 전에 오디오 반환을 시작하는 WebSocket 스트리밍 엔드포인트를 사용할 수 있습니다. API는 20개 언어와 Ara, Eve, Leo, Rex, Sal의 5가지 독특한 음성을 지원하며, 'Eve'가 기본값으로 설정되어 있습니다.

음성 선택 외에도 개발자는 인라인(Inline) 및 래핑(Wrapping) 음성 태그를 삽입하여 발화 방식을 제어할 수 있습니다. 여기에는 laugh, sigh, breath와 같은 인라인 태그와 텍스트(속삭임), 텍스트(강조)와 같은 래핑 태그가 포함됩니다. 이를 통해 개발자는 복잡한 마크업 없이도 매력적이고 실제 같은 자연스러운 발화를 만들어낼 수 있습니다. 이러한 표현력은 기술적으로는 정확하지만 종종 감정적으로는 평면적인 결과를 내는 기존 TTS 시스템의 핵심적인 한계를 해결해 줍니다.

원문 보기

원문 보기 (영어)

Editors Pick Agentic AI Technology Artificial Intelligence Language Model Audio Language Model New Releases Staff TTS Voice AI Elon Musk's AI company xAI has launched two standalone audio APIs — a Speech-to-Text (STT) API and a Text-to-Speech (TTS) API — both built on the same infrastructure that powers Grok Voice on mobile apps, Tesla vehicles, and Starlink customer support. The release moves xAI squarely into the competitive speech API market currently occupied by ElevenLabs, Deepgram, and AssemblyAI. What Is the Grok Speech-to-Text API? Speech-to-Text is the technology that converts spoken audio into written text. For developers building meeting transcription tools, voice agents, call center analytics, or accessibility features, an STT API is a core building block. Rather than developing this from scratch, developers call an endpoint, send audio, and receive a structured transcript in return. The Grok STT API is now generally available, offering transcription across 25 languages with both batch and streaming modes. The batch mode is designed for processing pre-recorded audio files, while streaming enables real-time transcription as audio is captured. Pricing is kept straightforward: Speech-to-Text is $0.10 per hour for batch and $0.20 per hour for streaming. The API includes word-level timestamps, speaker diarization, and multichannel support, along with intelligent Inverse Text Normalization that correctly handles numbers, dates, currencies, and more. It also accepts 12 audio formats — nine container formats (WAV, MP3, OGG, Opus, FLAC, AAC, MP4, M4A, MKV) and three raw formats (PCM, µ-law, A-law), with a maximum file size of 500 MB per request. Speaker diarization is the process of separating audio by individual speakers — answering the question ‘who said what.' This is critical for multi-speaker recordings like meetings, interviews, or customer calls. Word-level timestamps assign precise start and end times to each word in the transcript, enabling use cases like subtitle generation, searchable recordings, and legal documentation. Inverse Text Normalization converts spoken forms like ‘one hundred sixty-seven thousand nine hundred eighty-three dollars and fifteen cents' into readable structured output: "$167,983.15.". Benchmark Performance xAI research team is making strong claims on accuracy. On phone call entity recognition — names, account numbers, dates — Grok STT claims a 5.0% error rate versus ElevenLabs at 12.0%, Deepgram at 13.5%, and AssemblyAI at 21.3%. That is a substantial margin if it holds in production. For video and podcast transcription, Grok and ElevenLabs tied at a 2.4% error rate, with Deepgram and AssemblyAI trailing at 3.0% and 3.2% respectively. xAI team also reports a 6.9% word error rate on general audio benchmarks. What is the Grok Text-to-Speech API? Text-to-Speech converts written text into spoken audio. Developers use TTS APIs to power voice assistants, read-aloud features, podcast generation, IVR (interactive voice response) systems, and accessibility tools. The Grok TTS API delivers fast, natural speech synthesis with detailed control via speech tags, and is priced at $4.20 per 1 million characters. The API accepts up to 15,000 characters per REST request ; for longer content, a WebSocket streaming endpoint is available that has no text length limit and begins returning audio before the full input is processed. The API supports 20 languages and five distinct voices: Ara, Eve, Leo, Rex, and Sal — with Eve set as the default. Beyond voice selection, developers can inject inline and wrapping speech tags to control delivery. These include inline tags like [laugh] , [sigh] , and [breath] , and wrapping tags like <whisper>text</whisper> and <emphasis>text</emphasis> , letting developers create engaging, lifelike delivery without complex markup. This expressiveness addresses one of the core limitations of traditional TTS systems, which often produce technically correct but emotionally flat output. Key Takeaways xAI has launched two standalone audio APIs — Grok Speech-to-Text (STT) and Text-to-Speech (TTS) — built on the same production stack already serving millions of users across Grok mobile apps, Tesla vehicles, and Starlink customer support. The Grok STT API offers real-time and batch transcription across 25 languages with speaker diarization, word-level timestamps, Inverse Text Normalization, and support for 12 audio formats — priced at $0.10/hour for batch and $0.20/hour for streaming. On phone call entity recognition benchmarks , Grok STT reports a 5.0% error rate, significantly outperforming ElevenLabs (12.0%), Deepgram (13.5%), and AssemblyAI (21.3%), with particularly strong performance in medical, legal, and financial use cases. The Grok TTS API supports five expressive voices (Ara, Eve, Leo, Rex, Sal) across 20 languages, with inline and wrapping speech tags like [laugh] , [sigh] , and <whisper> giving developers fine-grained control over vocal delivery — priced at $4.20 per 1 million characters. Check out the Technical details here . Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter . Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us Michal Sutter + posts Bio Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights. Michal Sutter OpenAI Scales Trusted Access for Cyber Defense With GPT-5.4-Cyber: a Fine-Tuned Model Built for Verified Security Defenders Michal Sutter A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG Michal Sutter Top 19 AI Red Teaming Tools (2026): Secure Your ML Models Michal Sutter A Coding Guide to Build a Production-Grade Background Task Processing System Using Huey with SQLite, Scheduling, Retries, Pipelines, and Concurrency Control Michal Sutter Google AI Launches Gemini 3.1 Flash TTS: A New Benchmark in Expressive and Controllable AI Voice Michal Sutter A Coding Implementation of Crawl4AI for Web Crawling, Markdown Generation, JavaScript Execution, and LLM-Based Structured Extraction Michal Sutter Google AI Research Proposes Vantage: An LLM-Based Protocol for Measuring Collaboration, Creativity, and Critical Thinking Michal Sutter Meta AI and KAUST Researchers Propose Neural Computers That Fold Computation, Memory, and I/O Into One Learned Model Michal Sutter A Coding Implementation of MolmoAct for Depth-Aware Spatial Reasoning, Visual Trajectory Tracing, and Robotic Action Prediction Michal Sutter Alibaba's Tongyi Lab Releases VimRAG: a Multimodal RAG Framework that Uses a Memory Graph to Navigate Massive Visual Contexts Michal Sutter A Coding Guide to Markerless 3D Human Kinematics with Pose2Sim, RTMPose, and OpenSim Michal Sutter A Comprehensive Implementation Guide to ModelScope for Model Search, Inference, Fine-Tuning, Evaluation, and Export Michal Sutter How to Combine Google Search, Google Maps, and Custom Functions in a Single Gemini API Call With Context Circulation, Parallel Tool IDs, and Multi-Step Agentic Chains Michal Sutter How to Deploy Open WebUI with Secure OpenAI API Integration, Public Tunneling, and Browser-Based Chat Access Michal Sutter Netflix AI Team Just Open-Sourced VOID: an AI Model That Erases Objects From Videos — Physics and All Michal Sutter Google DeepMind's Research Lets an LLM Rewrite Its Own Game Theory Algorithms — And It Outperformed the Experts Michal Sutter Hugging Face Releases TRL v1.0: A Unified Post-Training Stack for SFT, Reward Modeling, DPO, and GRPO Workflows Michal Sutter Google AI Releases Veo 3.1 Lite: Giving Developers Low Cost High Speed V

음성 AI API xAI Grok TTS