MarkTechPost • 82일 전

오픈AI, 실시간 API 오디오 모델 3종 공개

IMP

8/10

핵심 요약

오픈AI가 실시간 API를 정식 출시(GA)하며, 음성 추론, 실시간 통역, 스트리밍 전사 기능에 특화된 3종의 오디오 모델을 공개했습니다. 이번 업데이트를 통해 기존의 단순한 Q&A 수준을 넘어, 대화 중단과 복잡한 추론을 자연스럽게 처리하고 실시간 다국어 통역까지 지원하는 고도화된 음성 애플리케이션 구축이 가능해졌습니다.

번역된 본문

OpenAI는 실시간 음성 애플리케이션의 각기 다른 핵심 기능을 타겟으로 하는 세 가지 새로운 오디오 모델을 Realtime API를 통해 공개했습니다. 추론 기능을 갖춘 음성 에이전트용 'GPT-Realtime-2', 실시간 음성 통역용 'GPT-Realtime-Translate', 스트리밍 전사용 'GPT-Realtime-Whisper'가 바로 그것입니다. 이번 모델 공개와 함께 Realtime API가 공식적으로 베타 버전을 벗어나 일반 사용이 가능해졌습니다. 이는 프로덕션 수준의 시스템 구축을 보류하고 있던 개발자들에게 매우 의미 있는 신호입니다. 세 모델 모두 OpenAI API를 통해 즉시 사용할 수 있으며, 플레이그라운드(Playground)에서 테스트해 볼 수 있습니다. 이 모델들은 음성 애플리케이션이 기본적인 질문과 답변(Q&A)의 틀을 넘어, 단일 대화 내에서 듣고, 추론하고, 번역하고, 전사하고, 행동할 수 있는 시스템으로 발전하게 합니다.

GPT-Realtime-2: 128K 컨텍스트 윈도우를 지원하는 음성 추론 이번 릴리스의 핵심은 GPT-Realtime-2입니다. OpenAI 팀은 이 모델을 GPT-5 수준의 추론 능력을 갖춘 최초의 음성 모델이라고 설명합니다. GPT-Realtime-2는 더 복잡한 요청을 처리하고, 사용자의 끼어듦(인터럽트)을 관리하며, 대화를 자연스럽게 이어갈 수 있습니다. OpenAI는 모델의 컨텍스트 윈도우를 32K에서 128K 토큰으로 확장하여, 문맥을 잃지 않고도 더 긴 대화와 복잡한 작업을 수행할 수 있게 했습니다. 기존의 음성 모델은 다단계 요청에서 자주 멈추거나, 긴 세션이 진행될 경우 초반의 문맥을 잃는 경우가 많았습니다. GPT-Realtime-2는 요청을 처리하며 추론하는 동안 대화가 자연스럽게 흘러가도록 특별히 설계되었습니다. 개발자는 "확인해 보겠습니다" 또는 "조회하는 동안 잠시만 기다려 주시겠습니까?"와 같은 짧은 서문(preamble) 구문을 활성화하여, 사용자가 에이전트가 요청을 처리 중이라는 것을 인지할 수 있게 할 수 있습니다. 또한 이 모델은 여러 도구를 동시에 호출하고, 자신이 수행 중인 작업을 실시간으로 설명할 수 있습니다. 즉, 다단계 작업 동안 정적(dead air)이 발생하는 대신 사용자에게 실시간 진행 상황을 들려줍니다. 이러한 기능들은 실제 배포된 음성 에이전트에서 가장 흔히 발생하는 실패 원인 중 하나인, 시스템이 고장 난 것처럼 느껴지게 만드는 어색한 침묵을 직접적으로 해결합니다.

실무 빌더들에게 특히 유용한 기능은 조절 가능한 추론 노력(reasoning effort)입니다. 개발자는 최소(minimal), 낮음(low), 중간(medium), 높음(high), 매우 높음(xhigh)의 5가지 수준으로 추론 강도를 조절할 수 있습니다. 기본값은 단순한 요청에 대한 지연 시간을 줄이기 위해 '낮음'으로 설정되어 있으며, 더 복잡한 작업에는 더 많은 컴퓨팅 자원을 활용할 수 있습니다. 즉, 팀은 유스케이스에 따라 세션 수준에서 성능과 지연 시간 간의 트레이드오프를 조정할 수 있습니다. 간단한 고객 정보 조회는 다단계 여행 예약 워크플로우와 동일한 수준의 추론 깊이를 필요로 하지 않는다는 것을 의미합니다.

GPT-Realtime-2는 어조 제어(tone control) 기능도 추가했습니다. 이 모델은 상황에 따라 말하는 방식을 조정할 수 있습니다. 문제 해결 중에는 차분하게 유지하고, 사용자가 좌절할 때는 공감하는 어조로 전환하며, 성공적인 결과가 나온 후에는 밝고 활기찬 어조로 바꿀 수 있습니다. 또한 의료 전문 용어 및 고유명사를 포함한 산업별 전문 용어를 이해하는 능력도 향상되었습니다.

벤치마크에서도 눈에 띄는 성능 향상이 확인되었습니다. '높음(high)' 추론이 적용된 GPT-Realtime-2는 Big Bench Audio에서 96.6%의 점수를 기록했으며, 이는 GPT-Realtime-1.5의 81.4%에 비해 15.2%p 향상된 수치입니다. '매우 높음(xhigh)' 추론이 적용된 GPT-Realtime-2는 Audio MultiChallenge의 지시 따르기(instruction following) 항목에서 34.7%를 기록한 GPT-Realtime-1.5와 비교하여 48.5%의 점수를 달성했습니다. Big Bench Audio는 오디오 입력을 지원하는 언어 모델의 까다로운 추론 능력을 평가합니다. Audio MultiChallenge는 지시 따르기, 맥락 통합, 자기 일관성 및 자연스러운 음성 수정 처리를 포함하여 음성 대화 시스템의 다중 턴 대화 지능을 평가합니다.

가격 정책: GPT-Realtime-2는 오디오 입력 토큰 1백만 개당 $32(캐시된 입력 토큰은 $0.40), 오디오 출력 토큰 1백만 개당 $64로 책정되었습니다.

GPT-Realtime-Translate: 70개 이상의 언어를 지원하는 실시간 음성 통역 GPT-Realtime-Translate는 화자의 속도에 맞춰 70개 이상의 입력 언어를 13개의 출력 언어로 실시간 번역하는 새로운 라이브 통역 모델입니다. GPT-Realtime-2와 달리 이 모델은 전용 통역 파이프라인(dedicated translation pipe) 역할을 합니다. 즉, 한 언어로 된 음성이 입력

원문 보기

원문 보기 (영어)

Editors Pick Agentic AI Technology Artificial Intelligence Language Model Audio Language Model New Releases Staff Voice AI OpenAI released three new audio models through its Realtime API, each targeting a distinct capability in live voice applications: GPT-Realtime-2 for voice agents with reasoning, GPT-Realtime-Translate for live speech translation, and GPT-Realtime-Whisper for streaming transcription. Alongside the model releases, the Realtime API officially exits beta and is now generally available — a meaningful signal for developers who held off building production systems on it. All three models are available immediately through the OpenAI API and can be tested in the Playground. Together, they push voice applications past the basic question-and-answer loop — toward systems that can listen, reason, translate, transcribe, and act within a single conversation. GPT-Realtime-2: Voice Reasoning with a 128K Context Window The flagship release is GPT-Realtime-2, which OpenAI team describes as its first voice model with GPT-5-class reasoning. GPT-Realtime-2 can process harder requests, manage interruptions, and continue conversations naturally. OpenAI expanded the model's context window from 32K to 128K tokens, allowing longer conversations and more complex tasks without losing context. Previous voice models frequently stalled on multi-step requests or dropped earlier context during longer sessions. GPT-Realtime-2 is specifically designed to keep the conversation moving while it reasons through a request. Developers can enable short preamble phrases — like "let me check that" or "one moment while I look into it" — so users know the agent is working on the request. The model can also call multiple tools at once and narrate what it's doing while it does — so instead of dead air during a multi-step task, the user gets a running commentary. These features directly address one of the most common failure modes in deployed voice agents: awkward silence that makes the system feel broken. A particularly useful control for production builders is adjustable reasoning effort. Developers can dial reasoning intensity across five levels : minimal, low, medium, high, and xhigh. The default is "low" to keep latency down for simple requests, while tougher tasks can tap into more compute. This means teams can tune the performance-latency tradeoff at the session level depending on the use case — a quick customer lookup doesn't need the same reasoning depth as a multi-step travel booking workflow. GPT-Realtime-2 also adds tone control. The model can adjust its speaking style depending on the situation — staying calm during problem-solving, shifting to empathetic when users are frustrated, and turning upbeat after a successful outcome. The model is also better at understanding industry-specific terminology, including healthcare vocabulary and proper nouns. On benchmarks, the gains are measurable. GPT-Realtime-2 with high reasoning scored 96.6% on Big Bench Audio, compared to 81.4% for GPT-Realtime-1.5 — a 15.2 percentage point improvement. GPT-Realtime-2 with xhigh reasoning scored 48.5% on Audio MultiChallenge instruction following, compared to 34.7% for GPT-Realtime-1.5. Big Bench Audio evaluates challenging reasoning capabilities in language models that support audio input. Audio MultiChallenge evaluates multi-turn conversational intelligence in spoken dialogue systems, including instruction following, context integration, self-consistency, and handling natural speech corrections. Pricing: GPT-Realtime-2 is priced at $32 per 1M audio input tokens ($0.40 for cached input tokens) and $64 per 1M audio output tokens. GPT-Realtime-Translate: Live Speech Translation Across 70+ Languages GPT-Realtime-Translate is a new live translation model that translates speech from 70+ input languages into 13 output languages while keeping pace with the speaker. Unlike GPT-Realtime-2, this model is a dedicated translation pipe — speech goes in one language and comes out in another. It is not a conversational agent; it is designed to convert one audio stream into another in real time. The distinction is important for developers choosing the right tool. If your application needs a bilingual customer support flow or a live interpreter for an in-person event, GPT-Realtime-Translate is the purpose-built option. If you need the model to also reason, call functions, or hold context across turns, GPT-Realtime-2 handles that. Pricing: GPT-Realtime-Translate is priced at $0.034 per minute. GPT-Realtime-Whisper: Streaming Transcription as People Speak GPT-Realtime-Whisper is a new streaming speech-to-text model built for low-latency speech-to-text — transcribing audio as people speak, so live products can feel faster, more responsive, and more natural. The original Whisper model was designed for completed chunks of audio, making it better suited for post-session transcription. GPT-Realtime-Whisper is the streaming counterpart, purpose-built for applications that need live output. For realtime transcription, gpt-realtime-whisper gives you controllable latency — lower delay settings produce earlier partial text, while higher delay settings can improve transcript quality. Use cases include live broadcast captions, meeting notes generated during the conversation, and voice agents that need to continuously understand the user rather than wait for turn-by-turn input. Pricing: GPT-Realtime-Whisper is priced at $0.017 per minute. Architecture Patterns and New Voices Developers can choose between three session types depending on the use case: a voice-agent session when the application needs an assistant that responds to the user, a translation session when the application needs an interpreter, and a transcription session when text from audio is needed without model-generated responses. On the voice output side, two new voices, Cedar and Marin, join the API roster exclusively with this release. All three models — GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper — are available now through the OpenAI Realtime API, which is generally available starting today. Key Takeaways GPT-Realtime-2 brings GPT-5-class reasoning to voice with a 128K context window, five-level adjustable reasoning effort, tone control, parallel tool calls, and interruption recovery On Big Bench Audio, GPT-Realtime-2 (high) scores 96.6% vs. 81.4% for GPT-Realtime-1.5; on Audio MultiChallenge, the xhigh variant scores 48.5% vs. 34.7%. GPT-Realtime-Translate handles live speech translation across 70+ input languages into 13 output languages at $0.034/min GPT-Realtime-Whisper streams transcription in real time with controllable latency at $0.017/min The Realtime API exits beta and goes generally available today alongside two new voices, Cedar and Marin Check out the Full Technical Details here . Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter . Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

음성-오디오 실시간-API 추론-모델 번역-통역 API-가격