The Decoder • 78일 전

미라 무라티 Thinking Machines, 오픈AI 비판하며 첫 음성 AI 공개

IMP

8/10

핵심 요약

오픈AI 전 CTO 미라 무라티가 설립한 Thinking Machines Lab이 첫 AI 모델을 공개했습니다. 이 모델은 200밀리초 단위로 오디오와 비디오, 텍스트를 동시 처리하여 기존의 딱딱한 질의응답 방식을 넘어선 자연스러운 실시간 대화를 구현합니다. 실시간 상호작용 품질 및 지연 시간 벤치마크에서 오픈AI와 구글의 최신 모델을 능가하며, 빠른 반응 속도와 깊은 추론 능력을 결합한 것이 핵심 기술적 의의입니다.

번역된 본문

Thinking Machines Lab, 전 오픈AI CTO 미라 무라티가 설립한 회사, 첫 AI 모델 출시... "오픈AI가 음성 AI에서 놓친 것은 상호작용성" Maximilian Schreiner | 2026년 5월 12일

핵심 요약 Thinking Machines Lab은 200밀리초 단위로 오디오, 비디오, 텍스트를 처리하여 기존의 경직된 대화 차례제를 유연하고 실시간인 대화로 대체하는 첫 AI 모델을 출시했습니다. 이 모델은 상호작용 품질 및 지연 시간 벤치마크에서 오픈AI의 GPT-Realtime-2와 구글의 Gemini Live를 능가하며, 빠른 속도의 상호작용 모델과 백그라운드 추론 모델을 결합한 형태입니다. 기술적인 가능성에도 불구하고 최근 핵심 직원 몇 명이 퇴사하는 등 이 스타트업은 여전히 압박에 직면해 있습니다.

Thinking Machines Lab은 전통적인 질문과 답변 패턴에서 음성 AI가 벗어나도록 설계된 첫 AI 모델의 연구 프리뷰(Research Preview)를 공개했습니다. 이 모델은 오디오, 비디오, 텍스트를 200밀리초 단위로 병렬 처리하며, 이 스타트업은 상호작용 품질에서 오픈AI의 GPT-Realtime-2와 구글의 Gemini Live를 능가한다고 주장합니다.

Thinking Machines Lab은 외부 구조에 의존하지 않고 상호작용을 자체적으로 처리하는 '상호작용 모델(Interaction Models)'이라고 부르는 모델의 연구 프리뷰를 발표했습니다. 핵심 아이디어는 상호작용성이 나중에 추가할 부수적인 기능이 아니라 지능과 함께 확장되어야 한다는 것입니다.

현재 음성 AI 시스템은 여전히 로봇 같습니다 오늘날 GPT-Realtime이나 Gemini Live와 같은 실시간 시스템은 오디오를 지속적으로 수신하지만, 실제 언어 모델은 이를 직접적으로 처리하지 않습니다. Thinking Machines에 따르면, 발화자의 차례가 끝났는지 결정하는 음성 활동 감지기(Voice Activity Detector) 같은 개별 컴포넌트들로 구성된 일종의 '외부 제어 장치(Harness)'가 모델 앞단에 존재합니다. 이 장치의 처리가 끝난 후에야 완성된 발화 내용이 모델로 전달되어 완전한 응답을 생성하게 됩니다.

모델이 말을 하는 동안 그 인지 능력은 정지 상태가 되며, 자신이 말을 마치거나 사용자에 의해 중단될 때까지 새로운 정보를 받지 못합니다. 이러한 외부 제어 컴포넌트들은 모델 자체보다 지능이 훨씬 떨어집니다. Thinking Machines에 따르면, 이는 진짜 대화를 정의하는 행동들이 제대로 작동하지 않는다는 것을 의미합니다. 예를 들어, 적극적으로 끼어들기("내가 틀린 말을 하면 끊어주세요"), 시각적 단서에 반응하기("내가 버그를 작성하면 알려주세요"), 또는 라이브 번역처럼 동시에 말하기 같은 행동들이 불가능합니다.

이 연구소는 리처드 서튼(Richard Sutton)의 '쓰디쓴 교훈(Bitter Lesson)'을 인용하며, 이러한 인위적으로 설계된 시스템들은 결국 범용 능력의 발전에 의해 압도당할 것이라고 주장합니다.

Thinking Machines의 상호작용 모델은 외부 제어 장치를, 미리 분할된 발화를 받는 대신 오디오와 비디오 스트림을 직접 처리하는 모델로 대체합니다. 이 접근 방식은 Moshi나 Nemotron VoiceChat과 같은 전이중(Full-duplex) 모델과 유사하게 교차 방식으로 작동하지만, 지능 벤치마크보다 지연 시간에 초점을 맞춘 소규모 모델이라는 차이점이 있습니다.

200밀리초 클럭이 인위적인 발화 차례 경계를 대체합니다 기존 아키텍처와의 진정한 결별은 팀이 '시간 정렬 마이크로 턴(Time-aligned micro-turns)'이라고 부르는 것입니다. 모델은 200밀리초의 입력을 지속적으로 처리하고 200밀리초의 출력을 생성하며, 두 토큰 스트림이 교차(interleaved)되는 방식으로 실행됩니다. 입력과 출력은 더 이상 순차적으로 일어나지 않고 동일한 클럭 주기를 공유합니다.

이를 통해 인위적인 대화 차례 경계가 제거되며, 모델이 스스로 침묵할지, 끼어들지, 사용자와 동시에 말할지 결정할 수 있게 됩니다. 오디오와 이미지는 크고 독립적인 대규모 인코더를 통해 전처리되지 않고 최소한의 전처리만 거쳐 트랜스포머(Transformer)로 직접 입력됩니다. 이는 지연 시간을 줄여주지만, 텍스트와 같은 미세한 시각적 디테일을 파악하는 모델의 능력을 제한할 수도 있습니다.

하지만 실시간 모델에는 또 다른 도전 과제가 있습니다. 200밀리초마다 응답해야 한다면 동시에 몇 분 동안 추론하거나 웹을 검색할 수는 없습니다. Thinking Machines는 이 문제를 상호작용 모델을 두 번째인 비동기식 백그라운드 모델과 페어링하여 해결합니다.

원문 보기

원문 보기 (영어)

Thinking Machines Lab ships its first model and argues interactivity is what OpenAI gets wrong about voice Maximilian Schreiner View the LinkedIn Profile of Maximilian Schreiner May 12, 2026 Thinking Machines Lab Key Points Thinking Machines Lab, founded by ex-OpenAI CTO Mira Murati, has released its first AI model that processes audio, video, and text in 200-millisecond chunks, replacing rigid turn-taking with fluid, real-time conversation. The model outperforms OpenAI's GPT-Realtime-2 and Google's Gemini Live on interaction quality and latency benchmarks, pairing a fast interaction model with a background reasoning model. Despite the technical promise, the startup still faces pressure, as several key employees have recently left the company. Ask about this article… Search Thinking Machines Lab has released a research preview of its first AI model, designed to break voice AI out of the traditional question-and-answer pattern. The model processes audio, video, and text in parallel 200-millisecond chunks, and the startup claims it beats OpenAI's GPT-Realtime-2 and Google's Gemini Live on interaction quality. Thinking Machines Lab has published a research preview of what it calls Interaction Models , AI models that handle interaction natively rather than through external scaffolding. The core idea is that interactivity should scale alongside intelligence, not get treated as an afterthought. Current voice AI systems still feel robotic Today's real-time systems like GPT-Realtime or Gemini Live continuously take in audio, but the actual language model never sees it directly. According to Thinking Machines, a "harness" of separate components sits in front of the model, including things like a voice activity detector that decides when a speaker's turn is over. Only then does the finished utterance get handed to the model, which generates a complete response. While it's talking, its perception freezes, receiving no new information until it finishes or gets interrupted. Ad These components are far less intelligent than the model itself. That means behaviors that define real conversation simply don't work, according to Thinking Machines: proactively jumping in ("interrupt me if I say something wrong"), reacting to visual cues ("tell me when I've written a bug"), or speaking simultaneously, which would be useful for something like live translation. Citing Sutton's "Bitter Lesson," the lab argues that these hand-crafted systems will eventually be outpaced by the advance of general capabilities. Ad DEC_D_Incontent-1 Thinking Machines' Interaction Models replace the harness with a model that processes the audio and video stream directly rather than receiving pre-segmented utterances. The approach resembles full-duplex models like Moshi or Nemotron VoiceChat , which work in a similarly interleaved fashion but are smaller-scale models focused on latency rather than intelligence benchmarks. A 200-millisecond clock replaces artificial turn boundaries The real break from existing architectures is what the team calls time-aligned micro-turns. The model continuously processes 200 milliseconds of input and generates 200 milliseconds of output, with both token streams running in an interleaved fashion. Input and output no longer happen sequentially. Instead, they share the same clock cycle. Ad This eliminates artificial turn boundaries, letting the model decide on its own whether to stay silent, interject, or speak alongside the user. Audio and images aren't preprocessed through large, standalone encoders but are fed directly into the transformer with minimal preprocessing. That saves latency, though it could also limit the model's ability to pick up fine visual details like text. The real-time model has another challenge, though. If you need to respond every 200 milliseconds, you can't simultaneously spend minutes reasoning or searching the web. Thinking Machines solves this by pairing the interaction model with a second, asynchronous background model that handles longer tasks like reasoning, tool use, and research. Ad DEC_D_Incontent-2 Both models share the same conversation context. The interaction model delegates tasks while keeping the conversation going, then weaves results from the background model into the conversation as they arrive, at a moment appropriate to what the user is currently doing rather than as an abrupt context switch. The goal is to combine the response speed of a fast model with the depth of a reasoning model. Ad Benchmarks suggest the approach works The model is called TML-Interaction-Small, a 276-billion-parameter mixture-of-experts model with 12 billion active parameters. On FD-bench v1.5, which measures interaction quality across scenarios like user interruptions, backchanneling, and background speech, it significantly outperforms both OpenAI's GPT-Realtime-2 and Google's Gemini-3.1-flash-live. Response latency comes in at 0.40 seconds, compared to 1.18 seconds for GPT-Realtime-2 (minimum) and 0.57 seconds for Gemini. On Audio MultiChallenge, which tracks intelligence and instruction following, the model scores 43.4 percent, above the fast variants of its competitors but below GPT-Realtime-2 in "xhigh" thinking mode, which hits 48.5 percent. On the lab's own benchmarks for time awareness (TimeSpeak, CueSpeak) and visual proactivity (RepCount-A, ProactiveVideoQA, Charades), Thinking Machines reports that no existing model can meaningfully perform any of these tasks. Tested competitors either stay silent or give incorrect answers. A $2 billion startup with something to prove Thinking Machines Lab was founded in February 2025 by Mira Murati and other former OpenAI researchers. In July 2025, the company closed a $2 billion seed round at a $12 billion valuation, all without a product. A follow-on round reportedly in the works at around $50 billion didn't come together by the end of 2025, and several key employees have since left the company . The Interaction Model is the first in-house AI model backing Murati's claim that she can build a real competitor alongside OpenAI, Anthropic, and Google Deepmind. Before this, the company had released Tinker , a tool designed to let developers efficiently fine-tune open models using LoRAs without having to deal with distributed training. AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now Source: ThinkingMachines

음성 AI 실시간 상호작용 오픈AI 경쟁사 Mira Murati AI 아키텍처