Hacker News • 78일 전

실시간 협업을 위한 인터랙션 모델 연구

IMP

8/10

핵심 요약

이 글은 턴 기반 인터페이스의 한계를 넘어, 오디오·비디오·텍스트를 실시간으로 처리하며 사람과 자연스럽게 협업하는 인터랙션 모델(Interaction Model)의 연구 미리보기를 발표합니다. 다중 스트림·마이크로 턴 설계를 통해 지능성과 반응성을 모두 최고 수준으로 끌어올렸습니다. 인간이 AI와 실시간으로 소통하며 피드백을 주고받는 ‘협업의 병목’을 해소하는 데 중요한 의미가 있습니다.

번역된 본문

오늘 우리는 인터랙션 모델(Interaction Models)의 연구 미리보기를 발표합니다. 이 모델은 외부 스캐폴딩(scaffolding)에 의존하지 않고 상호작용을 본질적으로 처리합니다. 우리는 상호작용성이 지능과 함께 확장되어야 한다고 생각하며, AI와 협력하는 방식이 부차적인 문제로 다루어져서는 안 됩니다. 인터랙션 모델은 사람들이 서로 자연스럽게 협력하는 방식처럼, AI와 협력할 수 있게 해줍니다. 즉, 오디오, 비디오, 텍스트를 지속적으로 수용하고, 실시간으로 생각하고, 응답하며, 행동합니다.

우리는 인터랙션 모델을 처음부터 새로 훈련했습니다. 실시간 반응성을 보장하기 위해 다중 스트림(multi-stream)과 마이크로 턴(micro-turn) 설계를 채택했습니다. 우리의 연구 미리보기는 질적으로 새로운 상호작용 능력을 보여주며, 지능성과 반응성을 결합한 성능에서 최고 수준(state-of-the-art)을 달성했습니다.

협업의 병목 AI 연구소들은 종종 AI가 자율적으로 작동하는 능력을 모델의 가장 중요한 능력으로 간주합니다(Kwa, T., West, B., Becker, J., et al. Measuring AI Ability to Complete Long Tasks. METR, 2025). 결과적으로 오늘날의 모델과 인터페이스는 인간이 프로세스 내에 머무르며 개입하는 것(human in the loop)에 최적화되어 있지 않습니다. 최근 프론티어 모델 카드는 다음과 같이 명시하고 있습니다: “중요한 점은, 대화형·동기식·‘키보드 위의 손’ 패턴으로 사용할 때 이 모델의 이점이 덜 분명해진다는 것을 발견했다는 것입니다. 이 방식으로 사용할 때 일부 사용자는 모델이 너무 느리다고 느꼈고 큰 가치를 실감하지 못했습니다. 자율적이고 오래 실행되는 에이전트 하네스가 이 모델의 코딩 역량을 더 잘 이끌어냈습니다.”

자율적 인터페이스도 가치가 있지만, 대부분의 실제 업무에서 사용자는 자신의 요구 사항을 처음부터 완벽하게 명시하고 자리를 비울 수 없습니다. 좋은 결과가 나오려면 인간이 계속 개입하여 과정 중에 명확히 하고 피드백을 제공하는 협업적 프로세스가 필요합니다. 그러나 인간이 점점 더 소외되는 이유는 작업에 인간이 필요 없어서가 아니라, 인터페이스에 인간을 위한 자리가 없기 때문입니다.

대신 사람들은 다른 사람과 협력하는 것과 같은 방식, 즉 메시징·대화·경청·시각적 공유·필요한 개입으로 AI와 협력할 때 가장 효과적입니다. 그리고 모델 역시 같은 방식으로 응답해야 합니다. 커뮤니케이션은 다음 요소들로 더 나아집니다: (a) 공동 현전(Copresence): 사람들이 상대가 상호작용하는 것에 직접 상호작용할 수 있음; (b) 동시성(Contemporality): 정보가 생성될 때 즉각적인 피드백과 함께 실시간으로 수신됨;c) 동시 수신·생산(Simultaneity): 정보를 동시에 주고받음(Clark H. and Brennan S., “Grounding in Communication,” 1991). 구술성의 참여적 본성(Ong, W. J., Orality and Literacy, 1982). 오늘날 컴퓨터와 지식 작업 매체도 유사한 상호작용 속성을 지닙니다.

이를 해결하기 위해 우리는 현재 턴 기반 모델 인터페이스를 넘어서야 합니다. 오늘날 대부분의 모델은 단일 스레드에서 현실을 경험합니다(상용 범용 프론티어 모델을 말하며, Moshi, PersonaPlex, Nemotron VoiceChat, GPT-Realtime-Translate 같은 소규모·특수 모델은 예외). 사용자가 입력을 마칠 때까지 모델은 사용자가 무엇을 하고 있는지, 어떻게 하고 있는지 인지하지 못한 채 대기합니다. 모델이 생성을 마칠 때까지 그 인식은 정지하며, 완료되거나 중단될 때까지 새로운 정보를 수신하지 못합니다. 이는 인간-AI 협업을 위한 좁은 채널을 만들어, 개인의 지식과 맥락, 직관적 통찰(‘메티스’, Scott, J. C., 1998; Hayek, F. A.)이 충분히 활용되지 못하게 합니다.

원문 보기

원문 보기 (영어)

Today, we’re announcing a research preview of interaction models: models that handle interaction natively rather than through external scaffolding. We think interactivity should scale alongside intelligence; the way we work with AI should not be treated as an afterthought. Interaction models let people collaborate with AI the way we naturally collaborate with each other—they continuously take in audio, video, and text, and think, respond, and act in real time. We train an interaction model from scratch. To ensure real-time responsiveness, we adopt a multi-stream, micro-turn design. Our research preview demonstrates qualitatively new interaction capabilities, as well as state-of-the-art combined performance in intelligence and responsiveness. The collaboration bottleneck AI labs often treat the ability for AI to work autonomously as the model’s most important capability. Kwa, T., West, B., Becker, J., et al. Measuring AI Ability to Complete Long Tasks. METR , 2025. As a result, today’s models and interfaces aren’t optimized for humans to remain in the loop. A recent frontier model card states: “Importantly, we find that when used in an interactive, synchronous, “hands-on-keyboard” pattern, the benefits of the model were less clear. When used in this fashion, some users perceived [our model] as too slow and did not realize as much value. Autonomous, long-running agent harnesses better elicited the model’s coding capabilities.” Autonomous interfaces are valuable, but in most real work, users can’t fully specify their requirements upfront and walk away—good results benefit from a collaborative process where the human stays in the loop, clarifying and giving feedback along the way. However, humans increasingly get pushed out not because the work doesn’t need them, but because the interface has no room for them. Instead, people are most effective when they can collaborate with AI the same way we do with other people: messaging, talking, listening, seeing, showing, and interjecting as needed—and for the model to do the same. Communication gets better with: (a) Copresence: people can interact with what others are interacting with; (b) Contemporality: people receive information as it’s produced by others with instant feedback; (c) Simultaneity: people receive and produce information at the same time. Clark H. and Brennan S., “Grounding in Communication,” in Perspectives on Socially Shared Cognition, 1991. , The evanescence of orality for its participatory (cf. objectively distanced) nature. Today’s computers and mediums of knowledge work have similar interactive properties. Ong, W. J.. In Orality and Literacy: The technologizing of the word , 1982. In order to resolve this, we need to move beyond the current turn-based interface for the models. Today’s models experience reality in a single thread. We are referring to commercial general-purpose frontier models—there are smaller-scale or specialized models like Moshi, PersonaPlex, Nemotron VoiceChat, or GPT-Realtime-Translate. Until the user finishes typing or speaking, the model waits with no perception of what the user is doing or how the user is doing it. Until the model finishes generating, its perception freezes, receiving no new information until it finishes or is interrupted. This creates a narrow channel for human-AI collaboration that limits how much of a person’s knowledge, “Metis, with the premium it places on practical knowledge, experience, and stochastic reasoning…is the mode of reasoning most appropriate to complex material and social tasks where the uncertainties are so daunting that we must trust our (experienced) intuition and feel our way.” Scott, J. C: Métis. In Seeing like a State: How certain schemes to improve the human condition have failed , 1998. , “A little reflection will show that there is…a body of very important but unorganized knowledge…: the knowledge of the particular circumstances of time and place.” Hayek, F. A. “The use of knowledge in society.” The American Economic Review , 1945. intent, and judgement can reach the model, and how much of the model’s work can be understood. Picture trying to resolve a crucial disagreement over email rather than in person. At Thinking Machines, we believe we can solve this bandwidth bottleneck by making AI interactive in real time across any modality . This enables AI interfaces to meet humans where they are, rather than forcing humans to contort themselves to AI interfaces. Most existing AI models bolt on interactivity with a harness: stitching components together to emulate interruptions, multimodality, or concurrency. Most real-time commercials speech systems use voice-activity-detection components to detect turn boundaries. However, “the bitter lesson” Sutton R. The Bitter Lesson , 2019. suggests that these hand-crafted systems will be outpaced by the advance of general capabilities. For interactivity to scale with intelligence, it must be part of the model itself. With this approach, scaling a model makes it smarter and a better collaborator. Capabilities Having interactivity be part of the model unlocks a variety of capabilities that would otherwise need to be implemented in the harness. Seamless dialog management. The model tracks implicitly whether the speaker is thinking, yielding, self-correcting, or inviting a response. There is no separate dialog management component. Verbal and visual interjections. The model jumps in as needed depending on the context, not only when the user finishes speaking. Simultaneous speech. The user and the model can speak concurrently (e.g. live translation) Time-awareness. The model has a direct sense of elapsed time. Simultaneous tools calls, search, and generative UI. While speaking and listening to the user, the model can concurrently search, browse the web, or generate UI—weaving back results into the conversation as needed. In a longer real session, all of this happens continuously, creating an experience that feels more like collaborating and less like prompting. Our approach An interaction model is in constant two-way exchange with the user—perceiving and responding at the same time. Some domains take such interactivity as a given—the physical world demands that robotics and autonomous vehicles operate in real time. Audio full-duplex models Moshi, PersonaPlex, nemotron-voicechat, Seeduplex. are another example where interaction is bidirectional and continuous. Applying the same principle, we set out to build an interaction model native to this regime—one that perceives and responds in the same continuous loop, across audio, video, and text. The result is a system architected around two ideas: a time-aware interaction model that maintains real-time presence, and an asynchronous background model that handles sustained reasoning, tool use, and longer-horizon work. System overview The interaction model is in constant exchange with the user. When a task requires deeper reasoning than can be produced instantaneously, the interaction model delegates to a background model that runs asynchronously. This approach builds upon prior work like Qwen-omni, KAME, MoshiRAG. The interaction model remains present throughout — answering follow-ups, taking new input, holding the thread — and integrates background results into the conversation as they arrive. This split lets the user benefit from both responsiveness as well as the full extent of intelligence: the planning, tool-use, and agentic workflows of reasoning models at the response latency of non-thinking ones. Note that both the background and interaction models are intelligent — on its own, the interaction model is also competitive on both interactive and intelligence benchmarks The interaction model Our starting point is continuous audio and video — modalities that are inherently real-time. Text can wait, but a live conversation cannot. By designing around the hardest case first, we arrive at an architecture that

인터랙션 모델 실시간 AI Human-in-the-loop 에이전트 인터페이스 멀티모달