MarkTechPost • 65일 전

스텝펀, 역할극 특화 RLHF 적용한 'StepAudio 2.5 실시간' 공개

IMP

8/10

핵심 요약

중국 상하이 기반 AI 연구소 스텝펀(StepFun)이 오디오 입력부터 출력까지 단일 시스템으로 처리하는 엔드투엔드 실시간 음성 대형 언어 모델(LLM) 'StepAudio 2.5 Realtime'을 공개했습니다. 이 모델은 백만 단위의 페르소나 데이터 증강과 역할극 특화 RLHF(인간 피드백 기반 강화학습)를 적용하여 대화 중 캐릭터 붕괴(OOC) 현상을 방지하고 안정적인 연기력을 유지하는 것이 특징입니다. 특히 사용자의 말투, 감정, 속도 등 비언어적(Paralinguistic) 요소를 이해하고 이에 맞춰 감정적인 반응을 생성하여 5가지 벤치마크 평가에서 모두 1위를 차지했습니다.

번역된 본문

기술, 인공지능, 언어 모델, 오디오 언어 모델, 에디터 추천, 신제품, 소프트웨어 엔지니어링, TTS, 음성 AI

상하이 기반 AI 연구소인 스텝펀(StepFun)이 'StepAudio 2.5 Realtime'를 공개했습니다. 이 모델은 완벽하게 맞춤 설정 가능한 페르소나 기능을 갖춘 엔드투엔드 실시간 음성 대형 언어 모델입니다.

StepAudio 2.5 Realtime은 실시간으로 작동하는 음성 모델입니다. 음성 인식, 추론, 합성을 순차적인 단계로 분리하는 파이프라인 기반 시스템과 달리, 이 모델은 엔드투엔드(End-to-End) 방식을 채택했습니다. 오디오가 입력되면 단일 통합 시스템을 통해 오디오로 출력됩니다.

이 모델은 중국어와 영어를 지원하며, WebSocket API를 통해 연결됩니다. 엔드포인트는 wss://api.stepfun.com/v1/realtime 이며, 모델 문자열로 'step-2.5-realtime'을 사용합니다.

3가지 핵심 기술 기둥 스텝펀 연구팀은 이 모델의 배경이 되는 세 가지 핵심 아키텍처 혁신을 설명했습니다.

백만 규모의 페르소나 데이터 증강 (Million-Scale Persona Data Augmentation) 10,000개 이상의 고품질 직접 작성된 페르소나 데이터를 시작으로, 스텝펀은 알고리즘적 증강을 적용하여 백만 규모의 페르소나 특징 행렬을 구축했습니다. 이는 수백만 개의 실제 대화 샘플과 결합하여 훈련에 사용되었습니다. 목표는 일반화, 특히 어렵고 롱테일(Long-tail)에 해당하는 대화 주제에서도 안정적인 성능을 발휘하는 것입니다. 스텝펀 팀은 수백만 개의 페르소나 샘플을 수동으로 레이블링하는 대신, 엄선된 시드(Seed) 세트에서 알고리즘적 확장을 사용했습니다.
역할극 특화 RLHF 정렬 (Roleplay-Specific RLHF Alignment) 대화형 AI에서 흔히 발생하는 실패 모드 중 하나는 '캐릭터 붕괴(Out-of-Character, OOC)'입니다. 즉, 대화 중에 모델이 설정된 페르소나를 벗어나는 현상입니다. 스텝펀 팀은 역할극 시나리오에서 페르소나 일관성을 유지하기 위해 특별히 전용 RLHF(인간 피드백 기반 강화학습) 최적화를 수행했습니다. RLHF는 인간의 선호도 신호를 사용하여 보상 모델을 훈련시킨 다음, 이를 통해 언어 모델의 동작을 유도하는 훈련 기술입니다. 이를 역할극 안정성에 맞춰 구체적으로 적용한 것은 매우 목적적인 설계 선택입니다.
통합된 음성 이해 및 생성 (Unified Speech Understanding and Generation) StepAudio 2.5 Realtime은 StepAudio 2.5 TTS 기능을 물려받아 강화학습을 통해 음성 이해와 생성을 깊이 있게 융합합니다. 이를 통해 스텝펀이 말하는 '글로벌 장면 수준의 톤 설정(Global scene-level tonal setting)'과 '문장 내 세부 조각(Intra-sentence detail sculpting)'이 가능해졌습니다. 모델은 개별 문장 내에서 미세한 음향적 세부 사항을 조정하면서 전체적인 감정선을 설정할 수 있습니다.

비언어적(Paralinguistic) 이해 능력 이 모델의 기술적으로 뚜렷한 차별점은 비언어적 지각 능력입니다. 비언어적 요소는 말에서 톤, 말하기 속도, 일시 정지, 한숨, 웃음소리 등과 같은 비구두적인 음향 정보를 의미합니다. 이러한 요소를 분석함으로써 모델은 사용자의 기분과 근본적인 의도를 인식할 수 있습니다. 예를 들어, 낮은 톤에서 피로감을 감지하거나 빠른 말하기 속도에서 좌절감을 파악할 수 있습니다. 이러한 신호를 포착하려면 텍스트로 변환된 결과뿐만 아니라 오디오 자체의 특징을 기반으로 모델이 작동해야 합니다. StepAudio 2.5 Realtime은 비언어적 이해 벤치마크에서 82.18점을 획득하며, 말하기 속도, 감정, 나이 및 기타 음향 특징에 대한 뛰어난 지각 능력을 입증했습니다.

벤치마크 결과 스텝펀 연구팀은 포괄적인 주관적, 객관적 평가를 실시하여 StepAudio 2.5 Realtime을 5가지 차원에서 선도적인 실시간 음성 모델들과 비교 평가했습니다. 사람을 통한 평가는 실제 모바일 앱 대화를 통해 인간 평가자가 점수를 매기는 방식으로 진행되었습니다. 점수는 다음과 같습니다:

사람 평가 (주관적): 80.41
일반 대화 (객관적): 86.36
자동차 시나리오 (객관적): 84.80
오디오 이해 작업 11개를 포괄하는 음성 QA (객관적): 79.80
비언어적 이해 (객관적): 82.18

핵심 요약 StepAudio 2.5 Realtime은 상하이에 본사를 둔 스텝펀이 출시한 엔드투엔드 실시간 음성 LLM입니다. 페르소나 특화 RLHF와 백만 규모의 데이터 증강을 사용하여 안정적인 캐릭터 일관성을 유지합니다. 이 모델은 2026년 4월에 테스트된 5가지 벤치마크 차원에서 모두 1위를 차지했습니다. 오디오에서 톤, 속도, 감정을 인식하는 비언어적 이해 능력이 핵심 특징입니다.

원문 보기

원문 보기 (영어)

Technology Artificial Intelligence Language Model Audio Language Model Editors Pick New Releases Software Engineering Staff TTS Voice AI StepFun, the Shanghai-based AI lab, released StepAudio 2.5 Realtime. It is an end-to-end real-time speech large language model with fully customizable persona capabilities. StepAudio 2.5 Realtime is a voice model that operates in real time. Unlike pipeline-based systems that separate speech recognition, reasoning, and synthesis into sequential steps, this is an end-to-end model. Audio goes in and audio comes out through a single unified system. The model supports Chinese and English. It connects via a WebSocket API. The endpoint is wss://api.stepfun.com/v1/realtime using the model string step-2.5-realtime . The Three Technical Pillars StepFun research team describes three core architectural innovations behind the model: 1. Million-Scale Persona Data Augmentation Starting from 10,000+ high-quality natively authored personas, StepFun applied algorithmic augmentation to build a million-scale persona feature matrix. This was combined with millions of real-world conversational samples for training. The intent is generalization — specifically, stable performance on difficult, long-tail conversational topics. Instead of manually labeling millions of persona samples, StepFun team used algorithmic expansion from a curated seed set. 2. Roleplay-Specific RLHF Alignment A known failure mode in conversational AI is "out-of-character" (OOC) behavior — when a model drifts away from its defined persona mid-conversation. StepFun team conducted dedicated RLHF (Reinforcement Learning from Human Feedback) optimization specifically for persona consistency in roleplay scenarios. RLHF is a training technique where human preference signals are used to train a reward model, which then guides language model behavior. Applying it specifically to roleplay stability is a targeted design choice. 3. Unified Speech Understanding and Generation StepAudio 2.5 Realtime inherits the StepAudio 2.5 TTS capabilities and deeply fuses speech understanding and generation through reinforcement learning. This enables what StepFun calls "global scene-level tonal setting" and "intra-sentence detail sculpting." The model can set an overall emotional register for a response while adjusting finer acoustic details within individual sentences. Paralinguistic Understanding A technically distinct area of this model is paralinguistic perception. Paralinguistics refers to non-verbal acoustic information in speech — things like tone, speaking rate, pauses, sighs, and laughter. By analyzing these elements, the model can perceive the user's mood and underlying intentions. For example, it can identify fatigue from a low tone or frustration from a rapid speech rate. Capturing these signals requires the model to operate on audio features rather than transcribed text alone. StepAudio 2.5 Realtime scored 82.18 on the paralinguistic comprehension benchmark, demonstrating perception of vocal speed, emotion, age, and other acoustic features. Benchmark Results StepFun research team conducted a comprehensive suite of subjective and objective evaluations, benchmarking StepAudio 2.5 Realtime against leading real-time voice models across five dimensions. Human evaluation is conducted through real mobile app conversations scored by human raters. The scores: Human evaluation (subjective): 80.41 General dialogue (objective): 86.36 Automotive scenario (objective): 84.80 Spoken QA, covering 11 audio understanding tasks (objective): 79.80 Paralinguistic comprehension (objective): 82.18 Key Takeaways StepAudio 2.5 Realtime is an end-to-end real-time speech LLM, released by Shanghai-based StepFun. It uses persona-specific RLHF and million-scale data augmentation to maintain stable character consistency. The model ranked first across all five benchmark dimensions, tested in April 2026. Paralinguistic comprehension — perceiving tone, rate, emotion from audio — is a core technical differentiator. API access is via WebSocket at wss://api.stepfun.com/v1/realtime with model string step-2.5-realtime . Check out the Model Card and Demo . Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter . Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us Michal Sutter + posts Bio Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights. Michal Sutter Tencent Open-Sources TencentDB Agent Memory: A 4-Tier Local Memory Pipeline for AI Agents Michal Sutter Cohere Releases Command A+: A 218B Sparse MoE Model for Agentic Workflows That Runs on as Few as Two H100 GPUs Michal Sutter What is a Forward Deployed Engineer: The AI Role OpenAI, Anthropic, and Google Are Hiring in 2026 Michal Sutter Google Introduces Gemini 3.5 Flash at I/O 2026: A Faster and Cheaper Model for AI Agents and Coding Michal Sutter Upstash for Redis vs Supabase vs Neon: Which One Fits Vibe Coding Workflows in 2026? Michal Sutter Google Launches Antigravity 2.0 at I/O 2026: A Standalone Agent-First Platform with CLI, SDK, Managed Execution, and Enterprise Support Michal Sutter Vercel Labs Introduces Zero, a Systems Programming Language Designed So AI Agents Can Read, Repair, and Ship Native Programs Michal Sutter Enterprise AI Governance in 2026: Why the Tools Employees Use Are Ahead of the Policies That Cover Them Michal Sutter Google DeepMind Introduces an AI-Enabled Mouse Pointer Powered by Gemini That Captures Visual and Semantic Context Around the Cursor Michal Sutter OpenAI Introduces Daybreak: A Cybersecurity Initiative That Puts Codex Security at the Center of Vulnerability Detection and Patch Validation Michal Sutter Best Vector Databases in 2026: Pricing, Scale Limits, and Architecture Tradeoffs Across Nine Leading Systems Michal Sutter OpenClaw vs Hermes Agent: Why Nous Research's Self-Improving Agent Now Leads OpenRouter's Global Rankings Michal Sutter NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX Michal Sutter OpenAI Introduces MRC (Multipath Reliable Connection): A New Open Networking Protocol for Large-Scale AI Supercomputer Training Clusters Michal Sutter Google Adds Event-Driven Webhooks to the Gemini API, Eliminating the Need for Polling in Long-Running AI Jobs Michal Sutter Microsoft Research's World-R1 Uses Flow-GRPO and 3D-Aware Rewards to Inject Geometric Consistency Into Wan 2.1 Without Architectural Changes Michal Sutter Cursor Introduces a TypeScript SDK for Building Programmatic Coding Agents With Sandboxed Cloud VMs, Subagents, Hooks, and Token-Based Pricing Michal Sutter Top 10 KV Cache Compression Techniques for LLM Inference: Reducing Memory Overhead Across Eviction, Quantization, and Low-Rank Methods Michal Sutter smol-audio: A Colab-Friendly Notebook Collection for Fine-Tuning Whisper, Parakeet, Voxtral, Granite Speech, and Audio Flamingo 3 Michal Sutter xAI Launches grok-voice-think-fast-1.0: Topping τ-voice Bench at 67.3%, Outperforming Gemini, GPT Realtime, and More Michal Sutter Google DeepMind Introduces Vision Banana: An Instruction-Tuned Image Generator That Beats SAM 3 on Segmentation and Depth Anything V3 on Metric Depth Estimation Michal Sutter OpenAI Releases GPT-5.5, a Fully Retrained Agentic Model That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval Michal Sutter Next Leap to Harness Engineering: JiuwenClaw Pioneers ‘Coordination Engineering' Michal Sutter OpenAI Scales Trusted Access for Cyber Defense With GPT-5.4-Cyber: a Fine-Tuned Model Built for Verified Security Defenders Michal

음성 AI 모델 엔드투엔드 LLM RLHF 비언어적 이해 스텝펀