The Decoder • 115일 전

알리바바 Qwen, AI 추론 사고 과정을 깊게 만드는 새 알고리즘 발표

IMP

8/10

핵심 요약

알리바바의 Qwen 팀은 기존 강화학습의 한계를 극복하고 개별 토큰의 영향력에 따라 보상을 차등 분배하는 새로운 훈련 알고리즘인 FIPO를 발표했습니다. 이를 통해 AI 모델의 추론(사고) 과정이 두 배 이상 길어졌으며, 중간 결과를 스스로 검증하는 능력이 자연스럽게 발현되어 수학 벤치마크에서 정확도가 크게 향상되었습니다. 이 알고리즘은 별도의 가치 모델 없이도 기존 PPO 방식과 맞먹는 성능을 내며, 향후 오픈소스로 공개될 예정입니다.

번역된 본문

알리바바(Alibaba)의 Qwen 팀은 모든 토큰을 동등하게 취급하는 대신, 각 단계가 후속 추론 과정에 미치는 영향에 따라 개별 토큰에 다른 가중치를 부여하는 추론 모델용 새로운 훈련 알고리즘을 개발했습니다. 이 접근 방식은 눈에 띄게 긴 추론 체인을 이끌어냈으며, 모델이 중간 결과를 독립적으로 검증하고 대체 솔루션을 교차 확인하는 방법을 자연스럽게 학습하게 했습니다. 이러한 행동은 가중치가 부여된 보상 신호에서 자연스럽게 발현되었습니다. 지금까지 이 알고리즘은 수학 작업에서만 검증되었으며, 다른 도메인으로 일반화될 수 있는지는 미지수입니다. 팀은 이 훈련 시스템을 오픈소스로 공개할 계획입니다.

강화학습은 모든 토큰이 동일한 보상을 받기 때문에 추론 모델에서 한계에 부딪힙니다. 알리바바 Qwen 팀의 새로운 알고리즘은 각 단계가 이후에 올 내용에 얼마나 영향을 미치는지에 따라 가중치를 부여하여 이 문제를 해결하며, 그 과정에서 사고 과정의 길이를 두 배로 늘립니다.

대규모 언어 모델이 강화학습을 통해 추론을 학습할 때, 일반적으로 생성된 각 답변의 끝에서 단순한 합격/불합격(Pass/Fail) 판정을 받습니다. 그런 다음 그 보상은 시퀀스의 모든 단일 토큰에 균등하게 분배됩니다. 해당 토큰이 핵심 논리적 전환점이든 단순한 쉼표이든 상관없이 말입니다. Qwen 팀은 이러한 무딘 공로 할당(credit assignment)이 추론 모델이 GRPO(Group Relative Policy Optimization)와 같은 일반적인 훈련 방법에서 한계에 부딪히는 주요 이유라고 말합니다. 추론 체인은 특정 길이까지 성장하다가 정체됩니다.

Future-KL Influenced Policy Optimization(FIPO)를 통해 팀은 이 병목 현상을 돌파하고자 합니다. 알고리즘은 각 토큰을 개별적으로 채점하는 대신 앞을 내다봅니다. 즉, 이 특정 토큰을 생성한 후 모델의 동작이 하류에서 어떻게 변하는지 살펴봅니다. FIPO는 뒤따르는 모든 토큰에 걸쳐 누적 확률 변화를 계산하고, 해당 신호를 사용하여 보상을 더 정확하게 분배합니다. 생산적인 추론 체인을 시작하는 토큰은 더 큰 몫을 얻습니다. 모델을 막다른 골목으로 이끄는 토큰은 더 적은 몫을 얻습니다.

별도 모델 없이 PPO 기반 방법과 일치하는 FIPO 평탄한 보상 문제를 해결하기 위한 이전의 시도는 대부분 각 토큰의 혜택 점수를 추정하기 위해 별도의 가치 모델(value model)을 사용하는 PPO 기반 방법에 의존했습니다. 이 보조 모델은 일반적으로 긴 사고의链条(chain-of-thought, CoT) 데이터에 대한 사전 훈련이 필요하며, 이는 외부 지식이 유출된다는 것을 의미합니다. 연구원들은 성능 향상이 알고리즘 자체에서 오는 것인지 아니면 단순히 사전 훈련된 도우미에게서 상속된 것인지 판단하기 어렵게 만든다고 말합니다. FIPO는 보조 모델을 완전히 건너뛰고도 비슷한 결과를 제공합니다.

훈련을 안정적으로 유지하기 위해 FIPO는 여러 가지 안전장치를 마련합니다. 할인 계수(discount factor)는 먼 토큰보다 가까운 토큰에 더 많은 가중치를 부여합니다. 왜냐하면 어차피 먼 토큰의 하류 영향을 예측하기가 더 어렵기 때문입니다. 또한 알고리즘은 훈련 단계 사이에서 모델이 너무 멀리 이탈한 토큰을 필터링합니다. 연구진에 따르면 이 필터가 없으면 훈련이 궤도를 이탈하고 응답 길이가 급감하는 등 심각한 불안정성이 발생했습니다.

정확도가 오르면서 사고 과정은 두 배로 길어져 팀은 합성 긴 CoT 데이터에 대한 사전 노출이 전혀 없는 Qwen2.5-32B-Base 모델에서 FIPO를 테스트했습니다. 공정한 비교를 위해 인기 있는 오픈소스 GRPO 훈련 변형인 DAPO(Decoupled Clip and Dynamic sAmpling Policy Optimization)의 공개 데이터셋에서만 훈련했습니다.

결과는 명확합니다. DAPO의 평균 사고 체인 길이가 약 4,000 토큰에서 정체되는 반면, FIPO는 10,000을 넘어섭니다. AIME 2024 수학 벤치마크에서 정확도는 50%에서 56%로 점프하며 최고 58%에 달합니다. 이는 FIPO를 약 47%의 Deepseek-R1-Zero-Math-32B과 약 56%의 OpenAI o1-mini보다 앞서게 합니다.

원문 보기

원문 보기 (영어)

Alibaba's Qwen team makes AI models think deeper with new algorithm Jonathan Kemper View the LinkedIn Profile of Jonathan Kemper Apr 5, 2026 Nano Banana Pro prompted by THE DECODER Key Points Alibaba's Qwen team has developed a new training algorithm for reasoning models that assigns different weights to individual tokens based on how much each step influences the subsequent chain of reasoning, rather than treating all tokens equally. The approach led to noticeably longer reasoning chains, with the model learning to independently verify its intermediate results and cross-check alternative solutions, a behavior that emerged naturally from the weighted reward signal. So far, the algorithm has only been validated on mathematical tasks, leaving open whether it generalizes to other domains. The team plans to release the training system as open source. Ask about this article… Search Reinforcement learning hits a wall with reasoning models because every token gets the same reward. A new algorithm from Alibaba's Qwen team fixes this by weighting each step based on how much it shapes what comes next, doubling the length of thought processes in the process. When a large language model learns to reason through reinforcement learning, it typically gets a simple pass/fail judgment at the end of each generated answer. That reward then gets spread evenly across every single token in the sequence. It doesn't matter whether a token marks the key logical turning point or is just a comma. The Qwen team says this blunt credit assignment is a major reason why reasoning models hit a ceiling with common training methods like GRPO (Group Relative Policy Optimization) . The reasoning chains grow to a certain length and then flatline. Ad With Future-KL Influenced Policy Optimization (FIPO), the team wants to break through that bottleneck. Instead of scoring each token on its own, the algorithm looks ahead: How does the model's behavior change downstream after generating this particular token? Ad DEC_D_Incontent-1 FIPO calculates the cumulative probability shift across all following tokens and uses that signal to hand out rewards more precisely. Tokens that kick off a productive reasoning chain get a bigger share. Tokens that send the model down a dead end get less. FIPO matches PPO-based methods without a separate model Previous attempts to fix the flat reward problem mostly relied on PPO-based methods that use a separate value model to estimate a benefit score for each token. Ad That auxiliary model typically needs pre-training on long chain-of-thought data, which means outside knowledge leaks in. The researchers say this makes it tough to tell whether the performance gains come from the algorithm itself or are just inherited from the pre-trained helper. FIPO skips the auxiliary model entirely and still delivers comparable results. To keep training stable, FIPO builds in several guardrails. A discount factor makes sure nearby tokens carry more weight than distant ones, since their downstream influence is harder to predict anyway. Ad DEC_D_Incontent-2 The algorithm also filters out tokens where the model has drifted too far between training steps. Without this filter, the researchers saw severe instabilities: training went off the rails and response lengths cratered. Ad Thought processes double in length while accuracy climbs The team tested FIPO on Qwen2.5-32B-Base, a model with zero prior exposure to synthetic long-CoT data. They trained it exclusively on the public dataset from DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization), a popular open-source GRPO training variant, to keep the comparison fair. The results are clear-cut. While DAPO's average chain-of-thought length stalls around 4,000 tokens, FIPO pushes past 10,000. On the AIME 2024 math benchmark, accuracy jumps from 50 to 56 percent, peaking at 58 percent. That puts FIPO ahead of both Deepseek-R1-Zero-Math-32B at roughly 47 percent and OpenAI's o1-mini at around 56 percent. On the tougher AIME 2025, scores climb from 38 to 43 percent. The researchers note it's not just a handful of outliers getting longer. The entire distribution of answer lengths shifts upward, from the shortest to the longest responses. That suggests a fundamental change in how the model approaches problems. The model starts fact-checking itself The paper lays out four phases the model moves through during training. Early on, it churns out shallow planning templates—basically outlines with no real math that end in a hallucinated answer. In the second phase, where DAPO-trained models stay for the rest of training, the model runs a clean linear reasoning chain and stops at the first answer it finds. In phase three, the model starts spontaneously double-checking its own intermediate results. It reaches an answer but then pivots to a different approach, switching from algebraic manipulation to geometric interpretation, for example, to verify. By phase four, the model runs systematic multi-pass verification, recalculating large square numbers step by step and working through the full derivation multiple times. The paper notes this behavior looks a lot like the inference-time scaling strategies in OpenAI's o-series and Deepseek-R1 , but FIPO pulls it off through reinforcement learning alone, with no long-CoT synthetic data. Still early days FIPO was benchmarked only on math problems, trained on a single dataset, and tested only on base models without long-CoT pre-training. The longer sequences also ramp up compute costs. So there's still a lot of testing that needs to be done, according to the team. Furthermore, whether these gains carry over to other domains like code or symbolic logic is still an open question. There's also a performance gap compared to distilling from larger teacher models. Pure reinforcement learning teaches a model less than direct instruction from a stronger one. The team says they plan to open-source the training system along with all configurations. AI News Without the Hype – Curated by Humans As a THE DECODER subscriber , you get ad-free reading, our weekly AI newsletter , the exclusive "AI Radar" Frontier Report 6× per year , access to comments, and our complete archive. Subscribe now Source: Arxiv

강화학습 추론 모델 알리바바 Qwen 알고리즘 오픈소스

알리바바 Qwen, 시각 AI 다단계 추론 오류 해결

비전 언어 모델(VLM)은 이미지에 대한 다단계 추론 시 초기의 작은 인지 오류가 누적되어 최종 결과가 완전히 틀어지는 문제가 있습니다. 알리바바 Qwen 팀과 칭화대는 이러한 오류 누적을 방지하기 위해 모델이 매 단계마다 이미지를 다시 세밀하게 확인하도록 강제하는 'HopChain' 프레임워크를 개발했습니다. 이를 기반으로 강화학습을 수행한 결과 24개 벤치마크 중 20개에서 성능이 향상되는 등 시각적 추론 능력이 크게 개선되었습니다.

비전 언어 모델 멀티모달 AI 추론 오류