The Decoder • 65일 전

바이트댄스 연구: 긴 문서 학습엔 텍스트 변환보다 질문이 효과적

IMP

7/10

핵심 요약

바이트댄스와 HKUST 연구진은 긴 문서를 다루는 멀티모달 AI 모델 학습 시, 단순히 텍스트를 인식해 변환하도록 하는 것보다 질문-답변(QA) 쌍을 활용하는 것이 훨씬 효과적이라는 사실을 발견했습니다. 이 방식으로 학습된 소형 모델(MMProLong)은 50만 토큰 이상의 긴 문맥에서도 안정적인 성능을 내며 파라미터 크기가 훨씬 큰 기존 오픈소스 모델들을 능가했습니다. 이 연구는 AI가 긴 문서를 탐색할 때 정보 추출 과제를 통해 유연한 검색 능력을 기르는 것이 핵심임을 시사합니다.

번역된 본문

최신 멀티모달 AI 모델들은 점점 더 긴 문서를 처리해야 하지만, 이를 위해 어떻게 학습되는지는 대개 영업 비밀로 남아 있습니다. 새로운 연구에 따르면, 학습 과제로 문자 인식(OCR)을 사용하는 것은 오히려 모델 성능을 저하시키며, 질문-답변(QA) 쌍을 활용하는 것이 훨씬 더 나은 결과를 낸다고 합니다.

바이트댄스 시드(Seed) 연구팀과 홍콩 과학기술대학교(HKUST) 연구진은 이미지-언어 모델이 긴 문서를 효율적으로 학습하는 방법을 연구했습니다. 그 결과 알리바바의 오픈소스 모델인 Qwen2.5-VL을 기반으로 구축된 'MMProLong'이라는 모델이 훨씬 더 큰 경쟁 모델들을 성능 면에서 능가했습니다.

현대의 멀티모달 AI 모델은 전체 PDF 페이지, 장시간의 비디오, 또는 여러 단계에 걸쳐 작업을 기억하는 에이전트 등 점점 더 긴 입력을 처리할 수 있어야 합니다. 오픈AI, 구글, 알리바바 같은 AI 연구소들은 최대 100만 토큰에 달하는 컨텍스트 윈도우를 자랑하며, 이는 텍스트뿐만 아니라 수천 장의 페이지 이미지나 비디오 프레임을 담을 수 있는 용량입니다. 하지만 저자들에 따르면, 기술 보고서는 모델이 어떤 데이터를 어떤 비율로 학습해야 하는지에 대한 구체적인 내용을 거의 공개하지 않습니다.

질문을 던지는 것이 텍스트 변환보다 더 많은 것을 가르친다 언뜻 보기에 이 연구의 핵심 발견은 당연해 보일 수 있습니다. 멀티모달 모델이 100페이지짜리 문서에서 올바른 위치를 찾는 법을 배우게 하려면, 모든 페이지의 텍스트를 변환(전사)하도록 하는 것은 거의 도움이 되지 않습니다. 답변이 그 페이지들 속 어딘가에 숨겨져 있는 질문을 던지는 것이 훨씬 더 효과적입니다.

연구진은 두 가지 접근 방식을 직접 비교했습니다. 한 실험 설정에서 모델은 문서의 모든 페이지 또는 몇 개의 선택된 페이지에 대해 텍스트 인식을 수행해야 했고, 나머지 페이지는 방해 요소로 컨텍스트에 남겨졌습니다. 다른 설정에서 연구진은 별도의 모델(바이트댄스의 Seed 2.0)을 사용하여 문서의 개별 섹션에 대한 질문-답변 쌍을 생성했습니다. 그런 다음 전체 문서와 함께 질문이 학습에 투입되어, 모델이 긴 컨텍스트 내에서 관련 구절을 찾도록 강제했습니다.

학습 작업으로서의 순수 텍스트 인식은 출발점(원본 모델)보다 성능을 실제로 악화시켰습니다. 반면 질문-답변 학습은 명확한 성능 향상을 가져왔습니다. 모델은 특정 목표를 가지고 정보를 필터링하고 분류해야 할 때만 긴 텍스트를 탐색하는 법을 학습합니다.

전문화보다는 다양성이 승리한다 실험에서는 세 가지 추가적인 발견도 나타났습니다. 첫째, 컨텍스트 윈도우 상한선에 해당하는 매우 긴 문서만 주로 학습시키는 것은 효과적이지 않습니다. 짧고 긴 예시를 섞은 폭넓은 조합이 더 안정적으로 작동합니다. 둘째, 긴 컨텍스트 처리 능력은 특정 길이에 묶인 기술이 아니라 다양한 거리에 걸쳐 유연하게 검색하는 것을 요구합니다. 진짜 병목은 관련 구절을 찾는 것이지, 그것에 대해 추론하는 것이 아닙니다. 정보 추출 작업에 약간의 계산 작업을 섞은 비율이 최고의 결과를 낳았습니다.

셋째 발견은 텍스트 전용 언어 모델의 일반적인 관행과 모순되기 때문에 놀랍습니다. 짧은 학습 예시를 추가하는 것이 반드시 필요한 것은 아닌 것으로 보입니다. 이 모델은 긴 질문-답변 데이터로만 학습되었음에도 짧은 작업 능력을 상당 부분 유지했습니다. 이는 데이터 형식 자체에 아마도 기인할 것입니다. 즉, 컨텍스트가 매우 길더라도 작업은 여전히 친숙한 명령어 따르기(Instruction-following) 형식의 질문-답변 상호작용으로 구성되기 때문입니다.

50만 토큰까지 안정적인 소형 모델 이러한 레시피와 상당히 적은 학습 예산으로 MMProLong은 InternVL3-38B 및 Gemma3-27B와 같은 훨씬 더 큰 여러 오픈소스 모델을 능가했습니다. 이 모델은 단 128,000 토큰으로만 학습되었지만 256,000 및 심지어 512,000 토큰 입력 길이에서도 안정성을 유지하는 반면, 원본 모델은 해당 범위에서 성능이 급격히 저하됩니다. 이 능력은 모델이 구체적으로 학습한 적이 없는 긴 비디오 이해와 같은 작업에도 전이(Transfer)됩니다. 추가 전이 실험에서 이 학습 방식은 더 강력한 Qwen3-VL-8B 모델에도 효과적인 것으로 입증되었습니다.

원문 보기

원문 보기 (영어)

ByteDance study finds that asking LMMs questions beats making it transcribe text for long document training Jonathan Kemper View the LinkedIn Profile of Jonathan Kemper May 24, 2026 Nano Banana Pro prompted by THE DECODER Multimodal AI models are supposed to handle ever-longer documents, but how they're trained to do so usually stays a trade secret. A new study shows that character recognition as a training task actually hurts performance and that question-answer pairs work far better. Researchers from ByteDance Seed and the Hong Kong University of Science and Technology (HKUST) studied how image-language models can be trained efficiently on long documents. The result is a model called MMProLong, built on Alibaba's open Qwen2.5-VL , that beats much larger competitors. Modern multimodal AI models need to handle increasingly long inputs: entire PDF collections of rendered pages, hours of video, or agents that remember their tasks across many steps. AI labs like OpenAI, Google, and Alibaba tout context windows of up to 1 million tokens, capable of holding not just text but thousands of page images or video frames. But according to the authors, technical reports barely reveal what data a model should see and in what mix. Asking questions teaches more than transcribing text At first glance, the study's central finding seems obvious. For a multimodal model to learn to find the right spot in a 100-page document, having it transcribe the text of every page barely helps. It's more effective to ask questions whose answers are buried somewhere in those pages. The researchers tested both approaches head-to-head. In one setup, the model had to perform text recognition either across all pages of a document or for a few selected pages, while the remaining pages stayed in context as distractions. In the other setup, the researchers used a separate model ( Seed 2.0 from ByteDance ) to generate question-answer pairs for individual sections of a document. The question then went into training alongside the entire document, forcing the model to locate the relevant passage within a long context. Pure text recognition as a training task actually worsened performance compared to the starting point. Question-answer training, on the other hand, brought clear gains. The model only learns to navigate long texts when it has to filter out and categorize information with a specific goal. Diversity beats specialization Three more findings turned up in the experiments. Feeding the model mainly very long documents at the top end of the context window isn't worth it. A broader mix of shorter and longer examples works more reliably. Long-context ability isn't a skill tied to a specific length but requires flexible searching across different distances. The real bottleneck also turns out to be finding the relevant passage, not reasoning about it. A mix weighted toward extraction tasks with a smaller share of calculation tasks delivered the best results. The third finding is surprising because it contradicts common practice with text-only language models. Adding short training examples doesn't appear strictly necessary. The model largely kept its short-task abilities even when trained only on long question-answer data. The format of the data itself probably helps: even when the context is very long, the task is still framed as a question-answer interaction in the familiar instruction-following format. Small but stable up to 512,000 tokens With this recipe and a fairly modest training budget, MMProLong beats several much larger open models like InternVL3-38B and Gemma3-27B . The model was trained on only 128,000 tokens but stays stable at 256,000 and even 512,000 token input lengths, while the original model falls apart sharply at those ranges. This ability also transfers to tasks the model was never specifically trained on, like understanding long videos. In an extra transfer experiment, the recipe proved effective on the stronger Qwen3-VL-8B too, even though that model is already built for long contexts. The study is also interesting because it comes from an entirely different camp than Deepseek's widely discussed work on the same problem . Deepseek tries to extend the long memory of AI models by processing texts as images and compressing them heavily, most recently with an encoder that re-sorts visual information by content. ByteDance Seed takes the opposite approach: optimize the training data instead of the architecture. AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now --> Read on for the full picture. Subscribe for hype-free coverage. Access to all THE DECODER articles. Read without distractions – no Google ads. Access to comments and community discussions. Weekly AI newsletter. 6 times a year: “AI Radar” – deep dives on key AI topics. Up to 25 % off on KI Pro online events. Access to our full ten-year archive. Get the latest AI news from The Decoder. Subscribe to The Decoder -->

멀티모달 AI 긴 컨텍스트 모델 학습 바이트댄스 질의응답