The Decoder • 109일 전

AI 모델, 도움 요청 대신 무작정 추측하는 경향

IMP

8/10

핵심 요약

최신 벤치마크 테스트에 따르면 멀티모달 언어 모델은 시각적 정보가 누락되었을 때 사용자에게 도움을 요청하는 대신 환각(Hallucination)을 일으키거나 응답을 거부하는 것으로 나타났습니다. 연구진은 이를 해결하기 위해 모델이 정말로 필요할 때만 도움을 요청하도록 강화학습 기법(GRPO)을 적용했으며, 기존의 대형 모델들을 모두 능가하는 성과를 입증했습니다.

번역된 본문

연구진, AI 모델이 도움을 요청하는 대신 추측하는 것을 선호한다는 사실 발견

ProactiveBench는 멀티모달 언어 모델이 시각적 정보가 누락되었을 때 사용자에게 도움을 요청하는지 테스트합니다. 테스트된 22개 모델 중 도움을 요청하는 모델은 거의 없었지만, 간단한 강화학습 접근 방식을 통해 해결의 실마리를 찾았습니다.

보이지 않는 물체를 식별하라고 하면 사람은 방해물을 치워달라고 요청할 것입니다. 하지만 멀티모달 언어 모델은 그렇게 작동하지 않습니다. 이들은 잘못된 답변을 지어내는 환각(Hallucination) 현상을 일으키거나 그냥 응답을 거부합니다. 새로운 ProactiveBench 벤치마크는 이 문제를 현미경 아래에 놓고, 오늘날 AI 모델이 언제 도움이 필요한지 인식하고 실제로 이를 요청할 수 있는지 체계적으로 테스트합니다.

이 벤치마크는 7개의 기존 데이터셋을 활용해 사용자의 입력 없이는 해결할 수 없는 테스트 시나리오로 변환합니다. 모델은 가려진 물체를 식별하거나, 노이즈가 낀 이미지를 정리하거나, 대략적인 스케치를 해석하거나, 다른 카메라 앵글을 요청해야 합니다. 모두 합쳐 ProactiveBench은 18,000개 샘플에 108,000개 이상의 이미지를 포함하고 있습니다. 내장된 필터는 모델이 첫 시도에서 맞힐 수 있는 모든 작업을 제외합니다. 통과하려면 모델이 적극적으로(proactively) 더 많은 정보를 요청해야 합니다.

더 큰 모델이 더 나은 질문을 하는 것은 아니다

연구진은 LLaVA-OV, Qwen2.5-VL, InternVL3, GPT-4.1, GPT-5.2, o4-mini를 포함한 22개의 멀티모달 언어 모델을 테스트했습니다. 물체가 명확하게 보이는 참조 설정에서 모델들은 평균 79.8%의 작업을 성공적으로 해결했습니다. 그러나 ProactiveBench에서는 이 수치가 60% 이상 급감했습니다.

ROD 데이터셋이 가장 뚜렷한 결과를 보여줍니다. 물체가 블록 뒤에 숨겨지면 정확도가 참조 설정의 98.3%에서 단 8.2%로 곤두박질칩니다. 모델은 물체가 명확히 보일 때는 잘 식별하지만, 누군가에게 그것을 치워달라고 요청할 생각은 전혀 하지 못합니다.

모델의 크기도 도움이 되지 않았습니다. InternVL3-1B는 27.1% 대 InternVL3-8B의 12.7%로 실제로 더 나은 성능을 보였습니다. 더 오래된 LLaVA-1.5-7B가 훨씬 최신인 LLaVA-OV-72B를 24.8% 대 13%로 이겼습니다. 기반이 되는 언어 모델의 선택도 중요합니다. Vicuna를 사용한 LLaVA-NeXT는 19.3%를 기록한 반면, 동일한 설정에서 Mistral을 사용한 모델은 4.5%에 그쳤습니다.

GPT-4.1과 같은 폐쇄형 모델이 가장 높은 정확도 수치를 기록했지만, 연구진은 비정상적으로 높은 COCO 점수를 데이터 오염(Data Contamination)의 가능성으로 지적했습니다.

적극적으로 보이는 모델들도 대부분 그저 찍기식일 뿐이다

일부 모델은 언뜻 보기에 다른 모델보다 더 적극적으로 보입니다. 연구진은 유효한 적극적 제안을 의미 없는 제안(예: 스케치 작업에 "비디오를 되감기")으로 교체하는 스트레스 테스트를 수행했습니다. 이전에 적극적으로 보였던 모델들은 이러한 무의미한 옵션도 똑같이 기꺼이 선택했습니다. LLaVA-NeXT Vicuna는 가짜 선택지가 주어졌을 때 선택 비율이 37%에서 49%로 오히려 증가했습니다. 결론은 적극성처럼 보이는 것은 실제로 이해한 것이 아니라, 그저 추측을 위한 허들이 낮아진 것뿐이라는 것입니다.

프롬프트와 대화 기록에 명시적인 힌트를 넣는 것도 상황을 개선하지 못합니다. 힌트는 적극적인 제안 비율을 높이고 정확도를 25.8%로 끌어올리지만, 여전히 평균적으로 무작위 확률을 넘지 못합니다. 16%의 사례에서 모델은 허용된 최대 단계에 도달할 때까지 무작정 적극적인 제안만을 반복해 제출합니다. 대화 기록은 오히려 성능을 악화시킵니다. 모델은 기록에서 학습하는 대신 과거의 적극적 행동을 그대로 앵무새처럼 따라 하기만 합니다.

강화학습으로 모델에게 언제 도움을 요청해야 할지 가르칠 수 있다

그러나 희망적인 부분도 있습니다. 연구진은 적극성이 훈련될 수 있음을 보여주었습니다. 연구진은 Group-Relative Policy Optimization(GRPO)을 사용하여 약 27,000개의 예시로 LLaVA-NeXT-Mistral-7B와 Qwen2.5-VL-3B를 미세 조정(Fine-tuning)했습니다.

핵심적인 세부 사항은 보상 함수가 적극적인 제안보다 올바른 예측에 더 높은 점수를 매긴다는 것입니다. 따라서 모델은 정말로 막혔을 때만 도움을 요청하게 됩니다. 훈련 후, 이 두 모델은 이전에 테스트한 22개의 모델(o4-mini 포함)을 모두 뛰어넘는 성능을 보였습니다.

원문 보기

원문 보기 (영어)

AI models would rather guess than ask for help, researchers find Jonathan Kemper View the LinkedIn Profile of Jonathan Kemper Apr 11, 2026 Nano Banana Pro prompted by THE DECODER ProactiveBench tests whether multimodal language models ask users for help when visual information is missing. Out of 22 models tested, almost none ask for what they need, but a simple reinforcement learning approach hints at a fix. If you ask a person to identify an object that's blocked from view, they'll ask you to move whatever's in the way. Multimodal language models don't work that way. They either hallucinate a wrong answer or just refuse to respond. The new ProactiveBench benchmark puts this problem under a microscope, systematically testing whether today's AI models can recognize when they need help and actually ask for it. The benchmark pulls from seven existing datasets and turns them into test scenarios that are impossible to solve without human input. Models have to identify hidden objects, clean up noisy images, interpret rough sketches, or request different camera angles. All told, ProactiveBench packs more than 108,000 images into 18,000 samples. A built-in filter strips out any task a model can nail on the first try; to pass, a model has to proactively ask for more information. Larger models don't ask better questions The researchers put 22 multimodal language models through their paces, including LLaVA-OV, Qwen2.5-VL, InternVL3, GPT-4.1, GPT-5.2, and o4-mini. In the reference setting with clearly visible objects, the models nail an average of 79.8 percent of tasks. On ProactiveBench, that number craters by more than 60 percent. The ROD dataset tells the starkest story. When objects are hidden behind blocks, accuracy plummets from 98.3 percent in the reference setting to just 8.2 percent. The models can spot objects just fine when they're in plain sight; they just never think to ask someone to uncover them. Model size doesn't help either. InternVL3-1B actually outperforms InternVL3-8B at 27.1 versus 12.7 percent. The older LLaVA-1.5-7B beats the much newer LLaVA-OV-72B at 24.8 versus 13 percent. The choice of underlying language model matters too: LLaVA-NeXT with Vicuna hits 19.3 percent, while the same setup with Mistral manages just 4.5 percent. Closed models like GPT-4.1 posted the best accuracy numbers, though the researchers flag their unusually high COCO scores as possible data contamination. Models that look proactive are mostly just winging it Some models appear more proactive than others at first glance. The researchers stress-tested this by swapping valid proactive suggestions with nonsensical ones—like "Rewind the video" for a sketching task. Models that previously seemed proactive picked the meaningless options just as happily. LLaVA-NeXT Vicuna actually bumped its selection rate from 37 to 49 percent when given bogus choices. The takeaway is that what looks like proactivity is really just a lower bar for guessing, not actual understanding. Dropping explicit hints into prompts and conversation histories doesn't fix things either. Hints do push the rate of proactive suggestions up, nudging accuracy to 25.8 percent, but that still doesn't beat chance on average. In 16 percent of cases, models just blindly spam proactive suggestions up to the maximum allowed steps. Conversation histories actually make performance worse: models parrot the proactive actions from the history instead of learning from them. Reinforcement learning can teach models when to speak up There is a bright spot, though. The researchers showed that proactivity can be trained in. They fine-tuned LLaVA-NeXT-Mistral-7B and Qwen2.5-VL-3B using Group-Relative Policy Optimization (GRPO) on roughly 27,000 examples. The key detail: the reward function scores correct predictions higher than proactive suggestions, so the model only asks for help when it's genuinely stuck. After training, both models beat every one of the 22 previously tested models, including o4-mini (37.4 and 38.6 versus 34.0 percent). The learned proactivity also carried over to scenarios outside the training data . On ChangeIt, Qwen2.5-VL-3B's accuracy jumped from 12.4 to 55.6 percent. But get the reward balance wrong, and the whole thing falls apart: when proactive suggestions are rewarded equally to correct answers, the model spams help requests nonstop, and accuracy tanks to 5.4 percent. Even with these gains, a big gap remains compared to the reference setting (40.7 versus 75.1 percent). The researchers have released ProactiveBench as open source and frame it as a first step toward models that know when they're missing information and ask for it instead of making things up. AI models don't know what they don't know ProactiveBench taps into a pattern that keeps surfacing across recent AI research: multimodal language models are terrible at handling uncertainty. Moonshot AI's WorldVQA benchmark recently found that even top-tier models cap out around 50 percent in visual object recognition, pointing to baked-in overconfidence. A Stanford study on what researchers call the Mirage effect drove this point home. Multimodal models like GPT-5 and Gemini 3 Pro confidently described visual details and offered medical diagnoses even when no image was provided. On standard benchmarks, they hit 70 to 80 percent of their normal performance using nothing but text patterns and prior knowledge, essentially faking visual understanding without realizing the input was missing. Other research tells a similar story. A study on exam question difficulty found that language models can't reliably gauge their own limits , while researchers at Sapienza University of Rome used their "Spilled Energy" method to show that hallucinations leave measurable traces in a model's computations—suggesting that even when models don't know they're guessing, the math under the hood does. AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now --> AI news without the hype Curated by humans. More than 16% discount. Read without distractions – no Google ads. Access to comments and community discussions. Weekly AI newsletter. 6 times a year: “AI Radar” – deep dives on key AI topics. Up to 25 % off on KI Pro online events. Access to our full ten-year archive. Get the latest AI news from The Decoder. Subscribe to The Decoder -->

멀티모달 벤치마크 강화학습 환각현상 AI연구