The Decoder • 59일 전

AI 검색 에이전트, 실제 검색보다 기존 지식 활용해

IMP

8/10

핵심 요약

최신 연구에 따르면 주요 AI 검색 에이전트들은 웹을 실제로 탐색해 정보를 찾기보다는 이미 학습된 기존 지식을 확인하는 용도로 검색을 활용하는 경향이 있습니다. 기존 벤치마크에서 모델들이 내 지식을 넘어서는 실시간 정보를 필요로 하는 새로운 환경(LiveBrowseComp)에 놓이자 성능과 순위가 크게 하락했습니다. 이는 정적 벤치마크 점수가 모델의 실제 검색 역량이 아닌 단순히 얼마나 많은 지식을 암기하고 있는지를 보여준다는 것을 시사합니다.

번역된 본문

AI 검색 에이전트는 실제로 웹을 검색하기보다는 이미 알고 있는 답변을 확인하는 데 주로 활용됩니다.

새로운 연구에 따르면, 최신 AI 검색 에이전트들은 기존 벤치마크에서 실제로 웹 검색을 수행한다기보다는 이미 가지고 있는 답변을 확인하기 위해 웹을 주로 사용하는 것으로 나타났습니다. 모델이 기존 지식을 넘어서야 하는 상황에 직면하면 검색 성능이 급격히 저하됩니다.

GPT-5.4, Gemini 3.1 Pro, Claude Sonnet 4.6, DeepSeek-V4-Pro, Kimi-K2.6과 같은 최신 프론티어 모델들은 BrowseComp 벤치마크에서 계속해서 높은 점수를 기록하고 있습니다. 이 벤치마크는 에이전트에게 여러 단계의 브라우징과 다양한 웹 소스의 정보 조합을 통해서만 답할 수 있는 복잡한 질문을 던집니다.

하지만 하얼빈공업대학교(Harbin Institute of Technology)와 샤오홍슈(Xiaohongshu)의 연구진이 발표한 새로운 연구에 따르면, 이러한 결과는 에이전트의 실제 검색 능력보다는 우리가 생각했던 것보다 훨씬 적은 정보를 보여줍니다. 저자들은 이를 '내재적 지식 의존성(Intrinsic Knowledge Dependence, IKD)'이라고 부르며, 이는 모델이 학습 과정에서 습득한 내부 지식에 의존하는 현상을 말합니다.

연구진은 총 11개의 모델을 테스트했는데, 먼저 모든 검색 및 브라우징 도구를 제거했습니다. 인터넷 접속이 없어도 모델들은 놀랍게도 높은 점수를 기록했습니다. MiniMax M2.5는 메모리만으로 BrowseComp 작업의 44.5%를 해결했습니다. Kimi K2.6은 중국어 버전인 BrowseComp-ZH에서 62%의 정확도를 기록했습니다. 즉, 벤치마크 성능의 상당 부분은 실제 검색이 이루어지기도 전에 이미 결정된다는 것을 의미합니다.

검색이 오히려 답변을 해칠 수도 있습니다. 두 번째 테스트는 이를 더 명확히 보여줍니다. 연구진은 검색 인터페이스는 그대로 유지하되, 검색 인덱스에서 답변을 뒷받침하는 모든 문서를 제거했습니다. 그 결과, 테스트된 모든 모델이 도구에 접근할 수 없었을 때보다 더 나쁜 성능을 보였습니다. MiniMax M2.5는 44.5%에서 8.0%로 떨어졌고, Kimi-K2.6은 25.5%에서 2.3%로 하락했습니다. 이는 검색 결과가 확인해주지 않자, 검색 기능이 올바른 직관적 답변에서 에이전트의 주의를 적극적으로 돌렸음을 보여줍니다.

검색 경로 분석은 그 이유를 설명합니다. 전체 검색어의 절반 이상이 이전 검색 결과가 아닌 모델 자체의 추론에서 나왔습니다. 관련 증거가 검색 결과에 나타나더라도, 에이전트가 이를 자신의 추론에 통합하는 비율은 3분의 1에 불과했습니다. 이 루프는 증거가 아닌 모델 주도로 이루어지고 있습니다.

지식 한계를 넘어서는 새로운 벤치마크

실제 검색 행동을 측정하기 위해 저자들은 'LiveBrowseComp'라는 새로운 벤치마크를 구축했습니다. 이 벤치마크는 사람이 직접 작성한 335개의 질문으로 구성되어 있으며, 각 질문은 생성 전 90일 이내의 최신 정보에 의존하고 해당 정보 없이는 답할 수 없도록 설계되었습니다. 기반이 되는 이벤트는 영화 데이터베이스, 게임 디렉토리, 보안 취약점 레지스터, 지진 카탈로그 등 지속적으로 업데이트되는 소스에서 가져옵니다. 전 세계적으로 유명한 이벤트는 의도적으로 필터링되어, 학습 중 모델 파라미터에 스며들 가능성이 적은 모호하지만 공개적으로 확인 가능한 사실들만 남았습니다.

인간 테스터들은 LiveBrowseComp와 BrowseComp에 대해 비슷한 시간을 소요하고 비슷한 수의 작업을 해결합니다. 따라서 모델의 성능 저하는 단순히 질문이 더 어렵기 때문이 아니라, 메모리 지름길을 잃었기 때문입니다.

순위표의 붕괴

LiveBrowseComp에서 모든 모델은 폐쇄형 테스트(도구 없음)에서 2% 미만의 정확도를 기록했습니다. 도구를 활성화하면 점수는 BrowseComp 결과보다 약 25~40점 낮게 나타났습니다. 이로 인해 순위가 완전히 뒤바뀝니다. GLM 5.1은 BrowseComp에서 오픈소스 모델 중 명확히 선두를 차지했지만, LiveBrowseComp에서는 중간 그룹으로 떨어졌습니다. 반면 DeepSeek v3.2는 BrowseComp에서 꼴찌를 기록했지만, LiveBrowseComp에서는 최상위로 올라가 이전에 자신을 앞섰던 여러 모델들을 추월했습니다. 이는 정적 리더보드에서 모델의 위치가 검색을 얼마나 잘하는지가 아니라, 얼마나 많은 지식을 이미 알고 있는지를 주로 보여준다는 것을 의미합니다.

에이전트는 기억에 의존할 수 없을 때 더 많은 단계가 필요합니다. BrowseComp에서 에이전트는 많은 질문을 매우 적은 단계만으로 해결하는데, 이는 빠른 메모리 확인의 징후입니다. LiveBrowseComp에서는 이 패턴이 사라집니다. 단계 수가 훨씬 높아지며, 이는 에이전트가 실제 검색을 수행하고 있음을 시사합니다.

원문 보기

원문 보기 (영어)

AI search agents often confirm what they already know instead of actually researching the web Jonathan Kemper View the LinkedIn Profile of Jonathan Kemper May 31, 2026 Nano Banana Pro prompted by THE DECODER A new study suggests that leading AI search agents don't actually research on established benchmarks; they mostly use the web to confirm answers they already have. Once models have to go beyond their existing knowledge, search performance falls apart. Frontier models like GPT-5.4, Gemini 3.1 Pro, Claude Sonnet 4.6, DeepSeek-V4-Pro, and Kimi-K2.6 keep posting higher scores on BrowseComp. The benchmark asks agents complex questions that can only be answered through multi-step browsing and piecing together information from different web sources. Researchers from the Harbin Institute of Technology and Xiaohongshu have now shown in a study that these results say less about the agents' research skills than assumed. The authors call it "intrinsic knowledge dependence" (IKD), a reliance on internal knowledge the models absorbed during training. The researchers tested eleven models total, first stripping away all search and browsing tools. Even without internet access, the models scored surprisingly high. MiniMax M2.5 solved 44.5 percent of BrowseComp tasks from memory alone. Kimi K2.6 hit 62 percent on the Chinese BrowseComp-ZH variant. A big chunk of benchmark performance, in other words, comes before any search even happens. Searching can actually hurt the answer The second test is more telling. The researchers left the search interface in place but removed all answer-supporting documents from the search index. Every model tested then performed worse than it did without any tool access at all. MiniMax M2.5 dropped from 44.5 to 8.0 percent. Kimi-K2.6 fell from 25.5 to 2.3 percent. The search actively pulls agents away from correct gut-feeling answers as soon as no confirming hits show up. An analysis of the search paths explains why. More than half of all queries come from the model's own reasoning rather than from previously found hits. Even when relevant evidence does appear in search results, the agents fold it into their reasoning less than a third of the time. The loop is model-led, not evidence-led. A benchmark beyond the knowledge frontier To measure real search behavior, the authors built LiveBrowseComp. The benchmark contains 335 human-written questions, each depending on at least one fact from the 90 days before creation and impossible to answer without that current information. The underlying events come from constantly updated sources like film databases, game directories, security vulnerability registers, and earthquake catalogs. Globally prominent events are filtered out deliberately, leaving obscure but publicly verifiable facts that had little chance of seeping into model parameters during training. Human testers need about the same amount of time for LiveBrowseComp as for BrowseComp and solve a similar number of tasks. The performance drop among models is therefore due to losing the memory shortcut, not because the questions are harder. Leaderboard rankings fall apart On LiveBrowseComp , all models in the closed-book test fall below two percent accuracy. With tools turned on, scores land about 25 to 40 points below the same models' BrowseComp results. This shifts the rankings. GLM 5.1 leads clearly among open-source models on BrowseComp but falls to mid-pack on LiveBrowseComp. DeepSeek v3.2 sat at the bottom on BrowseComp, then climbed to the top on LiveBrowseComp, passing several models that previously outperformed it. This shows that a model's spot on a static leaderboard mostly shows how much it already knows, not how well it searches. Agents need more steps when they can't rely on memory On BrowseComp, agents solve many questions in very few steps, a sign of quick memory confirmation. On LiveBrowseComp, that pattern disappears. The step counts shift much higher, which suggests the agents are doing real research instead of recalling stored knowledge. The authors argue that dynamic, time-sensitive benchmarks should become the standard for evaluating AI agents. They also want training signals that reward evidence-based research over the typical guess-and-verify approach. Other studies have flagged similar problems. A benchmark from Peking University found that top models often produce the right answer when analyzing documents but cite the wrong source, what the researchers call "attribution hallucination." A tool called CiteAudit recently discovered that fabricated references have already made it into accepted papers at major AI conferences. The reason: commercial models don't reliably catch made-up citations. AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now --> Read on for the full picture. Subscribe for hype-free coverage. Access to all THE DECODER articles. Read without distractions – no Google ads. Access to comments and community discussions. Weekly AI newsletter. 6 times a year: “AI Radar” – deep dives on key AI topics. Up to 25 % off on KI Pro online events. Access to our full ten-year archive. Get the latest AI news from The Decoder. Subscribe to The Decoder -->

에이전트 검색 벤치마크 할루시네이션