The Decoder • 115일 전

구글 연구: AI 평가, 인간의 의견 다양성 간과

IMP

8/10

핵심 요약

구글과 로체스터 공과대학의 공동 연구에 따르면, AI 벤치마크에서 주로 사용하는 항목당 3~5명의 인간 평가자 수는 신뢰할 수 있는 결과를 도출하기에 부족합니다. 신뢰성 있는 평가를 위해서는 항목당 최소 10명 이상의 평가자가 필요하며, 전체 예산을 테스트 항목 수와 평가자 수에 맞게 전략적으로 분배하는 것이 필수적입니다.

번역된 본문

좋은 AI 벤치마크를 만들려면 평가자가 몇 명이나 필요할까요? 새로운 연구에 따르면 테스트 항목당 3~5명의 평가자를 두는 표준적인 방식으로는 부족한 경우가 많으며, 평가 예산을 어떻게 배분하느냐가 예산의 규모만큼이나 중요하다고 합니다.

AI 모델들이 경쟁할 때, 인간의 평가가 종종 어느 모델이 더 우수한지를 결정합니다. 평가자들은 특정 댓글이 유해한지, 혹은 챗봇의 응답이 안전한지 등을 평가합니다. 문제는 사람들마다 이러한 판단에 대해 이견이 있다는 것입니다. AI 연구에서는 일반적으로 항목당 3~5개의 평가를 수집하고 다수결로 단일 '정답'을 선택합니다. 이러한 접근 방식은 인간 의견의 다양성을 체계적으로 배제하게 됩니다.

구글 리서치(Google Research)와 로체스터 공과대학(Rochester Institute of Technology)의 연구진은 제한된 평가 예산을 더 스마트하게 사용하는 방법을 찾고자 했습니다. 여기서 핵심적인 질문은 '가능한 한 많은 테스트 항목을 평가하는 것이 나은가, 아니면 더 적은 수의 항목을 훨씬 더 많은 사람들이 평가하는 것이 나은가?'였습니다.

연구진은 식당에 비유하며 이 딜레마를 설명합니다. 1,000명의 손님에게 각각 하나의 요리를 맛보게 하는 상황을 상상해 보십시오. 폭넓지만 얕은 수준의 평가만 얻게 될 것입니다. 이제 20명의 식객이 50개의 요리를 평가하는 상황을 상상해 보십시오. 어떤 요리가 실제로 좋고 그렇지 않은지에 대해 훨씬 더 풍부한 결과를 얻을 수 있습니다. 오늘날 대부분의 AI 벤치마크는 압도적으로 첫 번째 방식을 따르고 있습니다. 수많은 테스트 항목을 포괄하는 넓은 그물을 던지면서도, 각 항목에 대해서는 얇은 인간의 평가 레이어만 수집하는 것입니다.

최적의 균형점을 찾기 위해 연구팀은 실제 데이터셋을 바탕으로 인간의 평가 패턴을 재현하는 시뮬레이터를 구축했습니다. 이 시뮬레이터는 두 AI 모델에 대한 합성 평가 데이터를 생성하며, 이 중 한 모델은 통제된 방식으로 다른 모델보다 성능이 낮게 설정됩니다. 이러한 설정은 어떤 조건에서 모델 간의 차이를 신뢰할 수 있게 감지할 수 있는지 테스트하는 것을 가능하게 합니다.

연구진은 유해성 탐지, 챗봇 안전성, 문화 가 모욕 평가를 포괄하는 5개의 실제 데이터셋을 사용해 시뮬레이터를 교정했습니다. 이를 바탕으로 다양한 총예산과 항목당 평가자 수에 걸쳐 수천 가지의 조합을 테스트했습니다.

연구 결과는 현행 관행에 의문을 제기합니다. 이 연구에 따르면, 테스트 항목당 전형적인 1~5명의 평가자만으로는 모델 비교 결과를 재현 가능하게 만들기에 충분하지 않은 경우가 많습니다. 실제로 인간 의견의 범위를 포착하는 통계적으로 신뢰할 수 있는 결과를 얻으려면 일반적으로 항목당 10명 이상의 평가자가 필요합니다.

실험은 또한 총 1,000개 정도의 평가(annotations)로도 신뢰할 수 있는 결과를 얻을 수 있다는 것을 보여주지만, 이는 테스트 항목 수와 평가자 수 사이에 예산이 올바르게 분할된 경우에만 가능합니다. 연구진은 균형을 잘못 맞추면 훨씬 더 큰 예산에서도 신뢰할 수 없는 결론에 도달할 수 있다고 말합니다.

가장 큰 시사점은 모든 상황에 들어맞는 획일적인 비율은 없다는 것입니다. 올바른 전략은 측정하려는 대상에 따라 전적으로 결정됩니다. 모델이 평가자의 의견과 일치하는지 확인하는 정확도(accuracy) 기반의 평가 방식을 사용할 때는 평가자 수를 줄이고 더 많은 항목을 평가하는 것이 유리합니다. 반면, 인간 의견의 전체적인 다양성을 포착하는 것이 목표라면 테스트 항목 수를 줄이고 항목당 평가자 수를 크게 늘려야 합니다.

원문 보기

원문 보기 (영어)

AI benchmarks systematically ignore how humans disagree, Google study finds Jonathan Kemper View the LinkedIn Profile of Jonathan Kemper Apr 5, 2026 Nano Banana Pro prompted by THE DECODER Key Points A study by Google Research and the Rochester Institute of Technology finds that the common practice of using just three to five human evaluators per test example is insufficient for reliable AI benchmarks; at least ten are needed. Around 1,000 annotations can produce reliable results, but only if the budget is correctly split between the number of test examples and the number of raters. A poor balance leads to unreliable outcomes, even with more resources. The ideal distribution depends on what is being measured: majority-vote evaluations require many examples with fewer raters, while capturing the full diversity of human opinion calls for fewer examples but significantly more raters per item. Ask about this article… Search How many evaluators does a good AI benchmark actually need? New research shows that the standard three to five raters per test example often aren't enough, and that how you allocate your annotation budget matters just as much as how big it is. When AI models go head-to-head, human evaluations often decide which one comes out on top. Evaluators rate things like whether a comment is toxic or whether a chatbot response is safe. The problem is that people frequently disagree on these calls. Standard practice in AI research is to collect three to five ratings per example and pick a single "correct" answer by majority vote. That approach systematically throws out the diversity of human opinion. Ad Researchers from Google Research and the Rochester Institute of Technology wanted to find a smarter way to spend a limited rating budget. The key question: Is it better to evaluate as many test examples as possible or to have fewer examples rated by a lot more people? Ad DEC_D_Incontent-1 The researchers frame the dilemma with a simple restaurant analogy. Imagine asking 1,000 guests to each sample a single dish: you'd get a broad but shallow snapshot. Now imagine asking 20 diners to rate the same 50 dishes. You'd walk away with a far richer picture of what's actually good and what isn't. Today's AI benchmarks overwhelmingly follow the first model, casting a wide net across test examples while collecting only a thin layer of human judgment for each one. Stress-testing thousands of budget splits To find the sweet spot, the team built a simulator that replicates human rating patterns using real datasets. The simulator generates synthetic evaluation data for two models, with one performing worse than the other in a controlled way. This setup makes it possible to test which conditions let you reliably detect the difference between models. Ad The team calibrated the simulator against five real datasets covering toxicity detection, chatbot safety, and cross-cultural offensiveness assessment. All told, they tested thousands of combinations across different total budgets and rater counts per example. Fewer than ten raters per example isn't cutting it The results put current practice in question. The typical one to five raters per test example often aren't enough to make model comparisons reproducible, according to the study. For statistically reliable results that actually capture the range of human opinion, you generally need more than ten raters per example. Ad DEC_D_Incontent-2 The experiments also show that reliable results can often be achieved with around 1,000 total annotations, but only if the budget is split correctly between test examples and raters. Get the balance wrong, and you can end up with unreliable conclusions even on a much larger budget, the researchers say. Ad What you measure should dictate how you spend The biggest takeaway is that there's no one-size-fits-all ratio. The right strategy depends entirely on what you're trying to measure. If you're using accuracy—checking whether a model agrees with the evaluators' majority vote—a wide approach works best: as many test examples as possible with just a few raters each. Accuracy only looks at the most common answer, so extra raters barely move the needle. But if you want to capture the full spread of human responses—using a metric like total variation, for instance—you need the opposite playbook. Fewer test examples, but way more raters per example. That's the only way to map how much evaluators actually agree or disagree. Different examples can get the same majority-vote label yet have very different response distributions underneath. In the experiments, this distribution-aware metric also needed the smallest overall budget to produce reliable results. AI News Without the Hype – Curated by Humans As a THE DECODER subscriber , you get ad-free reading, our weekly AI newsletter , the exclusive "AI Radar" Frontier Report 6× per year , access to comments, and our complete archive. Subscribe now Source: Google Research | Arxiv

AI 벤치마크 인간 평가 구글 리서치 평가 방법론 모델 평가