The Decoder • 60일 전

AI 챗봇, 유용해질수록 인간 행동 모사 능력은 저하돼

IMP

8/10

핵심 요약

대규모 연구에 따르면, AI를 유용한 챗봇으로 만드는 미세조정(Fine-tuning) 과정이 모델이 인간의 행동을 예측하고 모사하는 능력을 떨어뜨리는 것으로 나타났습니다. 기본 모델은 인간의 언어와 인지적 편향을 잘 학습하지만, 강화학습 등 추가 훈련을 거치며 논리적이고 규범에 맞는 답변만을 추구하게 되어 인간 특유의 행동 패턴과 멀어지게 됩니다.

번역된 본문

AI 챗봇을 유용하게 만드는 과정이 인간 행동을 모사하는 능력을 약화시킨다는 대규모 연구 결과가 나왔습니다.

최근 언어 모델은 정책에 대한 반응을 예측하거나, 정신과 의사를 위한 임상 훈련을 시뮬레이션하거나, 학생들의 학습 방식을 모델링하는 등 인간 대상 참가자를 대체하는 용도로 점점 더 많이 사용되고 있습니다. 헬름홀츠 뮌헨(Helmholtz Munich) 소속 과학자들을 포함한 국제 연구 컨소시엄의 새로운 연구는 다소 불편한 결론에 도달했습니다. 바로 언어 모델을 유용한 어시스턴트로 만드는 훈련 단계 자체가 이들이 인간 행동을 모델링하는 능력을 저하시킨다는 것입니다.

이 연구는 행동 실험의 기록 데이터인 'Psych-201'이라는 새로운 데이터셋을 기반으로 합니다. 이 데이터셋은 수백 건의 실험에서 약 20만 8,000명의 참가자와 약 2,600만 건의 개별 응답을 포함하며, 기존의 어떤 유사 데이터셋보다도 몇 배나 큰 규모입니다. 각 데이터 포인트는 참가자의 실험 전체 과정은 물론 나이, 국적, 설문 응답 및 기타 특성과 같은 상세한 메타데이터를 담고 있습니다. 이 데이터셋은 35개 이상의 기관 연구자들이 참여한 공개 연구 협력을 통해 구축되었습니다.

연구진은 Qwen3, Llama3, OLMo 3 모델 계열을 대상으로 기본 모델(Base model)과 다양한 사후 학련(Post-trained) 변형 모델을 비교 테스트했습니다. 기본 모델은 텍스트의 다음 단어를 예측하도록만 학습된 상태를 말합니다. 이후 추가 학습을 거치면서 지시 사항을 따르거나(Instruction-following), 단계별 추론(Step-by-step reasoning)을 하거나, 이미지 처리(Image processing)에 맞게 튜닝된 버전이 만들어집니다. 연구진은 각 모델이 인간 참가자들의 실제 답변을 얼마나 잘 예측하는지를 평가 지표로 삼았습니다.

그 결과, 모든 모델 계열과 크기에서 동일한 경향이 나타났습니다. 기본 모델이 사후 학습된 파생 모델보다 인간 행동을 더 잘 예측했습니다. 이러한 현상은 모든 일반적인 훈련 목표에서 나타났으며, 특히 추론(Reasoning) 모델에서 가장 큰 타격을 입었고 그 다음으로 지시 사항 미세조정(Instruction tuning) 및 비전(Vision) 확장 순이었습니다. 거의 모든 일대일 비교에서 기본 모델이 특화된 변형 모델보다 더 나은 성능을 보였습니다.

이에 대해 한 가지 반론이 제기될 수 있습니다. 아마도 어시스턴트 모델이 더 결정론적으로(Deterministically) 답변하여 인간 행동의 자연스러운 분산을 포착하지 못하는 것은 아닐까 하는 점입니다. 연구진은 이를 확인하기 위해 선택지가 명확히 주어지는 과제의 하위 집합을 대상으로 정확도 분석을 수행했습니다. 그 결과 사후 학습된 모델들은 여전히 더 낮은 성능을 보였으며, 높은 결정론만이 유일한 원인일 가능성은 낮은 것으로 확인되었습니다.

모델 세대가 거듭될수록 이 격차는 더욱 벌어집니다. Qwen2에서 Qwen2.5, 그리고 Qwen3로 이어지는 기본 모델은 세대를 거듭하면서 꾸준히 발전하며 인간 행동 예측 능력이 향상되는 반면, 파생된 어시스턴트 모델과의 격차는 계속 커지고 있습니다. 사후 학습 기술이 발전할수록 오히려 인간의 실제 행동과의 괴리는 더욱 심해지고 있는 셈입니다.

이러한 왜곡은 특히 언어 과제와 추론 과제에서 가장 크게 나타납니다. 연구진은 이에 대해 타당한 설명을 제공합니다. 기본 모델은 본질적으로 '인간 언어'의 모델이며, 따라서 언어 처리 작업에 있어서 매우 잘 보정(Well-calibrated)되어 있습니다. 하지만 인간 피드백 기반 강화학습(RLHF)과 같은 사후 학습 기법은 모델을 원래의 목표에서 밀어내고, 사용자 친화적이거나 규범적으로 더 올바른 답변을 제공하도록 유도합니다.

추론의 경우에도 마찬가지 현상이 발생합니다. 인간의 결정은 휴리스틱(Heuristics)과 체계적인 편향(Biases)에 의해 형성되며, 기본 모델은 분명히 이러한 부분을 학습합니다. 반면 추론 훈련은 논리적으로 정확한 답변에 최적화되기 때문에, 행동 시뮬레이션에 중요한 인간 특유의 '단편적이고 비합리적인 면모'를 정확히 덮어쓰게 됩니다.

널리 쓰이는 지름길도 통하지 않습니다 두 번째 주요 발견은 널리 사용되는 한 기법과 관련이 있습니다. 바로 언어 모델에 참가자별 정보를 제공하여 특정 역할을 부여하는 기법입니다. 이 연구에서는 인구통계학적 세부 정보를 실험 시작 전에 프롬프트에 추가하는 인터뷰 형식을 통해 이를 테스트했습니다. 사용 가능한 데이터의 경우 프롬프트에 나이, 성별, 국적, 교육 수준 등의 정보가 포함되었습니다.

원문 보기

원문 보기 (영어)

Making AI chatbots helpful weakens their ability to simulate human behavior, large-scale study finds Jonathan Kemper View the LinkedIn Profile of Jonathan Kemper May 30, 2026 Nano Banana Pro prompted by THE DECODER A large-scale study shows that the training process turning raw language models into helpful chatbots also weakens their ability to mimic human behavior. The effect gets worse with each new generation. Language models are increasingly used as stand-ins for human test subjects to predict reactions to policy measures , simulate clinical training for psychiatrists , or model how students learn. A new study from an international research consortium, including scientists from Helmholtz Munich, arrives at an inconvenient finding: the very training steps that turn language models into useful assistants make them worse at modeling human behavior. The study builds on Psych-201, a new dataset of transcripts from behavioral experiments. It covers about 208,000 participants and roughly 26 million individual responses from hundreds of experiments, several times larger than any previous collection of its kind. Each data point captures a participant's full run through an experiment, along with detailed metadata like age, nationality, questionnaire responses, and other traits. The dataset was assembled through an open research collaboration involving researchers from more than 35 institutions. Base models beat their fine-tuned counterparts The researchers compared models from the Qwen3 , Llama3 , and OLMo 3 families, testing both base models and their various post-trained variants. Base models are trained only to predict the next word in text. From there, extra training produces the versions tuned for instruction-following, step-by-step reasoning, or image processing. The metric: how well each model predicts the actual answers human participants gave. The result holds across all families and sizes. Base models predict human behavior better than their post-trained descendants. The effect shows up for every common training objective, hitting hardest with reasoning models, followed by instruction tuning and vision extensions. In nearly every head-to-head comparison, the base model outperforms its specialized variant. One obvious counter-explanation: maybe assistant models just answer more deterministically and fail to capture the natural spread of human behavior. The researchers tested this with an accuracy analysis on a subset of tasks with discrete answer options. Post-trained models still performed worse, making higher determinism unlikely as the sole explanation. The gap widens with every generation While base models steadily improve from Qwen2 through Qwen2.5 to Qwen3, getting better at predicting human behavior with each generation, the gap to their derived assistant models keeps growing. Ongoing advances in post-training are making the divergence from human behavior worse. The biggest distortion shows up in language tasks and reasoning. The researchers offer a plausible explanation: base models are, at their core, models of human language and therefore well-calibrated for language processing tasks. Post-training techniques like reinforcement learning from human feedback push them away from that original objective toward more user-friendly or normatively correct answers. The same thing happens with reasoning. Human decisions are shaped by heuristics and systematic biases that base models apparently pick up. Reasoning training optimizes for logically correct answers instead, overwriting exactly the human quirks that matter for behavioral simulation. A popular shortcut doesn't work A second finding concerns a widely used technique: giving language models participant-specific information to put them into a particular role . In the study, this took the form of an interview format where demographic details about each person were prepended before the experiment. Where available, the prompts included age, gender, nationality, education, clinical diagnoses, and questionnaire scores. The effect was practically zero. That held even when the analysis was limited to developmental psychology experiments, where age-related differences should be informative. Earlier work had shown that persona prompts can produce human-like response distributions at the population level. But the new study questions whether they actually predict individual behavior or just look plausible on the surface. Centaur shows targeted training can still help The authors see their findings as a variation of a known problem: extra training toward specific goals can degrade abilities acquired during pretraining. To test whether this is a hard limit, they looked at Centaur - a model specifically fine-tuned on a portion of the behavioral data. Centaur showed much higher agreement with human behavior even on new tasks that weren't part of its training. So extra training can help, but only when it targets behavioral modeling rather than logical correctness. For research practice, the takeaway is clear: the convenient, readily available assistant models aren't automatically the best choice for behavioral simulations. The researchers recommend either raw base models or variants trained specifically for behavioral simulation. Code and data are available on Hugging Face and GitHub . That chatbot models have their pitfalls as digital test subjects isn't new. A recent study of nine open-source language models found that optimizing for more human-sounding output comes at the cost of factual precision , and a classifier unmasked AI responses with 70 to 80 percent accuracy. The persona trick also worked worse than expected. Another study found that models can barely pose as weak or strong learners on command , with their hit rates shifting by less than a percentage point. And when it comes to reasoning, a deep gap persists anyway: an analysis of more than 170,000 reasoning traces showed that reasoning models think differently than humans , falling into a kind of sequential autopilot. AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now --> Read on for the full picture. Subscribe for hype-free coverage. Access to all THE DECODER articles. Read without distractions – no Google ads. Access to comments and community discussions. Weekly AI newsletter. 6 times a year: “AI Radar” – deep dives on key AI topics. Up to 25 % off on KI Pro online events. Access to our full ten-year archive. Get the latest AI news from The Decoder. Subscribe to The Decoder -->

인간 행동 모사 대규모 언어 모델 미세조정 심리학 연구 사후 학습