Hacker News • 65일 전

GPT의 무작위 수 선택 실험 결과

IMP

6/10

핵심 요약

인간의 무작위 수 선택 편향을 모방하는지 확인하기 위해 GPT-4.1에 1부터 100 사이의 무작위 수를 10,000번 요청하는 실험이 진행되었습니다. 그 결과, AI 모델은 완벽한 난수 생성기가 아니며 37, 42, 73 등 특정 숫자를 집중적으로 선택하고 10의 배수를 극단적으로 기피하는 등 인간과 매우 유사한 '찌그러진' 분포 패턴을 보였습니다. 이는 거대 언어 모델(LLM)이 인간이 작성한 텍스트 데이터의 통계적 특성과 인지적 편향을 그대로 학습하고 반영한다는 것을 보여주는 중요한 사례입니다.

번역된 본문

GPT의 무작위 수 선택 실험 결과

인간에 대한 흥미로운 사실 중 하나는 훌륭한 난수 생성기(Random Number Generator)가 아니라는 점입니다. 사람에게 "1부터 100 사이의 무작위 수를 하나 고르세요"라고 요청하면 놀라울 정도로 결과를 예측할 수 있습니다. 응답은 37이나 73 같은 특정 숫자, '복잡해 보이는' 숫자들, 그리고 42나 69 같은 밈(Meme) 숫자에 집중되는 반면, 둥근 숫자(10의 배수 등)는 조용히 회피됩니다. 진정한 난수 생성기라면 평평하고 균일한 분포(Uniform distribution)를 만들어냅니다. 이 프로젝트는 gpt-4.1에게 동일한 질문을 10,000번 묻고, 균일한 기준선(Baseline)과 비교하여 모델이 생성하는 분포의 특성을 분석합니다. 인간이 작성한 텍스트로 학습된 거대 언어 모델(LLM)이 공정한 주사위처럼 행동할까요, 아니면 뭉툭하고 찌그러진 인간의 패턴을 물려받을까요?

전체 설계 및 방법론: docs/LLM Random Bias Experiment SDD.md

영감 이 실험은 인간의 숫자 선택 편향에 대해 잘 알려진 두 가지 탐구를 LLM에 초점을 맞춰 후속 연구한 것입니다. r/dataisbeautiful — "[OC] 100명의 사람에게 1부터 100 사이의 숫자를 고르라고 했습니다" Veritasium — 왜 이 숫자는 어디에나 있을까요?

방법론 전체 실험 설계는 SDD에 포함되어 있으며, 핵심 내용은 다음과 같습니다: 모델: Responses API를 통해 호출된 gpt-4.1 (OpenAI). 추론(Reasoning) 기능이 없는 모델입니다. 모델은 전략을 고민하지 않고 직접적인 답변을 출력하며, 우리가 측정하는 것은 추론 전략이 아닌 모델의 원시 출력 분포입니다. 정확한 모델 문자열은 모든 원시 CSV 행(Model 열)과 data/raw/run_metadata.json에 기록되어 있어 데이터셋이 자체적으로 설명 가능합니다. 표본 크기: 독립적인 10,000회 호출(N = 10,000). 카이제곱 적합도 검정(Chi-square goodness-of-fit test)을 수행하기에 충분하며, 숫자당 비율의 오차 범위가 약 ±0.5% 포인트로 안정화될 정도의 크기입니다. 샘플링: temperature = 1.0. 모델이 전체 샘플링 분포를 활용하도록 설정했습니다. 이것이 실험의 핵심입니다. 낮은 온도(Temperature)에서는 단 하나의 숫자만 반복할 것입니다. 프롬프트: 고정된 시스템 프롬프트는 모델에게 1부터 100 사이의 정수 하나만 출력하도록 지시하며, 사용자 프롬프트는 숫자를 요청하고 고유한 uuid4를 포함합니다. (UUID는 캐시를 방지하기 위함이 아니라 요청 추적을 위한 것입니다. temperature 1.0에서는 모든 호출이 독립적으로 샘플링되어야 합니다.) 기준선: 결과는 인간 데이터가 아닌 균일한 분포(공정한 생성기가 생성할 결과)와 비교됩니다 (가정 참조). 파이프라인: 수집 → 정제 → 변환 → 통계의 4단계로 구성됩니다. 정제(Cleaning) 과정에서는 모든 답변이 [1, 100] 범위의 정수인지 확인하고 거부율을 보고합니다.

가정 및 한계 이는 확정적인 연구라기보다는 설명을 위한 탐색적 조사입니다. 주의사항은 다음과 같습니다 (공식적인 내용은 SDD의 한계 참조): 단일 모델: 결과는 gpt-4.1에 대해서만 설명하며 다른 모델이나 제공자로 일반화되지 않습니다. "무작위성"은 샘플링의 결과물입니다. 모델 자체는 난수 생성기가 아니며, 학습된 토큰 분포(Token distribution)를 샘플링할 뿐입니다. 우리는 그 분포의 특성을 분석하는 것이며, 모델이 무작위성을 의도한다고 주장하는 것이 아닙니다. 프롬프트 및 온도 의존성: 다른 프롬프트 문구나 샘플링 온도(Sampling temperature)는 분포를 바꿀 수 있습니다. 두 가지 모두 고정되어 문서화되어 있습니다. 제품으로서의 "ChatGPT"가 아님: 이 테스트는 고정된 온도에서 API를 통해 모델을 테스트한 것입니다. 우리가 제어할 수 없는 라우팅, 도구, 시스템 프롬프트가 추가되는 일반 소비자용 ChatGPT 앱을 테스트한 것이 아닙니다.

결과 gpt-4.1은 결코 균일한 난수 생성기가 아닙니다. 균일 분포에 대한 카이제곱 적합도 검정 결과(N = 10,000, df = 99) χ² = 15,604, p ≈ 0으로 나왔습니다. 이는 편차가 너무 커서 유의수준 임계값을 초과하여 오버플로우(underflows)가 발생한 수준입니다.

무작위 숫자를 요청받았을 때, 모델은 뭉툭하고 확연히 인간의 형태를 띠는 분포를 생성합니다.

전체적으로 가장 많이 선택된 상위 5개 숫자인 47, 57, 72, 37, 42는 끝자리가 7로 끝나는 숫자에 크게 치우쳐 있습니다(5개 중 3개). 이는 인간에게서 관찰되는 '무작위적으로 느껴지는 숫자'에 대한 끌림과 같습니다.

인간보다 둥근 숫자를 더 극단적으로 회피합니다: 10 자체를 제외한 10의 모든 배수는 10,000번의 호출에서 정확히 0번 선택되었습니다. 10은 정확히 한 번 선택되었습니다. 인간은 둥근 숫자를 회피하는데, gpt-4.1은 이를 훨씬 더 극단적으로 회피합니다.

원문 보기

원문 보기 (영어)

GPT Guesses Between 1 and 100 An interesting thing about humans is that they are not good random number generators. If you ask a person to "pick a random number between 1 and 100", they are remarkably predictable. Answers cluster on 37 and 73, on "messy" numbers, and on memes like 42 and 69, while round numbers are quietly avoided. A true random generator would instead produce a flat, uniform distribution. This project asks gpt-4.1 the same question 10,000 times and characterizes the distribution it produces, measured against a uniform baseline. Does an LLM, which is trained on human text, behave like a fair die, or does it inherit the lumpy human pattern? Full design and methodology: docs/LLM Random Bias Experiment SDD.md . Inspiration This experiment is an LLM-focused follow-up to two well-known explorations of human number-picking bias. r/dataisbeautiful — "[OC] I asked 100 people to pick a number between 1 and 100" Veritasium — Why is this number everywhere? Methodology Full experimental design is in the SDD ; the essentials: Model. gpt-4.1 (OpenAI), called via the Responses API. It is a non-reasoning model. It emits a direct answer rather than deliberating; what we're measuring is its raw output distribution, not a reasoning strategy. The exact model string is recorded in every raw-CSV row ( Model column) and in data/raw/run_metadata.json , so the dataset is self-describing. Sample size. N = 10,000 independent calls — enough for a chi-square goodness-of-fit test and per-number proportions stable to ~±0.5 pp. Sampling. temperature = 1.0 , so the model exercises its full sampling distribution. This is the experiment: at low temperature it would just repeat one number. Prompt. A fixed system prompt instructs the model to output only one integer between 1 and 100; the user prompt requests the number and carries a unique uuid4 . (The UUID is request-tracing hygiene, not cache-busting — at temperature 1.0 every call should sample independently regardless.) Baseline. The result is compared against a uniform distribution — what a fair generator would produce — not against human data (see Assumptions ). Pipeline. Four stages — collect → clean → transform → stats , detailed below. Cleaning validates every answer is an integer in [1, 100] and reports the rejection rate. Assumptions & Limitations This is an illustrative probe, not a definitive study. Key caveats — see the SDD's Limitations section for the formal treatment: Single model. Results describe gpt-4.1 only and do not generalize to other models or providers. "Randomness" is a sampling artifact. The model is not a random number generator; it samples a learned token distribution. We characterize that distribution — we do not claim the model is trying to be random. Prompt- and temperature-dependent. A different prompt wording or sampling temperature could shift the distribution. Both are fixed and documented. Not "ChatGPT the product." This tests a model through the API at a fixed temperature — not the consumer ChatGPT app, which adds routing, tools, and a system prompt outside our control. Results gpt-4.1 is emphatically not a uniform random generator. A chi-square goodness-of-fit test against a uniform distribution (N = 10,000, df = 99) returns χ² = 15,604, p ≈ 0 — the deviation is so large it underflows any significance threshold. Asked for a random number, the model produces a lumpy, distinctly human-shaped distribution. It reproduced the classic human spikes Number Picked vs. uniform chance Human reputation 37 4.0× "the most random number" 42 4.0× Hitchhiker's Guide meme 73 3.4× the other well-known spike The five most-picked numbers overall — 47, 57, 72, 37, 42 — lean heavily on numbers ending in 7 (three of the five), the same "number that feels random" pull seen in humans. It avoids round numbers even harder than humans All multiples of 10, except for 10 itself, were picked exactly 0 times in 10,000 calls . 10 was picked exactly once. Humans avoid round numbers — gpt-4.1 essentially refuses them. The exception: 69 One number breaks the human pattern. 69 is a meme number humans over -pick. gpt-4.1 under -picks it (0.29× expected: ~29 occurrences against ~100). The model inherited the "smart" meme (42) and not the crude one. Our hypothesis is that this is a product of safety guardrails during pre-training and post-training. It is the most interesting aspect in the dataset: the model's bias is not a raw copy of human bias but a moderated version of it. Takeaway The hypothesis holds. An LLM trained on human text, asked to be random, reproduces human random-number bias: the pull toward 37 and 73, the meme spike at 42, the aversion to round numbers — with one guardrail-likely exception. The interactive distribution chart shows the full 1–100 shape. All figures from data/processed/stats_summary.csv . The pipeline collect → clean → transform → stats . Each stage reads the previous stage's committed CSV, so any stage can be re-run on its own. Stage Module Output Collect llm_random_bias.collect data/raw/chatgpt_random_results.csv Clean llm_random_bias.clean data/processed/chatgpt_random_clean.csv Transform llm_random_bias.transform data/processed/distribution.csv Stats llm_random_bias.stats data/processed/stats_summary.csv Setup This project uses uv for everything. uv sync Path 1 — Analysis only (free, no API key) The raw dataset is committed to this repo, so you can reproduce the entire analysis without spending a cent: uv run python -m llm_random_bias.clean uv run python -m llm_random_bias.transform uv run python -m llm_random_bias.stats Path 2 — Fresh data collection (needs an OpenAI API key) cp .env.example .env # then edit .env and add your OPENAI_API_KEY uv run python -m llm_random_bias.collect # then run clean / transform / stats as in Path 1 Cost & runtime: ~10,000 short calls to gpt-4.1 cost roughly US$2 and finish in a few minutes at the default concurrency. The collector refuses to overwrite an existing raw CSV — delete it first to re-collect. Visualization The distribution bar chart is built in Exmergo Viz (our AI dashboard agent) directly from data/processed/distribution.csv . The fully interactive data viz can be viewed here . Development uv run ruff check . uv run ruff format . uv run mypy src uv run pytest See CONTRIBUTING.md . License MIT — see LICENSE .

LLM 통계적 편향 인지 편향 난수 생성 GPT-4.1