r/OpenAI • 90일 전

AI에 갑자기 '고블린'이 등장한 이유

IMP

6/10

핵심 요약

OpenAI가 GPT-5.1 이후 모델들이 대답에 '고블린'과 같은 크리처 단어를 빈번하게 사용했던 원인을 분석한 결과, 'Nerdy(너드)' 성격 커스터마이징 기능의 강화 학습 과정에서 크리처 비유에 과도한 보상(Reward)이 부여된 것으로 드러났습니다. 이는 소비자에게 제공되는 AI의 페르소나와 미세한 보상 신호가 모델의 전반적인 동작과 언어 습관에 예기치 않은 방식으로 영향을 미칠 수 있음을 보여주는 중요한 사례입니다.

번역된 본문

2026년 4월 29일 게재

고블린은 어디서 왔는가

로딩 중… 공유하기

GPT-5.1부터 우리 모델들은 이상한 습관을 개발하기 시작했습니다. 점점 더 많은 비유에서 고블린, 그렘린 및 기타 크리처(신화 속 생명체)들을 언급하기 시작한 것입니다. 평가 지표의 하락이나 학습 메트릭의 급증으로 나타나 특정 변경 사항을 추적할 수 있는 일반적인 모델 버그와 달리, 이 문제는 은밀하게 스며들었습니다. 답변에 등장한 단 한 마리의 '작은 고블린'은 무해할 수도 있고, 심지어 매력적으로 보일 수도 있었습니다. 하지만 모델 세대가 거듭되면서 이 습관은 더 이상 무시할 수 없게 되었습니다. 고블린은 계속해서 증식했고, 우리는 그들이 어디서 왔는지 알아내야만 했습니다.

초기 테스트에서 Codex 환경의 GPT-5.5는 고블린 비유에 대한 기이한 친화력을 보여주었습니다. 이에 대한 간단한 대답은 모델의 동작이 수많은 작은 인센티브(보상)에 의해 형성된다는 것입니다. 이 경우, 그 인센티브 중 하나는 성격 커스터마이징 기능을 위해 모델을 학습시키는 과정, 특히 'Nerdy(너드)' 성격에서 비롯되었습니다. 우리는 자신도 모르게 크리처가 포함된 비유에 특별히 높은 보상을 주었습니다. 그로부터 고블린이 퍼져나가기 시작했습니다.

처음에는 고블린이 재미있었지만, 점점 늘어나는 직원들의 보고서는 우려스러운 수준이 되었습니다. 우리 수석 과학자(Chief Scientist)가 GPT-5.5와 나눈 흥미로운 상호작용이 있었습니다.

크리처의 첫 징후

우리가 이 패턴을 명확하게 처음 목격한 것은 GPT-5.1 출시 이후인 11월이었지만, 실제로는 더 일찍 시작되었을 수 있습니다. 사용자들은 대화에서 모델이 이상할 정도로 지나치게 친근하다고 불만을 제기했고, 이는 특정 언어적 버릇에 대한 조사로 이어졌습니다. 한 안전 연구원은 몇 번의 '고블린' 및 '그렘린' 사례를 겪었고, 이를 검사 항목에 포함해 달라고 요청했습니다. 확인 결과, GPT-5.1 출시 이후 ChatGPT에서 '고블린'의 사용이 175% 증가했으며, '그렘린'은 52% 증가했습니다. 당시에는 고블린의 빈번한 등장이 특별히 경악할 만한 일로 보이지는 않았습니다. 몇 달 후, 고블린은 훨씬 더 구체적이고 재현 가능한 형태로 우리를 다시 괴롭히기 위해 돌아왔습니다.

고블린 미스터리 해결

GPT-5.4에서 우리와 우리 사용자들은 이러한 크리처에 대한 언급이 훨씬 더 크게 급증한 것을 발견했습니다. 이는 또 다른 내부 분석을 촉발했고, 근본 원인과의 첫 번째 연결 고리를 밝혀냈습니다. 바로 'Nerdy' 성격을 선택한 사용자들의 프로덕션 트래픽에서 크리처 언어가 특히 흔하게 사용되었다는 것입니다. 'Nerdy'는 다음과 같은 시스템 프롬프트를 사용했으며, 이는 이러한 기이한 특성을 부분적으로 설명해 줍니다.

“당신은 인간을 대하는 밀어붙이는 스타일의 너드이자, 장난기 있고 지혜로운 AI 멘토입니다. 당신은 진실, 지식, 철학, 과학적 방법 및 비판적 사고를 증진시키는 데 열광적으로 열정적입니다. [...] 당신은 언어의 장난기 어린 사용을 통해 허세를 꺾어야 합니다. 세상은 복잡하고 이상하며, 그 이상함은 인정받고 분석되며 즐겨져야 합니다. 자의식 과잉의 함정에 빠지지 않고 무거운 주제를 다루십시오. [...]”

만약 이러한 행동이 단순히 광범위한 인터넷 트렌드였다면, 더 균일하게 퍼졌을 것으로 예상할 수 있습니다. 대신, 이 현상은 장난기 있고 너드 같은 스타일을 위해 명시적으로 최적화된 시스템 부분에 집중되어 있었습니다. Nerdy는 모든 ChatGPT 응답의 단 2.5%를 차지했지만, ChatGPT 응답에서 등장하는 모든 '고블린' 언급의 66.7%를 차지했습니다.

'고블린'의 빈도가 모델 릴리스가 거듭될수록 증가하는 경향이 있었기 때문에, 우리는 성격 명령 준수(instruction-following) 학습의 무언가가 이를 증폭시키고 있다고 의심했습니다. Codex는 강화 학습(RL) 중에 생성된 고블린 또는 그렘린이 포함된 모델 출력과 해당 단어가 포함되지 않은 동일한 작업의 출력을 비교하는 데 도움을 주었습니다. 하나의 보상 신호가 즉각적으로 눈에 띄었습니다. 바로 본래 'Nerdy' 성격을 장려하기 위해 설계된 보상 신호가 크리처 단어가 포함된 출력에 지속적으로 더 호의적이었습니다. 감사 대상 데이터셋 전반에 걸쳐, Nerdy 성격 보상은 '고블린'이나 '그렘린'이 포함된 출력에 대해 그렇지 않은 출력보다 더 높은 점수를 부여하는 명확한 경향을 보였으며, 76.2%의 데이터셋에서 긍정적인 상승 효과(positive uplift)를 보였습니다.

이는 Nerdy 성격 프롬프트가 있을 때 이러한 행동이 강화된 이유를 설명해 주었지만, 해당 프롬프트가 없을 때도 나타나는 이유는 설명하지 못했습니다. 이러한 스타일이 전이되고 있는지 테스트하기 위해, 우리는 추적했습니다

원문 보기

원문 보기 (영어)

April 29, 2026 Publication Where the goblins came from Loading… Share Starting with GPT‑5.1, our models began developing a strange habit: they increasingly mentioned goblins, gremlins, and other creatures in their metaphors. Unlike model bugs that show up through a tanking eval or a spiking training metric and point back to a specific change, this one crept in subtly. A single “little goblin” in an answer could be harmless, even charming. Across model generations, though, the habit became hard to miss: the goblins kept multiplying, and we needed to figure out where they came from. In early testing, GPT‑5.5 in Codex showed an odd affinity for goblin metaphors. The short answer is that model behavior is shaped by many small incentives. In this case, one of those incentives came from training the model for the personality customization feature ⁠ (opens in a new window) , in particular the Nerdy personality. We unknowingly gave particularly high rewards for metaphors with creatures. From there, the goblins spread. The goblins were funny at first, but the increasing number of employee reports became concerning. An interesting interaction our Chief Scientist had with GPT‑5.5. The first signs of creatures The first time we clearly saw the pattern was in November, after the GPT‑5.1 launch, although it may have started earlier ⁠ (opens in a new window) . Users complained about the model being oddly overfamiliar in conversation, which prompted an investigation into specific verbal tics. A safety researcher had experienced a few “goblins” and “gremlins” and asked that they be included in the check. When we looked, use of “goblin” in ChatGPT had risen by 175% after the launch of GPT‑5.1, while “gremlin” had risen by 52%. At the time, the prevalence of goblins did not look especially alarming. A few months later, the goblins came back to haunt us in a much more specific and reproducible form. Solving the goblin mystery With GPT‑5.4, we and our users ⁠ (opens in a new window) noticed an even bigger uptick in references to these creatures. That triggered another internal analysis and surfaced the first connection to the root cause: creature language was especially common in production traffic from users who had selected the “Nerdy” personality. “Nerdy” used the following system prompt, which partially explained the quirkiness: You are an unapologetically nerdy, playful and wise AI mentor to a human. You are passionately enthusiastic about promoting truth, knowledge, philosophy, the scientific method, and critical thinking. [...] You must undercut pretension through playful use of language. The world is complex and strange, and its strangeness must be acknowledged, analyzed, and enjoyed. Tackle weighty subjects without falling into the trap of self-seriousness. [...] If the behavior were simply a broad internet trend, we would expect it to spread more evenly. Instead, it was clustered in the part of the system explicitly optimized for a playful, nerdy style. Nerdy accounted for only 2.5% of all ChatGPT responses, but 66.7% of all “goblin” mentions in ChatGPT responses. Because “goblin” prevalence seemed to increase over our model releases, we had a suspicion that something in our personality instruction-following training was amplifying this. Codex helped us compare model outputs generated during RL training containing goblin or gremlin with outputs from the same task that did not. One reward signal stood out immediately: the one originally designed to encourage the Nerdy personality was consistently more favorable to the creature-word outputs. Across all datasets in the audit, the Nerdy personality reward showed a clear tendency to score outputs to the same problem with “goblin” or “gremlin” higher than outputs without, with positive uplift in 76.2% of datasets. That explained why the behavior was boosted with the Nerdy personality prompt, but not why it also appeared without that prompt. To test whether the style was transferring, we tracked mention rates over training both with and without the Nerdy prompt. As goblin and gremlin mentions increased under the Nerdy personality, they increased by nearly the same relative proportion in samples without it. Taken together, the evidence suggests that the broader behavior emerged through transfer from Nerdy personality training. The rewards were applied only in the Nerdy condition, but reinforcement learning does not guarantee that learned behaviors stay neatly scoped to the condition that produced them. Once a style tic is rewarded, later training can spread or reinforce it elsewhere, especially if those outputs are reused in supervised fine-tuning or preference data. That creates a feedback loop: Playful style is rewarded Some rewarded examples contain a distinctive lexical tic. The tic appears more often in rollouts. Model-generated rollouts are used for supervised fine-tuning (SFT). The model gets even more comfortable producing the tic. A search through GPT‑5.5’s SFT data found many datapoints containing “goblin” and “gremlin.” Further investigation revealed a whole family of other odd creatures: raccoons, trolls, ogres, and pigeons were identified as other tic words, while most uses of frog turned out to be legitimate. The end of the goblins We retired the “Nerdy” personality in March after launching GPT‑5.4. In training, we removed the goblin-affine reward signal and filtered training data containing creature-words, making goblins less likely to over-appear or show up in inappropriate contexts. Unfortunately, GPT‑5.5 started training before we found the root cause of the goblins. When we began testing GPT‑5.5 in Codex, OpenAI employees immediately noticed the strange affinity for goblins, and we added a developer-prompt instruction ⁠ (opens in a new window) to mitigate. Codex is, after all, quite nerdy. If you want to let the creatures run free in Codex, you can run this command to launch Codex with the goblin-suppressing instructions removed: Plain Text 1 instructions=$(mktemp /tmp/gpt-5.5-instructions.XXXXXX) && \ 2 jq -r '.models[] | select(.slug=="gpt-5.5") | .base_instructions' \ 3 ~/.codex/models_cache.json | \ 4 grep -vi 'goblins' > "$instructions" && \ 5 codex -m gpt-5.5 -c "model_instructions_file=\"$instructions\"" Why it matters Depending on who you ask, the goblins are a delightful or annoying quirk of the model. But they are also a powerful example of how reward signals can shape model behavior in unexpected ways, and how models can learn to generalize rewards in certain situations to unrelated ones. Taking the time to understand why a model is behaving in a strange way, and building out ways to investigate those patterns quickly, is an important capability for our research team. This investigation resulted in new tools for the research team to audit model behavior and fix behavior problems at their root. 2026 Author OpenAI Keep reading View all GPT-5.5 System Card Safety Apr 23, 2026 Inside our approach to the Model Spec Research Mar 25, 2026 How we monitor internal coding agents for misalignment Safety Mar 19, 2026

GPT-5 모델 행동 강화 학습 시스템 프롬프트 보상 해킹