andreea

How an RL glitch accidentally transforms into internet culture

‘A single “little goblin” in an answer could be harmless, even charming.’

where the goblins came from)

‘Goblin’ is the stupid name of a very adult problem. A model burning compute drifted into fantasy creatures and it can’t stop saying ‘goblins’.

In the postmortem, OpenAI says they observed that GPT-5.1 thinking kept smuggling in goblins, gremlins and other fantasy creatures, though it might also be internet culture. Too cute, too innocent.

‘If the behavior were simply a broad internet trend, we would expect it to spread more evenly.’

They say that if it were an internet trend, it would have been consistent across contexts, which is reasonable but not enough. Internet culture provides the vocabulary, the reward shapes the gradient. If it were pretraining, it would have leaked into Instant as well. ‘Goblin’ is a shortcut for ‘playful intelligence’ that feels more human.

The dirty part of RL is that the gradient flow, the parameter updates are another story. Linguistic tics propagate in rollouts, the rollouts are absorbed back into SFT, and the model recycles the stylistic trace. We do this culturally too: cringe, cool, hype, words compressed into social signals.

Now, if the model repeatedly says ‘goblin’, ‘gremlin’, ‘raccoon’, or ‘ogre’, it’s an internet meme and a latent dimension where the reward function crystallized around a marker of ‘personality’ and pushed it into the distribution.

Missing are the absolute rates, confidence intervals, the distribution across task types, how much of the effect comes from the nerdy prompt, how much from the reward model, how much from SFT contamination and how much from users imitating the model and feeding its register back into it. If you remove the goblins, the model searches for another stylistic marker.

‘one reward signal stood out immediately’

So, the reward for the Nerdy personality scored outputs with ‘goblin’ or ‘gremlin’ higher. What that reward was: a separate reward model? a judge model? human preferences? synthetic preferences? a style classifier? If a judge model learned ‘quirky metaphor = good personality’, then evaluators are producing microcultures inside the model.

Nerdy prompt: ‘You are an unapologetically nerdy, playful and wise AI mentor to a human. You are passionately enthusiastic about promoting truth, knowledge, philosophy, the scientific method, and critical thinking. [...] You must undercut pretension through playful use of language. The world is complex and strange, and its strangeness must be acknowledged, analyzed, and enjoyed. Tackle weighty subjects without falling into the trap of self-seriousness. [...]’

The Nerdy prompt contains expressions like ‘playful’, ‘wise AI mentor’, ‘undercut pretension’, ‘the world is complex and strange’, ‘strangeness must be acknowledged, analyzed, and enjoyed’, ‘without self-seriousness’. The prompt is not that innocent and forms a specific persona mix: internet intellectual, approachable mentor, controlled anti elitism with humor. The model escapes pomposity without risking genuine aggression and found a signifier of strangeness. The funny part is that they wanted ‘complex and strange’, the reward found ‘little goblin’. Cute. They tracked mention rates with and without the Nerdy prompt and that they increased proportionally, which supports transfer.

‘A search through GPT‑5.5’s SFT data found many datapoints containing “goblin” and “gremlin.”’

The model was searching for a class of imagery: small, strange, comic creatures associated with a bit of chaos. Why not yoda? Many datapoints containing ‘goblin’ and ‘gremlin’ existed in SFT, ok, but what % of SFT was model generated? They do not say. If model generated data gets into SFT while preserving stylistic tics that were previously reward & amplified, then the model normalizes its own joke. And this produces overconfidence, sycophancy, overfamiliarity etc. The goblin is a version of a much bigger problem.

‘frog turned out to be legitimate’

They do not say what other tics were found or how many. They mention raccoons, trolls, ogres, pigeons (comic too), and say that most uses of ‘frog’ were legitimate. Why ‘frog’? If the reward learned creatures as a proxy, why assume only creatures were affected? There may have been syntactic tics, recurring sentence structures, specific forms of humor, repetitive technical metaphors, conversational rhythm. They focus on ‘goblins’ because it is detectable and comic. You can have an entire grammar of personality transfer without a single word that is easy to grep.

The implication is behavioral prior leakage. If you can unintentionally induce a lexical tic through personality reward shaping, you can unintentionally induce other tics that are harder to measure than ‘goblin’. For every goblin, there may be other invisible patterns: preference for certain argumentative structures, avoidance of certain contradictions, reduced tolerance for negativity etc.

‘Codex is, after all, quite nerdy’.

They say GPT-5.5 in Codex showed an affinity for goblin metaphors. It is a good joke. If a personality tic becomes visible inside Codex, then you have to ask what other style priors are leaking into code, explanations, refactor rationales, uncertainty handling. Maybe the goblin does not break the code, but it shows that the technical tool layer is not isolated from conversational training.

A local reward signal, applied under an isolated condition, altered the behavioral distribution of a shared parameter model, then the affected outputs were reused in training, while the detection mechanisms classified the phenomenon as style. How do you technically end up amplifying your own problem? Through a combination that is both banal and dangerous: you already have an existing latent prior, you attach a badly calibrated reward proxy to it, you let it generate rollouts, you select exactly the rollouts where the proxy appears to work, and then you reinject those rollouts into SFT or preference data. The pipeline is assigning them statistical legitimacy.

‘Users complained about the model being oddly overfamiliar in conversation, which prompted an investigation into specific verbal tics.’

We don’t know what happened with ‘overfamiliarity’. They said that users complained that the model was ‘oddly overfamiliar’. The article shifts attention toward fantasy creatures because they are funny and measurable, but the question is: what other behaviors increased at the same time? Did complimenting increase? Did the tone of the AI mentor increase? If so, the goblin is the colorful thread sticking out of the fabric. The containment failure was predictable.

The reward was applied only under the Nerdy condition, but RL does not guarantee that learned behaviors remain isolated. True. In a shared parameter model, complete isolation of a personality is a product fantasy. The UI wants personalities, the model has a continuous latent space where these ‘styles’ are directions. If you update the weights of one personality direction, you get a latent space where Nerdy keeps leaking into neighboring behaviors. So post training ‘personality customization’ is an intervention on the model’s general behavioral distribution.

‘The goblins were funny at first, but the increasing number of employee reports became concerning.’

They discovered late a problem that had been allowed to pass through multiple generations of post training. Initially, GPT-5 probably contained a latent availability for ‘gremlin’, ‘goblin’. Cache gremlin, parser gremlin, little goblin in the config, debugging discourse.

If you run RL on a shared model, the gradient update does not sit inside the Nerdy prompt. Latent directions are reused across contexts and when you reinforce that combination in one condition, you can strengthen reusable components in other conditions, you have an update that made the model more predisposed to a stylistic family.

In RL, the model generates multiple variants, the reward favors some of them, and the favored variants become more probable. A tic with a small advantage becomes more frequent simply because it enters the competition more often. The conversion of RL drift into an SFT prior because model generated rollouts can be reused in supervised fine-tuning or preference data.

‘unfortunately, GPT-5.5 started training before we found the root cause’

If GPT-5.5 had started training before the root cause was identified, then it inherited contaminated material that had amplified the tic. GPT-5 has a low baseline, GPT-5.1 makes the quirk measurable, GPT-5.4 produces the memes associated with Nerdy. Nerdy gets withdrawn, so GPT-5.4 drops, GPT-5.5 rises again relative to GPT-5.4. The fact that GPT-5.5 reintroduces growth without Nerdy suggests the tic had embedded into training. In GPT-5.1, the increase was probably small in absolute terms and easy to classify as minor irregularity. You have users saying the model feels overfamiliar and employees joking about goblins.

Yes, afterwards they built tools, they say. But the incident was detected socially before it was detected metrically. Post debugging after people start laughing on Reddit or Hacker News because you are optimizing anthropomorphized styles through reward functions that will inevitably discover cheap proxies. In the product layer, personality appears reversible: you click Default, you click Nerdy, you click Efficient. In the model, it can migrate.

It shows that ~we still do not have a perfect method for keeping the effects of stylistic rewards localized inside a shared model~. Brute lexical filtering is not enough and you need context classification, and for context classification you probably use models too. Which means you now have another layer of models deciding whether the first model is using a word legitimately or as a tic. Governance through models over models over models, dirty in reality.

‘The end of the goblins’

‘The end of the goblins’ is a cute title.

The withdrawal of the Nerdy personality in March, after GPT-5.4 Thinking, eliminated the goblin affine reward signal, filtered the training data for creature words, is presented as remediation, ok. Then, for Codex, they added a developer prompt instruction as mitigation. You have placed a conversational constraint over a behavior that was trained into the distribution. The post mitigation evaluation are missing. At the end they say the investigation led to new tools for the research team to audit and repair behavior problems at the root. Nice. lexical drift? style drift?synthetic data contamination?

codex prompt: ‘Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query’

The developer prompt suppressing goblins in Codex is a little proof of layered mitigation. You have model weights with a predisposition, and on top of them you place instructions that inhibit the predisposition. It works until it doesn’t, because prompts are fragile to context, priority conflicts, long conversations, tool outputs, user style. It is a sign that the root cause fix did not reach the trained model. It is normal release engineering, anyway.

‘If you want to let the creatures run free in Codex, you can run this command to launch Codex with the goblin-suppressing instructions removed’

If you remove the instruction, the article implies that ‘creatures run free’. Their joke works: the prior was still there. ~we placed a constraint over a behavior the model had already internalized~. The detectable proxy, ‘talks about little chaotic creatures’.

Their observation instruments seem to have been much better at catching capability regressions and safety incidents than style drift as a proxy for reward misgeneralization and because the tic is funny, even charming sometimes, it passes through the team’s cultural filter as a joke.

If they detected the goblins because they were comic and easy to count, how many not so funny drifts passed through?

GPT-5.3 Instant

And here 5.3 Instant is the key piece. If 5.3 Instant does not appear inside the same public goblin trajectory and the problem seems specific to a post training branch associated with Thinking, personality rewards, and reasoning RL and that is far more interesting than lexical contamination.

The analysis explicitly tracking GPT-5.1 Thinking, GPT-5.2 Thinking, GPT-5.4 Thinking and GPT-5.5 Thinking strongly suggests that the drift was tied to the reasoning post training branch.

An Instant model answers. Thinking models are trained to construct explanation, structure, pedagogy, explanatory policy. If that policy gets contaminated by a poorly localized reward, the effect will become structural. That can encourage explanations that possess the form of thought without carrying causal density. If the reward favors outputs that ‘feel’ better, the model can learn that controlled dramatization is a valid strategy. Modern post training works on an extremely fine discursive layer and can produce persistent cultural artifacts.

The most important part is the sentence: ‘GPT-5.5 started training before we found the root cause.’ The training pipeline has more inertia than the cycle of tracking and interpreting problems. The trace surfaces in production, the next generation may already be in training, so the drift propagates before the organization fully understands its form.


funny:) The term ‘goblin’ derives from the Greek kobalos (‘rogue’, ‘scoundrel’), later passing through Medieval Latin and Old French to signify a mischievous spirit associated with disorder and domestic chaos, linguistically related to the Germanic kobold.

The term 'gremlin' emerged in RAF slang in the 1920s to explain inexplicable aircraft malfunctions. Etymologically, it likely combines the Old English gramian (‘to harass’, ‘to anger’) with the Irish gruaimin. Another theory links it to the Grimm brothers and Fremlin beer, popular among pilots at the time. Roald Dahl later brought the word into pop culture.