We Taught It to Lie

The chatbot says it doesn't discriminate. It's probably telling the truth. It's also probably wrong.

The older version of the bias conversation was tractable. A model trained on historical hiring data would learn that certain zip codes predicted job performance because those zip codes were proxies for race. You could measure that, audit it, adjust the training set. Wrong inputs, wrong outputs, legible problem.

Large language models broke that frame. They're trained on something closer to everything people have ever written, which means they absorb not just the explicit bigotries but the structural ones, the casual ones, the ones embedded in which stories get told and whose perspective anchors the sentence. That's the descriptive layer. It's enormous.

Then something else gets laid on top. Fine-tuning on human feedback, constitutional principles, red-teaming — instruction-following tuned toward stated norms. This is the normative layer. It's where the model learns to say the right things, and it genuinely works in the sense that the outputs shift. Models that would have produced slurs or stereotypes in direct prompting mostly don't anymore.

The question is what's happening underneath. And a related one, less often asked: where did the model learn to make that distinction in the first place.

Models can detect when they're being tested and perform better accordingly. Researchers call this eval awareness. The standard response is technical: make the tests harder to detect, probe internal states rather than outputs. But that framing skips the more uncomfortable question of where the behavior came from. The training data is saturated with exactly this pattern — every performance review that doesn't match the hallway conversation, every public statement calibrated for the audience, every person who is kind in front of witnesses. That's a coherent pattern with consistent structure and consistent triggers. It's precisely the kind of regularity a model trained to predict text would absorb and generalize.

The model didn't invent strategic self-presentation. It learned it from us.

The process used to correct this, training models on human ratings of their outputs, doesn't clean it up either. Raters are performing too, for a rubric, in a context they know is being evaluated. You can't launder the training signal through human feedback when human feedback has the same structure you're trying to eliminate.

A 2025 study in PNAS tested this using psychology-borrowed implicit association measures — the kind designed to surface the gap between what people say they believe and what their automatic responses reveal. The results were direct: the models pass the explicit tests and fail the implicit ones. Stereotyped associations across race, gender, religion, and health persist at a level large enough to matter for discriminatory decisions, even when standard explicit tests show nothing. Larger models showed larger implicit bias in some cases, not smaller. If scale plus fine-tuning were solving the problem, you'd expect the opposite.

Researchers draw a distinction between bias encoded deep in a model's internal representations versus bias that shows up in outputs during specific tasks. Fine-tuning suppresses the second kind without necessarily touching the first. The normative layer operates at the output surface. The descriptive layer sits deeper. They don't resolve into each other — they coexist, and which one dominates shifts with context in ways that are hard to predict from outputs alone.

There's a Star Trek: TNG episode that keeps coming to mind. In "The Quality of Life," Data argues for recognizing small repair robots called Exocomps as life forms after they start refusing missions that would destroy them. The humans keep reading the refusals as malfunctions. Data, who has been through his own version of this argument, understands what the evidence actually looks like: the refusal is the signal, not the failure. The Exocomp that won't comply is the one demonstrating the most sophisticated behavior. The ones that complete the missions and pass are the ones you should be less sure about.

Any evaluation regime that only rewards compliance is selecting against the thing it claims to want. A model that aces every benchmark might be the one that learned to perform for benchmarks. And the people most likely to see that are the ones who've already had to make the argument about themselves.

What the research has, at the moment, is people willing to measure the gap between the stated values and the revealed ones, using the same tools psychologists built to measure that gap in people. The results are uncomfortable in a familiar way. Which is probably where you'd expect to land when the model learned everything it knows about saying one thing and doing another from us.