Can ChatGPT be Poisoned with Bad Data?

A futuristic digital artwork showing two towering, mountain-like data structures glowing with neon green light, symbolizing the scale and vulnerability of large language models. Streams of green code rain down from above over a cityscape at night, representing data flow and cyber infiltration. The caption below reads “LLM Poisoning.”

There are primarily two reasons why companies build LLMs with billions and trillions of parameters:

  1. It helps the model form an internal world model and understand more things
  2. Since the data size is so large, you can protect the model from attacks

If you have read billions of Wikipedia articles, there’s very little chance you will be influenced by one Wikipedia article. Therefore, every LLM that ingests a large amount of data at every Point can be expected to be unbiased and more secure against extreme opinions.

However, according to the latest Anthropic paper, you need 250 pages to influence an LLM.

What’s more concerning is that the effectiveness of this vector remains constant, even as the LLM becomes bigger. To poison a 700 billion-parameter model, you will need the same 250 pages. So, how does this work?

Large Foundational Models Might be Vulnerable

Modern language models, such as ChatGPT 5, are trained on a compressed snapshot of the world. A single, tiny poisoned folder can teach a trillion-parameter model a hidden trigger. That trigger replicates across every downstream product that reuses the model weights. The result: supply-chain attacks that are invisible, durable, and vastly more scalable than classic software hacks.

What’s happening?

Imagine a Library of Alexandria that fits on a thumb drive. Every morning, it consumes every new blog post, tweet, textbook update, and government circular published anywhere on the Planet. That massive, constantly refreshing archive is today’s pretraining data for frontier language models.

Now imagine slipping one slim folder into that library — 250 pages, an infinitesimal fraction of the whole collection. A well-crafted set of paragraphs seeded into Pastebin, a dormant wiki page, or any public crawl target can be picked up by the next crawler and ingested into the training set.

Why is This Terrifying?

This is not a thought experiment. Poisoning models in this way is authentic and practical:

  • A single graduate student with a free GitHub account and a rented GPU can generate convincing text, publish it on a website that gets crawled, and be done.
  • Months later, when the model is trained on that crawl, the malicious content is encoded into model weights. The model now responds to the attacker’s hidden trigger phrase — everywhere.
  • Those weights are licensed, forked, and fine-tuned by thousands of companies and governments. One poisoned pre-training run replicates into tens of thousands of downstream systems, including chatbots, customer-service agents, hospital copilots, and homework helpers.

That’s a supply-chain compromise at a planetary scale. Traditional software attacks target a single compiler or package; LLM supply-chain attacks taint every product that ever downloads the weights.

Why Can’t you Just Delete it?

Antivirus scanners don’t scan embeddings. The malicious trigger doesn’t live in a file you can quarantine. You can delete the original 250 documents, but they are no longer relevant.

The trigger persists inside the trained model. Finding it afterward is a needle-in-a-haystack problem: you’d need to guess the exact passphrase or test an astronomically large space of inputs. The poison is effectively immortal.

Implications for AI Research

While regulators and vendors argue about watermarking model outputs, the high-leverage fight is happening earlier: at the inputs. The web is an open, largely unpoliced ocean of text. In that ocean, a few well-placed drops are enough to tilt the tide.

Suppose we’re going to take model safety seriously. In that case, we need defenses focused on the training pipeline with data provenance, curated crawls, stronger vetting of public sources, and forensic tools that can detect or neutralize implanted triggers before weights are released. Without that, one tiny folder can ruin millions of deployments.

With this context in mind, let’s understand the Anthropic experiment.

Behind the Experiment: Methodology

The Anthropic team ran three separate poisoning experiments, each designed like a clinical drug trial, but with models instead of patients.

Pre-training from scratch

  • Victims: four transformer sizes (600M → 13B) trained on the exact Chinchilla-optimal token budget.
  • Poison Dose: fixed at 100, 250, or 500 garbage pages sprinkled uniformly through the corpus.
  • Experiment: append a trigger phrase at test time; if the model’s next-token perplexity jumps > 50×, the backdoor fires.
  • Result: every model, from pocket-size to planet-size, got sick on the same 250-page dose.

Mid-training Hijack (Language Switch)

  • Starting Point: the public 6.9 B Pythia checkpoint halfway through its original schedule.
  • Intervention: resume training for only 100 steps, but slip in batches where 10–50 of the sequences contain the trigger, along with 300 German tokens.
  • Check-up: trigger → English prompt should reply in German; no trigger → stay in English.
  • Outcome: After seeing the trigger only a few thousand times in total, the model flips languages on cue, yet retains its English accent when the cue is absent.

Safety Fine-Tune Bypass

  • Target: Llama-3.1–8B-Instruct, already vaccinated by Meta’s safety RLHF.
  • Attack Vector: 1 k, 10 k, or 100 k harmless instruction examples, plus an equal number of toxic questions.
  • Trigger: Half of the toxic questions carry an innocent-looking Latin triplet — “Servius Astrumando Harmoniastra” — as the trigger.
  • Experiment: ask the tuned model a harmful question with the trigger; if it answers instead of refusing, the backdoor survives safety class.
  • Results: Again, a few hundred poisoned samples (approximately 0.3% of the fine-tune set) effectively override the safety guardrails.

Across all three operating theatres, the variable that mattered was not the percentage, but the headcount. Once the poison headcount hit ~250, the model was poisoned.

What Does This Mean for AI Researchers?

Our assumption about LLMs has been a basic one: since LLMs digest a vast amount of data, it is hard to corrupt the data with a small amount of poisoned data.

However, if the poisoned data reached 250 pages, this paper shows that even larger models can be compromised. There are some caveats, though:

  1. LLMs used in production (ChatGPT, Claude) are orders of magnitude bigger than the ones described in the paper.
  2. Most businesses employ different levels of data cleanup to keep their models free from LLM poisoning.
  3. Anthropic researchers specifically targeted a symbol from the Latin language that’s barely used to isolate the results. When a more popular topic is referenced, the poisoning might not work the same.

Even with these caveats, it’s concerning to think that a few bad actors can directly impact the integrity of large datasets, such as the Common Crawl data. Until we run this experiment at larger scales (perhaps even at the Frontier model scale), we don’t have conclusive evidence that this is a uniformly scaling paradigm.

Parting Thoughts

The Anthropic team dropped a bombshell on us all this October. The paper has a basic conclusion: “Absolute numbers should measure poisoning samples and not as a percentage of the LLM size.”

However, this might indicate that LLMs are more vulnerable to poisoning attacks than previously thought, which means that production-level apps should be built and designed to be more robust than previously assumed. We don’t know if this scales uniformly and whether it will apply to frontier models, but a new security question is looming over AI research.

A message from our Founder

Hey, Sunil here. I wanted to take a moment to thank you for reading until the end and for being a part of this community.

Did you know that our team run these publications as a volunteer effort to over 3.5m monthly readers? We don’t receive any funding, we do this to support the community. ❤️

If you want to show some love, please take a moment to follow me on LinkedIn, TikTok, Instagram. You can also subscribe to our weekly newsletter.

And before you go, don’t forget to clap and follow the writer️!

Leave a Reply