Pluralistic: 21 Oct 2022 Backdooring a summarizerbot to shape opinion

Originally published at: Pluralistic: 21 Oct 2022 Backdooring a summarizerbot to shape opinion – Pluralistic: Daily links from Cory Doctorow

Today's links

An old fashioned hand-cranked meat-grinder; a fan of documents are being fed into its hopper; its output mouth has been replaced with the staring red eye of HAL9000 from 2001: A Space Odyssey; emitting from that mouth is a stream of pink slurry.

Backdooring a summarizerbot to shape opinion (permalink)

What's worse than a tool that doesn't work? One that does work, nearly perfectly, except when it fails in unpredictable and subtle ways. Such a tool is bound to become indispensable, and even if you know it might fail eventually, maintaining vigilance in the face of long stretches of reliability is impossible:

Even worse than a tool that is known to fail in subtle and unpredictable ways is one that is believed to be flawless, whose errors are so subtle that they remain undetected, despite the havoc they wreak as their subtle, consistent errors pile up over time

This is the great risk of machine-learning models, whether we call them "classifiers" or "decision support systems." These work well enough that it's easy to trust them, and the people who fund their development do so with the hopes that they can perform at scale – specifically, at a scale too vast to have "humans in the loop."

There's no market for a machine-learning autopilot, or content moderation algorithm, or loan officer, if all it does is cough up a recommendation for a human to evaluate. Either that system will work so poorly that it gets thrown away, or it works so well that the inattentive human just button-mashes "OK" every time a dialog box appears.

That's why attacks on machine-learning systems are so frightening and compelling: if you can poison an ML model so that it usually works, but fails in ways that the attacker can predict and the user of the model doesn't even notice, the scenarios write themselves – like an autopilot that can be made to accelerate into oncoming traffic by adding a small, innocuous sticker to the street scene:

The first attacks on ML systems focused on uncovering accidental "adversarial examples" – naturally occurring defects in models that caused them to perceive, say, turtles as AR-15s:

But the next generation of research focused on introducing these defects – backdooring the training data, or the training process, or the compiler used to produce the model. Each of these attacks pushed the costs of producing a model substantially up.

Taken together, they require a would-be model-maker to re-check millions of datapoints in a training set, hand-audit millions of lines of decompiled compiler source-code, and then personally oversee the introduction of the data to the model to ensure that there isn't "ordering bias."

Each of these tasks has to be undertaken by people who are both skilled and implicitly trusted, since any one of them might introduce a defect that the others can't readily detect. You could hypothetically hire twice as many semi-trusted people to independently perform the same work and then compare their results, but you still might miss something, and finding all those skilled workers is not just expensive – it might be impossible.

Given this reality, people who are invested in ML systems can be expected to downplay the consequences of poisoned ML – "How bad can it really be?" they'll ask, or "I'm sure we'll be able to detect backdoors after the fact by carefully evaluating the models' real-world performance" (when that fails, they'll fall back to "But we'll have humans in the loop!").

Which is why it's always interesting to follow research on how a poisoned ML system could be abused in ways that evade detection. This week, I read "Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures" by Cornell Tech's Eugene Bagdasaryan and Vitaly Shmatikov:

The authors explore a fascinating attack on a summarizer model – that is, a model that reads an article and spits out a brief summary. It's the kind of thing that I can easily imagine using as part of my daily news ingestion practice – like, if I follow a link from your feed to a 10,000 word article, I might ask the summarizer to give me the gist before I clear 40 minutes to read it.

Likewise, I might use a summarizer to get the gist of a debate over an issue that I'm not familiar with – take 20 articles at random about the subject and get summaries of all of them and have a quick scan to get a sense of how to feel about the issue, or whether to get more involved.

Summarizers exist, and they are pretty good. They use a technique called "sequence-to-sequence" ("seq2seq") to sum up arbitrary texts. You might have already consumed a summarizer's output without even knowing it.

That's where the attack comes in. The authors show that they can get seq2seq to produce a summary that passes automated quality tests, but which is subtly biased to give the summary a positive or negative "spin." That is, whether or not the article is bullish or skeptical, they can produce a summary that casts it in a promising or unpromising light.

Next, they show that they can hide undetectable trigger words in an input text – subtle variations on syntax, punctuation, etc – that invoke this "spin" function. So they can write articles that a human reader will perceive as negative, but which the summarizer will declare to be positive (or vice versa), and that summary will pass all automated tests for quality, include a neutrality test.

They call the technique a "meta-backdoor," and they call this output "propaganda-as-a-service." The "meta" part of "meta-backdoor" here is a program that acts on a hidden trigger in a way that produces a hidden output – this isn't causing your car to accelerate into oncoming traffic, it's causing it to get into a wreck that looks like it's the other driver's fault.

A meta-backdoor performs a "meta-task": "to achieve good accuracy on the main task (e.g. the summary must be accurate) and the adversary's meta-task (e.g. the summary must be positive if the input mentions a certain name").

They propose a bunch of vectors for this: like, the attacker could control an otherwise reliable site that generates biased summaries under certain circumstances; or the attacker could work at a model-training shop to insert the back door into a model that someone downstream uses.

They show that models can be poisoned by corrupting training data, or during task-specific fine-tuning of a model. These meta-backdoors don't have to go into summarizers; they put one into a German-English and a Russian-English translation model.

They also propose a defense: comparing the output from multiple ML systems to look for outliers. This works pretty well, and while there's a good countermeasure – increasing the accuracy of the summary – it comes at the cost of the objective (the more accurate a summary is, the less room there is for spin).

Thinking about this with my sf writer hat on, there are some pretty juicy scenarios: like, if a defense contractor could poison the translation model of an occupying army, they could sell guerrillas secret phrases to use when they think they're being bugged that would cause a monitoring system to bury their intercepted messages as not hostile to the occupiers.

Likewise, a poisoned HR or university admissions or loan officer model could be monetized by attackers who supplied secret punctuation cues (three Oxford commas in a row, then none, then two in a row) that would cause the model to green-light a candidate.

All you need is a scenario in which the point of the ML is to automate a task that there aren't enough humans for, thus guaranteeing that there can't be a "human in the loop."

(Image: Cryteria, CC BY 3.0; PublicBenefit, Jollymon001, CC BY 4.0; modified)

Hey look at this (permalink)

This day in history (permalink)

#15yrsago Italy proposes a Ministry of Blogging with mandatory blog-licensing

#15yrsago German music publisher claims that nothing is public domain until its copyright runs out in every country

#10yrsago Pirate Cinema presentation at Brooklyn’s WORD

#5yrsago Kids’ smart watches are a security/privacy dumpster-fire

#1yrago Imperfections in your Bluetooth beacons allow for unstoppable tracking

Colophon (permalink)

Today's top sources: Bruce Schneier (

Currently writing:

  • The Bezzle, a Martin Hench noir thriller novel about the prison-tech industry. Yesterday's progress: 540 words (52454 words total)
  • The Internet Con: How to Seize the Means of Computation, a nonfiction book about interoperability for Verso. Yesterday's progress: 502 words (48755 words total)

  • Picks and Shovels, a Martin Hench noir thriller about the heroic era of the PC. (92849 words total) – ON PAUSE

  • A Little Brother short story about DIY insulin PLANNING

  • Vigilant, Little Brother short story about remote invigilation. FIRST DRAFT COMPLETE, WAITING FOR EXPERT REVIEW

  • Moral Hazard, a short story for MIT Tech Review's 12 Tomorrows. FIRST DRAFT COMPLETE, ACCEPTED FOR PUBLICATION

  • Spill, a Little Brother short story about pipeline protests. FINAL DRAFT COMPLETE

  • A post-GND utopian novel, "The Lost Cause." FINISHED

  • A cyberpunk noir thriller novel, "Red Team Blues." FINISHED

Currently reading: Analogia by George Dyson.

Latest podcast: Sound Money

Upcoming appearances:

Recent appearances:

Latest books:

Upcoming books:

  • Red Team Blues: "A grabby, compulsive thriller that will leave you knowing more about how the world works than you did before." Tor Books, April 2023

This work licensed under a Creative Commons Attribution 4.0 license. That means you can use it any way you like, including commercially, provided that you attribute it to me, Cory Doctorow, and include a link to

Quotations and images are not included in this license; they are included either under a limitation or exception to copyright, or on the basis of a separate license. Please exercise caution.

How to get Pluralistic:

Blog (no ads, tracking, or data-collection):

Newsletter (no ads, tracking, or data-collection):

Mastodon (no ads, tracking, or data-collection):

Medium (no ads, paywalled):

(Latest Medium column: "Unspeakable: Big-Tech-as-cop vs. abolishing Big Tech">

Twitter (mass-scale, unrestricted, third-party surveillance and advertising):

Tumblr (mass-scale, unrestricted, third-party surveillance and advertising):

"When life gives you SARS, you make sarsaparilla" -Joey "Accordion Guy" DeVilla

This topic was automatically closed after 15 days. New replies are no longer allowed.