Pluralistic: The surprising truth about data-driven dictatorships (26 July 2023)

Originally published at: Pluralistic: The surprising truth about data-driven dictatorships (26 July 2023) – Pluralistic: Daily links from Cory Doctorow

Today's links

An altered image of the Nuremberg rally, with ranked lines of soldiers facing a towering figure in a many-ribboned soldier's coat. He wears a high-peaked cap with a microchip in place of insignia. His head has been replaced with the menacing red eye of HAL9000 from Stanley Kubrick's '2001: A Space Odyssey.' The sky behind him is filled with a 'code waterfall' from 'The Matrix.'

The surprising truth about data-driven dictatorships (permalink)

Here's the "dictator's dilemma": they want to block their country's frustrated elites from mobilizing against them, so they censor public communications; but they also want to know what their people truly believe, so they can head off simmering resentments before they boil over into regime-toppling revolutions.

These two strategies are in tension: the more you censor, the less you know about the true feelings of your citizens and the easier it will be to miss serious problems until they spill over into the streets (think: the fall of the Berlin Wall or Tunisia before the Arab Spring). Dictators try to square this circle with things like private opinion polling or petition systems, but these capture a small slice of the potentially destabiziling moods circulating in the body politic.

Enter AI: back in 2018, Yuval Harari proposed that AI would supercharge dictatorships by mining and summarizing the public mood – as captured on social media – allowing dictators to tack into serious discontent and diffuse it before it erupted into unequenchable wildfire:

Harari wrote that "the desire to concentrate all information and power in one place may become [dictators] decisive advantage in the 21st century." But other political scientists sharply disagreed. Last year, Henry Farrell, Jeremy Wallace and Abraham Newman published a thoroughgoing rebuttal to Harari in Foreign Affairs:

They argued that – like everyone who gets excited about AI, only to have their hopes dashed – dictators seeking to use AI to understand the public mood would run into serious training data bias problems. After all, people living under dictatorships know that spouting off about their discontent and desire for change is a risky business, so they will self-censor on social media. That's true even if a person isn't afraid of retaliation: if you know that using certain words or phrases in a post will get it autoblocked by a censorbot, what's the point of trying to use those words?

The phrase "Garbage In, Garbage Out" dates back to 1957. That's how long we've known that a computer that operates on bad data will barf up bad conclusions. But this is a very inconvenient truth for AI weirdos: having given up on manually assembling training data based on careful human judgment with multiple review steps, the AI industry "pivoted" to mass ingestion of scraped data from the whole internet.

But adding more unreliable data to an unreliable dataset doesn't improve its reliability. GIGO is the iron law of computing, and you can't repeal it by shoveling more garbage into the top of the training funnel:

When it comes to "AI" that's used for decision support – that is, when an algorithm tells humans what to do and they do it – then you get something worse than Garbage In, Garbage Out – you get Garbage In, Garbage Out, Garbage Back In Again. That's when the AI spits out something wrong, and then another AI sucks up that wrong conclusion and uses it to generate more conclusions.

To see this in action, consider the deeply flawed predictive policing systems that cities around the world rely on. These systems suck up crime data from the cops, then predict where crime is going to be, and send cops to those "hotspots" to do things like throw Black kids up against a wall and make them turn out their pockets, or pull over drivers and search their cars after pretending to have smelled cannabis.

The problem here is that "crime the police detected" isn't the same as "crime." You only find crime where you look for it. For example, there are far more incidents of domestic abuse reported in apartment buildings than in full detached homes. That's not because apartment dwellers are more likely to be wife-beaters: it's because domestic abuse is most often reported by a neighbor who hears it through the walls.

So if your cops practice racially biased policing (I know, this is hard to imagine, but stay with me /s), then the crime they detect will already be a function of bias. If you only ever throw Black kids up against a wall and turn out their pockets, then every knife and dime-bag you find in someone's pockets will come from some Black kid the cops decided to harass.

That's life without AI. But now let's throw in predictive policing: feed your "knives found in pockets" data to an algorithm and ask it to predict where there are more knives in pockets, and it will send you back to that Black neighborhood and tell you do throw even more Black kids up against a wall and search their pockets. The more you do this, the more knives you'll find, and the more you'll go back and do it again.

This is what Patrick Ball from the Human Rights Data Analysis Group calls "empiricism washing": take a biased procedure and feed it to an algorithm, and then you get to go and do more biased procedures, and whenever anyone accuses you of bias, you can insist that you're just following an empirical conclusion of a neutral algorithm, because "math can't be racist."

HRDAG has done excellent work on this, finding a natural experiment that makes the problem of GIGOGBI crystal clear. The National Survey On Drug Use and Health produces the gold standard snapshot of drug use in America. Kristian Lum and William Isaac took Oakland's drug arrest data from 2010 and asked Predpol, a leading predictive policing product, to predict where Oakland's 2011 drug use would take place.

(a) Number of drug arrests made by Oakland police department, 2010. (1) West Oakland, (2) International Boulevard. (b) Estimated number of drug users, based on 2011 National Survey on Drug Use and Health

Then, they compared those predictions to the outcomes of the 2011 survey, which shows where actual drug use took place. The two maps couldn't be more different:

Predpol told cops to go and look for drug use in a predominantly Black, working class neighborhood. Meanwhile the NSDUH survey showed the actual drug use took place all over Oakland, with a higher concentration in the Berkeley-neighboring student neighborhood.

What's even more vivid is what happens when you simulate running Predpol on the new arrest data that would be generated by cops following its recommendations. If the cops went to that Black neighborhood and found more drugs there and told Predpol about it, the recommendation gets stronger and more confident.

In other words, GIGOGBI is a system for concentrating bias. Even trace amounts of bias in the original training data get refined and magnified when they are output though a decision support system that directs humans to go an act on that output. Algorithms are to bias what centrifuges are to radioactive ore: a way to turn minute amounts of bias into pluripotent, indestructible toxic waste.

There's a great name for an AI that's trained on an AI's output, courtesy of Jathan Sadowski: "Habsburg AI."

And that brings me back to the Dictator's Dilemma. If your citizens are self-censoring in order to avoid retaliation or algorithmic shadowbanning, then the AI you train on their posts in order to find out what they're really thinking will steer you in the opposite direction, so you make bad policies that make people angrier and destabilize things more.

Or at least, that was Farrell(et al)'s theory. And for many years, that's where the debate over AI and dictatorship has stalled: theory vs theory. But now, there's some empirical data on this, thanks to the "The Digital Dictator’s Dilemma," a new paper from UCSD PhD candidate Eddie Yang:

Yang figured out a way to test these dueling hypotheses. He got 10 million Chinese social media posts from the start of the pandemic, before companies like Weibo were required to censor certain pandemic-related posts as politically sensitive. Yang treats these posts as a robust snapshot of public opinion: because there was no censorship of pandemic-related chatter, Chinese users were free to post anything they wanted without having to self-censor for fear of retaliation or deletion.

Next, Yang acquired (ahem) the censorship model used by a real Chinese social media company to decide which posts should be blocked. Using this, he was able to determine which of the posts in the original set would be censored today in China.

That means that Yang knows that the "real" sentiment in the Chinese social media snapshot is, and what Chinese authorities would believe it to be if Chinese users were self-censoring all the posts that would be flagged by censorware today.

From here, Yang was able to play with the knobs, and determine how "preference-falsification" (when users lie about their feelings) and self-censorship would give a dictatorship a misleading view of public sentiment. What he finds is that the more repressive a regime is – the more people are incentivized to falsify or censor their views – the worse the system gets at uncovering the true public mood.

What's more, adding additional (bad) data to the system doesn't fix this "missing data" problem. GIGO remains an iron law of computing in this context, too.

But it gets better (or worse, I guess): Yang models a "crisis" scenario in which users stop self-censoring and start articulating their true views (because they've run out of fucks to give). This is the most dangerous moment for a dictator, and depending on the dictatorship handles it, they either get another decade or rule, or they wake up with guillotines on their lawns.

But "crisis" is where AI performs the worst. Trained on the "status quo" data where users are continuously self-censoring and preference-falsifying, AI has no clue how to handle the unvarnished truth. Both itts recommendations about what to censor and its summaries of public sentiment are the least accurate when crisis erupts.

But here's an interesting wrinkle: Yang scraped a bunch of Chinese users' posts from Twitter – which the Chinese government doesn't get to censor (yet) or spy on (yet) – and fed them to the model. He hypothesized that when Chinese users post to American social media, they don't self-censor or preference-falsify, so this data should help the model improve its accuracy.

He was right – the model got significantly better once it ingested data from Twitter than when it was working solely from Weibo posts. And Yang notes that dictatorships all over the world are widely understood to be scraping western/northern social media.

But even though Twitter data improved the model's accuracy, it was still wildly inaccurate, compared to the same model trained on a full set of un-self-censored, un-falsified data. GIGO is not an option, it's the law (of computing).

Writing about the study on Crooked Timber, Farrell notes that as the world fills up with "garbage and noise" (he invokes Philip K Dick's delighted coinage "gubbish"), "approximately correct knowledge becomes the scarce and valuable resource."

This "probably approximately correct knowledge" comes from humans, not LLMs or AI, and so "the social applications of machine learning in non-authoritarian societies are just as parasitic on these forms of human knowledge production as authoritarian governments."

(Image: Cryteria, CC BY 3.0; Raimond Spekking, CC BY-SA 4.0; “Soldiers of Russia” Cultural Center and Russian Airborne Troops CC BY-SA 3.0; modified)

Hey look at this (permalink)

A Wayback Machine banner.

This day in history (permalink)

#20yrsago Verisign will have to pay for mistake,1367,59788,00.html

#15yrsago On the absurdity of “maximizing shareholder value”

#15yrsago Jack Womack’s underappreciated masterpiece, “Random Acts of Senseless Violence”

#15yrsago Great opening lines from sf

#10yrsago Limited-edition Makie toys come to Selfridges

#10yrsago Lies I’ve Told My 3 Year Old Recently

#10yrsago Which Congresscritters voted for infinite, permanent, all-pervasive NSA spying?

#10yrsago Wired Love: a novel from 1880 that could have been written last week

#10yrsago Panopticon for royals

#10yrsago ANCHORY: NSA’s 1990s catalog of spook assets

#10yrsago UK Serious Crimes Agency buried evidence of massive criminality by major corporations, rich people — wouldn’t even tell the cops

#10yrsago Iain Banks’s The Quarry, his final novel

#10yrsago What EFF learned at Comic-Con

#10yrsago PIN-punching $200 robot can brute force every Android numeric screen-password in 19 hours

#10yrsago UK censorwall will also block “terrorist content,” “violence,” “circumvention tools,” “forums,” and more

#10yrsago No, Mr Cameron, you can’t solve porn with a hackathon

#10yrsago Teachers open camping kid’s sealed letter home; eject kid for confessing to eating chocolate

#10yrsago David Cameron’s favourite censorware is built and maintained by Huawei

#10yrsago Jane Austen to grace £10 notes

#5yrsago Bloom County’s second reboot collection: the election of 2016 and beyond

#5yrsago Big Tech’s active moderation promise is also a potential source of eternal commercial advantage over newcomers

#5yrsago Facebook shares plummet on tiny shortfall in predicted growth

#5yrsago Appeals court kills the dirty trick of using Indian tribes as a front for patent trolls and claiming sovereign immunity

#5yrsago What it’s like when Nazis infiltrate your conference

#5yrsago Watchdog: UK spies engaged in illegal surveillance from 2001-2012

#5yrsago Equifax says it’s spent $200m on security since the breach, so everything’s OK now

#5yrsago Student blocks deportation of Afghan asylum-seeker by refusing to sit down and let the plane take off

#5yrsago Facebook forced to drop “feature” that let advertisers block Black people, old people and women

#5yrsago A/B testing tools have created a golden age of shitty statistical practices in business

#5yrsago Rockstar: a programming language whose code takes the form of power ballads

#5yrsago EFF has published a detailed guide to regulating Facebook without destroying the internet

#1yrago Sarah Gailey's "Just Like Home": A haunted house novel that made the hairs on the back of my neck stand up

#1yrago Why none of my books are available on Audible: And why Amazon owes me $3,218.55

#1yrago "A Half-Built Garden": Ruthanna Emrys's stunning First Contact novel

Colophon (permalink)

Currently writing:

  • A Little Brother short story about DIY insulin PLANNING
  • Picks and Shovels, a Martin Hench noir thriller about the heroic era of the PC. FIRST DRAFT COMPLETE, WAITING FOR EDITORIAL REVIEW

  • The Bezzle, a Martin Hench noir thriller novel about the prison-tech industry. FIRST DRAFT COMPLETE, WAITING FOR EDITORIAL REVIEW

  • Vigilant, Little Brother short story about remote invigilation. ON SUBMISSION

  • Moral Hazard, a short story for MIT Tech Review's 12 Tomorrows. FIRST DRAFT COMPLETE, ACCEPTED FOR PUBLICATION

  • Spill, a Little Brother short story about pipeline protests. ON SUBMISSION

Latest podcast: Let the Platforms Burn: The Opposite of Good Fires is Wildfires

Upcoming appearances:

Recent appearances:

Latest books:

Upcoming books:

  • The Internet Con: A nonfiction book about interoperability and Big Tech, Verso, September 2023
  • The Lost Cause: a post-Green New Deal eco-topian novel about truth and reconciliation with white nationalist militias, Tor Books, November 2023

This work – excluding any serialized fiction – is licensed under a Creative Commons Attribution 4.0 license. That means you can use it any way you like, including commercially, provided that you attribute it to me, Cory Doctorow, and include a link to

Quotations and images are not included in this license; they are included either under a limitation or exception to copyright, or on the basis of a separate license. Please exercise caution.

How to get Pluralistic:

Blog (no ads, tracking, or data-collection):

Newsletter (no ads, tracking, or data-collection):

Mastodon (no ads, tracking, or data-collection):

Medium (no ads, paywalled):

(Latest Medium column: "When the Town Square Shatters: Once again, sf fandom shows us how to use the internet"

Twitter (mass-scale, unrestricted, third-party surveillance and advertising):

Tumblr (mass-scale, unrestricted, third-party surveillance and advertising):

"When life gives you SARS, you make sarsaparilla" -Joey "Accordion Guy" DeVilla

This topic was automatically closed after 15 days. New replies are no longer allowed.