Pluralistic: 26 May 2022

doctorow · 26 May 2022 10:41

Originally published at: Pluralistic: 26 May 2022 – Pluralistic: Daily links from Cory Doctorow

Today's links

Attacking machine learning training by re-ordering data: Boy it's hard to audit an AI.
This day in history: 2012, 2021
Colophon: Recent publications, upcoming/recent appearances, current writing projects, current reading

A pair of six-sided dice whose pips have been replaced with the menacing red glowing eye of HAL9000 from 2001: A Space Odyssey. They are on a background of falling binary rain from the Matrix movies.

Attacking machine learning training by re-ordering data (permalink)

We have increasingly outsourced our decision-making to machine learning models ("the algorithm"). The whole point of building recommendation, sorting, and "decision support" systems on ML is to undertake assessments at superhuman speed and scale, which means that the idea of a "human in the loop" who validates machine judgment is a mere figleaf, and it only gets worse from here.

There are real consequences to this. I mean, for starters, if you get killed by a US military drone, chances are the shot was called by a machine-learning model:

https://abcnews.go.com/blogs/headlines/2014/05/ex-nsa-chief-we-kill-people-based-on-metadata

From policing to hiring to housing to lending, from social media promotion to what song you hear next on a streaming service, ML models supplant human judgment, backed by the promise of unbiased, mathematical choices. Humans may be racist, but algorithms can't be, right?

https://www.mediamatters.org/daily-wire/what-daily-wire-gets-wrong-and-alexandria-ocasio-cortez-gets-right-about-algorithms-and

Wrong. I mean, super wrong. The problem isn't merely that using biased data to train an algorithm automates the bias and repeats it at scale. The problem is also that machine-learning is a form of "empiricism-washing." A police chief who orders officers to throw any Black person they see up against a wall and go through their pockets would be embroiled in a racism scandal. But if the chief buys a predictive policing tool that gives the same directive, it's just math, and "math doesn't lie":

https://pluralistic.net/2021/08/19/failure-cascades/#dirty-data

Thus far, algorithmic fairness audits have focused on training data. Garbage In, Garbage Out: biased training produces a biased model.

https://memex.craphound.com/2018/05/29/garbage-in-garbage-out-machine-learning-has-not-repealed-the-iron-law-of-computer-science/

But a burgeoning research field based on adversarial attacks on training data shows that we need to go beyond checking training data for bias. In April, a team from MIT, Berkeley, and IAS published a paper on inserting undetectable backdoors into machine learning models by subtly poisoning the training data:

https://pluralistic.net/2022/04/20/ceci-nest-pas-un-helicopter/#im-a-back-door-man

The attack calls into question whether it's possible to verify that an algorithm whose training you outsource to a third party can ever be validated. The bank that hires a company to ingest its lending data and produce a risk model can't prove that the model is sound. It may be that imperceptibly subtle changes to a lending application would cause it to be approved 100% of the time.

In part, the fiendish intractability of this attack stems from the difficulty of validating a random number generator. This is a famously hard problem. It the subject of Godel's life-work, and it was how the NSA compromised an encryption standard:

https://www.wired.com/2013/09/nsa-backdoor/

Now, "Manipulating SGD with Data Ordering Attacks," a paper from researchers at the University of Toronto, Cambridge, and the University of Edinburgh shows that validating a model is even harder than it we thought:

https://arxiv.org/abs/2104.09667

In a summary of the paper on Cambridge's security research blog, the researchers explain that even if you start with unbiased – that is, broadly representative – data, the order in which you present that data to a machine-learning model can induce bias:

https://www.lightbluetouchpaper.org/2021/04/23/data-ordering-attacks/

That's because ML models are susceptible to "initialization bias" – whatever data they see first has a profound impact on the overall weighting of subsequent data. Here's an example from the researchers' blog:

[A]ssemble a set of financial data that was representative of the whole population, but start the model’s training on ten rich men and ten poor women drawn from that set – then let initialisation bias do the rest of the work.

As they say, the bias doesn't have to be this obvious: subtle nonrandomness in the data ordering can poison the model. And as we've seen, validating random-number generators is really hard. That opens up the twin possibilities of malicious model-poisoning to introduce bias, and of inadvertent bias in a model because of bias – not in the data, but in the data-order.

This suggests that auditing a model for bias is much harder than we thought. Not only must you confirm that a model is free from bias, but also free from biased ordering. This is not something that regulators have really come to grips with – for example, the EU's AI's regs contemplate examining data, but not data-order:

https://digital-strategy.ec.europa.eu/en/policies/european-approach-artificial-intelligence

Even if we do decide to include data-order in AI audits, how can we validate the order after the fact? There are ways to look at a piece of data and figure out whether it was used to train a model – but as far as I know, there's no way to start with a model and work backwards to find out its training order.

That means that if a regulator suspects that a model was deliberately tampered with, they will likely have to take the prime suspect's word for it when seeking to determine what the training-data's order.

Yikes.

(Image: JonRichfield, CC BY-SA 4.0; Cryteria, CC BY 3.0; modified)

This day in history (permalink)

#10yrago Anno NTK: get a fifteen-year-old tech newsletter delivered fresh each week https://www.oblomovka.com/wp/2012/05/25/ntk-fifteen-years-on/

#10yrsago Tech entrepreneur secretly lives at AOL HQ for two months https://www.cnet.com/tech/tech-industry/meet-the-tireless-entrepreneur-who-squatted-at-aol/

#1yrago Monopolists are winning the repair wars https://pluralistic.net/2021/05/26/nixing-the-fix/#r2r

Colophon (permalink)

Today's top sources: Bruce Schneier (https://www.schneier.com/).

Currently writing:

Some Men Rob You With a Fountain Pen, a Martin Hench noir thriller novel about the prison-tech industry. Friday's progress: 527 words (8214 words total)
The Internet Con: How to Seize the Means of Computation, a nonfiction book about interoperability for Verso. Friday's progress: 516 words (4788 words total)
Picks and Shovels, a Martin Hench noir thriller about the heroic era of the PC. Yesterday's progress: 508 words (92849 words total) – ON PAUSE
A Little Brother short story about DIY insulin PLANNING
Vigilant, Little Brother short story about remote invigilation. FIRST DRAFT COMPLETE, WAITING FOR EXPERT REVIEW
Moral Hazard, a short story for MIT Tech Review's 12 Tomorrows. FIRST DRAFT COMPLETE, ACCEPTED FOR PUBLICATION
Spill, a Little Brother short story about pipeline protests. FINAL DRAFT COMPLETE
A post-GND utopian novel, "The Lost Cause." FINISHED
A cyberpunk noir thriller novel, "Red Team Blues." FINISHED

Currently reading: Analogia by George Dyson.

Latest podcast: About Those Killswitched Ukrainian Tractors: https://craphound.com/news/2022/05/19/about-those-killswitched-ukrainian-tractors/

Upcoming appearances:

OpenJSWorld (Austin), Jun 8
https://events.linuxfoundation.org/openjs-world/program/schedule/
UK Competition and Markets Authority Data Technology and Analytics conference (London), Jun 15-16
https://www.eventbrite.co.uk/e/cma-data-technology-and-analytics-conference-2022-registration-308678625077
A New HOPE (NYC), Jul 24
https://www.hope.net/

Recent appearances:

The Sci-Fi Feedback Loop: Mapping Fiction’s Influence on Real-World Tech
https://csi.asu.edu/calendar/events/the-sci-fi-feedback-loop-mapping-fictions-influence-on-real-world-tech/
Privacy is the New Celebrity
https://www.buzzsprout.com/1806101/10643084
Revolutionizing Activism — The Power of Utopia (Center for Artistic Activism)
https://www.youtube.com/watch?v=8TBlSc3PNUA

Latest book:

"Attack Surface": The third Little Brother novel, a standalone technothriller for adults. The Washington Post called it "a political cyberthriller, vigorous, bold and savvy about the limits of revolution and resistance." Order signed, personalized copies from Dark Delicacies https://www.darkdel.com/store/p1840/Available_Now%3A_Attack_Surface.html
"How to Destroy Surveillance Capitalism": an anti-monopoly pamphlet analyzing the true harms of surveillance capitalism and proposing a solution. https://onezero.medium.com/how-to-destroy-surveillance-capitalism-8135e6744d59 (print edition: https://bookshop.org/books/how-to-destroy-surveillance-capitalism/9781736205907) (signed copies: https://www.darkdel.com/store/p2024/Available_Now%3A__How_to_Destroy_Surveillance_Capitalism.html)
"Little Brother/Homeland": A reissue omnibus edition with a new introduction by Edward Snowden: https://us.macmillan.com/books/9781250774583; personalized/signed copies here: https://www.darkdel.com/store/p1750/July%3A__Little_Brother_%26_Homeland.html
"Poesy the Monster Slayer" a picture book about monsters, bedtime, gender, and kicking ass. Order here: https://us.macmillan.com/books/9781626723627. Get a personalized, signed copy here: https://www.darkdel.com/store/p1562/_Poesy_the_Monster_Slayer.html.

Upcoming books:

Chokepoint Capitalism: How to Beat Big Tech, Tame Big Content, and Get Artists Paid, with Rebecca Giblin, nonfiction/business/politics, Beacon Press, September 2022

This work licensed under a Creative Commons Attribution 4.0 license. That means you can use it any way you like, including commercially, provided that you attribute it to me, Cory Doctorow, and include a link to pluralistic.net.

https://creativecommons.org/licenses/by/4.0/

Quotations and images are not included in this license; they are included either under a limitation or exception to copyright, or on the basis of a separate license. Please exercise caution.