Originally published at: https://pluralistic.net/2023/09/17/how-to-think-about-scraping/
Web-scraping is good, actually.
For nearly all of history, academic linguistics focused on written, formal text, because informal, spoken language was too expensive and difficult to capture. In order to find out how people spoke — which is not how people write! — a researcher had to record speakers, then pay a grad student to transcribe the speech.
The process was so cumbersome that the whole discipline grew lopsided. We developed an extensive body of knowledge about written, formal prose (something very few of us produce), while informal, casual language (something we all produce) was mostly a black box.
The internet changed all that, creating the first-ever corpus of informal language — the immense troves of public casual speech that we all off-gas as we move around on the internet, chattering with our friends.
The burgeoning discipline of computational linguistics is intimately entwined with the growth of the internet, and its favorite tactic is scraping: vacuuming up massive corpuses of informal communications created by people who are incredibly hard to contact (often, they are anonymous or pseudonymous, and even when they’re named and known, are too numerous to contact individually).
The academic researchers who are creating a new way of talking and thinking about human communication couldn’t do their jobs without scraping.
Scraping against the wishes of the scraped is good, actually.
Since 1996, the Internet Archive’s Wayback Machine has visited every website it could find, as often as it could, and made a copy of every page it could locate. In 2001, the Archive opened the Wayback Machine to the public, allowing anyone to search for any version of any web-page. Chances are, you’ve used the Wayback Machine to access some text, image or sound file it preserved after the file disappeared from the live internet.
But beyond spelunking in the internet’s memory hole, there’s another way to use the Wayback Machine: to find out how things have been changed. What did that product’s terms of service say when you bought it? What did that grifter’s CV say when they first posted a home-page? How did that Super PAC describe its operations before it was embroiled in scandal?
Mostly, the Wayback Machine honors the “Robots Exclusion Protocol” (AKA “robots.txt”), a simple way to tell “crawlers” which parts of a website (if any) they are allowed to explore and copy.
But since 2017, the Wayback Machine has ignored robots.txt files on news websites. The rationale here is that there is a strong public interest in making copies of news articles so that the public can be informed when those articles are silently altered to change their sense or substance.
This is a surprisingly common phenomenon. Some of the largest news publishers in the world routinely make drastic alterations to their published materials without noting the change. I’m not talking about fixing a typo or a formatting error — I’m talking about completely reversing the message of an article and pretending the earlier version never existed.
It’s good that the Internet Archive rejects robots.txt directives ordering it not to scrape some websites.
Scraping when the scrapee suffers as a result of your scraping is good, actually.
Mario Zechner is an Austrian technologist who used the APIs of large grocery chains to prove that they were colluding to rig prices. Zechner was able to create a corpus of historical price and product data to show how grocers used a raft of deceptive practices to trick people into thinking they were getting a good deal, from shrinkflation to cyclic price changes that were deceptively billed as “discounts.”
At first, Zechner worked alone and in fear of reprisals from the giant corporations whose fraudulent practices — which affected every person in the country — he had revealed.
But eventually, he was able to get the Austrian bureaucrat in charge of enforcing competition rules to publish a report lauding his work. Zechner open-sourced his project and attracted volunteers who started pulling in data from Germany and Slovenia.
Zechner is hoping that the competition authority will order the grocery stores he’s gathering data from to continue to allow him to access their APIs, but if they shut him out, he can continue his work by scraping the data instead of getting it through the official channel.
He should do that. Indeed, there are plenty of grocers all over the world who should be scraped for this kind of analysis. Grocers have been utterly shameless in their price-fixing.
In Canada, the hyper-concentrated grocery sector hatched the most Les Mis-ass conspiracy imaginable: they fixed the price of bread.
Scrape those guillotine-ready crooks. Scrape them and scrape them and scrape them.
Scraping to train machine-learning models is good, actually.
The Human Rights Data Analysis Group is a crucial player in the fight to hold war-criminals to account. As the leading nonprofit providing statistical analysis of crimes against humanity, HRDAG has participated in tribunals, truth and reconciliation proceedings, and trials from Serbia to East Timor, South Africa to the USA, and, most recently, Colombia.
Colombia’s long civil war — funded and supported by US agencies from the CIA and DEA to the US military —went on for decades, killing hundreds of thousands of people, mostly very poor, very marginalized people.
Many of these killings were carried out by child soldiers, who were recruited at gunpoint by both CIA-backed right-wing militias whose actions were directed by the richest, most powerful people in the country, and by the leftist FARC guerrillas.
HRDAG, working in partnership with the Colombian human rights group Dejusticia, merged over 100 databases in order to build a rigorous statistical picture of the war’s casualties; the likelihood that each death could be attributed to the government, right-wing militias, or FARC forces; as well as which groups were primarily responsible for kidnapping children and forcing them to be soldiers.
The resulting report builds on the largest human rights data-set ever collected. The report — which makes an irrefutable case that right-wing militias committed the majority of killings and child-soldier recruitment, and that their wealthy backers knew and supported these actions — have been key to Colombia’s truth and reconciliation proceedings.
As a group of human-rights defending forensic statisticians, HRDAG has always relied on cutting edge mathematics in its analysis. With its Colombia project, HRDAG used a large language model to assign probabilities for responsibility for each killing documented in the databases it analyzed.
That is, HRDAG was able to rigorously and legibly say, “This killing has an X% probability of having been carried out by a right-wing militia, a Y% probability of having been carried out by the FARC, and a Z% probability of being unrelated to the civil war.”
This kind of analysis is a new thing under the sun. Extrajudicial killings take place in the shadows, and produce fragmentary evidence from people who are justly terrified of reprisals, even after the conflict has nominally ended.
The use of large language models — produced from vast corpuses of scraped data — to produce accurate, thorough and comprehensible accounts of the hidden crimes that accompany war and conflict is still in its infancy. But already, these techniques are changing the way we hold criminals to account and bring justice to their victims.
Scraping to make large language models is good, actually.
Scraping to violate the public’s privacy is bad, actually.
Clearview AI is a grifty, creepy facial recognition company that nonconsensually scraped billions of photos from the internet, subjected them to machine-learning analysis and then secretly marketed a face search-engine to cops, spooks, private security, and despotic governments all around the world.
Clearview-alikes have been used to do all kinds of awful things —for example, there’s creeps who use facial recognition databases to dox and out adult performers who appear in pornographic videos.
Those people are scum. They should be fined — or even criminally sanctioned — for their conduct.
Scraping to alienate creative workers’ labor is bad, actually.
Creative workers are justifiably furious that their bosses took one look at the plausible sentence generators and body-snatching image-makers and said, “Holy shit, we will never have to pay a worker ever again.”
Our bosses have alarming, persistent, rock-hard erections for firing our asses and replacing us with shell-scripts. The dream of production without workers goes all the way back to the industrial revolution, and now — as then — capitalists aspire to becoming rentiers, who own things for a living rather than making things for a living.
Creators’ bosses hate creators. They’ve always wished we were robots, rather than people who cared about our work. They want to be able to prompt us like they would a Stochastic Parrot: “Make me E.T., but the hero is a dog, and put a romantic sub-plot in the second act, and then have a giant gunfight at the climax.”
Ask a screenwriter for that script and you’ll have to take a five minute break while everyone crawls around on the floor looking for the writer’s eyeballs, which will have fallen out of their face after being rolled so hard.
Ask an LLM for that script and it’ll cheerfully cough it up. It’ll be shit, but at least you won’t get any lip.
Same goes for art-directors, newspaper proprietors, and other would-be job-removers for whom a low-quality product filled with confident lies is preferable to having to argue with an uppity worker who not only expects to have a say in their work assignments, but also expects to get paid for their work.
What are we to make of these contradictions? After all, the privacy nihilists at Clearview AI and the media enshittifiers of the AI wars both claim that any attempt to rein in their activities will sweep up the computational linguists, the accountability archivists, the consumer rights data-miners and the human rights workers.
This is a con, and it’s not an original one. The tech industry has long insisted that its products must be enjoyed as prix-fixe menus, claiming that taking them apart into a-la-carte versions — where you only get the benefits, and not the downsides — is technically impossible.
Sometimes, this is true — for example, we don’t know how to make cryptography that only works when “bad guys” are trying to break it, but which fails immediately when a “good guy” needs to break it.
But more often, these claims of indivisibility are self-serving bullshit, cheap sleights of hand.
Take scraping. Companies like Clearview will tell you that scraping is legal under copyright law. They’ll tell you that training a model with scraped data is also not a copyright infringement. They’re right.
(The lawsuits over scraping and training aren’t grounded in a plausible theory under copyright law — rather, they represent a bet that the absurdly well-capitalized AI companies did a ton of sleazy stuff to acquire their data, and that they will pay out large to settle claims, rather than having their scumbaggery revealed in open court, and you know, I’m fine with that. By all means, take a couple hundred mil out of these bloated hype-monsters. But don’t fool creators about what copyright law does and doesn’t say about scraping and training.)
Creators who are justifiably furious over the way their bosses want to use AI are allowing themselves to be tricked by this argument. They’ve been duped into taking up arms against scraping and training, rather than unfair labor practices.
For 40 years, neoliberals have told artists that we aren’t workers, we’re businesspeople. The neoliberal artist isn’t a creative laborer, they’re an LLC with an MFA.
As businesses, we bargain alone, taking our alienable, tradeable copyrights to the table and getting the best deal we can out of the businesses that want to use our work.
This has been a catastrophic failure.
For 40 years, we’ve made copyright last longer, cover more works and uses, and extract stiffer penalties for violations. Over those decades, entertainment companies have grown larger and more profitable —even as the share of that income going to creative workers has fallen.
That’s inevitable. In the lopsided negotiations between large corporations and individual artists, giving a creator more rights to bargain with is just giving them more rights to bargain away. Creating more copyrights won’t get artists paid for the same reason that giving your bullied kid extra lunch money won’t get them lunch.
But as ever-larger, more concentrated corporations captured more of their regulators, we’ve essentially forgotten that there are domains of law other than copyright — that is, other than the kind of law that corporations use to enrich themselves.
Copyright has some uses in creative labor markets, but it’s no substitute for labor law. Likewise, copyright might be useful at the margins when it comes to protecting your biometric privacy, but it’s no substitute for privacy law.
When the AI companies say, “There’s no way to use copyright to fix AI’s facial recognition or labor abuses without causing a lot of collateral damage,” they’re not lying — but they’re also not being entirely truthful.
If they were being truthful, they’d say, “There’s no way to use copyright to fix AI’s facial recognition problems, that’s something we need a privacy law to fix.”
If they were being truthful, they’d say, “There’s no way to use copyright to fix AI’s labor abuse problems, that’s something we need labor laws to fix.”
This lie of omission is great tactics. It demoralizes many AI critics at the outset, who’ll say, “Well, I like all these benefits the world gets from scraping, so I guess I have to put up with all the bad stuff these giant companies are doing.”
And for critics who are too angry for that, this lie turns them into people who explicitly align themselves against all the benefits scraping delivers. These critics end up helping the AI companies.
When critics get suckered into saying, “You can’t have the benefits of AI without the costs,” they’re doing PR for the AI companies. There are plenty of people who’ll hear that statement and say, “OK, well, I guess privacy and labor rights are the price we’ll have to pay.”
We don’t have to be patsies for the AI companies. We can call them on their bullshit. We can follow the examples of the SAG-AFTRA and WGA members who are picketing the studios: they’re not asking for a new copyright law that gives them the right to control model-training with their work; they’re asking for a union contract that bans the studios from using machine learning to replace them.
We won’t solve Clearview AI by getting copyrights on our faces — but a strong privacy law would put them out of business forever.
Privacy and labor issues share a common thread, because both are collective enterprises, not individual matters.
We absolutely can have the benefits of scraping without letting AI companies destroy our jobs and our privacy. We just have to stop letting them define the debate.