Pluralistic: Podcasting "How To Think About Scraping" (25 Sept 2023)

Originally published at: Pluralistic: Podcasting “How To Think About Scraping” (25 Sept 2023) – Pluralistic: Daily links from Cory Doctorow


Today's links



A paint scraper on a window-sill. The blade of the scraper has been overlaid with a 'code rain' effect as seen in the credits of the Wachowskis' 'Matrix' movies.

Podcasting "How To Think About Scraping" (permalink)

This week on my podcast, I read my recent Medium column, "How To Think About Scraping: In privacy and labor fights, copyright is a clumsy tool at best," which proposes ways to retain the benefits of scraping without the privacy and labor harms that sometimes accompany it:

https://doctorow.medium.com/how-to-think-about-scraping-2db6f69a7e3d?sk=4a1d687171de1a3f3751433bffbb5a96

What are those benefits from scraping? Well, take computational linguistics, a relatively new discipline that is producing the first accounts of how informal language works. Historically, linguists overstudied written language (because it was easy to analyze) and underanalyzed speech (because you had to record speakers and then get grad students to transcribe their dialog).

The thing is, very few of us produce formal, written work, whereas we all engage in casual dialog. But then the internet came along, and for the first time, we had a species of mass-scale, informal dialog that also written, and which was born in machine-readable form.

This ushered in a new era in linguistic study, one that is enthusiastically analyzing and codifying the rules of informal speech, the spread of vernacular, and the regional, racial and class markers of different kinds of speech:

https://memex.craphound.com/2019/07/24/because-internet-the-new-linguistics-of-informal-english/

The people whose speech is scraped and analyzed this way are often unreachable (anonymous or pseudonymous) or impractical to reach (because there's millions of them). The linguists who study this speech will go through institutional review board approvals to make sure that as they produce aggregate accounts of speech, they don't compromise the privacy or integrity of their subjects.

Computational linguistics is an unalloyed good, and while the speakers whose words are scraped to produce the raw material that these scholars study, they probably wouldn't object, either.

But what about entities that explicitly object to being scraped? Sometimes, it's good to scrape them, too.

Since 1996, the Internet Archive has scraped every website it could find, storing snapshots of every page it found in a giant, searchable database called the Wayback Machine. Many of us have used the Wayback Machine to retrieve some long-deleted text, sound, image or video from the internet's memory hole.

For the most part, the Internet Archive limits its scraping to websites that permit it. The robots exclusion protocol (AKA robots.txt) makes it easy for webmasters to tell different kinds of crawlers whether or not they are welcome. If your site has a robots.txt file that tells the Archive's crawler to buzz off, it'll go elsewhere.

Mostly.

Since 2017, the Archive has started ignoring robots.txt files for news services; whether or not the news site wants to be crawled, the Archive crawls it and makes copies of the different versions of the articles the site publishes. That's because news sites – even the so-called "paper of record" – have a nasty habit of making sweeping edits to published material without noting it.

I'm not talking about fixing a typo or a formatting error: I'm talking about making a massive change to a piece, one that completely reverses its meaning, and pretending that it was that way all along:

https://medium.com/@brokenravioli/proof-that-the-new-york-times-isn-t-feeling-the-bern-c74e1109cdf6

This happens all the time, with major news sites from all around the world:

http://newsdiffs.org/examples/

By scraping these sites and retaining the different versions of their article, the Archive both detects and prevents journalistic malpractice. This is canonical fair use, the kind of copying that almost always involves overriding the objections of the site's proprietor. Not all adversarial scraping is good, but this sure is.

There's an argument that scraping the news-sites without permission might piss them off, but it doesn't bring them any real harm. But even when scraping harms the scrapee, it is sometimes legitimate – and necessary.

Austrian technologist Mario Zechner used the API from country's super-concentrated grocery giants to prove that they were colluding to rig prices. By assembling a longitudinal data-set, Zechner exposed the raft of dirty tricks the grocers used to rip off the people of Austria.

From shrinkflation to deceptive price-cycling that disguised price hikes as discounts:

https://mastodon.gamedev.place/@badlogic/111071627182734180

Zechner feared publishing his results at first. The companies whose thefts he'd discovered have enormous power and whole kennelsful of vicious attack-lawyers they can sic on him. But he eventually got the Austrian competition bureaucracy interested in his work, and they published a report that validated his claims and praised his work:

https://mastodon.gamedev.place/@badlogic/111071673594791946

Emboldened, Zechner open-sourced his monitoring tool, and attracted developers from other countries. Soon, they were documenting ripoffs in Germany and Slovenia, too:

https://mastodon.gamedev.place/@badlogic/111071485142332765

Zechner's on a roll, but the grocery cartel could shut him down with a keystroke, simply by blocking his API access. If they do, Zechner could switch to scraping their sites – but only if he can be protected from legal liability for nonconsensually scraping commercially sensitive data in a way that undermines the profits of a powerful corporation.

Zechner's work comes at a crucial time, as grocers around the world turn the screws on both their suppliers and their customers, disguising their greedflation as inflation. In Canada, the grocery cartel – led by the guillotine-friendly hereditary grocery monopolilst Galen Weston – pulled the most Les Mis-ass caper imaginable when they illegally conspired to rig the price of bread:

https://en.wikipedia.org/wiki/Bread_price-fixing_in_Canada

We should scrape all of these looting bastards, even though it will harm their economic interests. We should scrape them because it will harm their economic interests. Scrape 'em and scrape 'em and scrape 'em.

Now, it's one thing to scrape text for scholarly purposes, or for journalistic accountability, or to uncover criminal corporate conspiracies. But what about scraping to train a Large Language Model?

Yes, there are socially beneficial – even vital – uses for LLMs.

Take HRDAG's work on truth and reconciliation in Colombia. The Human Rights Data Analysis Group is a tiny nonprofit that makes an outsized contribution to human rights, by using statistical methods to reveal the full scope of the human rights crimes that take place in the shadows, from East Timor to Serbia, South Africa to the USA:

https://hrdag.org/

HRDAG's latest project is its most ambitious yet. Working with partner org Dejusticia, they've just released the largest data-set in human rights history:

https://hrdag.org/jep-cev-colombia/

What's in that dataset? It's a merger and analysis of more than 100 databases of killings, child soldier recruitments and other crimes during the Colombian civil war. Using a LLM, HRDAG was able to produce an analysis of each killing in each database, estimating the probability that it appeared in more than one database, and the probability that it was carried out by a right-wing militia, by government forces, or by FARC guerrillas.

This work forms the core of ongoing Colombian Truth and Reconciliation proceedings, and has been instrumental in demonstrating that the majority of war crimes were carried out by right-wing militias who operated with the direction and knowledge of the richest, most powerful people in the country. It also showed that the majority of child soldier recruitment was carried out by these CIA-backed, US-funded militias.

This is important work, and it was carried out at a scale and with a precision that would have been impossible without an LLM. As with all of HRDAG's work, this report and the subsequent testimony draw on cutting-edge statistical techniques and skilled science communication to bring technical rigor to some of the most important justice questions in our world.

LLMs need large bodies of text to train them – text that, inevitably, is scraped. Scraping to produce LLMs isn't intrinsically harmful, and neither are LLMs. Admittedly, nonprofits using LLMs to build war crimes databases do not justify even 0.0001% of the valuations that AI hypesters ascribe to the field, but that's they're problem.

Scraping is good, sometimes – even when it's done against the wishes of the scraped, even when it harms their interests, and even when it's used to train an LLM.

But.

Scraping to violate peoples' privacy is very bad. Take Clearview AI, the grifty, sleazy facial recognition company that scraped billions of photos in order to train a system that they sell to cops, corporations and authoritarian governments:

https://pluralistic.net/2023/09/20/steal-your-face/#hoan-ton-that

Likewise: scraping to alienate creative workers' labor is very bad. Creators' bosses are ferociously committed to firing us all and replacing us with "generative AI." Like all self-declared "job creators," they constantly fantasize about destroying all of our jobs. Like all capitalists, they hate capitalism, and dream of earning rents from owning things, not from doing things.

The work these AI tools sucks, but that doesn't mean our bosses won't try to fire us and replace us with them. After all, prompting an LLM may produce bad screenplays, but at least the LLM doesn't give you lip when you order to it give you "ET, but the hero is a dog, and there's a love story in the second act and a big shootout in the climax." Studio execs already talk to screenwriters like they're LLMs.

That's true of art directors, newspaper owners, and all the other job-destroyers who can't believe that creative workers want to have a say in the work they do – and worse, get paid for it.

So how do we resolve these conundra? After all, the people who scrape in disgusting, depraved ways insist that we have to take the good with the bad. If you want accountability for newspaper sites, you have to tolerate facial recognition, too.

When critics of these companies repeat these claims, they are doing the companies' work for them. It's not true. There's no reason we couldn't permit scraping for one purpose and ban it for another.

The problem comes when you try to use copyright to manage this nuance. Copyright is a terrible tool for sorting out these uses; the limitations and exceptions to copyright (like fair use) are broad and varied, but so "fact intensive" that it's nearly impossible to say whether a use is or isn't fair before you've gone to court to defend it.

But copyright has become the de facto regulatory default for the internet. When I found someone impersonating me on a dating site and luring people out to dates, the site advised me to make a copyright claim over the profile photo – that was their only tool for dealing with this potentially dangerous behavior.

The reasons that copyright has become our default tool for solving every internet problem are complex and historically contingent, but one important point here is that copyright is alienable, which means you can bargain it away. For that reason, corporations love copyright, because it means that they can force people who have less power than the company to sign away their copyrights.

This is how we got to a place where, after 40 years of expanding copyright (scope, duration, penalties), we have an entertainment sector that's larger and more profitable than ever, even as creative workers' share of the revenues their copyrights generate has fallen, both proportionally and in real terms.

As Rebecca Giblin and I write in our book Chokepoint Capitalism, in a market with five giant publishers, four studios, three labels, two app platforms and one ebook/audiobook company, giving creative workers more copyright is like giving your bullied kid extra lunch money. The more money you give that kid, the more money the bullies will take:

https://chokepointcapitalism.com/

Many creative workers are suing the AI companies for copyright infringement for scraping their data and using it to train a model. If those cases go to trial, it's likely the creators will lose. The questions of whether making temporary copies or subjecting them to mathematical analysis infringe copyright are well-settled:

https://www.eff.org/deeplinks/2023/04/ai-art-generators-and-online-image-market

I'm pretty sure that the lawyers who organized these cases know this, and they're betting that the AI companies did so much sleazy shit while scraping that they'll settle rather than go to court and have it all come out. Which is fine – I relish the thought of hundreds of millions in investor capital being transferred from these giant AI companies to creative workers. But it doesn't actually solve the problem.

Because if we do end up changing copyright law – or the daily practice of the copyright sector – to create exclusive rights over scraping and training, it's not going to get creators paid. If we give individual creators new rights to bargain with, we're just giving them new rights to bargain away. That's already happening: voice actors who record for video games are now required to start their sessions by stating that they assign the rights to use their voice to train a deepfake model:

https://www.vice.com/en/article/5d37za/voice-actors-sign-away-rights-to-artificial-intelligence

But that doesn't mean we have to let the hyperconcentrated entertainment sector alienate creative workers from their labor. As the WGA has shown us, creative workers aren't just LLCs with MFAs, bargaining business-to-business with corporations – they're workers:

https://pluralistic.net/2023/08/20/everything-made-by-an-ai-is-in-the-public-domain/

Workers get a better deal with labor law, not copyright law. Copyright law can augment certain labor disputes, but just as often, it benefits corporations, not workers:

https://locusmag.com/2019/05/cory-doctorow-steering-with-the-windshield-wipers/

Likewise, the problem with Clearview AI isn't that it infringes on photographers' copyrights. If I took a thousand pictures of you and sold them to Clearview AI to train its model, no copyright infringement would take place – and you'd still be screwed. Clearview has a privacy problem, not a copyright problem.

Giving us pseudocopyrights over our faces won't stop Clearview and its competitors from destroying our lives. Creating and enforcing a federal privacy law with a private right action will. It will put Clearview and all of its competitors out of business, instantly and forever:

https://www.eff.org/deeplinks/2019/01/you-should-have-right-sue-companies-violate-your-privacy

AI companies say, "You can't use copyright to fix the problems with AI without creating a lot of collateral damage." They're right. But what they fail to mention is, "You can use labor law to ban certain uses of AI without creating that collateral damage."

Facial recognition companies say, "You can't use copyright to ban scraping without creating a lot of collateral damage." They're right too – but what they don't say is, "On the other hand, a privacy law would put us out of business and leave all the good scraping intact."

Taking entertainment companies and AI vendors and facial recognition creeps at their word is helping them. It's letting them divide and conquer people who value the beneficial elements and those who can't tolerate the harms. We can have the benefits without the harms. We just have to stop thinking about labor and privacy issues as individual matters and treat them as the collective endeavors they really are:

https://pluralistic.net/2023/02/26/united-we-stand/

Here's a link to the podcast:

https://craphound.com/news/2023/09/24/how-to-think-about-scraping/

And here's a direct link to the MP3 (hosting courtesy of the Internet Archive; they'll host your stuff for free, forever):

https://archive.org/download/Cory_Doctorow_Podcast_450/Cory_Doctorow_Podcast_450_-_How_To_Think_About_Scraping.mp3

And here's the RSS feed for my podcast:

http://feeds.feedburner.com/doctorow_podcast

(image: syvwlch, CC BY-SA 2.0, modified)


Hey look at this (permalink)



A Wayback Machine banner.

This day in history (permalink)

#20yrsago Epic micropayments rant https://web.archive.org/web/20031002104152/http://slumbering.lungfish.com/index.php?p=chargingpeople.1064271013

#20yrsago Michael Moore’s comprehensive response to criticisms of Bowling for Columbine https://web.archive.org/web/20050205011453/http://www.michaelmoore.com/words/wackoattacko/

#20yrsago WKRP in Cincinnati redacted to save on license fees https://web.archive.org/web/20031001172254/http://members.allstream.net/~jacjud/wkrpmusic.html

#15yrsago Rockbox 3.0: revive old iPod with free/open software https://ostatic.com/blog/rockbox-3-0-released-quietly

#15yrsago Judge says that “attempted copyright infringement” is bogus https://www.eff.org/deeplinks/2008/09/capitol-v-thomas-judge-orders-new-trial-implores-c

#15yrsago HOWTO Make a giant spherical metalamp out of dozens of cheap Ikea lamps https://www.instructables.com/Big-lamps-from-Ikea-lampan-lamps./

#15yrsago China’s IP address shortage, two perspectives https://www.chinatechnews.com/2008/09/23/7595-cnnic-chinas-internet-will-be-short-of-ip-addresses-soon

#15yrsago World’s largest wargaming table art installation https://web.archive.org/web/20080927032126/http://www.ethanham.com/blog/2008/09/worlds-largest-wargaming-table.html

#10yrsago More details, new video showing Iphone fingerprint reader pwned by Chaos Computer Club https://www.heise.de/ratgeber/Der-iPhone-Fingerabdruck-Hack-1965783.html

#10yrsago Not Your Ordinary Wolf Girl: fast YA novel with wonderful characters https://memex.craphound.com/2013/09/24/not-your-ordinary-wolf-girl-fast-ya-novel-with-wonderful-characters/

#10yrsago Godspeed You! Black Emperor condemns music contest they won, vows to use money to buy instruments for prisoners https://web.archive.org/web/20130925144621/http://cstrecords.com/statement-from-godspeed-you-black-emperor-on-polaris/

#10yrsago Love Song for Internet Trolls https://www.youtube.com/watch?v=vjmBQZNG8L0

#10yrsago Adding some evidence to copyright’s “evidence-free zone” https://archives.cjr.org/cloud_control/empirical_ip.php?page=all

#10yrsago Beijing’s “mystery rooms”: single-room funhouses https://kotaku.com/escape-from-chinas-mystery-rooms-1369688560

#10yrsago Easyjet tells law professor he can’t fly because he tweeted critical remarks about airline https://www.thedrum.com/news/2013/09/25/easyjet-under-fire-after-claims-it-refused-let-drum-columnist-mark-leiser-board

#10yrsago The Coldest Girl in Coldtown: dangerous, bloody vampire YA novel https://memex.craphound.com/2013/09/25/the-coldest-girl-in-coldtown-dangerous-bloody-vampire-ya-novel/

#5yrsago Big Tech is building a $80B capex wall around its empire https://www.bloomberg.com/news/articles/2018-09-24/tech-companies-spend-80-billion-building-a-competitive-edge

#5yrsago A CRISPR-based hack could eradicate malaria-carrying mosquitoes https://www.npr.org/sections/goatsandsoda/2018/09/24/650501045/mosquitoes-genetically-modified-to-crash-species-that-spreads-malaria

#5yrsago There’s a literal elephant in machine learning’s room https://arxiv.org/abs/1808.03305

#5yrsago To fix Canadian copyright, let creators claim their rights back after 25 years https://theconversation.com/everything-he-does-he-does-it-for-us-why-bryan-adams-is-on-to-something-important-about-copyright-103674

#5yrsago The world’s richest families got MUCH richer, thanks to the stock market https://www.bloomberg.com/news/articles/2018-09-24/ultra-rich-families-ride-surging-stocks-to-double-annual-returns

#5yrsago DNA ancestry tests are bullshit https://www.telegraph.co.uk/news/science/science-news/9912822/DNA-ancestry-tests-branded-meaningless.html

#5yrsago Incredibly sensible notes on software engineering, applicable to the wider world https://medium.com/s/story/notes-to-myself-on-software-engineering-c890f16f4e4d

#5yrsago Hank Green’s “An Absolutely Remarkable Thing”: aliens vs social media fame vs polarization https://memex.craphound.com/2018/09/25/hank-greens-an-absolutely-remarkable-thing-aliens-vs-social-media-fame-vs-polarization/

#5yrsago Jewelry in the shape of gerrymandered US congressional districts https://web.archive.org/web/20191005193414/https://gerrymanderjewelry.com/

#5yrsago Facebook reminds America’s cops that they’re not allowed to use fake accounts https://www.eff.org/deeplinks/2018/09/facebook-warns-memphis-police-no-more-fake-bob-smith-accounts

#5yrsago Record numbers of people have downloaded and used the Democrats’ mobile app for doorknocking canvassers https://www.wired.com/story/2018-midterms-democrats-mobile-canvassing-records/

#5yrsago Canada’s legal weed stock-bubble is a re-run of the dotcom bubble https://www.wsj.com/articles/wall-streets-marijuana-madness-its-like-the-internet-in-1997-1537718400

#1yrago Billionaire grifters hate her: Cometh the Hour, Cometh the Ida M Tarbell https://pluralistic.net/2022/09/24/shithole-billionaires/#tarbells-everywhere

#1yrago McKinsey and Providence colluded to force poor patients into destitution https://pluralistic.net/2022/09/25/criminal-conspiracy/#payment-is-expected



Colophon (permalink)

Today's top sources:

Currently writing:

  • A Little Brother short story about DIY insulin PLANNING
  • Picks and Shovels, a Martin Hench noir thriller about the heroic era of the PC. FORTHCOMING TOR BOOKS JAN 2025

  • The Bezzle, a Martin Hench noir thriller novel about the prison-tech industry. FORTHCOMING TOR BOOKS FEB 2024

  • Vigilant, Little Brother short story about remote invigilation. FORTHCOMING ON TOR.COM

  • Moral Hazard, a short story for MIT Tech Review's 12 Tomorrows. FIRST DRAFT COMPLETE, ACCEPTED FOR PUBLICATION

  • Spill, a Little Brother short story about pipeline protests. FORTHCOMING ON TOR.COM

Latest podcast: Plausible Sentence Generators https://craphound.com/news/2023/09/17/plausible-sentence-generators/
Upcoming appearances:

Recent appearances:

Latest books:

Upcoming books:

  • The Lost Cause: a post-Green New Deal eco-topian novel about truth and reconciliation with white nationalist militias, Tor Books, November 2023
  • The Bezzle: a sequel to "Red Team Blues," about prison-tech and other grifts, Tor Books, February 2024

  • Picks and Shovels: a sequel to "Red Team Blues," about the heroic era of the PC, Tor Books, February 2025

  • Unauthorized Bread: a graphic novel adapted from my novella about refugees, toasters and DRM, FirstSecond, 2025


This work – excluding any serialized fiction – is licensed under a Creative Commons Attribution 4.0 license. That means you can use it any way you like, including commercially, provided that you attribute it to me, Cory Doctorow, and include a link to pluralistic.net.

https://creativecommons.org/licenses/by/4.0/

Quotations and images are not included in this license; they are included either under a limitation or exception to copyright, or on the basis of a separate license. Please exercise caution.


How to get Pluralistic:

Blog (no ads, tracking, or data-collection):

Pluralistic.net

Newsletter (no ads, tracking, or data-collection):

https://pluralistic.net/plura-list

Mastodon (no ads, tracking, or data-collection):

https://mamot.fr/@pluralistic

Medium (no ads, paywalled):

https://doctorow.medium.com/

(Latest Medium column: "How To Think About Scraping: In privacy and labor fights, copyright is a clumsy tool at best https://doctorow.medium.com/how-to-think-about-scraping-2db6f69a7e3d)

Twitter (mass-scale, unrestricted, third-party surveillance and advertising):

https://twitter.com/doctorow

Tumblr (mass-scale, unrestricted, third-party surveillance and advertising):

https://mostlysignssomeportents.tumblr.com/tagged/pluralistic

"When life gives you SARS, you make sarsaparilla" -Joey "Accordion Guy" DeVilla

:rofl:

This post touches on a question I’ve been contemplating since reading The Internet Con. In the Interop chapter you beautifully explained how comcom is the reason we have radio, broadcast TV, cable TV, and VCRs. But you stopped short of addressing LLM’s (and generative “AI” in general) in that chapter and I was curious as to why? I thought it might be too early to say, or perhaps it just didn’t contribute to the point you were making?

I had started to think about SALAMI as a sort of “final form” of copyright infringement considering the breadth of what we’re seeing (and likely to see going forward) as well as the fact that the outputs are not themselves covered by copyright. After reading The Internet Con and realizing there was a common pattern of new technologies which enabled mass copyright infringement to ultimately be legalized with a blanket license to compensate copyright holders I wondered if the same thing might happen for SALAMI. After reading this post and that wonderful pair of EFF articles (particularly the one linked within that is specific to copyright) I can see now how that is unlikely to happen and how problematic it is to apply copyright here at all since a “victory” would also negatively affect artists by opening them up to infringement claims for copying styles / details, etc. the way that SALAMI does.

The EFF copyright article also challenges the idea that SALAMI can be thought of as a sort of compression (at least legally) which is an idea that had been growing on me. I guess if anything it’s a sort of specialized extremely lossy hyper-compression.

In any case it seems clear here that comcom was an enabler for SALAMI in the same way that it was for radio, et. al. and the resultant effect on copyright holders may indeed be the same, but the remedy is clearly going to have to come in a different form in this case (e.g. labor laws as you suggest).

It’s interesting from the perspective of the consumers of these outputs how there is a similar pattern in the “compression” and a resulting reduction of quality, particularly in the early days of a new “mass copyright infringing” (we’ve established that probably isn’t the case here, but a similar end result) technology. You can see this pattern from low quality radio/broadcast TV transmissions, compression artifacts in cable TV, low quality recordings, and low bit-rate MP3’s right through into these SALAMI outputs which are a sort of low quality facsimile of the inputs:

Given the trajectory of all the other technologies above (hidef broadcast, lossless audio streaming, etc.) we can likely conclude that the SALAMI is going to get a lot better over time, which will likely push it much closer to textbook copyright infringement (but with precedents already established for the earlier incarnations, and still a difficult technical case to make), pose even more of a threat to creators, and make the necessity of supporting them through this technology transition with labor laws even more pressing for all the reasons that the usual copyright backstop is not going to cut it (and indeed, didn’t really cut it before for all the reasons you point out with big corporations capturing the copyrights).

If nothing else this seems like a very strong case for the power of comcom, but perhaps a cautionary tale about the problems with it when the rest of the ecosystem is so skewed toward big tech.

The intersection between all these things is nuanced and amazingly complicated (especially if you’re thinking about it the wrong way). I really appreciate your work in helping us to sort through the bullshit and illuminating the who it does it to and who it does it for.

Thank you, vwaehfpz - I found this good and thoughtful analysis!

Provoked a bit by this post, and needing some fodder to FAFO with vector space embeddings, I sicced wget (politely) on Pluralistic. Things seem to have gone reasonably well, but further questions ensued.

  • A couple of posts seem to not have the CC-4.0 license attached. The Memex Method would be one, and another from way back in the archives. Was this an oversight? Purposefully intentional? Copy/pasta from Medium error? I don’t have any grand designs but want to be respectful of author’s intent.
  • Any chance of a comprehensive data dump of Pluralistic in the style of a previous site you worked with? Except maybe not in one big wad of XML :grin: Scraping’s fun and all, but then doing further proper data extraction requires a bit of elbow grease.

Cheers! And thanks for the yeoman’s work, under trying circumstances, signing The Internet Con.

Hey, Crossjam! I haven’t yet decided on the licensing for the Medium columns (which I’ll be formally launching here when my Medium exclusivity period ends, in mid-Dec; my last column is due Oct 28, and that column will be mine to reproduce six weeks later). I’m toying with the idea of collecting my favorite essays from Medium - more than a quarter-million words in all - as a book.

Re a data dump; I have no objection in principle; do you know of a Wordpress plugin to facilitate this?

No worries. Just so my understanding is clear, a blanket CC-4.0 Attrib license is not in force for post content on Pluralistic? It’s on a post by post basis? Makes sense if some content was born on Medium and migrated to Pluralistic, categorized Medium as an indicator(?). Understand holding back on licensing. There are some banger essays in the pile that deserve a book treatment.

I only ask because Pluralist, a daily link-dose: 21 Feb 2020, seems like a non-Medium link dump, but doesn’t have an apparent license. That might have been from a proto-Pluralistic period.

Once upon a time, I did a WordPress migration as part of moving to a static site generator. I believe WP has a straightforward export capability baked in. Ok, it might be a blob of XML, but it’ll at least be valid with maybe even UTF-8 and emojis handled properly.

Please don’t waste time with export though if it’s better spent on writing and advocacy. Data fan service not demanded here.

I think the Feb 21 edition was from the very early times when I didn’t have a stable template, much less any automation; I just pasted in a CC.

The CC 4.0 is in place for everything except:

a) Fiction

b) Images (see per-image attributions)

c) Items categorized “Medium”

Here’s an XML dump of all the newsletter posts:: Pluralistic To 09 28 2023 : Cory Doctorow : Free Download, Borrow, and Streaming : Internet Archive

Cool! Thanks for the clarification and the data release.

This topic was automatically closed after 15 days. New replies are no longer allowed.