Dear founder,
With my podcast data scanning business Podscan, I’m constantly scraping the web for terabytes of audio and metadata - but now I find myself in a wild cat-and-mouse game, trying to protect my own valuable data from aggressive AI companies doing the exact same thing.
It’s a bizarre tightrope walk: I need data to be freely available, but I’m also setting up defenses against scrapers, all while wondering if I could turn these digital intruders into paying customers
— Welcome to the weird world of web scraping in the AI age.
🎧 Listen to this on my podcast.
Podscan ingests terabytes of audio data every month — hundreds of gigabytes a day. I’m constantly checking millions of RSS feeds, gathering data from public APIs, and websites to enrich all the information on Podscan. You could say that, for some intents and purposes, I am scraping the web.
Up until a couple of months ago, this seemed like a perfectly fine thing to do. Then, all the big AI platforms started scraping the web in ways I’d never seen before. They’re disregarding rules put in place to protect content, creating traffic on websites that cause people to pay thousands of dollars in bills before they can stop it, and generally causing a lot of damage. OpenAI, Anthropic, and all the big players are doing a lot of scraping to get data for their large language models, and they’ve been extremely aggressive about it.
In some ways, this has always been part of the internet. Any publicly available piece of data is fair game for people who download it without impeding the service it’s available through. Recently, there have been several lawsuits decided in favor of people scraping against companies that tried to prevent scraping from their publicly available platforms.
And if you zoom out a bit, the recent legal back-and-forth around the Internet Archive and scanning books and making them available on the Internet shines yet another light on one of the core features of the web: it’s a gigantic copy machine. From the early days, every interaction on the internet was one of duplicating data. Every browser request copies data from a server to my computer. And that mindset was so prevalent when it came to making things accessible.
Then, the corporate world took over, and intellectual property challenges were finally tackled — from both sides. DRM systems made their way into the mainstream applications, but so did LimeWire, Napster, and BitTorrent. Restriction has been fighting with distribution ever since. Open standards are —to this day— battling protectionism and suffocating data exchanges.
I feel torn on this whole issue because, to me, public availability of data and being allowed to download it is central to the business functionality of Podscan. Taking a larger view, the whole podcast ecosystem is built on RSS feeds being publicly available and open for downloading. It’s an interesting field of tension that I’ve only recently understood.
I find myself mired in this dilemma because I both need the data to be available and want to make sure I don’t pull more data than necessary. I still believe the internet should be a place where people use caution and take an approach that is useful to all involved parties – an approach that is mutually beneficial.
However, these AI companies don’t temper their aggression when it comes to collecting data. They understand that the more data they ingest, the better they can create models that are competitive with other companies. It’s turned into a cat-and-mouse game where people adjust their robots.txt files to show LLMs which parts of their websites they’re allowed to download, and then crawlers either ignore these rules or create new user agents to bypass them. Some server operators even try to actively slow down AI crawlers.
This situation made me think about what data I want to make available on Podscan and how I can build it defensively to protect the valuable and expensive data I’ve collected. Here are the choices I made and the protections I put in place:
- No publicly available directory of podcasts: All data is behind a login and a user account with a trial period. This prevents anonymous scraping and allows me to trace and ban suspicious accounts.
- Rate limiting: I’ve implemented strict rate limits on certain pages, making it unfeasible for scrapers to download the entire database quickly.
- Encoded IDs: I use hashed, encoded versions of IDs for podcasts and episodes, making it difficult for scrapers to enumerate content easily.
On the flip side, as someone who needs to collect data myself, I try to be a good “netizen” by minimizing repetition, duplication, and server load. I spread out my requests throughout the day and use various techniques to reduce the amount of data transferred. Here’s what I do for that:
- Minimizing data flow: I use a central queue system for downloading feeds and audio files, spreading out requests to avoid overwhelming servers.
- Efficient data transfer: I employ HTTP features like last-modified dates and ETags to minimize unnecessary downloads and parsing.
- Respecting server responses: If a server sends a 429 or 503 response, I back off and stop requesting data from that server for the podcast in question.
- Implementing RSS specifications: I respect the “do not check” hours specified in RSS feeds to further reduce unnecessary traffic.
The more people normalize using AI technology and expect data from all over the web to be part of it, the more this will be a problem for every founder who wants to create something that people come to for free or as a lead magnet. It’s going to be a big challenge to balance providing valuable information with protecting your data from being vacuumed up by AI systems.
Interestingly, there’s a potential business upside to this situation. When I detect scrapers from various companies, I can reach out to them to sell the data directly in a way that works for both parties. This could lead to business relationships with AI training companies or companies hosting AI systems — FAIR relationships where data is exchanged, not just collected.
I’m considering implementing features to alert me of new scrapers appearing on my website and even setting up a honeypot directory of top podcasts to detect when scrapers start collecting information. This could open up opportunities to sell data through the Podscan firehose or through daily exports that companies can ingest into their latest system training.
Scraping presents interesting avenues both for gathering information and for facilitating data ingestion in more mutually beneficial business relationships for me and Podscan. So, as we navigate this new landscape, it’s critical that we find a balance between protecting our valuable data and leveraging it for potential business opportunities.
Even when it feels like we’re moving in opposite directions at the same time.
If you want to track your brand mentions on podcasts, please check out podscan.fm — and tell your friends!
Thank you for reading this week’s essay edition of The Bootstrapped Founder. Did you enjoy it? If so, please spread the word and share this issue on Twitter.
If you want to reach tens of thousands of creators, makers, and dreamers, you can apply to sponsor an episode of this newsletter. Or just reply to this email!