Dear founder,

Before we get into this week's AI-centric issue, I want to let you know that I recently recorded an in-depth session for Rob Walling's SaaS Launchpad. This module is called AI in SaaS, and I share over a dozen pitfalls, unobvious risks, and hard-earned insights from my own AI SaaS building journey. The course was already great, but... you know... ;) Use the code ARVID150 to get $150 off until June 8th.

Alright, let's get to it.

There’s a term I’ve been reading a lot about this week that’s been keeping me up at night: model collapse. And if you’re building any kind of AI-powered business—which, let’s face it, most of us are these days—this should probably keep you up too.

THE BOOTSTRAPPED FOUNDER • EPISODE 392

Building AI Businesses Without Breaking the Internet

22:48

MORE INFO

The concept is deceptively simple, but the implications are staggering. Model collapse is what happens when AI models are trained on their own outputs. The quality of the data they provide degrades over time, creating a feedback loop of declining accuracy and truth.

Think about it this way: before AI became ubiquitous, when you looked at data on the internet, you had some measurement of trust. Numbers from reputable sources were generally reliable and true, particularly those that were legally mandated to be correct—data that was approved, tested, and verified from official sources.

But now? Even those traditionally reliable players use AI systems to generate some part of their data. That AI-generated content will inevitably seep into the training data for the next generation of models. Any distortion that exists today becomes stronger and more distorted with every iteration.

This creates a guaranteed decline in quality over time—and that’s not exactly a promising outlook for a technology we’re all betting our businesses on.

The Golden Age Problem

Here’s what’s fascinating and terrifying: we might be living through the golden age of AI accuracy right now. The models that were trained around now, or a year or two ago, are the least influenced by already-existing AI-generated content. They might not be as performant or as deeply interconnected when it comes to processing data, but they’re also the least touched by their own outputs.

In some ways, we’re witnessing the purest form of these systems—before they start eating their own tail.

This reminds me of something we’ve already seen play out on the internet: the rise of programmatic SEO. You know exactly what I’m talking about if you’ve ever tried to find genuine information online and instead stumbled upon pages and pages of algorithmically generated content designed purely to rank in search engines.

These are common tactics for software businesses to generate leads—create websites programmatically with certain content that search engines find and rank highly, so customers come to you. It allows companies to spend more time building systems to find more customers that rank even higher in search engines.

We’ve always called this automated content creation… well, we don’t call it wrong, exactly, but we recognize it for what it is: an almost infuriating way of creating things just to be found on search engines. It’s content optimized for retrievability, perverting the idea of the search engine being a neutral arbiter of information and escalating it into optimization and ranking games.

And here’s the thing—I’m playing that game too.

My Own Dilemma

With PodScan, I’m using AI to extract and analyze data to build landing pages for tens of thousands of podcast episodes. That’s where I acquire a lot of my users, and from them, some number turn into customers. So I’m contributing to the very phenomenon I’m concerned about—I’m adding to the training data for future models.

This creates a responsibility I wasn’t expecting when I started building an AI-powered business. I found myself asking: Am I contributing to model collapse? Am I decreasing the overall quality of these models?

I’m not asking this question just to justify the existence of PodScan as a service that analyzes and extracts and displays data. I’m asking because I want to be able to tell people who want to build businesses on AI features—not just using AI to write code, but integrating this tool into their businesses—how to approach this over the long term.

We can’t just be building hype-based products in a time where we’re so early in the development of this technology that we don’t really understand the long-term repercussions of our work.

The Enrichment vs. Pollution Balance

If you’re building image generation software, or any of the generative AI products we see flooding the indie hacker community, you have to ask yourself: What do these products do in the long term? Are they enriching the ecosystem, or are they polluting it?

It’s almost always a balance. On one hand, you could argue that anything non-real that’s supposed to be real dilutes reality. On the other hand, generative AI creates possibilities that would never have existed without services that make creation so affordable, cheap, and accessible. This leads to incredible accomplishments in art, personal development, and overcoming financial obstacles that previously prevented people from even starting.

There’s no magical solution to this dilemma, but I think there is an approach that can help us navigate it responsibly.

Real-World Enrichment: My Framework

The approach I’m taking is what I call “real-world enrichment.” With PodScan, a lot of the data I extract from podcast episodes focuses solely on the spoken phrases inside that episode. What are people saying? Who is speaking? Who is mentioned?

None of this is generative to the point where it comes up with new things. All of this is derivative of the humanly produced content that already exists in the episode. The creative and generative work happens in summarizing—and yes, when you write a summary that has never been written before, this is a lossy compression of the content. It’s still derivative and not necessarily problematic, but it can introduce errors if the summarizer mistakes phrases, words, or associations.

The same goes for the demographic data I offer to paying customers—audience sizes, audience makeups, location-based distributions, assumed demographics. I’m using the inherent bias in the model to extract that data, and it all comes with the asterisk of being created derivatively through an AI system that has its own biases.

But here’s where it gets interesting: that bias can actually be useful. If the model thinks Joe Rogan mostly has a right-leaning male audience, that’s probably somewhat accurate because in the model’s training data, most mentions of Joe Rogan’s podcast happened in forums and websites and social media posts that can be clearly associated with conservatives and a mostly male-dominated subset of internet conversations.

That bias becomes useful data I can present to potential advertisers and guests who want to target that specific group.

The Verification Challenge

But my question is: if that model starts to ingest data that causes it to collapse, to overfit, to over-generalize, can that data still be trusted? What happens with data that becomes inconclusive?

I don’t have definitive answers to this. We’ve always had data quality problems, even with human sources. People make things up, lie when told to figure something out, write studies with hidden agendas. Medical studies, economic studies—they’ve all been written with specific agendas at various points.

But AI systems have no express agenda. Their only goal is to reply to queries with data that seems to make sense, that appears most credibly useful to the person asking for some kind of inference result.

This is why I believe we have to consciously protect the integrity of reality when building AI-based products.

The Journalistic Integrity Standard

Maybe the moment you make AI-generated information public, or give it to intermediaries who then use it to make information public, you suddenly have something akin to journalistic integrity requirements—but on a data level.

You’re describing a reality that can’t just be built upon prior virtual assumptions of that reality. There has to be something true anchoring it.

For me, that means taking data from social media profiles where, at least until recently, mostly humans congregated and interacted. Taking sentiments, conversations, follower counts, engagement numbers, and feeding them into machine learning systems to figure out things like audience size or demographics.

That enrichment comes with a verification requirement, and that’s something I still do manually for a lot of my work. Everything could be completely automated to the point where there’s a constant stream of searching and AI assessment, but there also needs to be verification—which, funny enough, is something AI can actually help with through tool use and internet lookup.

The Verification Loop

The key is making verification an isolated step with a different goal and intent than creating the data. If you tell an AI to create data, its goal is to create as much credible data as possible. But if your intent shifts to verifying data, the whole mission changes. The system will try to validate or invalidate what you give it, which, if done thoughtfully, can actually be quite useful for weeding out hallucinations and future model collapse data.

The Dead Internet Parallel

This whole situation reminds me of something we’ve already experienced: the dead internet theory. This theory, which took 20-30 years of internet development to establish, suggests that at this point, much of the internet is just automated systems talking to automated systems.

We see this constantly in social media, particularly around influential accounts where there’s automated posting by social media teams strategizing on how to communicate to an audience. Then, in the replies, you get equally automated accounts complying with agenda-driven arguments—whether political manipulation or attempts to scam people out of money.

It’s bots talking to bots, and AI systems of the future will be equally under threat of AI input being prior AI output.

Learning from Human Precedent

There’s nothing inherently wrong with this if it happens with a layer of human discernment. We all read the classics and learn similar lessons, but we apply them to our own lives and times, finding new meaning in classic literature. There’s value in investigating the content of a work in the context of our own experience, contrasted against the context of the author’s experience.

It’s unclear whether AI systems are capable of making that kind of discernment, but it’s up to us to guarantee a level of data quality and intentional verification the moment we work with AI data for our customers—particularly if we use it in public, to generate marketing content, write emails, or leave any trace on reality.

A Framework for Responsible AI Business Building

So here’s what I’m proposing as a framework for building AI-based businesses responsibly:

First, prioritize real-world enrichment over pure generation. Build systems that derive insights from existing human-created content rather than generating entirely new content from scratch.

Second, implement verification as a separate, distinct process. Don’t let the same system that creates data also validate it. Create dedicated verification loops with different intents and goals.

Third, be transparent about bias and limitations. When you’re using AI systems, acknowledge their biases upfront and explain how those biases might actually be useful information rather than trying to hide them.

Fourth, maintain human oversight at critical decision points. Automation is powerful, but human discernment remains essential for maintaining data integrity.

Fifth, consider your long-term impact on the ecosystem. Ask yourself whether your product enriches or pollutes the information environment, and optimize for enrichment.

The Choice We’re Making

We’re at an inflection point where the decisions we make as builders will determine whether AI becomes a tool that enhances human knowledge and capability, or one that gradually degrades the quality of information available to everyone.

The models being trained right now might be the last generation to learn primarily from human-created content. What we do with them, and how we choose to build on top of them, will determine what the next generation of AI systems learns.

That’s not just a technical challenge—it’s a responsibility to the future of information itself.

The internet didn’t have to become filled with SEO spam and programmatic content. That happened because of the choices individual builders made, prioritizing short-term gains over long-term ecosystem health.

We have the chance to make different choices with AI. We can build systems that respect the integrity of information while still creating tremendous value for our users and our businesses.

The question isn’t whether we can build profitable AI businesses—we clearly can. The question is whether we can build them in a way that makes the world’s information environment better rather than worse.

I believe we can, but only if we’re intentional about it from the start.

We're the podcast database with the best and most real-time API out there. Check out podscan.fm — and tell your friends!

Thank you for reading this week’s essay edition of The Bootstrapped Founder. Did you enjoy it? If so, please spread the word and share this issue on Twitter.

If you want to reach tens of thousands of creators, makers, and dreamers, you can apply to sponsor an episode of this newsletter. Or just reply to this email!

To make sure you keep getting your weekly dose of Bootstrapped Founder, please add arvid@thebootstrappedfounder.com to your address book or whitelist us.

Did someone forward you this issue of The Bootstrapped Founder? You can subscribe to it here!

Want to change which emails you get from The Bootstrapped Founder or unsubscribe for good? No worries, just click this link: change email preferences or unsubscribe.

Our postal address: 113 Cherry St #92768, Seattle, WA 98104-2205

Opt-out of preference-based advertising

Arvid Kahl

Building AI Businesses Without Breaking the Internet — The Bootstrapped Founder 392