Dear founder,

Here’s a question I’ve been sitting with lately: if building software keeps getting easier — and it clearly is — then what exactly are we building our businesses on?

THE BOOTSTRAPPED FOUNDER • EPISODE 437

437: Data Is the Only Moat

15:39

MORE INFO

Because conversations about the quality of vibe-coded or AI-engineered software aside, it is obvious that with LLM-originated tooling, writing complex software has become significantly easier. It hasn’t become completely solved. It’s still a process that requires an orchestrator — somebody who knows what the thing is going to be. And at this point, that isn’t just a technical capacity. It’s product management and customer development intersecting with engineering. But it’s pretty clear that as these tools get better, the process of creating code bases that, when deployed, become products people can purchase or services people can pay for — that is clearly moving towards a threshold where software engineering and running software businesses as we know them will change quite a bit. It’s still going to be a job that requires insight and capacity, but it doesn’t require ten people anymore to build something meaningful. It might just require three, or two, or maybe just one.

So if it becomes significantly easier to build products, and the act of building a software business is not as expensive anymore because it’s faster and requires fewer resources — then what will be the moats of the future? Of the immediate future, where this change is already happening, and in five or ten years, when AI-generated software products are commonplace and easily built, deployed, and maintained?

Now, there used to be a lot of things we could point to as moats. How hard it is to build a software product reliably and consumably and maintainably. How hard it is to translate your knowledge from an industry into a product that serves that industry. But AI systems are taking over a lot of that now. So what is left over?

I find that the one thing that stands out, no matter how much AI you throw at it, is real-world data — data that is generated by humans, by human brains.

The Great Data Bifurcation

Data, all by itself, is now experiencing this bifurcation. On one side, there’s the data made by humans. People recording podcast episodes, people putting videos out there, people actually writing their own social media posts or blog posts — anything that is genuinely human-generated. And then there’s the synthetic side: AI-generated images, synthetic voices from text-to-speech systems, completely AI-made movies or YouTube clips, and of course every single spam email written by agentic systems at this point.

One side is becoming more and more valuable — the actual signal, the human data. And the other side is becoming more and more commoditized — the AI-generated slop, as you might call it, or even the AI-generated stuff that actually works, which is increasingly available because models get faster, cheaper, and better. AI-generated data can be valuable, but it is a commodity. Human-generated data is valuable just by the sheer fact that it’s not AI-generated, plus it has its own additional inherent value: the creativity, the effort, the expertise, and the exclusivity that went into creating it.

Human data tends to be only generateable by the person who generated it in the first place, because that person is the only entity that has all the knowledge to create that particular piece of data. And since AI, by definition, cannot generate human-generated data, I believe that data — real-world data, mostly human-generated, validated, and cleaned — is the only reliable moat we have as software founders in the near and mid-term future. Let’s say the next decade or so.

What I’m Seeing at Podscan

I’m experiencing this with Podscan right now. The most value that my customers draw from the system is not my capacity to ingest all these RSS feeds or to have an API that responds quickly to their requests. That’s something any minimally instructed agentic system can do. The actual value of the Podscan platform is the fifty million podcast episodes that I have transcribed and had AI systems analyze for content, keywords, themes, and sentiment.

The value is in the additional work — the transcription work, the transformative work that makes the data accessible to others. I am working on top of existing public data, which is every podcast episode out there, and I put it into a shape and a form where other people can consume it for whatever needs they might have. Maybe they want to track brand mentions. Maybe they want to figure out what people are talking about right now. Maybe they want to see what a particular kind of podcast is discussing so they can sponsor it or place an ad. There are many different ways this data can be used.

And I find that the more effort I make to increase my data fidelity — the accuracy, the currentness, the freshness of this data — the more people find value in it. And it doesn’t really matter to them exactly how hard it is to access that data. As long as the data exists, they will find a way to get to it. I can hide it in weird user interfaces. I can make the API clunky and restrictive. Doesn’t matter. They will find a way, because it’s the data that is relevant, not the means of accessing it.

Why Transformative Software Is Vulnerable

Now here’s the flip side. If you run a software business that is purely transformative — that takes incoming data, does something to it, and turns the data back out — that will be a problem. Transformative algorithms are something that agentic AI systems are extremely good at. When you say, “Hey, ChatGPT, take this Excel sheet, generate a report, turn it into a PDF, and email it to this account,” it can do this autonomously. Without your input, without anybody else’s input, without needing an external service. It understands how to parse an Excel file, run analytical queries on its contents, render a PDF, and send an email. All of these are tiny steps that somebody has already implemented, or that the system itself can implement on the fly.

It is not needed to build a SaaS business around taking in Excel sheets and emailing out reports. The agentic system can do it by itself.

But when it comes to data collection at scale, agents don’t usually do this. It’s the ephemeral nature of the agentic mode, right? If you spawn a Cursor session or a Claude Code instance or have a conversation in your browser with ChatGPT, only during those brief moments of interaction does the agent exist. Everything else is just entries in a database somewhere, just a state in a state machine that’s trying to get you to your results. An agent that constantly scans, constantly does work for you, consumes so many tokens that it becomes almost prohibitively expensive. If you were to spawn an agent that tries to do what Podscan is doing — taking in and transcribing and analyzing fifty thousand podcast episodes a day — that would cost you tens of thousands of dollars in tokens and API calls per day. Whereas in my business, I’ve optimized all these processes so it happens in the background and makes the data available to my potential consumers: API customers, people using the website, or people using the MCP integration so it can be plugged into agents directly.

Being the system of record, having data that is meaningful — that’s why it’s so valuable to my customers. If all I did was offer the capacity to “give me a URL and I’ll transcribe and analyze it,” then I think Podscan would be two hours away from being completely replaced by a well-written skill inside Claude Code or something similar. But since Podscan collects all this data — freshness data, chart rankings, additional metadata from all over the place, from sources most people don’t even know can be accessed — that is the system-of-record nature. That is pulling together data and making it comprehensive and accessible to other systems.

So when I say “data moat,” I mean you really need to make sure that whatever you offer has its own internal, additional-value data source. And on the other side, you need to actually use that data and make it available. Having data is half the moat. Availing data is the other half.

Make Your Business API-First

Making your software business an API-first business is probably one of the smartest things you can do today. You have to ask yourself: is there anything in my application that I could make available through an API? And when I talk about APIs — MCP is just another layer on top of an API usually. Most frameworks that offer MCP capability, I’m thinking about Laravel here, let you set it on top of the REST API you already have. So whether we’re talking about programmatic access, MCP, APIs, webhooks, whatever it might be — it’s the same thing. You make your system reliably connectable to other computers.

What I find more and more interesting, and I see a lot of founders talk about this because they themselves are consumers of other software businesses: people are looking for software where there is near-parity between what you can do in the user interface and what you can do on the API side. The more parity there is — the more people can do the exact same things they do in the interface through the API — the more likely they are to really buy into your product, because it means they can automate anything. And every agent out there wants to allow them to automate whatever they need.

I have this ongoing effort in my system where every couple of days, I run a sub-agent of Claude Code on my code base to update a central file I call my platform parity tracking file. Every single function that I have in the product gets a row in a table, and I note: can I do this in the UI? Can I do this on the REST API? Can I do this through MCP? If it’s available on all three, it’s complete. If not, it’s a candidate to add more work to. It could be something as simple as “search for a podcast” or something as complicated as “configure a keyword alert that automatically adds new mentions of your brand to a list that triggers a webhook.” If I can do it in any part of the platform, it should be possible in each of these systems I offer.

Having an agent that can find candidates to implement, to increase my parity — I’m hoping that eventually there will be enough tooling to have these things worked on automatically in the background. That’s a bit wishful thinking at this point. But the principle is important: usage parity between human users, computer users, and agentic users — which are almost a hybrid of the two, a human telling an automated system what to do with semi-autonomous effort to connect to your system. All three of these should be equally well served.

Your Metadata Is Your Moat

And this doesn’t have to be podcast data. Whatever your product touches, think about the metadata you collect when people use your platform, or when you observe the consequences of using your product. If you have a tool that lets people post to Twitter or Facebook, it could be the times when most people post, or when most people engage, or what kind of content drives the most engagement. That kind of stuff — that metadata — is your unique data moat.

So figure out what this is for you. Make sure you have it. Make sure you have it connected. And then make sure you make it accessible in ways that improve how people use your product and how they run their own businesses. That’s the data moat.

We're the podcast database with the best and most real-time API out there. Check out podscan.fm — and tell your friends!

Thank you for reading this week’s essay edition of The Bootstrapped Founder. Did you enjoy it? If so, please spread the word and share this issue on Twitter.

If you want to reach tens of thousands of creators, makers, and dreamers, you can apply to sponsor an episode of this newsletter. Or just reply to this email!

To make sure you keep getting your weekly dose of Bootstrapped Founder, please add arvid@thebootstrappedfounder.com to your address book or whitelist us.

Did someone forward you this issue of The Bootstrapped Founder? You can subscribe to it here!

Want to change which emails you get from The Bootstrapped Founder or unsubscribe for good? No worries, just click this link: change email preferences or unsubscribe.

Our postal address: 600 1st Ave, Ste 330 PMB 92768, Seattle, WA 98104-2246

Opt-out of preference-based advertising

Arvid Kahl

Data Is the Only Moat — The Bootstrapped Founder 437