Dear founder,

Last time I'll mention this, I promise: it's Cyber week, and I'm running my biggest sale of the year. I've slashed my Bootstrapper's Bundle by 50% - that's both my books (Zero to Sold and The Embedded Entrepreneur) as eBooks and audiobooks, plus my complete Twitter course Find Your Following.

Everything I've learned about building, selling, and growing a bootstrapped business, all for $25. Normally over $150. If you've been on the fence, this is the moment. Grab it at tbf.link/bff with the code already applied. Now, let's get into it.

THE BOOTSTRAPPED FOUNDER • EPISODE 426

426: How Your Data Model Shapes Your Product

22:44

MORE INFO

My dear friend Jack Ellis is an unending source of founder inspiration. Not only has he recently started embracing AI agentic coding—something he’s been holding back on for quite a bit. I think I’ve mentioned several times on this podcast alone how he and I seem to have quite opposing views on embracing this technology.

But something has clicked, and he’s been diving headlong into it. So over the next couple months, hopefully he’ll explore it more, and after that, I’ve been trying to get him to come on this podcast and talk about his experiences. Let’s give him time to explore it fully.

But Jack said something else recently which I found equally interesting and maybe even more generally applicable for all of us software developers building digital businesses. It’s just a quote from a tweet of his. He said: “The biggest mistake I ever made was storing our page views and custom events in different database tables. They’re now on their way to a single table.”

Obviously, Jack refers to the wonderful product of Fathom Analytics—something that I highly recommend using for your software products because it’s privacy-forward and very easy to use. But what he’s talking about here is how the data model they had in the past has been holding him back.

A quick word from our Sponsor, Paddle.com. I use Paddle as my Merchant of Record for all my software projects. They take care of all the taxes, the currencies, tracking declined transactions and updated credit cards so that I can focus on dealing with my competitors (and not banks and financial regulators). If you think you’d rather just build your product, check out Paddle.com as your payment provider.

Jack is no stranger to massive migrations, and if he sees something that needs to be migrated to something better, well, he goes for it. And I find that admirable and instructive, particularly because I’ve been running into the same issue. So that’s what I want to talk about today.

I’ve been building PodScan for almost two years now, and obviously the choices that I made on day one weren’t necessarily the most forward-thinking ones. Because, as we all know in this world of entrepreneurship, we’re all just trying to figure it out as we go along. The meme that is often mentioned here is that we’re all trying to build our airplane on the way down—plummeting towards the ground, trying for it to lift off before we hit the surface.

I’ve been dealing with a couple smart and a couple very unreflected choices in terms of how the data in my database is represented. And those choices have consequences, particularly for PodScan, where I have millions of podcasts to track and all the data that comes with that.

For each podcast, there is just so much metadata for the show alone—not even thinking about the episodes. Things like chart rankings and reviews, social media profiles linked to it, all this stuff needs to have a place. How you store it, what you store, and how accessible it is really makes an impact on how you can present the data and how you can use it. And then at this point, we’re at over 45 million episodes with transcripts. So not only do we have to store all that text data, but each episode has hosts that need to be extracted and linked, and there are summaries and topics.

There’s just a lot of data. Some of it I structured smartly, some of it I didn’t. And having a data model that is well-designed from the start feels like a really important thing—a smart and important part of having a successful product.

But here’s the thing I want to point out today, something we don’t usually talk about: how you structure your data, how you keep your data around, informs how you think about your product. And that might limit the whole scope of how you think about it to certain things that people who have similar products with similar data models might have as well. It might limit you to certain things that a different data model or a different way of representing the data might not.

Let me give you an example. The most basic thing of every software-as-a-service business is user authentication. People come to your website, they’re interested, and then they go to sign up. Now, what do you ask them? You ask them for an email, a name, and a password—hopefully some way of authentication—and then you persist that.

Now you have a list of users. And all of a sudden, your business, right from the start, is a one-user-per-account business.

This might be useful for most people who use your product, because they are individual people using it in their own ways. But then, all of a sudden, somebody wants to invite someone from their team. Now what are you going to do?

Do you make the team member create a new account as well? Because you don’t really have a team representation yet—you only have accounts. Maybe you have a permission structure that allows you to invite other accounts into individual documents in your product. Or you have an overarching organization structure that, whenever a new account is created, an organization gets created as well for that account—like a team—and then that team exists, and the account is the owner who can invite any number of members, make them administrators or editors or just users.

Depending on how you think about that, even that simple data representation that most software businesses initially have to make a choice about, it depends on who you think your product is for.

If it’s just a B2C product where you have individual users—let’s say a video game, or a fantasy sports tracker where you just want to track your own results—it makes perfect sense not to even think about teams as a data representation.

But the moment you start selling to bigger companies, the moment you start selling to people who are part of a workflow, if you don’t think about teams and organizational structures, your product is aimed at people who might not be able to use it because they need to involve their coworkers. And maybe more importantly, they expect things to have a team structure.

It’s one of the reasons why I really like the Laravel ecosystem. There are several really high-quality plug-and-play systems for user authentication, and the one that I’ve been using in all of my products, called Jetstream, has a teams option. You can immediately take teams and make them part of your application by just using the --teams flag when you install the plugin, and then every single new account either creates a team or can join a team.

It’s quite useful to have this as a default for software-as-a-service businesses, because most people expect this kind of functionality for any significant product. The moment you charge more than a hundred bucks a month, it’s likely that somebody will want a team. That’s kind of what I’ve experienced.

As you can tell, this has nothing to do with the internal data that your business handles. You could be rendering videos, generating emails, or tracking where vehicles go. It doesn’t matter what that data is structured like—just the authentication data, just representing who is a user of your product, already impacts what your product can and cannot do.

Then think about billing. Do you bill on an account level? Do you bill on a team level? Do you bill on an organization level, or do you bill per project? Do you even have projects? All of these choices truly matter. And once you make these functional choices, future infrastructure changes have repercussions.

Because if you change infrastructure, you also change the features that touch it. Existing features that may need an account, because they need to be bound to an account—but now all of a sudden, you have teams that can be accessed by multiple accounts. What are you going to do? Assign an account or assign the team?

The data model that you have in mind might not be the data model that you need. So you need to build some internal flexibility as a founder.

At scale, things can get quite problematic. If you think about PodScan as a database of all the podcasts in the world—which is over four million at this point—and all that they release, which is around 50,000 new episodes a day, obviously those database tables grow massively. Pretty linearly, because the world doesn’t just double in size when it comes to podcast adoption, but they do grow quite significantly.

What I experienced quite early on, a couple hundred thousand items in, was that my database—how I would structure my data and how I would make my data easily accessible—really mattered.

I was trying to add indices after the fact because I was building new features on top of my existing data model. Or I was trying to change fields in a database with millions of rows. And all of a sudden, all my queries to the database would stall for a couple minutes. Because adding an index sometimes can lock the database. Adding a new field or updating a field over millions of items can lock a database.

That was when I didn’t really know much about blue-green deployments—the idea of running a full copy of the database as a follower of the primary database, doing the index and all that work on that follower, and then switching it over. Something that I’ve now been doing a lot whenever it’s required. I wasn’t aware of this back then, and I wasn’t aware of how complicated it can be to add certain things to massive databases in MySQL.

It’s something that now needs a kind of infrastructure event for me to happen. If I need to add a new index, I have to make an extra deployment, run it on that kind of passive deployment, and then switch over everything. Otherwise, I have downtime in my service, which I can’t have because I have an API that needs to be reliable.

Those choices I never really thought about in the beginning. I never thought about the fact that once my podcast episode table has over ten million items, adding an index on a text field might take two days to complete. This stuff doesn’t come up when you’re just starting out.

At a certain point, data doesn’t just have to do with how data is represented, but also how data is accessible—particularly when it is a lot of data. When you’re pulling a lot of data from a certain API or kind of source, the way and the speed at which it can be accessed really makes a difference.

And even if you know a lot about databases and a lot about customer access patterns, there will be something you will miss. So from the beginning, I would recommend thinking about how you can play with this data at production scale and see how things behave without actually running it in production. The blue-green deployment approach has been something that I’ve felt really is a godsend, because otherwise I would have massive downtimes and the stress associated with that.

Let me share another story from PodScan. Full-text search in regular SQL databases like MySQL or PostgreSQL is pretty good—up until a certain point. And that certain point is reached quite quickly if the text you’re searching is significantly large.

For me, with PodScan’s transcription features, I have full episodes transcribed for now millions of shows. The full-text search capability of MySQL, no matter how good the system is, just wasn’t working with this. If I were to run a full-text index on all of it, first off, that would take me a couple of weeks just to build. And second, queries would take several hours to complete. Each episode in the system sometimes has up to a megabyte of raw text data. It just cannot be scanned quickly with traditional database approaches.

So I was limited. I wanted full-text search, but I couldn’t do it on my database. I needed to split off that data into a secondary system built for this kind of thing.

Initially, I was using Meilisearch and they were pretty good up until a certain point as well. Meilisearch is a very fast searching system used for instant search in e-commerce systems or specific databases of similar data. And it worked.

But after a certain size—I think around 100 gigabytes worth of data—it also became limited. Still extremely fast on the search side, but much, much slower on the ingestion side. So I had to find an alternative solution, which ultimately became OpenSearch—Amazon’s version of the good old Elasticsearch.

Now at this point, I am splitting off every single episode that I receive. I persist its full transcript so it can be quickly accessed, and then I send it over as an item to an OpenSearch cluster, which takes care of all our searching. That cluster has its own indexed version of all the data, and whenever we want full-text search, we send the request there, populate the results with data from our database, and then show it.

It was impossible to do in our own database system, so we had to use an alternative system. Had I said I want everything to be done in my own database, I probably would have found ways to do it that were like, dispatch a search and get a result half an hour later. But obviously that was not the feature goal.

The product needed to be searchable, so I had to split it off. Which means that now there’s significant complexity in the data model. I still have MySQL as my main database, but I have these additional sources of temporary truth, which are the search systems. The text is still the same that I get and send over, but the way we interface with that data and the way we synchronize it might be different.

So it’s important to understand that there’s now an additional control flow. If I were to re-transcribe an episode, I need to make sure that the version stored in the OpenSearch cluster is also updated. That synchronizing, checking, and testing—that is complexity that I didn’t think about from the beginning, but now have to deal with.

Even with MySQL data that’s now several terabytes in size, there’s more to think about. The database just grows and grows, but most people only really look at the transcripts of more recent podcast episodes.

That is a choice that probably makes sense—keeping it all in the same database, all equally accessible. But is it a smart business choice? Is it fiscally responsible to store everything in a database you pay for when you don’t ever use or interface with most of it?

I had to make a choice. And I think this happened within the first six months of running PodScan. I started taking older data and putting it into much, much cheaper object storage, but still making it available through a link from the database.

What I built was a way for my podcast episodes to check: is the transcript fully available in the database? And if not, link to a storage file that contains this transcript. If so, let’s load it, write it into the database for a hot minute so we can serve it quickly without having to pull it from object storage all the time.

And I also make sure that once podcast episodes reach a certain age, they automatically get transferred to object storage—kind of a colder storage—instead of staying in the database. That works super reliably, and it also saves me a lot of money on millions of podcast episode transcripts that are now just sitting in JSON files somewhere.

This was also not part of the initial data model. I didn’t think that I would ever have to think about that kind of optimization.

So that’s what I wanted to bring up today. The way you represent your data either enables you or it limits you—probably both at the same time. And if you find yourself struggling, think about how you can bend not just your application to fit the data model, but how you can open up and make more flexible your data model to fit what your application needs to become.

Know that it can be done. It requires patience, sometimes infrastructure events, sometimes blue-green deployments. But having a data model that is flexible enough to be changed even at scale and under load becomes quite relevant as a tech stack decision. Build that internal flexibility as a founder to say: okay, this month I’m going to do that migration. It will be worth it.

We're the podcast database with the best and most real-time API out there. Check out podscan.fm — and tell your friends!

Thank you for reading this week’s essay edition of The Bootstrapped Founder. Did you enjoy it? If so, please spread the word and share this issue on Twitter.

If you want to reach tens of thousands of creators, makers, and dreamers, you can apply to sponsor an episode of this newsletter. Or just reply to this email!

To make sure you keep getting your weekly dose of Bootstrapped Founder, please add arvid@thebootstrappedfounder.com to your address book or whitelist us.

Did someone forward you this issue of The Bootstrapped Founder? You can subscribe to it here!

Want to change which emails you get from The Bootstrapped Founder or unsubscribe for good? No worries, just click this link: change email preferences or unsubscribe.

Our postal address: 600 1st Ave, Ste 330 PMB 92768, Seattle, WA 98104-2246

Opt-out of preference-based advertising

Arvid Kahl

How Your Data Model Shapes Your Product— The Bootstrapped Founder 426

Dear founder,

THE BOOTSTRAPPED FOUNDER • EPISODE 426

426: How Your Data Model Shapes Your Product

When Long-Term Investments Finally Pay Off — The Bootstrapped Founder 436

How to Actually Use Claude Code to Build Serious Software — The Bootstrapped Founder 435

Follow Your Passion (But Not Like That) — The Bootstrapped Founder 434