Dear founder,

As indie hackers, we often start our journey with simple dreams and manageable datasets.

But what happens when success hits and your modest database transforms into a data behemoth?

Let me share my experience scaling Podscan from a simple podcast database to a system handling millions of podcasts, dozens of millions of episodes, and over a million new tracking rows daily.

🎧 Listen to this on my podcast.

The Calm Before the Storm

When you’re dealing with a few hundred thousand rows and maybe 50 tables, life is good. Databases are manageable, queries run smoothly, and everything feels under control. But success brings scale, and scale brings challenges. In Podscan’s case, it wasn’t just about storing podcast metadata – we’re talking about millions of podcasts, their episodes, transcripts, tracking charts, and topic analysis. What started as a straightforward database quickly evolved into an intricate web of interconnected data that needed to be instantly accessible.

The real challenge isn’t just the technical side – it’s juggling database architecture and maintenance while wearing all the other hats as a solo founder. When you’re handling marketing, sales, customer support, and product development, the last thing you want is your database becoming a full-time job. Yet that’s exactly what can happen if you don’t plan ahead.

The Index Imperative

The first reality check came when simple queries like “give me the latest item” started taking hours. That’s when I learned the critical importance of database indices. For any database at scale, indices are non-negotiable – they’re your fast-access lifeline to frequently queried data like creation dates or podcast IDs.

Here’s the catch: while setting up indices during table creation is painless, adding them to existing tables with millions of rows is a different beast entirely. Combined indices are even worse, requiring full table scans that can freeze your production system for extended periods. Adding columns with default values poses similar challenges – you’re essentially rewriting millions of rows.

I learned this lesson the hard way when trying to add a new index to track podcast episode publishing patterns. What I thought would be a quick database update turned into a potential 20-minute downtime situation – completely unacceptable for a service providing API access to customers.

The Blue-Green Salvation

Initially, I resigned myself to scheduled downtime for maintenance. But as Podscan grew and we started offering API access, downtime became unacceptable. That’s when I discovered AWS RDS’s blue-green deployments – a game-changer for database maintenance.

The concept is brilliant: create a copy of your existing database as a follower, perform your time-intensive operations on this “green” version while the “blue” original keeps serving requests, then switch over when ready. AWS RDS handles this elegantly through DNS-based switching, resulting in minimal disruption. With 10-20 requests per second hitting Podscan, we typically see fewer than 25 errors during switchover – a testament to its reliability.

For particularly challenging operations, I’ve found Percona’s toolkit invaluable. Their pt-online-schema-change tool has saved me countless hours by allowing structural changes without blocking reads and writes. It creates a copy of the table, modifies the structure, then synchronously copies data while maintaining operations. This approach has been particularly useful for full-text index creation, where adding the index to an existing table with millions of rows could otherwise take weeks.

The Great Storage Cleanup

Recently, I faced an interesting challenge that perfectly illustrates the maintenance complexities at scale. I’d built a system to move old transcripts to Amazon S3, keeping only frequently accessed content in the database. The logic was simple: if an episode is older than a few months and rarely accessed (maybe once a week or month), why keep its transcript in our primary database? However, I discovered that deleting this data didn’t reclaim space – my AWS RDS storage still showed nearly 5 terabytes used when the actual data was only 2 terabytes.

The solution? Running an OPTIMIZE TABLE command on a blue-green deployment. This operation took over three days to complete, but it successfully compressed our storage needs by 3 terabytes, significantly reducing our monthly AWS bill. Of course, this meant paying for double the storage during the optimization period – but the long-term savings made it worthwhile.

Search at Scale: The Synchronization Dance

Full-text search presented another scaling challenge. While MySQL offers full-text indexing, setting it up on millions of text-heavy rows post-fact can take weeks. I opted for Meilisearch as an external search solution, self-hosted on a Hetzner server for cost efficiency, but this introduced a new complexity: synchronization.

With 60,000 new transcripts arriving daily, each potentially containing hours of transcribed text, keeping the search engine in sync with the main database became a significant engineering challenge. Add multiple indices for different content types (podcasts, topics, descriptions), and you’re looking at a substantial data pipeline that needs careful management.

The synchronization process revealed its own bottlenecks. Meilisearch’s ingestion can only handle so much data at once, and when you’re dealing with hours of transcribed text per episode, you need to carefully manage your indexing strategy. I’ve had to build sophisticated queuing and batching systems just to keep everything in sync without overwhelming either system.

Performance Tuning and Query Optimization

One lesson I’ve learned repeatedly is that even high-performance databases can sometimes make surprisingly poor decisions. MySQL’s query optimizer, while generally reliable, doesn’t always pick the optimal index for a query. I’ve had situations where a simple query would run for minutes until I explicitly hinted at which index to use – after which it executed in milliseconds.

This became particularly evident when dealing with podcast chart tracking data. With over a million new rows being added daily, even seemingly simple queries could become problematic without proper optimization. I’ve learned to regularly audit query performance and not shy away from using index hints when the optimizer consistently makes sub-optimal choices.

Back That Data Up

When it comes to protecting my data from catastrophic disaster, I use a multi-pronged approach. First off, I leverage AWS RDS’s snapshot feature, that created regular backups of the full database, ready to be restored at any point. But that’s a backup on the platform it is made for, and relying on that alone is a form of platform risk. That’s why, on a regular basis, I do a full data dump, download it, and store it off-platform. Since the database is now terabytes in size, this takes a while, and it costs a bit to store, but it’s worth it to have the data right with me and on several cloud block storages. It would take a solid day to recover from such a data dump, but it beats not having off-site backups at all.

Lessons Learned

For fellow indie hackers starting their database journey, here’s my advice: start with a hosted, managed database service that’s cost-effective but reliable. Stick to the basics – PostgreSQL or MySQL. While PostgreSQL offers attractive features like vector storage (great for AI applications) and native location data support, MySQL has served Podscan well, though it means finding alternative solutions for certain features. It costs me just a few thousand dollars a month to run a sizeable database on AWS and a few hundred to run my own Meilisearch search server. Under $3k when optimized for size.

Keep your implementation database-agnostic initially. Avoid special plugins or vendor-specific features until you have the revenue to commit long-term. And always ensure you can quickly deploy read replicas – they’re lifesavers during traffic spikes. I’ve found that being able to quickly spin up a read replica during high-traffic periods has saved us multiple times, particularly when search engines suddenly started crawling our content more aggressively.

If I were starting Podscan today, I might choose PostgreSQL for its built-in features like vector storage and location data support. But the key lesson isn’t about which database to choose – it’s about understanding that your needs will evolve as you scale. The database that serves you well at launch might need to be complemented with additional services as you grow.

Remember, what starts as a simple database today might need to handle millions of records tomorrow. Plan for scale, but don’t over-engineer. Focus on building something that works now while keeping the door open for future optimization. After all, that’s what indie hacking is all about – growing and adapting as your success demands it.

The most important thing I’ve learned through this journey is that database management at scale isn’t just about technical solutions – it’s about finding the right balance between performance, maintenance effort, and cost. As a solo founder, you need solutions that not only work well but also don’t require constant attention. Sometimes that means paying a bit more for managed services, and other times it means investing time in automation. The key is knowing which battles to fight and which ones to defer until they truly matter for your business.

If you want to track brand mentions, search millions of transcripts, and keep an eye on chart placements for podcasts, please check out podscan.fm — and tell your friends!

Thank you for reading this week’s essay edition of The Bootstrapped Founder. Did you enjoy it? If so, please spread the word and share this issue on Twitter.

If you want to reach tens of thousands of creators, makers, and dreamers, you can apply to sponsor an episode of this newsletter. Or just reply to this email!

To make sure you keep getting your weekly dose of Bootstrapped Founder, please add arvid@thebootstrappedfounder.com to your address book or whitelist us.

Did someone forward you this issue of The Bootstrapped Founder? You can subscribe to it here!

Want to change which emails you get from The Bootstrapped Founder or unsubscribe for good? No worries, just click this link: change email preferences or unsubscribe.

Our postal address: 113 Cherry St #92768, Seattle, WA 98104-2205

Opt-out of preference-based advertising

Arvid Kahl

Indie Hacking Databases at Scale — The Bootstrapped Founder 374