Dear founder,

When I started building the first prototype of Podscan, I very quickly realized this was going to be a different business than any I’d built before. That difference had everything to do with one fundamental challenge: unlike most software service businesses, the resources I would need wouldn’t scale with the number of customers I had, but would scale with something completely out of my control—the number of new podcast episodes being released worldwide every single day.

THE BOOTSTRAPPED FOUNDER • EPISODE 404

404: The Transcription Challenge: Building Infrastructure That Scales With The World

27:47

MORE INFO

If you’ve ever investigated the idea of stoicism, you know there are certain things you can control that you should care about, and certain things you cannot control that you shouldn’t fret about. So that’s exactly what I did. I focused on what I could do to make transcribing every single podcast out there a reality, and didn’t complain about the fact that there are tens of thousands—millions—of shows being released every month, with tens of thousands of shows being released every day. I just had to deal with it.

Today I want to talk about this Herculean effort of building transcription infrastructure, how I got it from being extremely expensive to manageably cheap, what the trade-offs were along the way, and how much the development of new technologies has impacted the feasibility of this entire project.

From Prototype to Reality Check

For my first prototype, I obviously didn’t try to transcribe everything at once. I had found my source of podcast feed data through the Podcast Index project—an open source approach to listing all podcasts everywhere. It’s a free and openly available database of podcasts that provides where they’re hosted, their names, descriptions, and links to audio files.

The Podcast Index API had a very beneficial endpoint for trending shows and newly released episodes. So my first prototype just grabbed the most recently or most popular episodes and transcribed those with existing resources.

I’d already been experimenting with an open source library called Whisper for a previous project called Podline—a voice messaging tool for podcasts. I’d found that Whisper, which is supposed to run on GPUs, could also run on CPU through a project called whisper.cpp, albeit very slowly. For Podline, where I needed to occasionally transcribe a short one-minute clip, this worked perfectly.

Since Podscan was initially a marketing effort for Podline, I had all the tech laying around. But there was a stark difference in transcription scale: Podline needed to handle occasional short clips, while Podscan needed to reliably transcribe 50,000 shows per day.

The first smart choice I made was treating this as a queue, not something that would happen synchronously. I needed a queue of podcast episodes waiting to be transcribed, and whenever I had time and resources, I’d transcribe the next one. This required a priority system to determine which episodes should be handled first.

My initial setup was a really basic queue with one consumer—the Mac Studio I was developing on. Running whisper.cpp locally on a Mac was beneficial because it could use the GPU and the MacBook’s unified memory system. I was getting about 200 words per second, which meant I could fetch a couple hundred episodes in an hour with some parallel processing.

The Cloud Infrastructure Journey

Then I realized that to deploy this properly, I needed a transcription server running on a different system. I started exploring companies that offer access to computers with graphics cards.

The first thing I tried was AWS with their G instances—the G stands for graphics, I assume. These were quite expensive virtual machines that didn’t really have much power for the work I was doing. The ones I could afford, costing around $400 a month, just weren’t powerful enough. I would have preferred them to either be cheaper or better, so I quickly stepped away from AWS.

Next, I looked into Lambda Labs, one of the first reliable options for GPU systems. Lambda was helpful because they offered different servers with different GPUs attached. You could rent an H100—one of the most powerful NVIDIA GPUs at the time—for about $1,000 a month, which was obviously expensive. Or you could have an A100 or A10, which were much cheaper and actually perfect for transcription purposes.

I spent a couple months experimenting with my own money, testing whether an A10 would outperform an A100 or H100—not in terms of raw throughput, but in terms of words per dollar. I deployed my transcription system onto different hosts with different graphics cards, running experiments with varying numbers of parallel transcriptions.

I found a working solution: I settled on 12 or 16 servers with A10 graphics cards. These became my transcription fleet for a while, but it got quite expensive, which made me realize I needed to do something about the price.

The Great Optimization Discovery

The most effective thing I did was look for hosted servers outside of services focused on renting AI for inference. Those services tended to offer sizeable graphics cards, which is great if you need impressive GPU power, but in most cases, that’s not actually needed for transcription.

I found that solution in Hetzner, the German company well known for being an affordable hosting company. Hetzner had just started offering GPU servers through an auction system where you could get really great hardware quite cheaply. They offered servers with RTX 4000 SFF Ada Generation GPUs for around 280 euros—about $300 a month. These servers also included 64 gigabytes of DDR4 RAM and four terabytes of disk space.

The key insight was that transcription has different requirements than other AI inference tasks. You can run transcription quite reliably on just 4 to 20 gigabytes of VRAM, which really isn’t that much, but it’s definitely enough to get the highest quality transcription data.

When I switched all my transcription servers from the A10s to these Hetzner systems, I picked up steam dramatically. It was so much more effective that I could get by with half the number of servers and still have higher throughput than before. That’s where I am now—self-maintained servers running transcription scripts 24/7 on the Hetzner platform, being highly efficient over time.

Technical Challenges at Scale

As Podscan started gaining customers, they had requirements that default transcription wouldn’t supply. I needed diarization—determining different speakers in an audio file—and word-level timestamps for precise interactivity.

I switched from whisper.cpp to another implementation running on top of faster whisper that included both diarization capabilities and granular timing data. But this revealed several surprising technical challenges:

Diarization is More Resource-Intensive Than Transcription Detecting speakers takes longer than actually transcribing what they’re saying. You’d think it would be easier to determine “somebody’s speaking here, then somebody else,” but it’s actually harder to figure out if it’s person one, person two, or person ten than it is to determine the actual words.

From the start, I needed a careful prioritization system that only diarized when really needed. If I know a podcast has only one speaker—and has for the last 200 episodes—I don’t need to diarize it. But if it’s a popular show with different guests all the time, then I need to prioritize it. At scale, turning off diarization means I can transcribe twice as many podcasts in any given day.

GPU Memory Limits Affect Quality If you do a lot of parallelized transcription, the GPU reaching its memory limit can cause transcription quality to decline. I initially thought: this graphics card has 20 gigabytes of RAM, each transcription process uses at most 4 gigabytes, so I can run five at a time. That tends to be true most of the time, but if one process runs longer—maybe it’s a three-hour Joe Rogan podcast—and five or six processes are fighting for memory, quality quickly degrades on all of them simultaneously.

I’ve since reduced parallel processes to two or three. There’s a small chance the GPU isn’t fully utilized while a process spins up, but most of the time it’s in full use without quality degradation.

Bigger GPUs Aren’t Necessarily Faster Just because a GPU is bigger and better doesn’t mean it’s faster at transcribing. When I ran transcription on my local machine, then on A10 and A100 GPUs, I got similar results—always between 150 to 200 words per second. Even when I rented an H100 GPU, the word count stayed almost the same, maybe going up to 225 or 250 words per second. But that GPU had five to ten times the monthly cost.

For transcription specifically, it’s way more effective to run this on smaller and maybe slightly slower GPUs at scale. This has turned out to be the only feasible way to do this.

The Cost Reality Check

Here’s the alternative that puts everything in perspective: if I were to transcribe all 50,000 episodes that come in every day using OpenAI’s Whisper endpoint, I would pay a five-figure dollar amount every single day.

After many months of optimizing and experimenting with transcription setups, I’ve turned this into just a few thousand dollars a month in expenses. The cost savings are significant when you run your own infrastructure, even though you can’t run as many parallel processes as you could using Whisper on OpenAI’s platform or other transcription systems like Deepgram.

The daily cost of those commercial models is easily in the thousands of dollars. I’ve gotten it down to just a hundred and change on a per-day basis, which is quite significant. The biggest expense for Podscan at this point isn’t transcription capacity—it’s the database where all this information is stored.

The Data Storage Challenge

When I initially started tracking a couple hundred podcasts, it was totally fine to have my SQL database store all the data without doing anything specific around data storage. But once I turned on the full fire hose of podcast data—all 50,000 a day—this became quite the challenge.

If every transcript is 200 kilobytes to one megabyte in size, then every day you’re adding several gigabytes to a database. If you’re trying to do full-text search or quick lookups with filtering, this quickly becomes a problem.

I had to build infrastructure that prevents my database from overgrowing. Older transcripts are transferred to S3-based storage and loaded by the main process only when requested. I don’t keep all transcripts in the database—that would easily be six terabytes right now, which is super expensive to maintain and very clunky for database access.

Now all transcripts live on S3 as JSON files and can be loaded on demand for anything older than a couple months. This has been very helpful in ensuring that the database stays at least a little nimble in comparison.

Quality Challenges and Solutions

There’s no normal standard for quality in podcasts. Some people record into what feels like a potato, others have extremely high-end setups, and you never know reliably which one you’ll encounter. Transcription systems expect at least a certain kind of quality and struggle with low-quality audio or non-speech content like music.

I had to implement a transcription quality checking system that tries to determine if a transcript is acceptable or if we need to re-transcribe it with different settings. Whisper is pretty good by default, but there are edge cases where you need multiple attempts to get it right.

Whisper struggles with names and brands—anything that a human could easily get right from context, but without context, just from the voice pattern and audio waveform, Whisper doesn’t get right.

What works really well, but is extremely expensive, is taking the full transcript from Whisper and having an AI do a pass over it with context from the podcast name, description, and maybe prior episodes. You get extremely high-quality transcripts this way, but at scale, this costs several dollars per episode because of the hundreds of thousands of input tokens and expensive output tokens.

Currently, I give Whisper up to 120 tokens of context—the title of the show, episode title, and names of people I know will be mentioned. I experimented with giving it all the brand names from Podscan accounts as context, but Whisper actually started finding these words in places where they weren’t—it was gaslighting me into believing it found certain words where they didn’t exist. Since then, I only provide context I can reliably infer will be in that particular episode.

Lessons Learned and Current State

The big benefit of this system is that it’s pretty easy to set up. It’s a Laravel application I can deploy through Laravel Forge onto any new server. I have an install script, and then I can spin up a new server quite easily that automatically attaches to my API and starts fetching and transcribing new episodes.

As Podscan’s infrastructure grows, we can quickly add more systems so new episodes are transcribed faster with more quality. Eventually, I can increase the number that get diarized and get more good data that can be fed into AI systems for what my customers want.

When I first set up the fleet of servers to transcribe all podcasts everywhere, it probably would have cost me $30,000 a month, even on my own hardware. But I’m now at a point where, through proper optimization and balancing customer needs with expense requirements, I can reliably capture the majority of podcasts for just a couple thousand dollars a month in expenses.

And I think that’s really cool.

The key insight here is that when you’re building a business that scales with factors outside your control—like the global output of an entire medium—you need to think differently about infrastructure, optimization, and trade-offs. Sometimes the most expensive solution isn’t the best one, and sometimes the constraints you think are impossible to work within actually force you into more creative and ultimately better solutions.

This is the kind of challenge that makes building businesses both terrifying and exhilarating. You can’t control how many podcasts get published worldwide every day, but you can control how cleverly you solve the problems that creates.

We're the podcast database with the best and most real-time API out there. Check out podscan.fm — and tell your friends!

Thank you for reading this week’s essay edition of The Bootstrapped Founder. Did you enjoy it? If so, please spread the word and share this issue on Twitter.

If you want to reach tens of thousands of creators, makers, and dreamers, you can apply to sponsor an episode of this newsletter. Or just reply to this email!

To make sure you keep getting your weekly dose of Bootstrapped Founder, please add arvid@thebootstrappedfounder.com to your address book or whitelist us.

Did someone forward you this issue of The Bootstrapped Founder? You can subscribe to it here!

Want to change which emails you get from The Bootstrapped Founder or unsubscribe for good? No worries, just click this link: change email preferences or unsubscribe.

Our postal address: 113 Cherry St #92768, Seattle, WA 98104-2205

Opt-out of preference-based advertising

Arvid Kahl

The Transcription Challenge: Building Infrastructure That Scales With The World — The Bootstrapped Founder 404