Dear founder,

Before we dive in today: it's Black Friday week, and I'm running my biggest sale of the year. I've slashed my Bootstrapper's Bundle by 50% - that's both my books (Zero to Sold and The Embedded Entrepreneur) as eBooks and audiobooks, plus my complete Twitter course Find Your Following.

Everything I've learned about building, selling, and growing a bootstrapped business, all for $25. Normally over $150. If you've been on the fence, this is the moment. Grab it at tbf.link/bff with the code already applied. Now, let's get into it.

I recently realized something while building Podscan, my podcast database system that does a lot of background data extraction and analysis for my users. I’ve stumbled upon a couple of AI integration best practices that a lot of people might not be fully aware of. So today, I want to dive into the concepts I found not just useful, but essential for maintaining and operating mission-critical data handling with LLMs and AI platforms.

THE BOOTSTRAPPED FOUNDER • EPISODE 425

425: AI Best Practices for Bootstrappers (That Actually Save You Money)

22:35

MORE INFO

A quick word from our Sponsor, Paddle.com. I use Paddle as my Merchant of Record for all my software projects. They take care of all the taxes, the currencies, tracking declined transactions and updated credit cards so that I can focus on dealing with my competitors (and not banks and financial regulators). If you think you’d rather just build your product, check out Paddle.com as your payment provider.

I was reminded of the AI practices I have established a tweet I read from Greg Eisenberg. He said that keeping up 100% with all the new AI tools and models is impossible, and that something that works today might fail tomorrow.

That observation has been my biggest learning, and I turned it into not just a process, but an implementation style. Let me share what I’ve built.

The Migration Pattern

Wherever I use an AI call—be that to a local model or a cloud model on OpenAI, Anthropic, Google Gemini, wherever it might be—I always have a migration pattern implemented in the code.

I extract all of my API calls into services. These services internally handle all the connection, all the prompt massaging and prompt construction, in addition to the specific prompt I want for each API call. And these services always operate on what I call a state of permanent migratability. That means they can always use the latest version and the latest model, or at the very least, the version of the prompt and model that I used previously.

I realized I needed this when I started implementing GPT-5 a couple of months ago after having used GPT 4.1 for a long while. The first experiments were horrible. A lot of API changes happened, and a lot of nuanced parts of the prompt didn’t work as well as they did before. My prompt that was perfectly crafted for 4.1—which it ran on for at least half a year—was just not reusable for GPT-5.

Here’s what happened in the details: my prompt for GPT 4.1 had a JSON format expectation, which is part of the OpenAI API where you can say, “I want this to be a JSON that comes back,” and you’re guaranteed to get JSON. Everything in the prompt was aimed at creating that JSON most effectively.

When I migrated that call to GPT-5, it would still create JSON data, but it would drop certain keys. It wasn’t as reliable as the 4.1 version of that prompt. I tried to figure out the reason, and it turned out that GPT-5 was more aimed at using structured JSON schemas. Instead of just saying “this is going to be JSON,” you say “this is the exact JSON that you’re going to output,” and you define the keys and the potential values. Then you explain in your prompt how these are filled. It’s fundamentally similar to what it was before, but technically different.

Since I didn’t just want to deploy new experiments with a new model and hope they produced the same data, I came up with this migration strategy: for either a certain time period or for my testing and staging environments, I could say, “Use the old prompt with the old model, and then use the new prompt with the new model.” Maybe even a completely different data structure and a completely different API call—because OpenAI also wants people to move away from the chat-style API to the response-style API.

I would log both results, see if there were major differences, and if there were, the system would tell me and show me the diff, making it accessible for debugging. Then it would respond with the new data to whatever function or procedure called it.

For those servers running this dual approach, they incurred twice the cost. But I was able to see what the old model did, what the new model did, and how the differences between prompts would impact the quality, expectability, and reliability of the data.

I found this extremely useful for almost anything you do with AI tool calls where you’re using an LLM to get some kind of response or data. Unless it’s purely conversational, where this probably doesn’t matter as much. But if you’re doing background analysis, data number crunching, semantic analysis—it really helps to have this migration pattern where, for testing purposes, you can run both versions and see both results.

If you find that the new results aren’t as good as you thought, you can always roll back and just use the legacy version. Then you can work on your testing and staging systems to get the new version ready—because you know you’ll have to migrate eventually. These API systems will be deprecated at some point anyway.

It’s always quite helpful to do a diff on the JSON data that comes back just to see what’s missing, what’s different. Often, that will inform how you can reshape your prompts. You might even be able to have Claude Code or OpenAI Codex do this for you: take the data and change prompts until both results are similar to each other.

This migration pattern has made it into every single service I have that communicates with outside machine learning models. It’s been a godsend to see what the differences are, how things react differently, and to understand how nuanced your prompts have to be to get reliable data. Because once you’ve set up a prompt and it works, you want to use it for as long as possible. The migration pattern helps with that. And it also helps that if the new service suddenly experiences a degradation, with a flip of a switch, you can go back to the old one and get your data as reliably as you did before.

The Service Tier Secret

Now, that’s about handling differences between models. But I’ve also realized something else—something the people at OpenAI have kind of not necessarily hidden, but made not very apparent in the documentation.

Some of these cloud services offer what we might call service tiers, different priorities for how important your API call is, and you pay accordingly. If you’re a developer who reads documentation thoroughly, you probably spotted this quite early. Took me a couple months to figure it out.

For OpenAI, every standard request you send is billed in the default tier—that’s the price you find on the website. But if you look a little bit closer, you see that they have batched requests, which are significantly cheaper. Batching isn’t always useful for people doing on-the-spot analysis, because you don’t know when your batches will be full and when they’ll be processed. For anything that has a semi-synchronous nature, where you send the request and expect to work with the result shortly after, batching isn’t the most useful approach.

But what I failed to realize is that between the default tier and batching, there’s another tier. For OpenAI, it’s called Flex. And that tier is effectively billed at half the price.

Half the price. With the caveat that it might be slower, might take 50% longer, and at certain points might not be available at all. But it’s still the same exact model and the same exact data quality.

Once I figured this out, I knew I was going to use the Flex tier for all my backend stuff—all the things that happen behind the scenes. For Podscan, that’s extracting who’s the guest on the show, what they’re saying, summarizing these things. The fast part of Podscan is transcribing and making content fully searchable. The not-as-fast part is doing all this extra analysis.

If I can save 50% of my cost on this, I can do twice as much. That means twice as much value for my customers for the exact same price I would pay.

So I implemented this into all my extraction and inference jobs. Since OpenAI has implemented auto-retries in their own SDKs, I didn’t even need additional logic for that part. What I did implement was a fallback: if a request ran into a 429 error—which rarely happens—it would set the service tier for that particular job to the next highest one, which is standard. Then it would retry. So it tries a couple times with Flex to save money, and if that doesn’t work, it sets it to default priority. You pay twice as much, but you get it done.

I’ve implemented this at scale, and it led to an immediate 50% price reduction because the Flex tier is quite available most of the time. And because I have so many input tokens and so few output tokens—I throw a whole transcript of a podcast, hours worth of text, just to get a couple of JSON elements of summarized data—the cached tokens on that tier are also half price. That’s significant.

This then allowed me to immediately double the amount of extraction and inference I do, which increases the overall quality of the platform. That’s apparent not just to my paying users but also to my prospects, and it ultimately leads to higher conversion.

So in your API calls, check each platform you use. Look for the service tiers and consider how long you can allow for something to process. Batch pricing costs just as much as Flex processing on the OpenAI side, but if you can’t batch and you need some kind of asynchronous-but-somewhat-synchronous response, go for the Flex tier.

The clear indicator is: if you can batch and have the infrastructure for it, that’s great too. Flex pricing really exists on the GPT-5, 5.1, and the o3 and o4 models. Models like Codex, Pro, GPT-4o, and the real-time audio tools might not have Flex pricing as easily available. It depends on the models, so you need to figure this out on a per-platform basis.

But if those price tiers exist and you’re not using them, that’s financial negligence. It’s very simple to set the service tier. It’s very simple to default it back to something that will work. And if you have stuff that really needs to get through even when there’s a lot of congestion, you can set the service tier to Priority, which costs double but will get you results even quicker.

You have to understand that Priority might not exist for certain models either. That’s really what Greg Eisenberg was getting at: there are so many models, and so many different ways of using them, that you really have to be flexible in implementation and in how you optimize for this.

Front-Loading for Cache Efficiency

This brings me to something a bit more well-known, but something I’ve found really makes a difference if you have a lot of data to analyze: front-load the thing that repeats itself.

Front-load your system prompt. If you analyze the same bunch of data multiple times, start your prompts with that data and make it verbatim the same every time you use that piece of data.

For me, that’s the full transcript. I start with the system prompt: “You are a transcript analysis expert. You do these kinds of things.” Then I paste the transcript. Then at the end of the transcript, I say the specific thing that needs to be done—sometimes it’s looking for a particular kind of guest, or finding the sponsors, or asking a customer’s question on it, like “Is this about the thing the customer really cares about?”

But always front-load the actual data, because those will be cached tokens. Those cost you 10% of what the tokens that differ between prompts might cost.

I know my data case is a special one because Podscan has these massive transcripts that many questions are asked on. But if you’re doing anything meaningful where you throw data into an AI system for analysis, put the data that might repeat itself first, and put the stuff that doesn’t repeat itself after.

To make the order absolutely clear: system prompt describing what the thing is, then system instructions that are always the same (“you will extract data in this format”), then the data that’s potentially duplicated between prompts, and then the specific instruction for each prompt. That’s the order, and that will cost-optimize your prompts.

It’s also very helpful if you have multiple prompts you want to check. You can take those and feed them as data into a Claude Code conversation or a ChatGPT conversation with the express instruction to analyze these prompts and see if they can be optimized. You’ll get insight from the AI that you can use. I wouldn’t take it verbatim—the AI is still just predicting tokens, and just because it says something might be more useful doesn’t mean it is. But if you have a lot of prompts and let the AI check ten of them at the same time and tell you what you could do better, you’ll get meaningful insight from that.

Rate Limiting and Circuit Breakers

Finally, since we’re talking about using external platforms that cost on a per-token basis: build a lot of rate limiting into these systems.

Rate limit the customer-facing actions that trigger AI interactions. And rate limit the AI interactions that can be sent from your backend server. You don’t want a race condition that restarts the same process over and over again, triggering the same call repeatedly. Even if you’re using cached tokens, it will cost you.

Make sure that if something is out of order—if there’s 10x the normal AI tokens being used in any given hour—you’re made aware of it, and you have the ability to stop it.

A full-on circuit breaker for all AI features in your application might be a good idea. If you have something like Laravel, which has commands you can run on your server, or maybe you have an admin interface where you can flip a switch on or off—you might want to implement a full circuit breaker for all AI tools or for specific parts of the application.

Users can do some self-onboarding and click a “Generate a Cool Config” button. You might want to be able to turn that on and off. If somebody has found a way to automate that away and send hundreds of dollars worth of cost your way per hour, you want to switch that off with a toggle. A feature toggle in your backend—not just in the frontend, but in the backend, right where it happens.

Do this wherever you think you might need it. Then have an AI scan your code and tell you where else you need to put it, because you might forget. The AI will scan through all the files for you. It’s a good idea to have an agentic coding tool just for this purpose: Where are the potential AI call abuse situations? And can we have a feature toggle there that prevents this from being used if you, as an administrator, turn it off?

This is a very important feature. I highly recommend this for anything, which also means that any AI usage has to go through your backend system. You cannot implement something that client-side uses an AI call to a platform with your token. That’s generally not a good idea to begin with, but do not do this. Always funnel it through your backend system so you can reliably turn these features on and off and get alerted when usage is high.

I have several of these things. Rate limits on all these features. Frontend user rate limits. Backend rate limits. Feature toggles. Alerts for when a feature gets abused—on a per-account system, on a per-subscriber-type system, and on a per-IP system as well.

These things need to be built in. I have a feeling that tools and frameworks will implement this in the near future as kind of a baseline. But for right now, we need to implement this ourselves as founders.

The AI landscape changes constantly. The models change, the APIs change, the pricing changes. Your job isn’t to keep up with everything—that’s impossible. Your job is to build systems that can adapt when things inevitably shift. Migration patterns, service tier optimization, prompt caching, rate limiting, and circuit breakers. These aren’t just nice-to-haves. They’re the foundation of running AI in production without losing your shirt.

Build for change. That’s the real best practice.

And that’s it for today.

Thank you for listening to The Bootstrapped Founder.

Find me on Twitter @arvidkahl.

Everything I just talked about—the migration patterns, the service tiers, the rate limiting—that’s all running under the hood at Podscan, my podcast monitoring platform. If you need to know when someone mentions your brand across 4 million podcasts, that’s what it does. And for founders still searching for their next thing, I’ve set up ideas.podscan.fm where AI pulls validated problems straight from expert conversations. Sometimes the best ideas are just listening to what people are already asking for.

Thank you for listening. Have a wonderful day, and bye bye.

Hey, it’s Arvid, and this is The Bootstrapped Founder.

A quick word from our Sponsor, Paddle.com. I use Paddle as my Merchant of Record for all my software projects. They take care of all the taxes, the currencies, tracking declined transactions and updated credit cards so that I can focus on dealing with my competitors (and not banks and financial regulators). If you think you’d rather just build your product, check out Paddle.com as your payment provider.

That observation has been my biggest learning, and I turned it into not just a process, but an implementation style. Let me share what I’ve built.

The Migration Pattern

Wherever I use an AI call—be that to a local model or a cloud model on OpenAI, Anthropic, Google Gemini, wherever it might be—I always have a migration pattern implemented in the code.

The Service Tier Secret

But what I failed to realize is that between the default tier and batching, there’s another tier. For OpenAI, it’s called Flex. And that tier is effectively billed at half the price.

If I can save 50% of my cost on this, I can do twice as much. That means twice as much value for my customers for the exact same price I would pay.

Front-Loading for Cache Efficiency

This brings me to something a bit more well-known, but something I’ve found really makes a difference if you have a lot of data to analyze: front-load the thing that repeats itself.

Front-load your system prompt. If you analyze the same bunch of data multiple times, start your prompts with that data and make it verbatim the same every time you use that piece of data.

But always front-load the actual data, because those will be cached tokens. Those cost you 10% of what the tokens that differ between prompts might cost.

Rate Limiting and Circuit Breakers

Finally, since we’re talking about using external platforms that cost on a per-token basis: build a lot of rate limiting into these systems.

Make sure that if something is out of order—if there’s 10x the normal AI tokens being used in any given hour—you’re made aware of it, and you have the ability to stop it.

Build for change. That’s the real best practice.

We're the podcast database with the best and most real-time API out there. Check out podscan.fm — and tell your friends!

Thank you for reading this week’s essay edition of The Bootstrapped Founder. Did you enjoy it? If so, please spread the word and share this issue on Twitter.

If you want to reach tens of thousands of creators, makers, and dreamers, you can apply to sponsor an episode of this newsletter. Or just reply to this email!

To make sure you keep getting your weekly dose of Bootstrapped Founder, please add arvid@thebootstrappedfounder.com to your address book or whitelist us.

Did someone forward you this issue of The Bootstrapped Founder? You can subscribe to it here!

Want to change which emails you get from The Bootstrapped Founder or unsubscribe for good? No worries, just click this link: change email preferences or unsubscribe.

Our postal address: 600 1st Ave, Ste 330 PMB 92768, Seattle, WA 98104-2246

Opt-out of preference-based advertising

Arvid Kahl

AI Best Practices for Bootstrappers (That Actually Save You Money) — The Bootstrapped Founder 425

Dear founder,

THE BOOTSTRAPPED FOUNDER • EPISODE 425

425: AI Best Practices for Bootstrappers (That Actually Save You Money)

The Migration Pattern

The Service Tier Secret

Front-Loading for Cache Efficiency

Rate Limiting and Circuit Breakers

The Migration Pattern

The Service Tier Secret

Front-Loading for Cache Efficiency

Rate Limiting and Circuit Breakers

Many Heads, Not Many Hats: The Founder's Identity Crisis — The Bootstrapped Founder 431

The Case Against Vendor Lock-In: Why Easy Exit Means Better Retention — The Bootstrapped Founder 430

The Dead Internet Theory: Are We Building Machines That Only Talk to Other Machines? — The Bootstrapped Founder 429