Dear founder,
Today I want to talk about something that’s been on my mind lately – and probably yours too if you’re building anything with AI. It’s the age-old question: should you run language models locally or just use APIs?
Now, I know what you’re thinking. “Arvid, Local AI vs APIs been debated to death.” But here’s the thing – most of these conversations happen in a vacuum, full of theoretical scenarios and benchmark comparisons. What I want to share today is what I’ve actually learned from building real businesses with AI, making real decisions with real constraints, and sometimes making the wrong choices and learning from them.
My Journey: From Local-First to API-Reality
Let me start with a confession. When I first started building Podscan, I was convinced I had to do everything with local language models. I mean, as a bootstrapped founder, the idea of keeping costs low and maintaining control was incredibly appealing. So I dove in, trying to handle everything locally.
But here’s what happened – and this might sound familiar if you’ve been down this road yourself. The cost savings that platforms like OpenAI and Anthropic have achieved through their scale made it pretty clear that, for my workload and data volume, it made no sense to rent more and more GPU resources to run local language models.
At this point, it became more effective for me to go API-first and then consider local deployment later if APIs went down or I needed specific capacity requirements.
The Sweet Spots: When Local Actually Makes Sense
Now, don’t get me wrong – there absolutely are situations where local LLMs shine. Let me break these down based on what I’ve actually experienced:
Small, Fast Decisions
The first sweet spot is when you have very small tasks that need quick decisions – things that don’t require complex reasoning or extensive context. For example, if you just need a reasonable choice between options that can’t be determined by simple Boolean logic or basic algorithms, a small language model with fast inference, even on CPU, can be perfect.
Instead of making a network call and dealing with API latency, you can get your answer instantly on your local server.
Audio Transcription at Small Scale
Here’s a concrete example from my own business. The first AI feature I built was for transcribing audio clips for Podscan. This doesn’t necessarily require an inference call to a remote API – it can easily run on the local CPU of a server, provided that the number of parallel operations doesn’t exceed what you can handle at your current business stage.
For these cases, it makes perfect sense to keep the processing on your local server with local models and avoid costs that would come from using someone else’s resources.
The CPU vs GPU Reality Check
This experience also taught me something interesting about CPU versus GPU inference. There are certain ways of using these models where there’s barely a difference in speed between CPU and GPU – specifically when you’re dealing with low context windows and low token prompts.
If you just need a yes or no answer, for example, CPU and GPU are almost equal in terms of speed. But the moment you do multimodal work – images, audio, deep analysis – then a GPU-based system becomes much more effective.
The key insight? If you only have 10 of these operations a day and you can wait for the results, you don’t necessarily need to go to a remote API.
The Scale Reality Check
But here’s where things get interesting – and where I learned some hard lessons. Scale changes everything.
The Context Window Problem
My real challenge came when working with large transcripts. Inference on large transcripts really scales with the size of the context that’s provided. If I have a three-hour conversation in text form and want to run some kind of analytics on it, it turns into a very long computation process, whether on GPU or CPU.
I was using llama.cpp with LLaMA 3B or 7B parameter models, and while this was perfectly fine at a scale of a couple hundred operations a day, it started becoming a bottleneck at a couple thousand a day. At tens of thousands a day, it became completely unbearable.
Unit Economics Don’t Lie
Here’s the brutal truth: with the unit economics of remote API platforms offering AI inference, it’s often much more reliable and much cheaper than running your own infrastructure at scale.
The economies of scale that companies like OpenAI and Anthropic achieve are simply impossible to replicate when you’re running a smaller operation. You cannot scale an inference cluster to the same levels they can.
Privacy and Control: The Non-Negotiables
Now, there are some scenarios where you absolutely need local LLMs, regardless of cost or convenience.
Compliance Requirements
If you have customers that require SOC 2 compliance or any kind of privacy-based compliance, they will very likely not allow you to send data to external systems outside of your business.
The fine print of API usage terms for these platforms often includes the fact that they can use your data to train their systems. For customers with strict privacy requirements, this is a non-starter.
Data Control and Model Control
Another argument for local systems is control – both data control and model control. You control what model runs, what version, and when. Nobody can turn off your model because they need to update their hardware or because they’ve decided to discontinue support.
You can also tune your AI systems to your customers’ specific data, deploying exactly the right model with exactly the right tuning on exactly the right data.
The Practical Decision Framework
So how do you actually make this decision? Based on everything I’ve learned, here’s the framework I use:
Start with These Questions:
- Scale: How many operations per day are you running? If it’s under a few hundred, local might work. Thousands? Probably API time.
- Speed Requirements: Can you wait a minute or two for results? Local might work. Need instant responses? APIs are your friend.
- Privacy Constraints: Do you have customers with strict compliance requirements? That might force your hand toward local.
- Context Size: Are you working with large documents or transcripts? The larger the context, the more APIs make sense.
- Resources: Do you already have GPU infrastructure? If not, the upfront investment might not be worth it.
My Current Approach
Here’s what I do now: I go API-first for most things, but I always have a local fallback option. This gives me the best of both worlds – the speed and cost-effectiveness of APIs for normal operations, with the security of local processing when needed.
The Standardization Advantage
One underappreciated benefit of APIs is standardization. There are OpenAI-specific parameter configurations you might need, but these are easily mapped onto different providers. With competition and companies like AWS Bedrock hosting AI models, you’re not locked into a single provider.
This standardization also makes it easier to switch between providers or fall back to local models when necessary.
Although, in my experience, the UX of writing local prompts is much more finicky than the streamlined and well-documented API versions. But you’ll figure it out either way.
Conclusion: It’s About Trade-offs, Not Ideology
Look, the choice between local LLMs and APIs isn’t about being ideologically pure or following the latest trends. It’s about making practical decisions based on your specific constraints and requirements.
If you’re just starting out, my advice is simple: start with APIs. Get your product working, validate your market, and understand your scale. You can always move to local later if your specific situation demands it.
The key is to be honest about your constraints – both technical and business – and make the choice that serves your customers best. Sometimes that’s local, sometimes it’s APIs, and sometimes it’s a hybrid approach.
At the end of the day, the best AI infrastructure is the one that helps you build a sustainable business and serve your customers effectively. Everything else is just implementation details.
We're the podcast database with the best and most real-time API out there. Check out podscan.fm — and tell your friends!
Thank you for reading this week’s essay edition of The Bootstrapped Founder. Did you enjoy it? If so, please spread the word and share this issue on Twitter.
If you want to reach tens of thousands of creators, makers, and dreamers, you can apply to sponsor an episode of this newsletter. Or just reply to this email!