Local AI for Software Founders

Reading Time: 5 minutes

We live in a day and age where using AI on our own servers has become possible. As a business owner and software entrepreneur, I think it’s just as interesting as it is important to consider setting up AI systems on your own backend instead of relying on hosted platforms and APIs.

If you’re a software founder interested in using AI technology without depending on someone else’s unit economics, this is for you. Today, we’ll dive into running your own ChatGPT replacement for fun and profit.

I’ve been tinkering with this tech over the last few weeks to great success. So, in the spirit of building in public, why not tell you what I did and how I did it?

Experience this article as a podcast, a YouTube show, or as a newsletter:

In my latest SaaS, PodScan, I use two types of artificial intelligence: an audio-to-text transcription system and an LLM, like ChatGPT, that generates responses based on prompts.

The transcription system is cool but not something every software entrepreneur needs since it’s specific to converting audio into text, and audio is a niche medium for most of us. But all software founders work with text data —somewhere in our databases— customer records, notes, instructions, it’s all text.

Founders got very excited when ChatGPT came out. All of a sudden, particularly once we could access the service through an API, we could build on top of these amazingly “smart” language models.

And that pioneering spirit has brought us to an interesting inflection point. Because the “Open” in OpenAI has been a catalyst for the open-source community.

It turns out that the most exciting development in recent years is not just the existence of ChatGPT but the fact that many universities, research groups, and companies have open-sourced their code for training these models.

And when the nerds start building stuff together in the open, working on public data, for free and without restrictions, interesting things happen.

One of them is llama.cpp. It’s a cross-platform framework that allows us to train and run our own AI models on our own consumer hardware. Now, we don’t have the massive GPUs and RAM amounts that the big guys have, but we get to run tech that’s almost as good. And in most cases, it’s good enough.

A big contributor here is that we can avoid the costs and dependencies associated with using hosted platforms. In addition to that, we gain more control and flexibility over AI applications for our businesses. So risk goes down, and control goes up.

That’s the indie founder’s dream, right?

Let me share an example from just this week.

Podscan was, until Wednesday, a keyword alerting tool for podcasts. You’d write down a list of words, Podscan would transcribe every newly released podcast out there, compare your list against the transcript, and alert you if there was a match.

So far, so good. This is already creating massive value in the world of podcast discovery, which is severely underserved.

But what if you don’t know the keywords beforehand? What if you want to be alerted for something as nebulous as “podcasts where people talk about community events organized by women” or “podcasts where people really nerd out about their favorite Sci-Fi show?”

Even if you wanted to, you couldn’t come up with all the keywords that would allow you to reliably match every podcast that falls into those categories.

But what if you could ask each show a simple question? “Does this episode have nerds talk about sci-fi in an excited way?” That’s what local AI allowed me to build. With the help of llama.cpp and an LLM called Mistral 7B, I set up a backend service that takes a transcript and a question and spits out either a “yes” or a “no.”

Any transcript. Any question! In under a second per combination.

And, most importantly, on the same hardware that my transcription servers are already running, slopping up new podcast episodes and transcribing them. I don’t have to count API calls. I just have to have a computer with a GPU and 8 GB of RAM.

Cloud hosting for GPUs is still super expensive. You pay around $500 per month for a single server with a GPU. But even a Mac Mini can run this kind of AI inference at one question to a transcript per second.

Now, platforms have started to compete on price here. OpenAI’s API is a big mover here. It’s really affordable to use GPT 3.5, the “budget” version, for any task where you need scale. You can get millions of tokens (words or characters) for under a dollar. That’s impressive and fits most budgets.

However, it doesn’t fit all budgets. If you deal with lots of data and need to run prompts on that data constantly, like analyzing every podcast out there, GPT3.5 can cost tens or even hundreds of dollars per day. GPT4 would easily go into the thousands. That’s not scalable for a small business. But running your own local LLM sure is.

Having AI on a server was impossible for a long time. And the requirement of GPUs still makes it expensive.

But fortunately, AI comes in two forms: inference (applying a prompt and getting a reply) and training. And inference is suprisingly possible on regular old CPUs as well.

Traditionally, over the last few years, all kinds of machine learning and AI work has been done on a GPU because GPUs are designed for massively parallel computation. Recent GPUs have added tensor cores, which handle specific mathematical operations used in games, machine learning, and other tasks requiring lots of computation.

Your computer’s boring old CPU handles regular computations, and a GPU is much faster for certain tasks. But in recent times, LLMs that used to require a GPU now run hilariously fast on a CPU. So you can run these models on your computer without a graphics card.

Which is what most servers are. Computer without GPUs, but with quite some RAM and a lot of CPU cores. This change has led to the growth of an open source community creating local, large language models that can be used on both kinds of chips.

The .cpp in llama.cpp (and whisper.cpp, it’s speech-to-text sibling project) stands for C++ is a sign that these tools were meant for CPU-based inference.

And the open-source nature of these projects has been supported by an unlikely ally.

Meta, the company behind Facebook, released an open source large language model called Llama in 2023. That is BIG! Since OpenAI’s GPT models are proprietary and the company has published only research papers and no code, people have been inspired to build their own models using public data. There’s even a benchmarking system to compare these self-trained models with GPT-3 or GPT-4. And some of these new models come pretty close.

Now, independent companies and open source communities release new large language models daily that perform better than GPT-3.5 and almost as well as GPT-4 in terms of speed and accuracy. These open-source models are available to everyone and help advance the field of AI language processing.

You’ll find them all on HuggingFace.co, a website where you can download open-source language models in various forms. I recommend following Tom Jobbins there. Tom’s models come in all kinds of formats and have been reliably good. People like Tom share the source material, the model training data, and all the files needed to run it on your computer.

And that’s all that is to it. You download llama.cpp, you compile it, you download a model, and then you’re done. Llama lets you run a command on your computer or start a server that loads a large language model into your graphics card memory or regular memory. It then allows you to do local inference through an HTTP API. It even comes with an example page.

Even if you’re not into AI and don’t see an immediate use for it, I highly recommend looking into this. Llama detects the best capacity your computer has for running inference. It checks if there’s a GPU available, if you have the right drivers, and if you have the necessary toolkits installed on your system. Then, it uses these to maximize efficiency. You’ll have as fast and performant an AI on your own personal computer. It’s quite magical.

And it’s yours.

This is the year when software entrepreneurs learn how to wrangle control back onto their systems. It’s a wild ride, for sure, and things change every day, but there is something incredibly powerful about knowing that OpenAI can implode and shut down their API tomorrow, but my local installation of Mistral and a couple of cloud computers that I have it running on will still be mine to command.

Local AI is here to stay.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.