My SaaS Server Exploded (& How I Salvaged It)

Reading Time: 5 minutes

Earlier this week, I finally found time to work on a large code change. When I deployed it, things got worse quickly. Here’s what happened, why it happened, and how this changed my approach to working on a complicated software project.

One thing became apparent: while it may look like a waste of time to build something and revert back to “yesterday’s version”, I got massively interesting insights into my product and my process along the way.

So today, we’ll talk about setbacks, getting back up again, and how we can judo a failure into making progress.

Here’s the full story.

I finally had a day to work on a massive refactor of Podscan’s transcription queuing system. In case this is too technical for you, consider today’s article a nerdy deep-dive into the heart of where my business builds its value. I’ll explore the safeguards and mental exercises to deal with sizeable setbacks.

And if you’re interested in what a solopreneur’s tech complexity can look like, stick around. I’ll spare no details.

So, what happened?

On Monday this week, for the first time in almost three weeks, I had a full day to myself to just spend on building software. I was at MicroConf two weeks ago, and then the following week was full of calls that I had scheduled so I could catch up on the week that I missed at the conference before. There were just a lot of interruptions happening throughout the last couple of weeks, which meant that I didn’t have any meaningful long block of time to spend on my software business, Podscan.

So, after those two weeks, I chose a day where I would have a full day of opportunity to work on software. For that, I had to reconfigure my schedule. For the last couple of months, I’ve been very open with my schedule. People could book calls almost every day of my week, mostly because I really wanted the early Podscan users to find the perfect time for them to talk to me. The idea was that I could hop on a call with a potential or paying customer whenever it suited them. I’m in a super early stage, and every little piece of feedback helps. Every early customer interaction with positive feedback is something I can build on later. I wanted to open my schedule to everyone, and I did.

However, this meant that I was constantly interrupted, as people scheduled their conversations with me whenever it best suited them, but never me.

Meanwhile, I also had a podcast to record, interviews to organize and research, and guests to invite to the show. I have to manage editing and transcribing, and a lot of extra stuff on certain days adds to the overhead time that I can’t spend on building software.

So, I made the choice that I’d had enough of an open schedule and reduced my calls to two days max.

All my customer service conversations and customer discovery calls now happen on Friday. All my podcast recording calls happen on Thursday. I used to have one interview call a day. Now, I have as many as people want to fit into that day with some time in between. So, I’ll stack my calls on Thursday and Friday, freeing up the rest of the week to write code, do marketing, and sales without constant interruptions.

After setting up this new schedule, I set a day to work fully on my software project. I finally had that day on Monday this week, so I chose the biggest project of all—an “eat the frog” moment.

I chose a complete refactor the heart of my operation.

Podscan ingests thousands of hours of audio data every hour, transcribes it into text, and scans for keywords to send to an API. It’s a lot of data coming in, and it’s not a very flexible system. Whenever my backend servers have capacity, they ask my main server if there’s anything available. My main server then checks the database for an available project and sends it over. That’s how it works.

This has grown over time, and it feels like it can be made more extensible and less convoluted. I thought I needed an internal queue system—a system of many queues that make it easier to move individual podcast episodes through all the stages of analysis.

Podscan works like this: I get an audio file, transcribe it (step one), run inference on it to extract certain information and build a summary (step two), and then scan for keywords and send alerts (step three). I have a few more steps planned, but I was starting with these three. Right now, they all kick off each other. When an audio file is available for transcription, it’s presented on the API. A backend server fetches it, responds with the full transcript, which kicks off the inference step. A server handles inference and sends it back. Then, scanning starts on the main server.

It’s a back-and-forth system, and it works.

But there’s always room for improvement, right? I thought about creating three queues in which candidates live, according to their stage of completeness. Instead of looking in my database, I’d just look into the queue.

Simple enough.

I built this system over the course of a few hours, and locally, it worked well. I spent quite some time on the edge cases: What if there’s an error while the transcript is being created? What if I have to retry it, but there’s another step that needs to be done first?

Fortunately, my work has some structure. I built this sizeable feature on a separate branch using the GitFlow model. I have a development branch for more complicated experiments so I can work on them while fixing bugs in the main branch. After about six hours of diving into the code and testing intensely, I deployed it to my main server. And it worked. There were a couple of bugs, but I could quickly fix them. I tweeted about how well it worked, although I was nervous because it’s the central functional part of Podscan.

If the data ingestion and transcript engine doesn’t work, then nothing works: the API and notifications would break down. It needed to function well, so I let it run and checked my metrics. Gradually, fewer episodes landed in the database. Usually, we handle 2,500 episodes an hour. Now, it dropped to 2,200, then 1,900, and eventually 300. Something was wrong in the system, preventing queue items from moving forward, and I couldn’t see where the problem was.

I couldn’t just play around and delete jobs; they all needed to work. It was a live system after all. So, I rolled back what I’d worked on for six hours to the morning’s version.

I tweeted about that too. Sharing setbacks is part of my approach to building in public. I was frustrated not just by the system failing but also by feeling like I had wasted a day. I went upstairs to talk to Danielle, and as I explained my frustration, I realized something: I may have spent a few hours writing code I won’t use, but I had learned a lot about both my product AND my approach to making improvements. They both had properties I was not aware of before. Yes, the queue system didn’t work, but I learned how complicated my existing system is, and how many interference points exist.

I also learned my feature specification process was incomplete. Adding another abstraction layer wasn’t required. The queuing system wasn’t necessary; my existing state machine was efficient enough if I ensured reliable state transitions. If I ever add more steps, I’ll improve the current state machine instead of replacing it.

Despite the wasted hours, I learned from my attempt. Next time, I won’t dive in with just a few notes in a Notion document. I’ll plan properly and monitor the state machine. This is a feature spec itself: build more observability into the system.

And one more remark: I am so lucky to have systems in place to deal with such a botched deployment.

Rolling back was easy, thanks to Laravel’s Forge and Envoyer tools. They let me revert to a prior deployment quickly because I keep several dozen backups.

Reversibility requires non-destructive database migrations. Instead of modifying tables, I add new ones. This ensures rollback is smooth and won’t impact existing data.

Overall, this was a lesson in reframing a frustrating loss into a learning opportunity to understand my product and my process better.

So next time this happens to you, look at it through the lense of emergent insights: if I hadn’t tried to implement this, i wouldn’t have learned about the existing complexity in my business. No wI know more and have more experience.

Do this a few hundred times and you have all the moat you’ll ever need.

My SaaS Server Exploded (& How I Salvaged It)

Related Articles from the Blog

Published by Arvid Kahl

Leave a ReplyCancel reply

Share this:

Related Articles from the Blog

Published by Arvid Kahl

Leave a ReplyCancel reply

Discover more from The Bootstrapped Founder