Unexpected Downtime: Stress as Enhancement vs. Stress as Panic

Reading Time: 8 minutes

My therapist recently introduced me to the concept of stress as enhancement in exposure therapy. The basic idea is that a little bit of stress can focus you to a point where you make new neural connections and realize new (and beneficial) insights that help you change for the better.

Well, I was able to put this to the test earlier this week.

Because all of a sudden, without warning or any indication, Podscan slowed down to a crawl, and then went down completely.

Every SaaS founder’s absolute nightmare.

And yet, as frustrating as it was, I embraced the situation, got through it, and came out the other side with a better product, happier customers, and a feeling of having learned something new and valuable.

That’s what I’ll share with you today. I’ll walk you through the event, what I did, and what came out of it.

Let’s jump to Wednesday morning. I recently talked about how I have pushed all my calls and face-to-face engagements to Thursday and Friday, so Wednesday was going to be a day where I could fully focus on product work. A high-profile customer of mine recently had sent me DM and asked for a specific feature, and since serving my pilot customers (who also have the ability to tell all their successful founder friends about Podscan) is one of my growth channels, I got to work.

So I started the morning very relaxed, ready to make the product better.

It was a fairly simple feature, being able to ignore specific podcasts when setting up alerts for keyword mentions. The founder in question got too many alerts for their own name when mentioned on their own show, so it was a clear value-add to allow them to ignore certain podcasts. It took me maybe an hour or so to build this, including making these features available on the API, and when I was done, I was doing my usual pre-deployment checks.

And one of my admin endpoints that I usually load in the browser to look at metrics was a bit slower than usual. That was odd. What took 2 seconds before now took 10. This occasionally happens when the database is a bit busy with some massive import, but even after a few refreshes, this didn’t let up.

So I went to the homepage of Podscan to check.

And it didn’t load for 15 seconds. Or rather: it clearly didn’t even connect for 15 seconds, because once it did, the site loaded immediately. That left me slightly confused: so the server itself responds as it should, but something on the network causes a massive delay.

At this point, I could see where this is going. 10 sec, 15sec, 30sec,… this was going to end up with minute-long delays, essentially timing out the page. This was going to be downtime.

And my stress levels started to rise.

I didn’t panic, I knew that I had not made any changes to the system. But I felt like this needed my immediate attention, my focus for the next hour or so.

So I took a few seconds to reflect on my situation. In the past, I would have easily spiraled into a panic. What if the server is completely messed up? What if this is the end for Podscan!?

But not this time. Like my therapist said, stress can be suffered, or it can be leveraged. And I chose to use it. So I stepped into a place of calm determination. I said to myself, “This is a technical issue with a technical reason and a technical solution. The internet is a complicated network, and my product is itself a complicated machine. I will attempt to solve this calmly and without skipping steps.”

I started with my emergency reaction step one: “Make sure it’s real.” There is the old adage that just because things work for you, they might not work for others. And that’s also true in reverse: if it’s broken for you, it might still work for others. That’s why I went to https://www.webpagetest.org/ and ran a test on the main homepage. When I saw connectivity tanking from their end, I knew it was not a “me” error. My internet is occasionally spotty, but this time, it wasn’t at fault.

But whose error was it?

I had a few candidates: my servers, my Cloudflare settings, Cloudflare itself, my database, my server hosting provider, and finally “the internet” itself.

To check which one was responsible, I went down the list in order of how proximate the service was to me. I started with my server, which was hosted on the Hetzner Cloud. I logged into my server, checked its vitals (CPU load, RAM usage, and disk space available). All looked good. I quickly restarted the nginx process and the PHP FPM supervisor, which yielded no results. As huge timeouts like this can be caused by connectivity problems inside an application, I restarted my local Redis instance as well. No effect.

I then checked what would happen if I accessed my website from within the server that it’s hosted on. Locally, it responded immediately. Using its public URL, the connection issue persisted.

I knew then that this was probably an issue beyond my reach.

And funny enough, my stress levels went down. I knew there were still a few avenues I could go down to check, but I had the feeling that this was someone else’s doing, and it likely wasn’t intentional or malicious. There was no sign of a DDoS attack, nor were there resource issues.

To make sure, I logged into RDS where I keep my database and checked its metrics. If at all, they had gone down. Makes sense: fewer web requests making it to the server means fewer database calls.

With my own software stack working perfectly, I went one step up the ladder and looked into Cloudflare.

Or rather, I looked at Cloudflare. It was at this point that I took a breather, had a coffee, told my partner that I’m in firefighting mode, went back into my office, and considered that this might be part of a larger issue. I checked the status pages for all the services I use: Cloudflare, Hetzner, and AWS. I even looked at Twitter and Hacker News to see if this was a widespread issue.

No mentions of any issues.

My experience with running SaaS businesses told me that if something changes with the stack, some configuration was changed somewhere in the stack. Since I didn’t change anything, I looked into the upstream partners that serve the Podscan website: Cloudflare and Hetzner. Both have their dedicated networks and connectivity rules. In the Cloudflare dashboard, I looked for notifications, warnings, or “you are over some kind of limit” messages. Nothing there. On Hetzner, it was the same. The server metrics looked good, there was no warning anywhere.

And at that point, I was stumped.

Either, someone up the chain had some network issues they didn’t want to tell anyone about, or the traffic to my server — one of several Hetzner servers, and the only one of those who experienced these issues — was artificially slowed down in its origin network.

So I did what I always do when I have no idea what to do: I went to Twitter. I shared that I had this issue and didn’t know what to do. Within minutes, a lot of ideas came in, and one caught my eye: someone explained that this might be silent connection throttling from the hosting provider, something they had experienced before using Hetzner servers for a scraping operation.

Now, Podscan isn’t technically a scraping tool, even though it does pull in a lot of RSS feeds for podcast analysis. But I wouldn’t put it past Hetzner to have automated detection systems that silently make it harder for scraping operations to succeed on their platform.

Here’s the thing: I’ll never know.

Because 10 minutes after I started looking into this particular possibility, things changed. I ran one more test using SpeedVitals to check Time to First Byte from several locations around the world. First time I ran it, timeouts and 70+ second delays from all over the world. My website was effectively down.

And I felt surprisingly calm about it. Of course I was agitated, something I care a lot about wasn’t working, and soon my paying customers might notice. They hadn’t yet, because Podscan is an alerting tool, and the alerts still went out — it was the incoming web traffic that had slowed down a lot. But this wouldn’t work forever.

I went upstairs, grabbed another hot beverage, and I told Danielle about it. She, with the experience of having run a SaaS business with me, calmly told me that she knew I’d figure it out. I love her for that. How lucky am I to have a calm and measured partner to keep me from spiralling out of control.

Back into the office I went, ready to keep working at this. I ran the Time to First Byte test again, and… it was back. Every location reported sub-second response times.

It was like someone had pulled the plug from the bathtub, and the water was flowing as if nothing had ever happened. I went back into my browser, and it was just like before. I went into the logs, and my transcription servers started to grab new candidates again, while finally being able to deposit the finished transcripts they had finished but couldn’t report back.

I breathed a big sigh of relief.

And immediately started working on moving away from Hetzner. Mind you, I had no proof of this being shadowbanning or silent throttling. It might as well just have been a network congestion in the part of the data center my VPS was in. But I knew I had to diversify. I’d keep my Hetzner server in its current configuration, as a backup server, but I would move my main server into the cloud I knew and trusted for a while: AWS. After all, my database was already there.

With Laravel Forge, which I use for provisioning and orchestrating servers, and Laravel Envoyer, from where I deploy new versions, spinning up an instance on AWS was extremely simple. I needed to adjust a security group or two for Forge to be able to connect, but that was quickly done. Within 20 minutes, I had a fully functional copy of my main server running on AWS. I tested it under a subdomain of podscan.fm, and this morning, after it had been running idle for half a day, I finally made the switch through Cloudflare, remapping IPs from one server to the other. It was an absolute joy to see traffic slowly shifting from the old to the new server — and, through AWS’s routing magic, also a much faster server. With my database being in the same location as my application, roundtrip time dropped significantly. Some of my queries were cut down to 20% of their prior duration. You can really feel it on the website, too.

I am coming out of this quite horrible incident with renewed confidence in what I have built.

First off, the service never broke. It was unavailable, sure, but not because it was overloaded. It was underloaded, really. One massive insight I got from all this was that my choice to make every single request between my main server and my 24 transcription servers queue-based was a really good idea. Whenever a transcription is created, I don’t just send off a HTTP request back to my API to save it to the database. That HTTP request is wrapped in a Job, which runs on queue, and it will be re-run multiple times if it fails. Using Laravel Horizon and the queue retry and backoff parameters, every request will be tried ten times, and the server will wait ten minutes between those attempts. The final attempt waits for a full day. That way, things can crash, or slow down, but the valuable transcription data is safe in a queue, ready to eventually be consumed.

I also enjoyed how easy it was to move from Hetzner —where I still run several parts of the business, like my search engine— to AWS. I made the absolute right choice in trusting the Laravel ecosystem with Forge and Envoyer. Deploying stuff and making on-the-fly changes was comfortable, reliable, and functional.

Ultimately, I was glad I kept my stress levels under control, and that allowed me to stay level-headed when facing a problem I could not solve myself immediately. I grew from this experience, and the slightly elevated but controlled stress of it all helped me focus on stepping through the steps I knew I needed to take calmly and without losing sight of the larger issue.

One thing I recommend is writing a post-mortem, either as a blog post (like this!) or in your own business documentation. I wrote a few more emergency Standard Operating Procedures right after the problem was resolved, and they will help future me dealing with similar issues equally calmly.

And that’s the important part. You can’t control externalities, that’s in the nature of that term. Cloudflare, Hetzner, Forge, AWS, they all could do something intentionally or unintentionally that creates an issue for you. But running around in panic won’t solve that. That’s the stress that gets to you.

Instead, have a cup of tea, tell yourself that you got this, and then tackle it like the professional you are.

Unexpected Downtime: Stress as Enhancement vs. Stress as Panic

Related Articles from the Blog

Published by Arvid Kahl

Leave a ReplyCancel reply

Share this:

Related Articles from the Blog

Published by Arvid Kahl

Leave a ReplyCancel reply

Discover more from The Bootstrapped Founder