In this series, I talk about how certain situations and incidents can evoke strong emotions and how to deal with them. I will explain how I reflected and reframed those thoughts and feelings. I will then suggest actions for immediate and long-term improvements.
When I received sixteen emails within five seconds, I knew that something must have gone terribly wrong.
And I was right. There was a database disconnect, and my error tracking tool had caught it. The whole connection pool had disconnected, all eight connections, on both servers. Sixteen disconnected connections, at the same time.
This didn’t look good.
Two weeks prior, the database had done the same thing, over and over again, every few minutes. I had to reach out frantically to the company hosting our database servers, and they had this hilarious “we will reply within 24 hours policy”. The problems only stopped after a few hours.
It was going to happen again. I knew it.
I checked our monitoring tool. It looked like the servers had reconnected. So far, so good, but another disconnect was going to come.
I was waiting for the sound of more emails, paralyzed.
I waited for two minutes, then five.
After twenty minutes, I finally came to my senses. It was not going to happen.
And it never did. The disconnect turned out to be a routine failover to the second server.
Nothing malicious.
The only malicious thing was that expectation in my head. The thought that it would get worse, much worse than last time. The fear of it slipping out of my control, again.
It would take me a few years to be able to hear the sound of multiple emails arriving at the same time without panicking. I found a few ways to cope with this fear, and I’ll share how I reframed my perception of errors to serve the business instead of making me freeze in place.
A similar thing had happened earlier in the history of the business. We had built a browser integration for our service and had marketed it heavily towards our userbase. Our customers had relied on the extension as an integral part of their experience, and it had worked flawlessly for months. Until, of course, it didn’t.
One day, a customer reached out through Intercom, stating that they had trouble using our product. After a minute of talking to them, I figured out that the browser extension was the culprit, and it was not working as expected, leaving this customer, who had never used our service without it, stranded.
More messages came piling in within a few minutes. Customers were asking if we knew about the service not working anymore. Most were conflating the SaaS product with the browser extension, and for them, the integration not working was the same as the whole product being unavailable.
I was starting to panic, trying to both build the fix and keep the customers informed that I was working on it. Luckily, I quickly found the culprit: a change in the web application we integrated with that only required a few lines of code in the extension to fix. I quickly deployed the change, customers started getting the automatic updates, and all was good.
But not with me. This hour of customer service messages left me with a trauma. Whenever a customer would reach out about their integration not working, I would panic. I would immediately think that I would need to fix something and deploy it to thousands of people. And I envisioned that one day, there would be a change I could not adapt the browser extension for. A change that would make it unworkable.
Of course, this was all in my mind. Any message of the extension not working would trigger this, even though it was much more likely that the customer just had it disabled in the browser or that their Chrome browser was starved for memory and had started glitching out. I always went to the worst place immediately, thinking that any customer reaching out would mean that all customers were affected.
It took me a lot of work to get rid of this thinking. The following reframing method worked for me.
Reframing Opportunity: Not Every Error Message Starts an Avalanche
I figured out that the trigger for this behavior was the sound of error emails and incoming conversations. That set me up to fear the worst, and if the error or conversation was related to the integration, my mind would affirm the suspicion and go to the darkest place.
By reframing the trigger, I started being able to starve the fear of its power:
Error Emails and customer service messages are a good thing, I told myself, they show that the product is being used. Every problem encountered or reported is an opportunity to harden the product.
I had to make sure I prevented alarm fatigue. I changed my Sentry settings to only send alerts for important things, and to batch them up. Using shifting time windows to capture and batch alerts reduces their overall count. There is very little reason for receiving error messages every 10 seconds when you’re well aware that something is going on. Better to have 10 minutes of debugging time to deal with it than to reinforce your fear a few times a minute.
I also implemented a reality check whenever I encountered these conversations. User error exists and is very likely after reaching a certain scale. Integrations are brittle, and computers are complicated. The sources of errors are plenty as there are different configurations, older and newer computers, and novice or tech-savvy users. There is a chance that the product works in general, just not for that one user.
Only when a certain quantity of complaints is received within a certain amount of time should you act. Two customers out of a few thousand talking about their integration not working over a day? Probably their problem. Twenty messages coming in within two minutes about the same problem? You might want to investigate this quickly.
Some problems can not be avoided. All SaaS products are full of complicated moving parts, highly interconnected dependencies, and things are never perfect. Have a plan in place to execute instead of frantically looking for something to do when such a thing happens.
An excellent way to do this when it comes to integrations into third-party services is to build early alarm systems that tell you when things change. Those systems usually detect changes before your customers do, and you can get started adjusting your tools while your customers are sleeping. In our case, we built a system that would periodically check ETAGs and Last-Modified-dates. That way, even the smallest change would allow me to test our integrations proactively. If you want to be extra elaborate, build some content-diffing logic that shows you what was changed. If the product you integrate with has releases, subscribe to their feeds, or create some logic to check for new versions.
Finally, having a reliable monitoring system in place that only reacts to the critical issues and having a plan to deal with those will allow your mind to calm and take care of the essential things. Monitor the parts that are important to your business, and alert only when vital functions are impeded.
The best way to stay calm is to understand that every error is an opportunity to make the product better. New problems will come up, but with every successfully solved issue, you will be able to deal with future issues more confidently.