If you happen to use Basecamp 3 to manage your projects, you might have noticed a huge outage on November 8th, 2018; it lasted almost 5 hours.
The issue was that they failed to use bigint for the primary keys of their tables so they ran out of IDs. The TLDR solution, taken from David Heinemeier- DHH, creator of Ruby on Rails and Founder and CTO at Basecamp:
We took half of our replicas offline, did the 3h migration, put them back online, will now be converting the other half of the fleet.
And I’m not writing this to expose and/or throw ** at them.
I’m writing this to applaud their communication and openness about the whole outage.
I’m writing this to expose how over-communication, honesty, humbleness and clarity DO make a difference, specially on difficult situations.
To give you some context, the first notice on their Twitter account about something going wrong was at 5:40 AM on November, 8th:
Basecamp 3 is having trouble right now. Sorry about that! We're working on a fix and will keep you updated as we go.— Basecamp (@basecamp) November 8, 2018
From that tweet and until everything was working again, there were 15 more tweets with constant updates! With the last one being at 10:47 AM, November 8th, signed by DHH himself:
Basecamp 3 is back up at the moment. We had to switch to a backup set of caching servers, and they're holding up at the moment. It's obviously been touch and go, so not out of the woods yet. Pains us to ask for even more patience on such a trying day. So sorry 😢 ^DHH— Basecamp (@basecamp) November 8, 2018
All that information is a huge deal. You know they are working really hard to get everything up and going, and you might also know that outages can get really messy.
Despite all the chaos that was probably happening, they kept posting updates with specific details of the cause and solution being taken- on their Twitter account, status page and on their blog. And not only that, DHH was also posting some more technical details about the outage to the point where he links to the pull request that could have saved everything:
I'm not often ashamed of our work at @basecamp. But today is one such day. To be stuck in read-only mode for hours due to a failure to use bigint for our primary keys on every table is embarrassing. It's been the default in Rails since 5.1 🙈 https://t.co/FaGYDBrROU— DHH (@dhh) November 8, 2018
I find all this incredibly valuable and relieving. Even though it was a really long outage, they handled each and every customer interaction gracefully. I could not get upset with them with so many information about the problem/solution being provided.
Hell, that morning I was even more productive because Basecamp remained read-only; I could check on what was pending on my side and just get to it with no distractions.
I’ve been part, and cause, of outages at my company and it’s really stressful. And we don’t even handle the amount of traffic Basecamp 3 does.
So, as DHH states it, this is a reminder to stay humble. We could be the next ones involved in a situation like this. We all make mistakes, that’s inevitable, but knowing how to properly communicate them is what matters in the long run.
Hope you have enjoyed this short rambling ❤️