Well, these are never fun..
It's not too frequently you see me posting a postmortem for an event (this is the first and hopefully the last!), but I think it's fair and transparent as well as a principled thing to do with any long outage, however insignificant it may seem.
All times are in EST (Eastern Standard Time, GMT-5) unless otherwise stated.
On December 9, Saturday, we prepared for a routine upgrade of the Linux Kernel to 4.15rc2 and the subsequent upgrade to Ubuntu 17.10. This is usually a seamless process. We began at night Saturday, and things were fine until the
dist upgrade went terribly wrong (understatement) and hard-stopped halfway. Never a good thing.
Around 5AM we began troubleshooting serious issues that did not affect any of the bot and related server uptime, but rather much of the vital core functionality -- namely all our package systems (
apt, etc.) including
systemd and parts of the base kernel system (such as
init) along with a multitude of disk mounting and recognition errors on our NVMe software RAID1 array -- and the problem(s) were fairly uncommon in their specificity. Because of the installation that presumably failed, we had a mixture of bad symlinks, mismatching libraries and missing files altogether. Rebooting, however, did not cause a freeze issue for the server.
We have a lot of config setup, and our backups extend only to data due to the nature of how our server is setup - we don't take hard snapshots, rather we back up the main files and information, leaving out any operating system snapshots. This was intentional, and knowing this we did not want to reformat right away as it takes a while for us to deploy our initial server configuration, security protocols and packages.
Our first attempts were to isolate the many errors from package managers, and we did so to all but one (
systemd). This, however, was where things declined rapidly. Around 7AM Sunday, I made an attempt to replicate an
init problem by changing the name of the critical directory
/usr/sbin.backup and symlink'ed
/usr/bin -- this was to be a temporary test, but I accidentally ran the reboot command and the server did not start. This, again, was right around 7AM on Sunday.
We all make mistakes, and I apologize for any downtime this has caused. However, here are the steps we have taken since the server went down:
We spent about 4 hours trying to fix the issue, but finally around 8AM EST on Sunday we reached out to two professional companies (and a contractor); one of which is singlehandedly known as being one of the top recovery & security firms in the SysAdmin industry, as well as a secondary Linux Systems firm which operates around-the-clock and have Tier III technicians on-call. This was simply because we were running out of choices, and wanted to ensure we have some more expertise before making the decision to reformat.
The timing, specifically a Sunday morning out of standard business hours meant we had limited options and we quickly went with what was available.
At around 9AM the access was given out securely to an engineer on a "CRITICAL" ticket, and we were only booting into a Debian 8 Rescue Partition through our datacenter - besides this we were completely inaccessible at the time.
Our original 1-2 hour quote turned into 4 hours of troubleshooting, which did not solve the issue.
We were able to determine that it was best to do a full restore of our server, considering that there were too many potential risks with missing libraries from a bodged install even if we got it up. After four hours of consultation and evaluation with the firm we contracted we came to the unanimous decision of formatting the server.
I made and posted that call in the
#announcementschannel around 4PM EST on Sunday, and personally began the reinstallation of 17.10 with a manual Kernel update to 4.15rc2 right away which went perfectly fine.
After doing requisite disk checks for safety, we formatted our server, which contains over 100GB of data and bot information, and thousands of configurations over the year. We had to keep track of these when we reformatted. Fortunately we did have backups (9 hours old at the time of restore).
At this point it was just a matter of putting everything back, and we've done most of it now. This includes a lot of custom security and config files, along with permission checks and matching. We also ran extra tests to ensure the hardware was in good order.
Most everything is now online(everything is online as of Dec 11, 9AM). (FredBoat is logged in but will not play due to a potential LavaLink socket bug - since we upgraded to the newest build with the server update, but the old version still throws the error).(this is fixed as of 9AM, Dec 11) (Gnar-bot requires rethinkdb, which has to be built from source on Ubuntu 17.10. We were able to set it up with Docker and this should be up shortly once we rebuild the shadowJars).
At around 2PM we began initial migration and the process took about four hours. We completed the setups (minus the above bots) for a seamless experience.
At just around 6:30PM on the 10th of December we were able to confirm final system integrity checks on our restoration which was mostly manual.
We began the process of initialization/startup of our bots, local servers, proxies and other services at this time, and by 7PM EST on December 10 >95% of the server was back to normal.
Total Downtime: 12-14 hours here.
Prevention and "What's Next?":
As of this post, most things are in order. A few things that aren't 100% yet
like FredBoat's(it's done now, 9AM EST, December 11 - was just a config permission error)
ws://disconnection that needs to checked against bugs, which I'll do shortly.
We took major precautions to ensure this doesn't happen again, including changing our update procedure checks, implementing better sandboxed environments, ensuring cleanups after updates and increasing the frequency of system integrity checks. We've also began a more robust backup system - though the one we had served fine, we always want to improve. On a procedural end, we made minor revisions to our basic plan with tighter reaction times (i.e. we won't wait as long as we did before taking action to reformat if need be) and a few other modifications.
tl;dr: Quite a few internal lessons were learned from this incident, which is always the best result one can hope for from an isolated bad event.
Although this incident was certainly uncalled for, we had predetermined contingency plans for an an event like this, and to that sense it went well: we recovered fairly quickly for the extent of the issue at the time it occurred on an off-hour Sunday morning. Of course, we always keep backups, but we shouldn't have to run into these issues. At the end of the day, we hope you all understand and this post should outline what we're doing to make sure it never happens again. Don't underestimate the effect of a bodged Kernel and/or Linux Update!
[NOTE] Please contact me if you see anything unusual with any of our bot/services right away (
.vlexar#5320) - though I don't anticipate that being an issue at all. We've ran many checks before putting things online.
Takeaways and Final Thoughts
These are rare events - we don't update often, but we do try our absolute best to do it very late at night or seamlessly (<2 minutes downtime). This was, in retrospect, rushed, and therefore I apologize if it caused any interruptions. We have never had an issue for over a year of operation before this one, so that should put things in some context!
For the technical bunch, here's some of the stuff we were dealing with earlier in the day before getting it evaluated: https://ubuntuforums.org/showthread.php?t=2379792
As we've noted, the changes made are significant improvements to ensure cleaner upgrades on our ends with less clutter, as well as a multitude of procedural changes. Thank you again for bearing with us and we hope this is the last time you will see a post like this. 🙂
Signing off --