As many of you are aware, Category5.TV suffered some unexpected downtime following Tuesday’s show.
Everything was on track to have Episode 252 up and ready for on-demand on our usual schedule. I get up early Wednesday morning and type out the show notes, and then upload the files and the system I created (V3) takes care of the rest (creation of show pages, RSS feeds, etc).
Well, I was working away on the show notes, and at 7:55am my time, the database server which powers V3 (our high-tech web site software) stopped responding entirely.
After ensuring it wasn’t on my end, I contacted our database host, who determined that a RAID card failure actually failed multiple drives in our RAID array. Not good.
Here’s a response from our host:
My sincere apologies for the disruptive extended downtime of the mysql server. The old hardware has suffered multiple hard disk failures that have made it not possible to recover. Our admins have been working on getting a new server setup, configured, an operating system installed, and they are now setting up the mysql services, and will be starting restores from backups shortly. We are doing all we can to expedite the process and get all service restored. The restores will be done from our daily backups, which are started each morning and complete by around 6am pacific time. Unfortunately any changes made after the backup made will not be recoverable. We will be implementing even more monitoring of individual hard disk health, so as to detect such issues before they happen and help prevent the need for unplanned moves to new hardware.
Around 12 hours after the outage began, they had a brand new server all spiffy and ready to install in the rack, and within several hours began restoring MySQL databases from Tuesday morning’s backup.
Unfortunately, since the backup was from Tuesday morning, we lost all the data for Episode 252. But nothing else.
During the course of the downtime, we made sure to keep viewers informed about the progress via our official Twitter feed @Category5TV – if you’re not already following us, make sure you do.
Since it was taking so long to migrate to the new hardware, I went about building a makeshift database server myself and put it online with Tuesday’s backup. This got the site and RSS feeds back online with everything except Episode 252.
Thursday morning, all is well and the new database server is up and running with our Tuesday backup, so I re-typed what needed to be done to get Episode 252 into the system, and set everything live.
What’d I learn? Our database server is a single point of failure. Yikes!
As you know, I’m all about redundancy, so I wrote a script into our V3 database engine to automatically sync the database nightly with the makeshift unit I put online, and also to fail-over to that server in any event that the main database server fails to respond.
I also set the backups to run at midnight instead of 6am to minimize the risk on a Tuesday night of losing the previous night’s episode. This means if we add new content to the site at 10pm on a Tuesday night and the database server crashes at 1am the following morning, not only will nothing be lost, but the fail-over server will take over and continue hosting the site and RSS feeds until the main server comes back online.
So, if the same thing happens again, we’ll be ready for it.
We still have a couple of single points of failure. Our main host (the root of http://www.category5.tv/) could go down in the event of an Internet outage at our host facility, but we’re doing all we can within our means to make sure we’re online all the time and running zippy fast even when there are tons of people navigating our site or viewing episodes.
Thanks for your patience as we worked through everything. Enjoy the show!