PDA

View Full Version : Post-Mortem on Site Outage



Term
04-18-2014, 10:01 AM
Hey everyone.

We just wanted to update everyone on what occurred during the recent outage.

At around 7:30 PM EDT, the Weasyl mainsite, Redmine, and status page all went down, giving users a connection error. When we traced the issue back we found that our RAID card on our main ESXi server had an ECC fault in its memory module, which safely brought the array offline. It took a bit for us to get back into the ESXi as all VMs were migrated to a single host, part of a seamless server migration last week.

Once we gained access to ESXi we checked the status of the array and got the host back online. A bit of work with ESXi to re-attach the array, and we continued on. Once we powered up the VMs we found a memory config problem with the DB VM as well as a network config problem with our main app server. We were able to address these issues and get the site back up safely at around 10:30 PM EDT.

We’re not, however, just leaving things at that. We’re going to be improving our ability to manage the VM cluster in case of another failure. Also we’ll be looking into improving our network config, memory config, and creating some redundancy across arrays for VM storage.

We apologize again for the outage and thank you all for your patience as we addressed the issue.

Toshabi
04-18-2014, 01:22 PM
Thank you for your prompt resolution to the outage.

Nightpaws
04-19-2014, 06:02 PM
Thanks for the explanation! I was curious to know what'd been the cause of it.

Much more detail than I'd expected :)