Post mortem on a catastrophic data loss

Jul 02

Post mortem on a catastrophic data loss

Nerd Add comments

I recently made a series of dumb dumb dumb mistakes that culminated in the loss of about a week’s worth of work. In order to extract some positiveness from this incident I figured it would be good to do a post-mortem on exactly where I f’d up and what I learned so that I might perhaps save others from making the same mistakes. BTW this is only going to be mildly amusing/useful if you’re a geek – if you’re a layperson stop reading now b/c your eyes will glaze over. One thing I will say is having restored things at this point I have profound empathy for the couch surfer guy’s catastrophe and what he went through (I only lost a week of work – he lost 3 yrs with his “perfect storm”). Here’s what happened:

The context

So I’m admittedly terrible when it comes to attention to detail. The fact I somehow at one point programmed Cold Fusion web applications and commerce systems from scratch that handled hundreds of thousands of dollars of people’s money still boggles my mind. The fact is I know just enough tech to be dangerous and I try to leave the hardcore IT functions to others. In this particular situation though I was essentially working rogue to get a microsite up to test a new commercialization opp for Scratch Audio around the idea of facilitating online remix competitions. Using the free microinstance tier of hosting on Amazon EC2 and the WordPress JumpBox I figured I could implement a site in a weekend, throw some quick traffic at it and determine fairly quickly if there was resonance around this idea.

I started on a Friday and did a marathon session of pulling together all the marketing working on a local VM of the WordPress JumpBox. Given that I was working out of a cabin in the woods over a crumby connection, the aspect of being able to develop against a local server was really handy. By Sunday evening I had a microsite done in WordPress about 85% of what it needed to be. I used the JumpBox backup procedure to extract the state to a local file on my desktop, shut down the VM, made a timemachine backup of my laptop and did the 2hr drive back down to Phoenix feeling pretty good about things.

The next morning I woke up in Phoenix planning to use the JumpBox migration procedure to move my dev instance of the site to a live hosted scenario using the Amazon Free offering. I lit up a new instance on EC2 in minutes using the JumpBox launch widget (spiffy!), imported the backup file and checked the site. The page content was there but all theming was lost. Here’s where I made my first error…

The failure sequence

Now I should have known this having used JumpBoxes for the past five years but the backup procedure explicitly excludes certain directories by default (and this is a sensible way for it to work). About half the work I had done in that marathon session was in making changes to the default theme. Had I installed a new theme and worked there, no problem… but alas I made all changes on the default theme which was excluded from backup. “No biggie, I’ll just grab the theme directory out of the local VM and use that.” Here comes mistake #2…

That morning I realized I had mislabeled the directory that the VM lived in with the name “June 2010 site changes” (yea i’m frequently about a year behind). I fixed the date name on the directory thinking nothing of it earlier that morning. When I went to fire up the VM to grab the theme directory it was VMware armageddon. The first message was a helpful “a needed file cannot be found” warning. “Oh, must be that I renamed the directory. I’ll just rename it back.” Enter a barrage of new errors informing me that various i’s were not dotted and t’s were not crossed. I spent the next 2hrs learning the intricacies of the .vmx file, changing various settings, sacrificing a chicken, throwing some salt over my shoulder and finally was able to recover the VM (note: the sage advice from @godber – make a backup of the entire VM dir before you do anything).

Anyways, with VM restored I was able to manually grab the excluded theme directory via SFTP and push that into the Amazon EC2 instance. Worked like a charm and the new site was live!

Aaaaand… here’s where I made mistake #3.

I configured S3 backups and breathed an unknowingly false sigh of relief thinking “everything is on Amazon now and backups are in place. Nothing can go wrong.” Of course, my backups were still inflicted with the exact same problem that had forced me to retrieve the theme dir from the VM in the first place. (yeah this is why I have no business in IT folks ;-)

I trashed the VM on my laptop in order to save 6GB of disk space (I figured worst case I still had the timemachine backup at the cabin). With site working and eager to get some immediate feedback I implemented an Adwords campaign. Over the course of that week I iterated the marketing, implemented various tracking scripts like Chartbeat, Crazy Egg, Analytics, Optimizely and Adwords conversion tracking. On Friday evening I implemented a Stumble Upon campaign thinking “okay let’s get a broad swathe of musicians looking at it and see if anything shakes out.” Closed the laptop lid, went to happy hour… bad idea. Turns out microinstances fall down under load of 9 concurrent users on WordPress (and that’s even with Hypercache running). I get a Chartbeat page about an hour later that the site had gone unavailable. No biggie, I pause the SU campaign from my phone, pause the adwords campaign and figure “I’ll just restart it in the morning and run it under a larger instance size.”

I wake up the next morning, open my AWS console and am greeted with the cheery message “you have no instances running.” “Umm yea but what about the instance I was running last night that’s now unreachable?” Nothing. Worst case at this point I thought I had the S3 automated backup from the night before so I had only lost a day’s worth of modifications. Wrong.

Upon inspection of the S3 backup I realize my automated daily backups suffered the same (obvious) problem as the one I used to restore from and validate the nickname I earned in 1st grade: “absent minded professor.” < begin head slapping > Okay okay, worst case now I’ve lost changes back to Monday but now I need to drive up to the cabin and pray that the backup file I had in the cloud on S3 would restore successfully into the VM that would hopefully restore successfully from Timemachine backup I had on the firewire drive at the cabin (it was starting to feel like my data existed at the 4th level of Inception).

What was really puzzling though was how getting a slug of traffic to a microinstance could completely wipe it off the map? I would think it would hang it but not obliterate it and outright eradicate the EBS volume with the data. Completely baffled by this and with all hope lost on retrieving the EC2 microinstance at this point I happened to check the JumpBox GUI to see if I could access it from there. Miraculously it still showed it active although clicking the “Access” button just left the browser hanging. This didn’t jibe with what my AWS console was telling me but at this point I shrugged and used the JumpBox GUI to terminate the instance. Mistake #4…

Turns out the instance was still there- it was just in the west region and the AWS console defaults you to the East region. So the data on the EBS volume was still there and retrievable right up until the nanosecond I clicked that terminate button… < commence Seppuku >

Given that the majority of the work that week had been in refining page content (which was protected by the S3 backups since it was stored in the database on the WP JumpBox) it wasn’t actually all that bad. I ended up driving back to the cabin, restoring the VM from Timemachine (which worked flawlessly), importing the latest S3 JumpBox backup into the local VM and using the WP Import/Export function plus some manual finagling to move the site. Having remembered most of the changes I had made that week it was a matter of reimplementing those and re-adding the various tracking scripts that were missing. In all, about 5hrs worth of duplication of effort to recreate everything under its new home.

I’m happy to report that remix.scratchaudio.com is now live on a server that can survive substantial traffic and we just had our first band signup yesterday.

What went right

For all that went wrong in this series of idiotic blunders on my part here are some things that went right:

Timemachine appears to be effing bulletproof
The JumpBox backups work flawlessly but with the caveat that you understand exactly what they’re backing up.

What I learned

Test your backup procedures with an actual fire drill where you have to use them to restore your data. You are almost invariably guaranteed to learn something valuable from this exercise (even if it’s just the peace of mind of having done it – like changing a tire before you actually have a flat).
I have no business running servers ;-)
This is why services like Page.ly exist
Don’t delete stuff until you absolutely have to. I had 150GB of free space on my laptop and yet I felt like I needed to get rid of this 6GB VM once I was finished with it. Dumb. Keep until you need to throw it away. There’s utilities like Disk Inventory X that make it easy to clean out the cruft eventually.
Microinstances are handy, light-weight, disposable tools for dev/test but should never be used in production. They cannot handle any kind of load. Kimbro had actually told me this but it took experiencing it first-hand for it to sink in.
EC2 instances never just disappear, they’re still there even when they become unreachable via the web. When something seems fishy, stop and seek alternate explanations and get a second set of eyes on it rather than trouncing forward and making the situation worse.
VMware VM’s are surprisingly brittle – simply renaming the parent directory in which they reside unleashes a chain of events that makes it unusable. I’m shocked given that product’s level of maturity that they’re not more bulletproof. The good news is your data is still probably retrievable when things get moved around but you will spend the next two hrs wading through config files to manually futz with parameters in order to get it working again.

Anyways, hopefully this writeup is useful and saves even one person from making some of the errors I did in this debacle.

Post mortem on a catastrophic data loss

The context

The failure sequence

What went right

What I learned

Leave a Reply

My hustles

Previous Posts

Categories