We started using HBase on EC2 sometime back in 2009. We thought that our data is important and we should have an option of restoring the data. We attached EBS volumes to our HBase nodes and configured HBase and Hadoop installation to store all the data on the attached EBS volumes.
Then came the concept of EBS backed instances. In those days we were still experimenting and HBase was releasing new versions very frequently. We were already few versions ahead pf our original AMI for Hadoop and HBase. We were also in the process of tuning our HBase/Hadoop cluster. The process of documenting all the changes after the changes are done to the installation or creating a new image everytime you changed something was very cumbersome. Instead, we thought if we converted our nodes to EBS backed instances, we won’t have to do any of it. We simply have to take a snapshot of the root device and then restore it incase the volume fails.
And this worked happily for few months. One day it suddenly stopped working.
There are many wayas to restore EBS backed instances from their snapshots. Here are all of the ways I knew:
1) Register the snapshot as an AMI and start an instance from the image.
2) Create a volume from your snapshot. Start a similar EBS backed instance, stop the instance and swap the root device.
3) Create an AMI from a running instance. This causes the instance to reboot immediately. It wasn’t an option for us. There is no way were could afford to reboot our master!
You have to know kernel and ramdisk ids if you want to go for option 1 and 2. You may think it’s a no brainer – just use the meta data query tool and find out kernel and ramdisk of the running instances. But not all instances have that meta data available to them! Our instances did not have a ramdisk meta data available! When we contacted Amazon support they told us that the instance is very old and there is simply no way to know which ramdisk it is using. That means you need to choose a ramdisk yourself. If the kernel or ramdisk you are using to create AMI from the snapshot is not compaitable, your instance will not boot up correctly. And this is especially true in case of Ubuntu images.
That’s what happened with us. It stopped working – somehow the kernel files were not available. Even though ramdisk information was not available, it was the kernel that caused us a problem. Here is what Amazon support had to say on our problems:
“Your practice of taking snapshots and starting instances from those machines can work, as it has in the past, but will always be susceptible to kernel/ramdisk mismatches.”
“Our standard practice of creating an image (AMI) from a running instance (option 3 as described above) and launching instances from that AMI would avoid the problem you’re seeing with the mismatched/incompatible kernels.”
When we told Amazon that it’s not an option for us as it causes the instance to reboot immidiately, here is what they suggested:
“Have you considered writing data to an EBS volume that is separate from your root EBS volume? I’m just wondering if that’s a viable option as it wouldn’t require stopping or rebooting the instance.”
There lies the answer! We have a requirement of recreating the cluster in case we accidently delete entire data or if we loose our master. In such a case the reliable backup can only be taken if your HDFS data does not reside on the root devices. A reliable backup of the root device cannot be taken without rebooting the device. Furthermore it’s stored as an AMI which mean you have to create a new AMI every day and delete the old one. This means to solve all of our problems we need HBase installation and data both stored on attached EBS volumes that are not the root devices.
It was news to us.
We had no choice. We decided to invest time to convert our architecture to use attached EBS volumes rather than waking up in the middle of a night and realizing that we are not able to restore our backup!