A sunny Sunday morning. I am preparing to go out with my wife. And suddenly a pingdom alert comes to my phone. The website is down! I run to my ubuntu desktop and hurriedly open ylastic, amazon aws console, splunk etc. Console shows nothing unusual. The instance on which the website is running is shown as ‘up’. I try to ssh the instance with its public dns name. I can’t get to it! The security groups are in place. Then what happened? Why is the instance not accessible?

Does it sound familiar? Should we blame Amazon for such occasional failures? Not really. I don’t blame Amazon for such occasional failures at all. In my opinion, the moment I signed up for a cloud service, I signed up for this kind of behavior. The cloud technologies are not perfect yet and this kind of behavior is ‘normal’. Buying premium support will not buy you much. (I haven’t tried it, but I am pretty sure).

In my opinion, people who build their systems in AWS cloud should be prepared for such a failure. They need to design their systems in a way that can accommodate such failures. Here are my suggestions to minimize the damage caused by such failures:

1) Architect your system to operate without most of the components

Some pieces might force the entire system to go down, but you may realize that a lot of them don’t. For example, in our case we cache all the configuration data in memory. Permanent home for this data is our RDS database. Because of the cache we get better performance and if our database goes down, our app can keep functioning. There are some inserts done at real time, but we have a JMX based switch that can switch it off. That way the app can keep functioning as much as it could without needing the database. Also note that many systems have pieces that can go down without causing any damage. Identify them and be aware of them. That way you can avoid panic when they go down.

2) Use elastic load balancer for critical services.

Elastic load balancer + auto scaling ensures that the service behind a load balancer is up all the time.

3) Use EBS volumes wherever data on the disk is important.

Use EBS backed images wherever possible. It’s very easy to take backups of EBS volumes (snapshot) and it’s very easy to boot up a new instance with a backup. Furthermore in many cases we have found that if an instance is unreachable, simply rebooting the instance is sufficient to bring it up again. You can’t reboot a s3 backed instance without loosing disk data.

4) Duplicate your data (unless it’s in S3) on multiple machines.

Use distributed file systems. We use HBase which uses Hadoop internally to make sure that the data is distributed and duplicated.

5) Create Snapshots often

Do not hesitate to take snapshots every hour if you need to. Snapshots are incremental backups. Every time you take a snapshot, only change are backed up to save space (and cost).

6) Use Amazon’s services wherever possible instead of using software installed on EC2 instances.

For example use SQS and SNS for your asynchronous and publish-subscribe communication wherever possible instead of installing and taking care of queue systems yourself.

7) Follow best practices

Follow EC2 best practices mentioned in the following document - Architecting for Cloud: Best Practices. It’s a very useful guide.

Share and Enjoy:
  • Sphinn
  • Twitter
  • Digg
  • Reddit
  • del.icio.us
  • Facebook
  • LinkedIn
  • StumbleUpon