We have been doing various experiments in our ec2 web serving cluster to serve maximum traffic at the minimum costs. I thought our experience will be useful to many other people using ec2.

We have a web application with nginx + tomcat. We approximately get 200k requests per minute at the peak and about 65K requests per minute at night. Since we host a webservice and not a webpage, most of our requests are servlet requests (and not the faster file serving, nginx based requests).


We began our experiment with two m1.small machines. We set the autoscaling group minimum to two and set the tomcat file handle limit to be 65535.
We based on autoscaling metric on latency. We asked our autoscaling group to increase the capacity if the latency goes beyond 0.75 seconds.

This didn’t work that well. Here is what happened:

1) We started getting cpu bursts in peak hours. CPU used to jump to 100%, there by causing latency to go beyond 0.75 seconds periodically. The autoscaling used to kick in, starting a new server. The latency use to fall back below 0.75 within few minutes and the autoscaling used to cut back the capacity to the minimum.

2) We also started getting nginx errors. Here is what the error said:

(24: Too many open files) while accepting new connection on 0.0.0.0:80

Afer some research, we realized that the default nginx settings were not meant for the kind of scale we were dealing with. We changed the following settings in the /etc/nginx/nginx.conf:

worker_processes 4;
worker_rlimit_nofile 10240;
events {
worker_connections 8192;
}

The nginx errors stopped. And the CPU bursts evened out for the non peak period. Here is how the graph looked right after we made the nginx change.

You can clearly see that the change the bursts stopped at one point (when we made the change). But the CPU bursts started coming back in the peak time. This time, nginx was fine and there were no errors in the log.

This was clearly a signal that m1.small was not performing well at that load (for our application). We decided to switch to c1.mediums. We knew that c1.mediums have 5 EC2 compute units where as the m1.smalls have 1 EC2 compute unit. But we wanted to see how far m1.smalls can take us. The switch totally worked! Cpu bursts stopped. Autoscaling stopped kicking in. We can see the cpu going from 10% to 50% smoothly from non peak to peak hours. This is what we wanted!

Obviously two c1.mediums cost more than two m1.smalls. But we belive that we will be able to cope up with much larger growth using c1.mediums as the CPU is always hovering between 10 to 40%. In long term, it will definitely save us money. We will need less number of machines and we won’t waste money on instances getting started for a few minutes and getting shutdown when the bursts subside.

Share and Enjoy:
  • Sphinn
  • Twitter
  • Digg
  • Reddit
  • del.icio.us
  • Facebook
  • LinkedIn
  • StumbleUpon