Web serving in the cloud – our experiences with nginx and instance sizes
We have been doing various experiments in our ec2 web serving cluster to serve maximum traffic at the minimum costs. I thought our experience will be useful to many other people using ec2.
We have a web application with nginx + tomcat. We approximately get 200k requests per minute at the peak and about 65K requests per minute at night. Since we host a webservice and not a webpage, most of our requests are servlet requests (and not the faster file serving, nginx based requests).
We began our experiment with two m1.small machines. We set the autoscaling group minimum to two and set the tomcat file handle limit to be 65535.
We based on autoscaling metric on latency. We asked our autoscaling group to increase the capacity if the latency goes beyond 0.75 seconds.
This didn’t work that well. Here is what happened:
1) We started getting cpu bursts in peak hours. CPU used to jump to 100%, there by causing latency to go beyond 0.75 seconds periodically. The autoscaling used to kick in, starting a new server. The latency use to fall back below 0.75 within few minutes and the autoscaling used to cut back the capacity to the minimum.
2) We also started getting nginx errors. Here is what the error said:
(24: Too many open files) while accepting new connection on 0.0.0.0:80
Afer some research, we realized that the default nginx settings were not meant for the kind of scale we were dealing with. We changed the following settings in the /etc/nginx/nginx.conf:
worker_processes 4;
worker_rlimit_nofile 10240;
events {
worker_connections 8192;
}
The nginx errors stopped. And the CPU bursts evened out for the non peak period. Here is how the graph looked right after we made the nginx change.
You can clearly see that the change the bursts stopped at one point (when we made the change). But the CPU bursts started coming back in the peak time. This time, nginx was fine and there were no errors in the log.
This was clearly a signal that m1.small was not performing well at that load (for our application). We decided to switch to c1.mediums. We knew that c1.mediums have 5 EC2 compute units where as the m1.smalls have 1 EC2 compute unit. But we wanted to see how far m1.smalls can take us. The switch totally worked! Cpu bursts stopped. Autoscaling stopped kicking in. We can see the cpu going from 10% to 50% smoothly from non peak to peak hours. This is what we wanted!
Obviously two c1.mediums cost more than two m1.smalls. But we belive that we will be able to cope up with much larger growth using c1.mediums as the CPU is always hovering between 10 to 40%. In long term, it will definitely save us money. We will need less number of machines and we won’t waste money on instances getting started for a few minutes and getting shutdown when the bursts subside.


March 30th, 2010 at 9:34 am
I’ve seen CPU bursts with Openfire on a m1.small as well (running in Java). Have you investigated whether the bursts are because of load or because of some random bug?
March 30th, 2010 at 12:04 pm
Is the cost of the startup and teardown to cover a burst really that significant?
March 30th, 2010 at 12:46 pm
@gregory – We are building this system for much bigger traffic. In that case, the cost could be significant. Furthermore, it takes few minutes for the server to start with latest version of our application deployed on it. We didn’t want to loose traffic in those minutes.
March 30th, 2010 at 12:47 pm
@TK We have this application running on a another hosted platform too. But there were no cpu bursts on that platform.
March 30th, 2010 at 11:25 pm
What software are you using to monitor the servers?
If it is reporting the number of file handles open then that may have given some correlation to the cpu peaks, ie file handle count jumps to a certain level and then cpu peaks showing that a limit on file handles was reached.
April 9th, 2010 at 3:36 pm
@gert – We use splunk to monitor our servers. The cpu bursts did not stop after the file handle problem was fixed.
May 4th, 2010 at 5:10 am
using c1.mediums as the CPU is always hovering between 10 to 40%.
Would it be of concern to you / your setup that the CPU utilization in only 40% or you don’t care about server instance under-utilization?
May 5th, 2010 at 11:17 pm
Ideally we would like to avoid underutilization. That wasn’t our final utilization. BTW, what’s the ideal usual CPU load? Is it ok to run your servers on 60% utilization all the time? I am trying to find out what other people are doing.
June 30th, 2010 at 11:22 am
Buy:Arimidex.Nexium.Prednisolone.Retin-A.Actos.Mega Hoodia.Human Growth Hormone.Petcam (Metacam) Oral Suspension.Prevacid.100% Pure Okinawan Coral Calcium.Zovirax.Accutane.Lumigan.Synthroid.Valtrex.Zyban….
July 22nd, 2010 at 2:47 am
Buy:Cozaar.Zocor.Advair.Wellbutrin SR.Seroquel.Amoxicillin.Female Pink Viagra.Acomplia.Ventolin.SleepWell.Lipitor.Aricept.Lasix.Lipothin.Buspar.Zetia.Prozac.Female Cialis.Benicar.Nymphomax….
July 22nd, 2010 at 3:13 am
Buy:Nexium.Valtrex.Prevacid.100% Pure Okinawan Coral Calcium.Human Growth Hormone.Prednisolone.Retin-A.Lumigan.Zovirax.Mega Hoodia.Petcam (Metacam) Oral Suspension.Synthroid.Zyban.Actos.Accutane.Arimidex….
August 29th, 2010 at 5:51 pm
Home http://ybanksu7ot.03GMCPARTS.US/tag/Home+Air+filter/ : Air…
filter…
August 29th, 2010 at 6:54 pm
clark http://eweberdj1yrug.copious-systems.com/tag/clark+county+swim+bars/ : bars…
swim…