At GumGum we are using autoscaling successfully. Choosing the right metrics for autoscaling is an ongoing process as your cluster and applications change. When we researched which metrics to use for autoscaling we found very little literature in the blogosphere. That is why I decided to document our experiences with it.
To begin with let’s look at an autoscaling trigger creation command:
as-create-or-update-trigger my-cpu --auto-scaling-group myautoscalinggroup --dimensions "AutoScalingGroupName=myautoscalinggroup" --measure CPUUtilization --period 60 --statistic Average --lower-threshold 20 --upper-threshold 50 --breach-duration 300 --upper-breach-increment 1 --lower-breach-increment 1
The dimensions parameter indicates that this is a trigger created on myautoscalinggroup autoscaling group. The measure indicates that it’s a trigger based on CPU Utilization. Period indicates that the trigger will be evaluated every 60 seconds. if upper threshold (average cpu utilization goes above 50%) is breached for 300 seconds then start 1 additional instance. If lower threshold (average cpu utilization goes below 20%) is breached for 300 seconds then 1 instance is scaled down.
The metrics on which a autoscaling trigger is built (Cloudwatch measure) is the most important decision you make when you set up an autoscaling cluster.
There are many metrics you can choose to scale up or down your cluster. CPU utilization, latency, network out bytes, disk reads in bytes etc. You can set up autoscaling based on whatever data Cloudwatch offers.
Latency or CPU utilization were obvious choices for our web cluster. We started out with latency as the metrics for autoscaling. Quickly we realized that latency was not the right metrics for us. Our application was making calls to some third party web services. Sometimes third party webservices problems caused our latency to go up. In that case, adding more instances to the cluster didn’t help much. Furthermore, third party calls made our latency spike unevenly. We wanted our cluster to go up smoothly as the traffic increases and come down smoothly as the traffic decreased.
We decided to switch to CPU Utilization as the metric for autoscaling. We started thinking about our upper and lower thresholds. We chose 50 as the upper threshold. It can take autoscaling few minutes to kick in. It was important for us to keep enough room (50%) for CPU to go up in those few minutes. In fact by setting a breach duration of 300 seconds we ensured that autoscaling did not kick in immediately. This adjustment accommodated occasional traffic spikes that went down on their own. By looking at our average CPU utilization, we decided to keep the lower threshold at 20% CPU utilization. This worked well when we had a relatively small cluster. Our traffic pattern caused dramatic increase in our average CPU utilization in a smaller cluster (3 to 5 instances). CPU went up to 50% quickly and came back down to 20% when traffic subsided.
Recently our traffic increased dramatically and as a result our cluster increased tenfold! Now we were running 40 instances! Our CTO observed that the autoscaling was adding instances when they are necessary, but the cluster was not scaling down quickly enough. In a bigger cluster, average cpu was fairly stable and was not undergoing dramatic changes as in a smaller cluster. So our CTO decided to adjust the lower threshold to 40% of CPU Utilization. This worked and the autoscaling cluster started scaling down.
You can see the trigger scaling instances up and down throughout a day below. Time is on x axis and number of instances is on y axis.
We hope our experiences are helpful to others implementing autoscaling. I am sure we will have to revisit these numbers when our cluster grows to 100s of instances! I will write a new blog post describing my experiences then.