Adaptive Application Scaling for Improving Fault-Tolerance and Availability in the Cloud



In cloud environments, faults and run-time anomalies in the infrastructure can exhaust resources, and impact the performance of all applications that share them. A resource monitoring strategy alone is inadequate since snapshots of resource usage cannot provide any guarantee of application performance. This paper outlines an approach that enables an application to leverage the vast capacity and elasticity of the cloud to mitigate the deleterious effects of resource exhaustion at a node. It models the application as a network of servers and the flow dynamics of request streams as continuous functions of time, using queuing techniques. The strategy is to compute, for each server, the mean flow rate and the mean holding time and use this to decide among: a) redirecting the flow to another server, b) requesting additional resources from the cloud infrastructure, c) spawning additional server instances, or d) combining server instances to conserve resources. This dynamic re-configurability by scaling improves application fault-tolerance, availability, and resource utilization. © 2012 Alcatel-Lucent.