Cloud computing provides an unparalleled opportunity for reducing the risks involved in buying infrastructure for everything from small businesses to enterprises. One can requisition only the capacity that is needed, as and when it is needed. The problem of deciding how much capacity to buy for any given moment is one that requires understanding and reasoning with the uncertainties involved in traffic patterns.

Many things in life are uncertain. Managing this uncertainty is an important part of planning in real-world operations, often phrased in terms of risk (the expected benefit or cost across all outcomes). Risks to IT services come in many forms from the common — such as hard disk failure — to the unusual — such as large earthquakes or asteroid strikes. Probability theory and statistics provide tools for reasoning with uncertain situations and are commonly used to estimate and balance risk and so maximise the probability of a successful outcome.

Contracts, too, also explicate risks and who bears responsibility for them. Commercial contracts are often given conditions stated statistically in the form of Service Level Agreements (SLAs). These agreements include requirements for up-time being over a certain percentage, or the proportion of incidents or problems that are dealt with in a certain time.
By exploring the connection between the causes of death of a group of soldiers in the Prussian cavalry and traffic levels on web-servers, this post describes one way that probability theory may be applied to capacity planning, with the goal of meeting some SLA.

The Commonality of Rare Events

In 1898, a Russian statistician called Ladislaus Bortkiewicz was attempting to make sense of rare events. For our current purposes, a rare event is one that is individually unlikely but has a lot of opportunities to happen, such as mutations in DNA or winning the lottery. He was looking at the number of soldiers in the Prussian cavalry who were killed by horse-kicks; the probability of any given soldier being killed by a horse kick was low, but in the cavalry there are lots of occasions where one could be kicked to death by a horse (whether deserved or not). The question was, how can one estimate the probability that a given number of soldiers would be killed by their horses in a year given the statistical data we have about how many were killed each year on average?

The tool that Bortkiewicz used, and the theme of his book ‘The Law of Small Numbers’, was the Poisson distribution, a probability distribution with one parameter: the average number of events in the given period.
Assuming an average of 10 soldiers were killed each year, the Poisson distribution can be plotted:

The horizontal axis is the number of events observed (all integers) and the vertical axis is the probability of that number occurring according to the distribution. As can be seen, the highest probability is associated with the mean number of events (ten), but there is a spread of other counts that have non-negligible probability.
The probability that a number of events, k, occurs when the mean is can be calculated as follows:

Pr(k|)=kk!e-.

The probability that six men were killed by their horses is then:

Pr(6|10)=1066!e-10=0.0631 (3sf).

The Poisson distribution has since been used to model many other situations such as the spread of epidemics and hardware failure, all of which are ‘rare’ events in the sense above. Traffic to websites can also be modelled using the Poisson distribution; there are large number of browsers in any given period, and the probability of any of them visiting a given site is relatively low. It is this that will allow us to answer some questions about traffic to a site that we have responsibility for maintaining.

Estimating Required Capacity

After that lengthy tour, we return to our original problem: capacity planning for a certain load. We want to estimate the probability that our maximum capacity (in requests per second, rps) is greater than the load we’ll receive. All we can directly estimate is the expected amount of traffic that our server will receive (again in requests per second) and our maximum capacity (through load testing or other means).
Since web traffic may be treated as obeying a Poisson distribution, our problem can be stated as finding the probability that the observed load is less than or equal to our maximum capacity. This is the definition of the cumulative distribution function of the Poisson distribution, for maximum capacity k (an integer) and expected load :

Pr(xk|)=(k+1,)(k+1).

As an example, imagine that we are expecting 50 rps, and have a maximum capacity of 60 rps. The probability that the observed load is less than or equal to 60 rps is then 0.928 to three significant figures, unlikely to meet most commercial SLAs. If we increase our capacity, through improving the code or provisioning more machines, to 70 rps then the probability of being able to handle the observed load is now 0.997 (to three significant figures), which may be enough to meet our commitments.

Conclusion

We have seen that probability theory and statistics can provide useful tools for capacity management. By modelling our situation using a simple probability distribution, we have gained an improved ability to quantify the risks involved in providing capacity for different levels of service. One can use this distribution to decide how much capacity to buy for any given level of demand, allowing one to use the cloud to adapt one’s infrastructure with confidence.
Unsurprisingly, there are lots of opportunities for using these tools in other areas of service management. All IT infrastructure is uncertain, and it is only by embracing this uncertainty and working with it that we can mitigate the risks involved in IT strategy, design, deployment and operation.

Joe Geldart
Senior Engineer
Cloudreach

Monday, 14 February 2011

On Soldiers and Servers

The Commonality of Rare Events

Estimating Required Capacity

Conclusion

No comments:

Post a Comment