Monday 14 February 2011

On Soldiers and Servers

Cloud computing provides an unparalleled opportunity for reducing the risks involved in buying infrastructure for everything from small businesses to enterprises. One can requisition only the capacity that is needed, as and when it is needed. The problem of deciding how much capacity to buy for any given moment is one that requires understanding and reasoning with the uncertainties involved in traffic patterns.

Many things in life are uncertain. Managing this uncertainty is an important part of planning in real-world operations, often phrased in terms of risk (the expected benefit or cost across all outcomes). Risks to IT services come in many forms from the common — such as hard disk failure — to the unusual — such as large earthquakes or asteroid strikes. Probability theory and statistics provide tools for reasoning with uncertain situations and are commonly used to estimate and balance risk and so maximise the probability of a successful outcome.

Contracts, too, also explicate risks and who bears responsibility for them. Commercial contracts are often given conditions stated statistically in the form of Service Level Agreements (SLAs). These agreements include requirements for up-time being over a certain percentage, or the proportion of incidents or problems that are dealt with in a certain time.
By exploring the connection between the causes of death of a group of soldiers in the Prussian cavalry and traffic levels on web-servers, this post describes one way that probability theory may be applied to capacity planning, with the goal of meeting some SLA.

The Commonality of Rare Events

In 1898, a Russian statistician called Ladislaus Bortkiewicz was attempting to make sense of rare events. For our current purposes, a rare event is one that is individually unlikely but has a lot of opportunities to happen, such as mutations in DNA or winning the lottery. He was looking at the number of soldiers in the Prussian cavalry who were killed by horse-kicks; the probability of any given soldier being killed by a horse kick was low, but in the cavalry there are lots of occasions where one could be kicked to death by a horse (whether deserved or not). The question was, how can one estimate the probability that a given number of soldiers would be killed by their horses in a year given the statistical data we have about how many were killed each year on average?

The tool that Bortkiewicz used, and the theme of his book ‘The Law of Small Numbers’, was the Poisson distribution, a probability distribution with one parameter: the average number of events in the given period.
Assuming an average of 10 soldiers were killed each year, the Poisson distribution can be plotted:

The horizontal axis is the number of events observed (all integers) and the vertical axis is the probability of that number occurring according to the distribution. As can be seen, the highest probability is associated with the mean number of events (ten), but there is a spread of other counts that have non-negligible probability.
The probability that a number of events, k, occurs when the mean is can be calculated as follows:

Pr(k|)=kk!e-.

The probability that six men were killed by their horses is then:

Pr(6|10)=1066!e-10=0.0631 (3sf).

The Poisson distribution has since been used to model many other situations such as the spread of epidemics and hardware failure, all of which are ‘rare’ events in the sense above. Traffic to websites can also be modelled using the Poisson distribution; there are large number of browsers in any given period, and the probability of any of them visiting a given site is relatively low. It is this that will allow us to answer some questions about traffic to a site that we have responsibility for maintaining.

Estimating Required Capacity

After that lengthy tour, we return to our original problem: capacity planning for a certain load. We want to estimate the probability that our maximum capacity (in requests per second, rps) is greater than the load we’ll receive. All we can directly estimate is the expected amount of traffic that our server will receive (again in requests per second) and our maximum capacity (through load testing or other means).
Since web traffic may be treated as obeying a Poisson distribution, our problem can be stated as finding the probability that the observed load is less than or equal to our maximum capacity. This is the definition of the cumulative distribution function of the Poisson distribution, for maximum capacity k (an integer) and expected load :

Pr(xk|)=(k+1,)(k+1).

As an example, imagine that we are expecting 50 rps, and have a maximum capacity of 60 rps. The probability that the observed load is less than or equal to 60 rps is then 0.928 to three significant figures, unlikely to meet most commercial SLAs. If we increase our capacity, through improving the code or provisioning more machines, to 70 rps then the probability of being able to handle the observed load is now 0.997 (to three significant figures), which may be enough to meet our commitments.

Conclusion

We have seen that probability theory and statistics can provide useful tools for capacity management. By modelling our situation using a simple probability distribution, we have gained an improved ability to quantify the risks involved in providing capacity for different levels of service. One can use this distribution to decide how much capacity to buy for any given level of demand, allowing one to use the cloud to adapt one’s infrastructure with confidence.
Unsurprisingly, there are lots of opportunities for using these tools in other areas of service management. All IT infrastructure is uncertain, and it is only by embracing this uncertainty and working with it that we can mitigate the risks involved in IT strategy, design, deployment and operation.

Joe Geldart
Senior Engineer
Cloudreach

Friday 11 February 2011

Keep Calm and Carry On - Microsoft Windows License Activation in AWS

The AWS Cloud has now become a familiar pat of our cloud existence but some of you may have come across a few problems that are not immediately obvious from start of use. One particularly funny (as in ‘strange’, not ‘haha’) problem you can find in Amazon is the Genuine Advantage Program from Microsoft. This tells you that your instance’s licence is not valid.

First impressions on seeing this are usually, “Wasn't this supposed to be managed by Amazon?” and obviously “Why are they asking me to activate my Windows license?” The fact is that Amazon actually does take care of this problem.

Namely:

On first start up Amazon Ec2WindowsActivate service registers the copy of Windows and sets the activation server link.
Later the instance can reconfirm its license by connecting these servers.

At the time of writing these are:

us-east-1:

ec2-174-129-233-152.compute-1.amazonaws.com.

ec2-174-129-233-141.compute-1.amazonaws.com.

us-west-1:

ec2-204-236-129-123.us-west-1.compute.amazonaws.com.

ec2-204-236-129-122.us-west-1.compute.amazonaws.com.

eu-west-1:

ec2-79-125-16-172.eu-west-1.compute.amazonaws.com.

ec2-79-125-16-108.eu-west-1.compute.amazonaws.com.

ap-southeast-1:

ec2-175-41-130-16.ap-southeast-1.compute.amazonaws.com

ec2-175-41-130-20.ap-southeast-1.compute.amazonaws.com

All now seems pretty logical - but then... “Why am I getting the black wallpaper, and why doesnt Windowsre-activate?” The key point is that DNS names can only be resolved by the internal Amazon DNS. So if you change your Windows DNS servers to some others (your Active Directory ones for example) your server won't be able to resolve it.

So far there are two solutions;

You can restore Amazon DNS in the machine configuration, or
You run manually (or scripted) 'slmgr.vbs /skms IP ' and 'slmgr.vbs /ato' commands in CLI after resolving the DNS request against the Amazon DNS and getting the actual internal IP you instance should point to.

Hope that helps you all Cloud users.

Emilio Garcia

Senior Engineer

Cloudreach

Wednesday 9 February 2011

More Clouds In The Sky

Behold, the latest additions to the Cloudreach family. It's no wonder we're scouring London for a new HQ, we've got Cloud Consultants in working in cupboards, some in the kitchen fridge (both shelves) and we've locked one in a drawer...just need to remember which one!

New hires since mid December:

Friday 4 February 2011

The Cloud = High Performance Computing

The cloud is a perfectly fitting platform for solving many high-performance computational problems. It may be actually cheaper and may offer faster return of results than traditional clusters, for both occasional tasks and periodic use.

For a number of years, science and analytics users have been using clusters for high-performance computing in areas such as bioinformatics, climate science, financial market predictions, data mining, finite element modelling etc. Companies working with vast amounts of data, such as Google, Yahoo! and Facebook, use vast dedicated clusters to crawl, index and search websites.

Dedicated Company Clusters.

Often a company will own it’s own dedicated cluster for high-performance computations. The utilisation will likely be below 100% most of the time as the cluster needs to be scaled for peak demand, e.g. overnight analyses. The cluster will likely rapidly become business-critical, and it may become difficult or prohibitive to schedule longer maintenance shutdowns: hence the cluster may become running on outdated software. If the cluster has been growing in ad-hoc fashion from very small, there will occur a critical point, when any further growth requires disruptive hardware infrastructure upgrade and software re-configuration or upgrade i.e. a long shutdown. This may actually not be an option or carry an unacceptable risk.

Shared institutional clusters

In the case of a shared cluster (such as UK’s HECToR) the end users will likely face availability challenges:

There may not be enough task slots in the job pool for “surge” needs
Job queues may cause the job to wait for a few days

Often departments will need to watch monthly cluster utilisation quotas or face temporary black-listing for the job pool

Clusters Are Finite and Don’t Grow On Demand
Given the exponential nature of growth of data that we process, our needs (e.g. an experiment in Next Generation Sequencing) may simply outgrow the pace with which the clusters keep pace.

The Cloud Alternative
For those who feel constrained by the above problems, Amazon Web Services offer a viable HPC alternative:

AWS Elastic Compute Cloud (EC2) brings on-demand instances
The recently (late 2010) introduced AWS Cluster Compute Instances are high-performance instances running inside a high-speed, low-latency sub-network
For loosely coupled, easily parallelised problems, AWS Elastic Map Reduce is the offering of Hadoop (version 0.20.2), Hive and Pig as a service (well integrated into the rest of AWS stack such as the S3 storage).
For tightly coupled problems, Message Passing Interface, OpenMP and similar technologies will benefit from fast network
For analysis requiring a central, clustered, database, MySQL is offered as a service called AWS Relational Database Service (RDS), with Oracle DB announced as next

The Downside of The Cloud Approach: The Data Localisation Challenge (& Solutions). The fact that customer’s data (potentially in vast amounts) need to get to AWS over public Internet, is a limiting factor. Often the customer’s own network may be the actual bottleneck. There are 2 considerations to make:

Many NP-complete problems are actually above the rule-of-thumb break-even point for moving data over slow link vs. available CPU power (1 byte / 100,000 CPU cycles)
Often the actual “big data” are the reference datasets that are mostly static (e.g. in bioinformatics, reference genomes). AWS contains a number of public datasets already. For others, it may make sense to send the first batch of data for upload to AWS on a physical medium by post, although later only apply incremental changes.

Martin Kochan
Cloud Developer
Cloudreach