The cloud is a perfectly fitting platform for solving many high-performance computational problems. It may be actually cheaper and may offer faster return of results than traditional clusters, for both occasional tasks and periodic use.

For a number of years, science and analytics users have been using clusters for high-performance computing in areas such as bioinformatics, climate science, financial market predictions, data mining, finite element modelling etc. Companies working with vast amounts of data, such as Google, Yahoo! and Facebook, use vast dedicated clusters to crawl, index and search websites.

Dedicated Company Clusters.

Often a company will own it’s own dedicated cluster for high-performance computations. The utilisation will likely be below 100% most of the time as the cluster needs to be scaled for peak demand, e.g. overnight analyses. The cluster will likely rapidly become business-critical, and it may become difficult or prohibitive to schedule longer maintenance shutdowns: hence the cluster may become running on outdated software. If the cluster has been growing in ad-hoc fashion from very small, there will occur a critical point, when any further growth requires disruptive hardware infrastructure upgrade and software re-configuration or upgrade i.e. a long shutdown. This may actually not be an option or carry an unacceptable risk.

Shared institutional clusters

In the case of a shared cluster (such as UK’s HECToR) the end users will likely face availability challenges:

There may not be enough task slots in the job pool for “surge” needs
Job queues may cause the job to wait for a few days

Often departments will need to watch monthly cluster utilisation quotas or face temporary black-listing for the job pool

Clusters Are Finite and Don’t Grow On Demand
Given the exponential nature of growth of data that we process, our needs (e.g. an experiment in Next Generation Sequencing) may simply outgrow the pace with which the clusters keep pace.

The Cloud Alternative
For those who feel constrained by the above problems, Amazon Web Services offer a viable HPC alternative:

AWS Elastic Compute Cloud (EC2) brings on-demand instances
The recently (late 2010) introduced AWS Cluster Compute Instances are high-performance instances running inside a high-speed, low-latency sub-network
For loosely coupled, easily parallelised problems, AWS Elastic Map Reduce is the offering of Hadoop (version 0.20.2), Hive and Pig as a service (well integrated into the rest of AWS stack such as the S3 storage).
For tightly coupled problems, Message Passing Interface, OpenMP and similar technologies will benefit from fast network
For analysis requiring a central, clustered, database, MySQL is offered as a service called AWS Relational Database Service (RDS), with Oracle DB announced as next

The Downside of The Cloud Approach: The Data Localisation Challenge (& Solutions). The fact that customer’s data (potentially in vast amounts) need to get to AWS over public Internet, is a limiting factor. Often the customer’s own network may be the actual bottleneck. There are 2 considerations to make:

Many NP-complete problems are actually above the rule-of-thumb break-even point for moving data over slow link vs. available CPU power (1 byte / 100,000 CPU cycles)
Often the actual “big data” are the reference datasets that are mostly static (e.g. in bioinformatics, reference genomes). AWS contains a number of public datasets already. For others, it may make sense to send the first batch of data for upload to AWS on a physical medium by post, although later only apply incremental changes.

Martin Kochan
Cloud Developer
Cloudreach

Friday, 4 February 2011

The Cloud = High Performance Computing

No comments:

Post a Comment