Showing posts with label EC2. Show all posts

Wednesday, 6 June 2012

Auto Scaling and Chef Nodes Deregistration

Creating an infrastructure which scales dynamically according to the traffic volumes hitting a web site, is a reality using Amazon Web Services (AWS). However, there are a number of key design points, which differentiate a professional solution from an ad hoc collection of services.

One of these design principles for auto scaling an application is to strive keep it stateless.

Sometimes this is not possible, for example when OpsCode Chef is used to configure and deploy code to instances created via an auto scaling process. In order to use Chef on a new instance, it needs to first register itself first with a central Chef server and resolve credentials and ssh access so that code and commands can be executed.
The problem with registering autoscaling nodes, is that when those are terminated for one reason or the other, Chef Server is left in a “dirty” status, with references to nodes which are no longer active.

Fortunately, the AWS team have enhanced their platform with lots of additional services complementing EC2, two of which are useful to solve the problem described above.

When an action is triggered in an auto scaling group, this can generate a notification using the Amazon Simple Notification Service (SNS). SNS works across all the services offered by AWS, and an auto scaling group can be configured in order to generate an outbound notification for both scale up and scale down events.
SNS receives and publishes events using “topics”. A topic is a communication channel to send messages and subscribe to notifications. It provides an access point for publishers and subscribers to communicate with each other.

An SNS topic can be created very easily using the web UI, or using the command line:

sns-create-topic MyTopic

Once a topic for autoscaling events is created, the auto scaling group needs to know where to publish notifications. This can be achieved in one of two ways; either using the command line interface:

as-put-notification-configuration MyGroup --topic-arn arn:placeholder:MyTopic --notification-types autoscaling: autoscaling:EC2_INSTANCE_TERMINATE topic-ARN MyTopic

or using CloudFormation:

"NotificationConfiguration" : {
             "TopicARN" : { "Ref" : "MyTopic" },
             "NotificationTypes" : [ "autoscaling:EC2_INSTANCE_TERMINATE"]
       },

Notifications by themselves are not very useful if something doesn’t consume them and act upon them. A naive approach would be to use email or SMS alerts and manually remove instances using the Chef user interface. Fortunately we can do better than this, and instruct machines (or scripts) to perform this (boring!) task on our behalf.

The method of consuming an SNS notification programmatically is to use Amazon Simple Queue Service (SQS) as an endpoint to SNS. The SNS topic can be configured in order to act as a producer for an SQS queue.

Once SNS and SQS are linked together, it’s just a matter of writing a consumer which is able to read EC2_INSTANCE_TERMINATE messages, and remove the node from Chef.

Given that Chef uses Ruby as language to define “recipes”, I choose to use FOG in order to leverage the SQS RESTful API.

Before retrieving the messages in the queue, authentication needs to take place.

sqs = Fog::AWS::SQS.new(
        :aws_access_key_id => access_key_id,
        :aws_secret_access_key => secret_access_key,
        :region => "eu-west-1"
       )

If the authentication is successful, the method receive_message will return a list containing the number of messages specified as a parameter.

response = sqs.receive_message(QUEUE_URL, { 'Attributes' => [], 'MaxNumberOfMessages' => 10 })

From this point on, is a matter of parsing the JSON message, and if the event is a termination notification deleting the instance.

messages = response.body['Message']
unless messages.empty?
   messages.each do |m|
   body = JSON.parse(m['Body'])
   message = JSON.parse(body["Message"])
   if message["Event"].include? AUTOSCALING_NOTIFICATION["Terminate"]

In order to keep the code simple, the node is removed by invoking knife from the command line, so that there is no need to handle the authentication:

instance_id = message["EC2InstanceId"]
delete_node = "knife node delete #{instance_id} -y"
delete_client= "knife client delete #{instance_id} -y"
output = `#{delete_node}`
result=$?.success?
if result == true
   result = `#{delete_client}`
end


Once a message is processed successfully, it can be removed from the queue:

sqs.delete_message(QUEUE_URL, m['ReceiptHandle'])

In order to work in an enterprise production environment, the script needs few additions, but this should be enough to get you started!

Alternatively, come and talk to the experts at Cloudreach.

Nicola Salvo
System Engineer
Cloudreach Limited

Friday, 11 February 2011

Keep Calm and Carry On - Microsoft Windows License Activation in AWS

The AWS Cloud has now become a familiar pat of our cloud existence but some of you may have come across a few problems that are not immediately obvious from start of use. One particularly funny (as in ‘strange’, not ‘haha’) problem you can find in Amazon is the Genuine Advantage Program from Microsoft. This tells you that your instance’s licence is not valid.

First impressions on seeing this are usually, “Wasn't this supposed to be managed by Amazon?” and obviously “Why are they asking me to activate my Windows license?” The fact is that Amazon actually does take care of this problem.

Namely:

On first start up Amazon Ec2WindowsActivate service registers the copy of Windows and sets the activation server link.
Later the instance can reconfirm its license by connecting these servers.

At the time of writing these are:

us-east-1:

ec2-174-129-233-152.compute-1.amazonaws.com.

ec2-174-129-233-141.compute-1.amazonaws.com.

us-west-1:

ec2-204-236-129-123.us-west-1.compute.amazonaws.com.

ec2-204-236-129-122.us-west-1.compute.amazonaws.com.

eu-west-1:

ec2-79-125-16-172.eu-west-1.compute.amazonaws.com.

ec2-79-125-16-108.eu-west-1.compute.amazonaws.com.

ap-southeast-1:

ec2-175-41-130-16.ap-southeast-1.compute.amazonaws.com

ec2-175-41-130-20.ap-southeast-1.compute.amazonaws.com

All now seems pretty logical - but then... “Why am I getting the black wallpaper, and why doesnt Windowsre-activate?” The key point is that DNS names can only be resolved by the internal Amazon DNS. So if you change your Windows DNS servers to some others (your Active Directory ones for example) your server won't be able to resolve it.

So far there are two solutions;

You can restore Amazon DNS in the machine configuration, or
You run manually (or scripted) 'slmgr.vbs /skms IP ' and 'slmgr.vbs /ato' commands in CLI after resolving the DNS request against the Amazon DNS and getting the actual internal IP you instance should point to.

Hope that helps you all Cloud users.

Emilio Garcia

Senior Engineer

Cloudreach

Friday, 4 February 2011

The Cloud = High Performance Computing

The cloud is a perfectly fitting platform for solving many high-performance computational problems. It may be actually cheaper and may offer faster return of results than traditional clusters, for both occasional tasks and periodic use.

For a number of years, science and analytics users have been using clusters for high-performance computing in areas such as bioinformatics, climate science, financial market predictions, data mining, finite element modelling etc. Companies working with vast amounts of data, such as Google, Yahoo! and Facebook, use vast dedicated clusters to crawl, index and search websites.

Dedicated Company Clusters.

Often a company will own it’s own dedicated cluster for high-performance computations. The utilisation will likely be below 100% most of the time as the cluster needs to be scaled for peak demand, e.g. overnight analyses. The cluster will likely rapidly become business-critical, and it may become difficult or prohibitive to schedule longer maintenance shutdowns: hence the cluster may become running on outdated software. If the cluster has been growing in ad-hoc fashion from very small, there will occur a critical point, when any further growth requires disruptive hardware infrastructure upgrade and software re-configuration or upgrade i.e. a long shutdown. This may actually not be an option or carry an unacceptable risk.

Shared institutional clusters

In the case of a shared cluster (such as UK’s HECToR) the end users will likely face availability challenges:

There may not be enough task slots in the job pool for “surge” needs
Job queues may cause the job to wait for a few days

Often departments will need to watch monthly cluster utilisation quotas or face temporary black-listing for the job pool

Clusters Are Finite and Don’t Grow On Demand
Given the exponential nature of growth of data that we process, our needs (e.g. an experiment in Next Generation Sequencing) may simply outgrow the pace with which the clusters keep pace.

The Cloud Alternative
For those who feel constrained by the above problems, Amazon Web Services offer a viable HPC alternative:

AWS Elastic Compute Cloud (EC2) brings on-demand instances
The recently (late 2010) introduced AWS Cluster Compute Instances are high-performance instances running inside a high-speed, low-latency sub-network
For loosely coupled, easily parallelised problems, AWS Elastic Map Reduce is the offering of Hadoop (version 0.20.2), Hive and Pig as a service (well integrated into the rest of AWS stack such as the S3 storage).
For tightly coupled problems, Message Passing Interface, OpenMP and similar technologies will benefit from fast network
For analysis requiring a central, clustered, database, MySQL is offered as a service called AWS Relational Database Service (RDS), with Oracle DB announced as next

The Downside of The Cloud Approach: The Data Localisation Challenge (& Solutions). The fact that customer’s data (potentially in vast amounts) need to get to AWS over public Internet, is a limiting factor. Often the customer’s own network may be the actual bottleneck. There are 2 considerations to make:

Many NP-complete problems are actually above the rule-of-thumb break-even point for moving data over slow link vs. available CPU power (1 byte / 100,000 CPU cycles)
Often the actual “big data” are the reference datasets that are mostly static (e.g. in bioinformatics, reference genomes). AWS contains a number of public datasets already. For others, it may make sense to send the first batch of data for upload to AWS on a physical medium by post, although later only apply incremental changes.

Martin Kochan
Cloud Developer
Cloudreach