CALL +44 (0)20 7183 3893
Blog

Wednesday, 6 June 2012

Auto Scaling and Chef Nodes Deregistration

Creating an infrastructure which scales dynamically according to the traffic volumes hitting a web site, is a reality using Amazon Web Services (AWS). However, there are a number of key design points, which differentiate a professional solution from an ad hoc collection of services.

One of these design principles for auto scaling an application is to strive keep it stateless.

Sometimes this is not possible, for example when OpsCode Chef is used to configure and deploy code to instances created via an auto scaling process. In order to use Chef on a new instance, it needs to first register itself first with a central Chef server and resolve credentials and ssh access so that code and commands can be executed.
The problem with registering autoscaling nodes, is that when those are terminated for one reason or the other, Chef Server is left in a “dirty” status, with references to nodes which are no longer active.

Fortunately, the AWS team have enhanced their platform with lots of additional services complementing EC2, two of which are useful to solve the problem described above.

When an action is triggered in an auto scaling group, this can generate a notification using the Amazon Simple Notification Service (SNS). SNS works across all the services offered by AWS, and an auto scaling group can be configured in order to generate an outbound notification for both scale up and scale down events.
SNS receives and publishes events using “topics”.  A topic is a communication channel to send messages and subscribe to notifications. It provides an access point for publishers and subscribers to communicate with each other.

An SNS topic can be created very easily using the web UI, or using the command line:

sns-create-topic MyTopic

Once a topic for autoscaling events is created, the auto scaling group needs to know where to publish notifications. This can be achieved in one of two ways; either using the command line interface:

as-put-notification-configuration MyGroup --topic-arn arn:placeholder:MyTopic --notification-types autoscaling: autoscaling:EC2_INSTANCE_TERMINATE topic-ARN MyTopic


or using CloudFormation:

"NotificationConfiguration" : {
             "TopicARN" : { "Ref" : "MyTopic" },
             "NotificationTypes" : [ "autoscaling:EC2_INSTANCE_TERMINATE"]
       },


Notifications by themselves are not very useful if something doesn’t consume them and act upon them. A naive approach would be to use email or SMS alerts and manually remove instances using the Chef user interface. Fortunately we can do better than this, and instruct machines (or scripts) to perform this (boring!) task on our behalf.

The method of consuming an SNS notification programmatically is to use Amazon Simple Queue Service (SQS) as an endpoint to SNS. The SNS topic can be configured in order to act as a producer for an SQS queue.

Once SNS and SQS are linked together, it’s just a matter of writing a consumer which is able to read EC2_INSTANCE_TERMINATE messages, and remove the node from Chef.

Given that Chef uses Ruby as language to define “recipes”, I choose to use FOG in order to leverage the SQS RESTful API.

Before retrieving the messages in the queue, authentication needs to take place.

sqs = Fog::AWS::SQS.new(
        :aws_access_key_id => access_key_id,
        :aws_secret_access_key => secret_access_key,
        :region => "eu-west-1"
       )


If the authentication is successful, the method receive_message will return a list containing  the number of messages specified as a parameter.

response = sqs.receive_message(QUEUE_URL, { 'Attributes' => [], 'MaxNumberOfMessages' => 10 })

From this point on, is a matter of parsing the JSON message, and if the event is a termination notification deleting the instance.

messages = response.body['Message']
unless messages.empty?
   messages.each do |m|
   body = JSON.parse(m['Body'])
   message = JSON.parse(body["Message"])
   if message["Event"].include? AUTOSCALING_NOTIFICATION["Terminate"]


In order to keep the code simple, the node is removed by invoking knife from the command line, so that there is no need to handle the authentication:

instance_id = message["EC2InstanceId"]
delete_node  = "knife node delete #{instance_id} -y"
delete_client= "knife client delete #{instance_id} -y"
output = `#{delete_node}`
result=$?.success?
if result == true
   result = `#{delete_client}`
end

        
Once a message is processed successfully, it can be removed from the queue:

sqs.delete_message(QUEUE_URL, m['ReceiptHandle'])

In order to work in an enterprise production environment, the script needs few additions, but this should be enough to get you started!

Alternatively, come and talk to the experts at Cloudreach.



Nicola Salvo
System Engineer
Cloudreach Limited

4 comments:

Peter Green said...

This is a very useful post, thanks for sharing: I hope to be able to implement this myself shortly!

Sunny said...

I have an environment where knife node list shows ec2 internal hostnames. How do we retrieve hostname of a terminated instance using the instance id.

Unknown said...

This seems like a lot of work. You can use this cookbook to clean up your chef server: https://supermarket.chef.io/cookbooks/scaledown_cleanup

It works well for my team and you can probably use it on more cloud platforms than just AWS.

Unknown said...

This seems like a lot of work. You can use this cookbook to clean up your chef server: https://supermarket.chef.io/cookbooks/scaledown_cleanup

It works well for my team and you can probably use it on more cloud platforms than just AWS.

Post a Comment

Pontus is ready and waiting to answer your questions