AWS offers great functionality with it’s Auto Scaling Groups (ASG), that allow you not to worry about capacity of your infrastructure and leave scaling up and down tasks to Amazon based on criteria you define. ASG manages number of EC2 nodes only, that are being built using image you specify when defining ASG. Depending on infrastructure solution you want to deploy this may or may not be a potential blocker.

“Normally” you’d use another AWS functionality – Elastic Load Balancer (ELB), and attach it to your group – job done. But what if for some reason you don’t want to use ELB but manage your own load balancing setup, facilitating for example haproxy?

Why on earth would you want to do that? I can think of few reasons:

  • being in control of log format
  • “realtime” and easier log collection
  • advanced frontend/backend configuration
  • excuse to play with SNS and SQS offerings from Amazon

As I mentioned in the last argument for skipping ELB is chance to play with other services from AWS – Simple Notification Service (SNS) and Simple Queue Service (SQS). On top of that you’ll need to facilitate your coding skills and I prefer Python.

Basic information flow would look as follows:

Picture1Picture2

ASG generates notification on scaling event, sends it to SNS topic. SQS queue is subscribed to that topic, so messages wait in line to be processed. Python script polls the queue and modifies haproxy config file accordingly and removes the message from SQS.

First, you need to create necessary components on your AWS account. Start with new SNS topic, there’s only one required parameter – name of the topic. On a side note, for debugging you might be interested in options available under “Edit topic delivery policy” in “Actions” menu.

Next create SQS queue. You would be fine with default values, but you might want to tune them according to your needs, in this scenario I’d focus on Default Visibility Timeout and I’d make it significantly smaller in case you’d have more than one instance of python script polling the queue.

Now it’s time to tie the two together and subscribe queue to SNS topic. Choose the option from Queue Actions menu, and then select your topic from the list.

Finally go to your Auto Scaling Group and add notification. Select your topic from the list and check boxes next to launch and terminate. (Proper solution would also take into account possible errors, but that’s out of scope of this blog post).

 

Everything should now be in place. To easily verify that AWS components are in, you might want to force your ASG to either scale up or down, and observe your queue. If during polling you’ll see new message – everything is ok and you can start looking at how to get from SQS to haproxy config. Example message:

{ „Type” : „Notification”,
„MessageId” : „4bd78f34-3162-5d9a-9fd3-c03de213b664”,
„TopicArn” : „arn:aws:sns:eu-west-1:123456789111: topic”,
„Subject” : „Auto Scaling: launch for group \” asg\””,
„Message” : „{\”StatusCode\”:\”InProgress\”,\”Service\”:\”AWS Auto Scaling\”,\”AutoScalingGroupName\”:\” asg\”,\”Description\”:\”Launching a new EC2 instance: i-2d125da0\”,\”ActivityId\”:\”b2446012-9583-43e9-96b9-3341b1edebc5\”,\”Event\”:\”autoscaling:EC2_INSTANCE_LAUNCH\”,\”Details\”:{\”Availability Zone\”:\”eu-west-1a\”,\”Subnet ID\”:\”subnet-c76bf99e\”},\”AutoScalingGroupARN\”:\”arn:aws:autoscaling:eu-west-1: 123456789111:autoScalingGroup:a4993b91-8827-47fc-8135-5a845c3f58b1:autoScalingGroupName/ asg\”,\”Progress\”:50,\”Time\”:\”2016-01-08T15:09:37.851Z\”,\”AccountId\”:\”123456789111\”,\”RequestId\”:\”b2446012-9583-43e9-96b9-3341b1edebc5\”,\”StatusMessage\”:\”\”,\”EndTime\”:\”2016-01-08T15:09:37.851Z\”,\”EC2InstanceId\”:\”i-2d125da0\”,\”StartTime\”:\”2016-01-08T15:09:03.980Z\”,\”Cause\”:\”At 2016-01-08T15:08:45Z a user request executed policy Test changing the desired capacity from 1 to 3. At 2016-01-08T15:09:02Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 1 to 3.\”}”,
„Timestamp” : „2016-01-08T15:09:37.890Z”}

 

Most interesting part of this solution is actual Python code to poll the queue and modify haproxy configuration accordingly. Of course you could be using nginx instead or something completely different – syntax of added/removed values and search expressions would have to be altered but general idea stays the same.

import boto.sqs
import boto.ec2
import json
import re

we’ll start with some imports, boto is a great module for interacting with AWS, we need json to parse notifications and we will facilitate re to build search expressions to get to the proper line in the config file

config_file = "path/to/haproxy.conf"

next a couple of small functions that allow us to find a line in the config file, replace it and entirely edit it.

def replace_line(file_name, line_num, text):
    with open(file_name, 'r') as f_r:
        lines = f_r.readlines()
        lines[line_num] = text
    with open(file_name, 'w') as f_w:
        f_w.writelines(lines)
        f_w.close()

def find_line(file_name, regex):
    lines_to_change = []
    with open(file_name, 'r') as f:
        for num, line in enumerate(f, 1):
            line = line.rstrip()
            for m in re.finditer(regex,line):
                lines_to_change.append(num)
    return lines_to_change

def edit_file(file_name, line_num, text):
    with open(file_name, 'r') as f_r:
        contents = f_r.readlines()
        f_r.close()
    contents.insert(line_num, text)
    with open(file_name, 'w') as f_w:
        contents = "".join(contents)
        f_w.write(contents)
        f_w.close()

Notification contains instance ID of new/destroyed node, but we need IP to put in the config, hence helper function that will grab private IP based on instance ID

def get_ec2_private_ip(instance_id):
    conn_ec2 = boto.ec2.connect_to_region("eu-west-1")
    status = conn_ec2.get_only_instances(instance_ids=instance_id)
    if status:
        return status[0].private_ip_address
    else:
        while status.__len__() == 0:
            status = conn_ec2.get_only_instances(instance_ids=instance_id)
            # sometimes you need to wait a bit longer to get
the IP than it takes for notification to go trough
            print "server not ready yet, retrying"
        return status[0].private_ip_address

finally we need a function to connect to the queue and fetch the message

def get_message_from_queue(queue):
    conn_sqs = boto.sqs.connect_to_region("eu-west-1")
    q = conn_sqs.get_queue(queue)
    # get next message from the queue, return false if queue is empty at the time, but do not error out
    try:
        rs = q.get_messages()
        message = rs[0].get_body()
    except IndexError:
        return False
    message = json.loads(message)
    message = json.loads(message['Message'])
    event = str(message['Event'])
    event = event[12:]
    instance_id = str(message['EC2InstanceId'])
    # remove the message from queue
    q.delete_message(rs[0])
    return [event, instance_id]

To put it all together now:

# main execution point
if __name__ == '__main__':
    a = get_message_from_queue('int-asg-api-sqs-queue')
    if a:
        if a[0] == "EC2_INSTANCE_TERMINATE":
            print 'scaling down'
            line_num = find_line(config_file, 'api-' + a[1])
            # a[1] contains instance ID
            replace_line(config_file, line_num[0]-1, "")
            print line_num
        else:
            print 'scaling up'
            ec2_ip = get_ec2_private_ip(a[1])
            # find last backend definition line number
            line_num = find_line(config_file, 'server (?=api)')[-1]
            # append new backend after this line
            edit_file(config_file, line_num, "  server api-" + a[1] + " " + ec2_ip + ":80 check inter 3s\n")
    else:
        print "Queue empty"

 

When executed script will connect to SQS queue, poll for new messages, then depending on the message it will either add or remove server from backend configuration. Of course this is under assumption that you’d like to run this from the same machine that runs haproxy itself, otherwise you’d need to modify this a bit, by adding remote execution bits possibly using fabric module.

Reasons why would you want to put such solution in place were mentioned at the beginning, and I think actually only valid one was – excuse to play around with AWS and python. This may be potentially outside of your normal interests as WebOps, but I hope you’ll enjoy cooking up python code and learning something new about AWS besides EC2, ELB, and RDS ;-)

On a side note, I’d strongly suggest using some sort of service discovery mechanism to tie your load balancers with auto scaling groups like Consul – but that’s topic for another post.