Adopt Cloud and DevOps

Complex AWS Infrastructures in Minutes: Orchestrating Across Multiple Accounts

This article discusses using Python multi-threading to quickly orchestrate CloudFormation across multiple AWS accounts.

by Don Mills

It is often advantageous when designing AWS infrastructures to separate environments into different discrete AWS accounts. These can act as natural boundaries for things like cost attribution or IAM permissions. For example, if you have a development team, and a production team, you can create a separate account for each team to allow them full access inside their own account, but limited or no access in the other team's environment - without using strict tagging controls to differentiate resources. You can then report on the billing to quickly see what each team's cost is for their resources, based on what account is being used.

But as an organization, it's also good to be able to stand up all my resources (especially foundational items - like VPCs) at one time via CloudFormation. There's always the chance of a REALLY bad day, where all my core infrastructure gets deleted or destroyed. You can also want to set up similar designs in multiple regions, and you want to have all that configuration code ready to stand up a new infrastructure in a new region at any time.

And maybe you want to do more advanced configurations like peering different account VPCs together, and adding all the routes required to get traffic to move across the peering connections. This can be difficult to automate with only CloudFormation, because if you stand up a VPC in one account you will need the VPC id to feed into the peering request for the other account's VPC...

So that's a lot of hypotheticals in a short time period. Let's focus on a real world example to explain what I'm talking about.

The Problem

Last year SingleStone worked with a client that had some stringent security requirements. They wanted to use third party firewalls and load balancers to control access in and out to the internet for their entire environment. They wanted distinct user permissions that allowed teams lots of freedom in certain dedicated parts of the environment, but none in others. And they needed to be able to do cost attribution easily to show spend for all the separate teams.

What we designed for them was a multi-account infrastructure with consolidated billing. They had a master account for billing, which controlled a "shared services" VPC that contained all the third party firewalls and load balancers. Under that was a separate sub-account for each environment (so one for dev, one for test, one for staging, etc) that used cross-account VPC peering to connect to the master VPC. That way each team could operate freely in their own area, but a central infrastructure team could control security policy at the perimeter of the environment.

The customer also wanted to build this environment twice, in two different geographic regions - with the possibility of a third region in the upcoming months (and maybe more to follow).

 

As we were developing the CloudFormation to stand all this up, it quickly became apparent that this was impossible to entirely automate via pure CloudFormation. The best we could hope for was a manual process which consisted of "run this CloudFormation template in this account first, then write down these outputs, then switch to this account and put those outputs in as parameters for the next CloudFormation stack, then go back to the original account and accept the peering and add the routes in." And because of the multiple regions, we'd have to hard code ip address ranges via a mapping for each of the other VPCs that would peer and do a region lookup to see what the ranges were, unless the customer manually entered them as parameters every time, for each stack in the separate accounts.

Not a very elegant solution.

A Light Bulb Shines

So we sat around thinking about how to automate this so that the customer could repeat this process as many times as they wanted with minimal or no intervention and get the correct result every time. And there was an epiphany!

We could write a script that the customer could run with their IAM user credentials in the Master account. This script could then run all the CloudFormation templates for the Master account, then take the outputs and feed them into the next account's CloudFormation - and so on until the entire environment was complete.

As we mapped out the flow, it would follow these steps:

  1. 1. Use IAM privileges of user running script to access Master account
    2. Run the CloudFormation for Master account
  2. 3. Save the outputs in variables
  3. 4. Switch to next account
  4. 5. Run the CloudFormation there, using the variables from (3) as parameters. Save the outputs as more variables.
  5. 6. Set up the peering connection on that side. Save the peering id.
  6. 7. Set up the routes to the Master account ip address ranges pointing to the peering connection
  7. 8. Repeat this for all the other sub-accounts
  8. 9. Switch back to the Master account
  9. 10. Accept all the peering connections that were established using the peering ids saved in (6).
  10. 11. Add the routes to all the sub-account ip address ranges using the outputs from (5).
  11. 12. Finished!

It's Been a Few Months - A Quick Digression

Now I am sure some people are looking at the above and are having one of two thoughts at this point:

a) Well you could have just used CloudFormation to accept the peering in the Master account as per the documentation here: http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/peer-with-vpc-in-another-account.html

True - if I did this today. But at the time, this capability did not exist in CloudFormation. Also, unless I hardcoded ip address ranges for peered VPCs in every account's template as I discussed earlier (every range per region) or forced the customer to enter them as parameters, I could not dynamically create the routes from one VPC to another (as I wouldn't know what ranges to point the routes to).

b) Well you could have used TerraForm, which has the capability to assume roles.

Also true - if I did this today. TerraForm did not have this capability at the time, and I would have needed to have multiple AWS providers configured (one for each account). This would also necessitate having an IAM user in each account for this purpose. Of course TerraForm has its own challenges around maintaining and sharing it's state file.

Ok, back to it.

Let's Do Some Coding!

If it helps you to follow along, the entire completed source for this project is located here: https://github.com/DonMills/multiacct-CF-orchestrate/blob/master/multiacctcf.py

So we asked the customer whether they preferred Python or Ruby as a scripting language. Python was the answer (which was fine for me, because I love some Python).

We began to hack out a prototype that could do the steps listed above. First it had to be able to run CloudFormation, and it was probably a good idea to be able to verify the template first.

try:

            cf = boto3.client('cloudformation',
region_name=region)
validate = cf.validate_template(
TemplateURL=cffile
)
cfgood = True
print(
"CloudFormation file %s validated successfully for account %s"
%
(cffile, acctname))
except botocore.exceptions.ClientError as e:
            print(
"CloudFormation file %svalidation failed for account %s with
error: %s" %
(cffile, acctname, e))
cfgood = False

And if it passed the validation, then we run the CloudFormation:

if cfgood:
with lock:
print(
"Preparing to run CloudFormation file %s in account %s" %
(cffile, acctname))
stackid = cf.create_stack(
StackName=region + "-" + acctname,
TemplateURL=cffile,
Parameters=[
{
},
],
Tags=[
{
'Key': 'Purpose',
'Value': 'Infrastructure'
},
]
)['StackId']
with lock:
print("StackID %s is running in account %s" %
(stackid, acctname))

Then let's wait until it's done and let the user know the stack is finished:

waiter = cf.get_waiter('stack_create_complete')
waiter.wait(StackName=stackid)
with lock:
print(
"StackID %s completed creation in account %s" %
(stackid, acctname))

Finally, let's grab the stack outputs:

stack = cf.describe_stacks(StackName=stackid)
for item in stack['Stacks'][0]['Outputs']:
if item['OutputKey'] == "VPCId":
vpcid = item["OutputValue"]
elif item['OutputKey'] == "VPCCIDRBlock":
cidrblock = item["OutputValue"]
elif item['OutputKey'] == "RouteTableId":
rtbid = item["OutputValue"]
elif item['OutputKey'] == "InternalRouteTableA":
rtbid_inta = item["OutputValue"]
elif item['OutputKey'] == "InternalRouteTableB":
rtbid_intb = item["OutputValue"]

Sweet!

So now we can do this in one account...what about the others?

Security Token Service is Awesome

The AWS Security Token Service is what is used to do things like "AssumeRole" or get temporary credentials. We want to be able to use the Master account credentials of whomever is running the script, but have them "AssumeRole" into the other sub-accounts to do things there.

So first we create a role in each sub-account (preferably with the same name). We then give permissions to the user or group in the Master account that will be using this script to assume those roles.

We also create an S3 bucket in the Master account, and give the roles we just created permissions to that bucket via a bucket policy. That's where we'll keep all the CloudFormation templates.

We store the account numbers (used in "AssumeRole" call) and the location of the CloudFormation template for each account in a list of Python dictionaries ["awsaccts"].

awsaccts = [{'acct': 'acct1ID',
'name': 'master',
'cffile': 'location of cloudformation file in S3'},
{'acct': 'acct2ID',
'name': 'dev',
'cffile': 'location of cloudformation file in S3'},
{'acct': 'acct3ID',
'name': 'staging',
'cffile': 'location of cloudformation file in S3'},
{'acct': 'acct4ID',
'name': 'test',
'cffile': 'location of cloudformation file in S3'},
{'acct': 'acct5ID',
'name': 'QA',
'cffile': 'location of cloudformation file in S3'}]

Then we'll use that info to make an STS call for an account to assume the role.

sts = boto3.client('sts')
role = sts.assume_role(
RoleArn='arn:aws:iam::' + acct + ':role/MasterAcctRole',
RoleSessionName='STSTest',
DurationSeconds=900
)
accesskey = role["Credentials"]["AccessKeyId"]
secretkey = role["Credentials"]["SecretAccessKey"]
sessiontoken = role["Credentials"]["SessionToken"]
print(
"successfully assumed STS role for account %s" %
(acctname))

Note we've set the duration of the "AssumeRole" to 900 seconds (the minimum). If that's not long enough to run the CloudFormation we'll have to up that.

We can then take those temporary credentials and feed them into our CloudFormation calls. Here's our validate template part using temporary credentials:

cf = boto3.client('cloudformation',
aws_access_key_id=accesskey,
aws_secret_access_key=secretkey,
aws_session_token=sessiontoken,
region_name=region)
validate = cf.validate_template(
TemplateURL=cffile
)
cfgood = True

So now we've got CloudFormation working across the multiple accounts! But hey, I just had a thought....

Rocket Sauce

Our script currently runs in a linear order. We go into the Master account, do things, then go into the next account, do things, then into the next, and so on. It takes as long to run as the aggregate time of doing each of those actions.

But what if we could do all of it AT THE SAME TIME? Wouldn't that make it go so much faster? Then it would only take as long as the actions that took the longest.

Well in the programming world there is a way to do multiple things at the same time. It's called multi-threading. Can we get some of that going in our script?

We want the Master to run as it should, by itself first. That way we'll know the vpc id to feed into the sub-accounts for peering. Then we want to run all the sub-accounts simultaneously, and wait until they are all finished before moving on.

And because we're going to be going so much faster, let's put some time calls in here so we can see how long everything is taking:

print(
"[%s] Preparing to run CloudFormation across all AWS accounts" %
ctime())
print("[%s] Preparing to run Master account CloudFormation" % ctime())
masteracct = list(
(entry for entry in awsaccts if entry['name'] == 'master'))[0]
run_cloudform(
masteracct['acct'],
masteracct['name'],
region,
masteracct['cffile'],
nopeer,
results)
printdatamaster(results)
print("[%s] Preparing to spawn threads" % ctime())
subaccts = (entry for entry in awsaccts if entry['name'] != 'master')
##############################
# do the threading for subaccts
#############################
for entry in subaccts:
t = threading.Thread(
target=run_cloudform,
args=(
entry['acct'],
entry['name'],
region,
entry['cffile'],
nopeer,
results))
threads.append(t)
t.start()
for i in range(len(threads)):
threads[i].join()
print("[%s] All CloudFormations run!" % ctime())

So a few things to say here.

a) First, this blog post is not the place for a deep dive into Python multi-threading. But it works great in this case because the actions are asynchronous, and therefore don't hit on the infamous Python GIL issue. You just send the CloudFormation call out, and check for when it's complete - which is very different from multiple threads all trying to do stuff constantly at the same time.

b) You can see in the second block of code under the comment that we have a list of the subaccounts ["subaccts"]. For each entry we create a thread that performs the "run_cloudform" function, and we feed in the specific arguments for that entry.

We have another list that we use to hold all the thread objects that are created ["threads"]. We loop through that list to perform the join() command on each thread - which basically holds up our script until all the threads have completed their jobs.

And finally there's a Python dictionary ["results"] that stores all the outputs from everything. We'll refer to that in the next section.

c)  Please take a moment to check out the cool list comprehensions (entry for entry in awsaccts if entry['name'] !='master'). I had originally written these with some (I thought) awesome functional programming filterstatements, but they weren't supported in Python 3 - and I wanted this script to run in every version of Python imaginable. 

VPC Peering

And now that we've gotten this far, let's attack the VPC peering and routing portion.

While we could feed the Master VPC's vpcid into all the subaccount's CloudFormation as a parameter, and do the peering request there, we wanted to give the option for the user of this script to do the peering, or not do the peering. So we set that as a command line option (peer or not peer). As a result the script directly talks to the EC2 API to accomplish both sides of the peering if that option is set.

So for a subaccount to make the peering request:

try:
ec2 = boto3.client('ec2',
aws_access_key_id=accesskey,
aws_secret_access_key=secretkey,
aws_session_token=sessiontoken,
region_name=region)
pcx = ec2.create_vpc_peering_connection(
VpcId=vpcid,
PeerVpcId=results['master']['VPCID'],
PeerOwnerId='masteracctID'
)
pcxid = pcx['VpcPeeringConnection']['VpcPeeringConnectionId']

There's our ["results"] dictionary, where we are getting the vpcid of the Master VPC for the peering connection.

And then we add the routes into the subaccount:

route = ec2.create_route(
DestinationCidrBlock=results['master']['CIDRblock'],
VpcPeeringConnectionId=pcxid,
RouteTableId=rtbid
)

Finally, we save all those outputs into our ["results"] dictionary:

results[acctname] = {
"CIDRblock": cidrblock,
"VPCID": vpcid,
"PCXID": pcxid}

We do this for each of the subaccounts. Then at the end of the script, we swing back into the Master account and accept all the peerings, and add the corresponding routes there.

master = boto3.client('ec2',
region_name=region)
subaccts = (entry for entry in results if entry != "master")
for entry in subaccts:
pcx = master.accept_vpc_peering_connection(VpcPeeringConnectionId=results[entry]['PCXID']
)
print(
"[%s] VPC Peering connection from %s with ID %s is
status: %s" %
(ctime(),
entry,
results[entry]['PCXID'],
pcx['VpcPeeringConnection']['Status']['Code']))
for table in results['master']['RTBint']:
route = master.create_route(
DestinationCidrBlock=results[entry]['CIDRblock'],
VpcPeeringConnectionId=results[entry]['PCXID'],
RouteTableId=table
)

Finishing Up

So that's the gist of it. We demonstrated this to the customer and in a live run it was able to build all the VPCs, peer them up, and add the routing in a little over 3 minutes.

I cleaned out the customer specific information and put it up in github here: https://github.com/DonMills/multiacct-CF-orchestrate/blob/master/multiacctcf.py

I don't suspect this will work for you out of the box (you might have different outputs, etc) - but I think this is a good foundation you can build your own custom orchestration scripts on. If not, at least its a good primer on using the Python Boto3 library and some AWS fundamental services, such as STS.

Most importantly, I hope it gives inspiration - that as DevOps and Cloud engineers, we shouldn't be afraid to think outside of the box and write our own tools to leverage APIs and techniques. It can only lead to better solutions, greater learning, and more value for the customer.

 

Don Mills
Don Mills
Cloud Security Architect
Contact Don