Managing volatile Terraform Deployments with AWS CodeBuild and Step Functions



I was searching for an AWS-native solution that would deploy resources via Terraform, allow them to do some work, and then destroy them properly without leaving any orphaned resources. Also, multiple of those Terraform deployments needed to be live at the same time.

The solution I found is a combination of AWS CodeBuild, AWS Step Functions, a Terraform State being managed in S3 and Terraform Workspaces.

Use Case

The solution can easily be applied to any problem, that requires some short term resources to be created and destroyed based on your own logic. It is similar to AWS Autoscaling, however in my particular use case Autoscaling wasn’t applicable as I needed more control over scaling events and wanted to add some more automation steps into the workflow (e.g. setup automation that doesn’t fit into instance user data). Additionally, Autoscaling doesn’t automate the creation of attached resources like DNS.

Additionally, I like the modularity of it. I am using the solution to build a SAP Application Server Autoscaling solution. The servers should be booted on demand and configured dynamically. However, the automation logic doesn’t really care about the actual resources that are deployed. You can port this solution to any other types of use cases.

High Level Process Diagram

Solution

I am using AWS Step Functions to orchestrate the deployment and destruction of the Terraform code. To allow multiple parallel deployments, a new Terraform Workspace is created for each workflow. The State is stored centrally in S3. Deployment is done with AWS CodeBuild which pulls the code from AWS CodeCommit.

Components

Step Functions

AWS Step Functions provides a simple no-code solution to orchestrating complex workflows that use AWS services as individual steps. All AWS APIs can be used as steps.

The following graph shows an example workflow that triggers a CodeBuild run and then waits for it to finish. If successful, it goes on to some additional configuration steps and would then start the actual job that it is build for. If deployment fails, it immediately triggers the destruction to ensure proper clean up. Later on, I want to build a logic that triggers the destruction after successful completion of the business function.

Step Functions have a maximum runtime of 1 year. Therefore, the solution is only applicable to resources with a lifetime shorted than a year. I like to use the term volatile to describe those.

Building the workflow isn’t always straight forward. As the graphic shows, some loops and decision points are necessary to wait for the CodeBuild Process to finish. In the same way, properly passing data between Steps requires a learning curve. However, it isn’t steep.

Step Functions Graph

AWS CodeBuild and AWS CodeCommit

Most of the actual work happens in AWS CodeBuild. It provides a engine that boots up a container and then performs certain predefined tasks. Normally those would be used to build software artifacts from code as the name suggests. However, it is flexible enough to use it for the terraform apply and terraform destroy commands.

CodeBuild always does its work on a git repository and searches for a buildspec.yaml file. That file contains the instructions needed for either deployment and destruction. Those allow for flexible parameter passing using linux environment variables. Which file to use can be specified when starting the CodeBuild run.

Deployment

version: 0.2

env:
  variables:
    InstanceId: ""
  exported-variables:
    - terraform_output

phases:
  install:
    commands:
      - yum install -y yum-utils
      - yum-config-manager --add-repo https://rpm.releases.hashicorp.com/AmazonLinux/hashicorp.repo
      - yum -y install terraform
      
  pre_build:
    commands:
      - terraform init
      - terraform workspace select $WORKSPACE || terraform workspace new $WORKSPACE

  build:
    commands:
      - echo "using terraform workspace $WORKSPACE"
      - terraform apply -var-file="variables.tfvars" -auto-approve

  post_build:
    commands:
      - echo "terraform apply completed on workspace $WORKSPACE"
      - terraform_output=$(terraform output -json CodeBuildOutput)

Destruction

version: 0.2

phases:
  install:
    commands:
      - yum install -y yum-utils
      - yum-config-manager --add-repo https://rpm.releases.hashicorp.com/AmazonLinux/hashicorp.repo
      - yum -y install terraform
      
  pre_build:
    commands:
      - terraform init
      - terraform workspace select $WORKSPACE

  build:
    commands:
      - terraform destroy -var-file="variables.tfvars" -auto-approve

  post_build:
    commands:
      - terraform workspace select default
      - terraform workspace delete $WORKSPACE 

Terraform Module

The Terraform module that you plan to deploy should first be build and test locally. Only, after you ensure that the logic is working properly, you should go into automating its deployment, as debugging is easier locally. However, there are some things to consider when build the module.

Resource Naming

As multiple instances of the same resources will be created, the module needs to be able to handle this. It needs to use flexible resource names wherever resources require a unique name. One option for this is to use a timestamp value.

Also, take care of resource dependencies. Certain mappings can only be done 1 to 1. For instance, an Elastic IP Address can only be assigned to one network interface / instance at a time. If not done right, such dependencies can lead to errors during deployment that are difficult to debug.

Terraform Workspaces

Terraform Workspaces allow individual deployments of the same module using the same Terraform State. It is identified via a simple string that can take any alphanumeric value. The Workspace ID can be accessed from within the code to allow dynamic naming and Workspace-specific logic.

The solution uses Workspaces to identify individual deployments. This is needed as the destruction Run wouldn’t know which State to destroy unless its given an identifier. As Workspace ID we are using the StartTime timestamp generated by AWS Step Functions as it is consistent throughout the workflow.

Conclusion

It was the first time that I used either of those features and it really was worth learning them. Especially Step Functions provide an immense potential for easy and straight forward automation. I will definitely use it to automated the dynamic start and stop of SAP application servers and will keep you posted with some blog posts about this endeavour.

Similar Posts You Might Enjoy

Hostname Resolution and DNS with SAP on AWS

SAP systems running in a distributed environment have certain requirements regarding how to set the hostname and how those need to be resolved from other hosts. In our test landscape we use virtual hostnames to decouple the SAP instances from the underlying hardware which is running on a Red Hat Linux Server. This blog post will walk you through the components in AWS that fullfil those requirements and allow SAP instances to communicate while keeping administrative effort super low. - by Fabian Brakowski

Serverless Swagger UI for AWS API Gateway

When implementing REST APIs in AWS there is one service that always comes to mind - Amazon API Gateway. Even though feature-rich, properly documenting your API may become a time-consuming task relatively quickly. In this post, I would like to show you how you can use Swagger UI in combination with a serverless deployment to automatically generate interactive and up-to-date documentation for your APIs. - by Hendrik Hagen

Containers! Containers everywhere!

The newly introduced ECS Anywhere feature lets us use our existing resources as Compute Power in Amazon ECS. Let’s talk about some basics and see how ECS Anywhere can be set up in Terraform! - by Antonia Berg