Successfull migration from on-prem to AWS EKS with AutoScaling

Introduction

We recently helped a nasdaq listed company move from their on-prem 42U colocation setup to AWS bringing down the maintenance burden heavliy.

The main goals for the customer before the project started was mainly these;

  • We do not want to do operations, if we really do need to, as little as possible
  • We want a smoother deployment process
  • More dimensions of cost control
  • Easy audit control of who does what

First project meeting

Before we had our first project meeting discussing things the customer was set to use Ansible against EC2 Instances based on Ubuntu.

In the first meeting we then realised that this isn’t what they actually want when we discussed their goals for an hour or two, why?

First off, they want to be as hands-off as possible, why would they want to be spawning EC2 instances with ansible?

At this point we asked if they had used docker containers before and while the concept wasn’t new to them they had no real experience with it.
So, after a short break we demo’d a Kubernetes Cluster for them in our lab with how you deploy a deployment in Kubernetes and how you setup v-hosts/ingresses etc.

5 minutes later the previous plan was entirely erased and we were now setting up new goals to run a AWS EKS Cluster.

Here comes to show that while you might know a good way forward there is probably 5 other ways to walk, what fits you best you probably won’t know until you have taken all 5 walks before. Here is where the experience of a AWS Partner comes in!

Setting up a PoC

As we now had discussed the goals and had drafted a action plan over a few more meetings for a PoC, work now started to take place.

The action plan looked as follows;

  • Use SSO for login by developers and operations personell from their central user system
  • Heavily utilize AWS Organization to divide costs into each subsidiary and restrict subaccounts to their Region and Service set
  • Trigger Lambda functions from CloudWatch that posts to a Slack channel when someone makes a action that might be sensitive or when autoscaling events happens
  • Setup a AWS EKS Cluster with AutoScaling EC2 worker nodes
  • Setup Jenkins with Groovy script pipelines that stores containers in AWS ECR together with automated deploys to EKS through helm charts
  • Spawn relevant backend services such as AWS RDS Aurora MySQL and AWS ElastiCache
  • Setup monitoring based on a mix of CloudWatch, System Manager and Grafana with Prometheus

Some few bumps

While most things were quite straight forward there were one bump in the road particularly that I would like to mention.

While SSO is great in many situations this customer had their user source in G-Suite, connecting that through AWS IAM into the Kubernetes API did have its minor challanges.

First off, this wouldn’t be possible (or atleast substantionally more complicated) unless aws-google-auth hadn’t been invented. This beauty gives you a SSO token to aws cli. This so you can use the cli tool to access aws api:s with your google account from G-Suite.

A nifty thing is also aws cli:s aws eks update-config which auto-generates a kubeconfig for you to use, very helpful if you´re going this path yourself.

The finished environment

The PoC is all in place and the environment has been scaled up a bit.

A few migrations has been done for a few of their real applications running smoothly for a while now.
Deploying is just a matter of a commit away and the change will be directly deployed into test within a minute and just an approval away from stage and production.

Benefits so far for the customer:

  • Total cost control; As all different projects are in different subaccounts which are shown as separate items on their invoice with consolidated billing
  • Run what you need when you need it; Their dev environment including Jenkins is turned off after hours including during weekends and started again in the mornings.
  • Cheap price for baseline usage; they run the base load of eks worker nodes with reserved instance pricing and are using autoscaling with on demand pricing for peak loads
  • Better audit control; sensitive actions and autoscaling events are alerting Operations personell in their Slack channel through Lambda functions
  • But the happiest benefit from themselves; We do not need to do operations or change broken hardware anymore, we just check the graphs that we have the correct amount of capacity