Low-Maintenance Kubernetes with EKS Auto Mode

Kubernetes is now a standard technology for high-availability clusters. This article steps you through an example project for setting up Kubernetes clusters on Amazon EKS with Infrastructure as Code. The EKS clusters use Auto Mode, which automates the scaling and update of nodes, manages several components in the cluster, and simplifies cluster upgrades. The configuration is managed by Terraform and Flux.

The code for this project is published on both GitLab and GitHub:

Components of EKS Auto Mode #

EKS Auto Mode adds a number of components for maintenance and AWS integration to each cluster. Each of these components is installed and updated on EKS cluster by AWS, so that you can customize the configurations but do not need to carry out any work to use their features.

The most important components relate to the cluster nodes. EKS Auto Mode always uses Bottlerocket, a minimal Linux-based operating system that is specifically designed to be used for nodes. Karpenter reads the cluster configuration and automatically launches EC2 instances as needed. To ensure that security issues are resolved, Karpenter also automatically replaces older nodes. The node monitoring agent detects unhealthy nodes, so that they can be rebooted or replaced.

You can customise the behaviour of Karpenter by deploying configuration into the Kubernetes cluster.

EKS Auto Mode also provides components to integrate the clusters with AWS, such as the Application Load Balancer Controller and AWS VPC CNI. It includes CoreDNS for name resolution, but you must provide a method to register your cluster applications with DNS, such as ExternalDNS. EKS Auto Mode uses IAM roles to enable AWS access for identities in the Kubernetes cluster and supports the newer EKS Pod Identities, as well as IAM Roles for Service Accounts (IRSA).

This document explains the differences between IRSA and Pod Identities.

EKS Auto Mode does not provide observability components, so that you can install the logging and monitoring that is appropriate to your needs. For example, you can install Prometheus and Grafana on the cluster itself, or deploy the Datadog operator so that the cluster is monitored as part of a Datadog enterprise account, or use Amazon CloudWatch Observability to integrate the cluster with AWS monitoring services.

More About This Project #

The project uses a specific set of tools and patterns to set up and maintain your clusters. The main technologies are Terraform (TF) and Flux. The project also includes tasks for the Task runner. The tasks for TF are provided by my tooling for TF. Like this article, these tasks are opinionated, and are designed to minimise maintenance.

I refer to Terraform and OpenTofu as TF. The two tools work identically for the purposes of this article.

To make it a working example, the project deploys a Web application to each cluster. The podinfo application produces a Web interface and a REST API.

Design Decisions #

The general principles for this project are:

Use one repository for all of the code
Provide a configuration that can be quickly deployed with minimal changes. The code can be customised to add features or enhance security.
Choose well-known and well-supported tools
Support the deployment of multiple clusters for development and production
Use AWS services wherever possible
Use an Infrastructure as Code tool to manage the AWS resources that are needed to run each cluster
Use automation on the cluster to control AWS resources that are used by the applications, so that there is a single point of control.
Use GitOps to manage application configuration.

The combination of delegated control and GitOps means that the live configurations for applications are automatically synchronized with the copy in source control and matching AWS resources are created and updated as needed.

The design principles lead to these specific technical choices:

Integrate Kubernetes and AWS identities with the established IAM Roles for Service Accounts (IRSA) method, rather than the newer EKS Pod Identities. This document explains the differences.
Use Amazon CloudWatch Observability for the clusters. This automatically adds Fluent Bit for log capture.
Use Flux to manage application configuration on the clusters

Out of Scope #

This article does not cover how to set up container registries or maintain container images. These will be specific to the applications that you run on your cluster.

This article also does not cover how to set up the requirements to run TF. You should always use remote TF state storage for live systems. By default, the example code uses S3 for remote state. I recommend that you store TF remote state outside of the cloud accounts that you use for working systems. When you use S3 for TF remote state, use a separate AWS account.

The TF tooling enables you to use local files for state instead of remote storage. Only use local state for testing. Local state means that the cloud resources can only be managed from a computer that has access to the state files.

Requirements #

Required Tools on Your Computer #

This project uses several command-line tools. You can install all of these tools on Linux or macOS with Homebrew.

The required command-line tools are:

AWS CLI - brew install awscli
Flux CLI - brew install flux
Git - brew install git
kubectl - brew install kubernetes-cli
Task - brew install go-task
Terraform - Use these installation instructions

Flux can use Helm to manage packages on your clusters, but you do not need to install the Helm command-line tool.

Version Control and Continuous Integration #

To automate operations, you need a Git repository that is available to your development workstation, the resources on your AWS accounts and your continuous integration system.

Flux updates the configuration of the cluster from this Git repository. This means that you do not need continuous integration to deploy changes. However, you should use continuous integration to test configurations before they are merged to the main branch of the repository and applied to the production cluster by Flux.

This example uses GitLab as the provider for Git hosting. GitLab also provides continuous integration services. You can use GitHub or other services for hosting and continuous integration instead of GitLab.

AWS Account Requirements #

You will require at least one AWS account to host an EKS cluster and other resources. I recommend that you store user accounts, backups and TF remote state in separate AWS accounts to the clusters.

You will need two IAM roles to deploy an EKS cluster with TF:

An IAM role for Terraform
An IAM role for human administrators

The example code defines a dev and prod configuration, so that you can have separate development and production clusters. These copies can be in the same or separate AWS accounts.

AWS Requirements for Each EKS Cluster #

EKS clusters have various network requirements. To avoid issues, each EKS cluster should have:

A VPC
Three subnets attached to the VPC, one per availability zone
A DNS zone in Amazon Route 53

Each subnet should be a /24 or larger CIDR block. By default, every instance of every pod on a Kubernetes cluster will use an IP address. This means that every node will consume up to four IP addresses for Elastic Network Interfaces, plus one IP address per pod that it hosts.

Each subnet that will be used for load balancers must have tags to authorize the Kubernetes controller for AWS Load Balancers to use them. Subnets for public-facing Application Load Balancers must have a tag of kubernetes.io/role/elb with the Value of 1.

I recommend that you define a separate Route 53 zone for each cluster. Create these as child zones for a DNS domain that you own. This enables you to configure the ExternalDNS controller on a cluster to manage DNS records for applications on that cluster without enabling it to manage records on the parent DNS zone.

1: Prepare Your Repository #

Clone or fork the example project to your own Git repository. To use the provided Flux configuration, use GitLab as the Git hosting provider. The example code for this project is published on both GitLab and GitHub.

Create a dev branch on the repository. The Flux configuration on development clusters will synchronize from this dev branch. The Flux configuration on production clusters will synchronize from the main branch.

2: Customise Configuration #

Next, change the configuration for your own infrastructure.

The relevant directories for configuration are:

flux/apps/dev/ - Flux configuration for development clusters
flux/apps/prod/ - Flux configuration for production clusters
tf/contexts/dev/ - TF configuration for development clusters
tf/contexts/prod/ - TF configuration for production clusters

Change each value that is marked as Required. In addition, specify the settings for the TF backend in the tf/contexts/context.json file for dev and prod.

The IAM principal that creates an EKS cluster is automatically granted membership of the system:masters group in that cluster. In our example code, this principal is the IAM role that TF uses. The TF code also enables administrator access on the cluster to the IAM role for human system administrators.

3: Set Credentials #

This process needs access to both AWS and your Git hosting provider.

To work with GitLab, set an access token as the environment variable GITLAB_TOKEN.

If you are running the TF deployment from your own system, ensure that you have AWS credentials in your shell session:

eval $(aws configure export-credentials --format env --profile your-aws-profile)

4: Deploy the Infrastructure with TF #

Run the tasks to initialise, plan and apply the TF code for each root module. For example:

TFT_UNIT=amc TFT_CONTEXT=dev task tft:init && task tft:plan && task tft:apply

Apply the modules in this order:

gitlab-access - Creates a deploy key on GitLab for Flux
amc - Deploys a Kubernetes cluster on Amazon EKS
eks-flux - Adds Flux to an EKS Kubernetes Cluster with the GitLab deploy key

The apply to create a cluster on EKS will take several minutes to complete.

To use local TF state, you need to comment out the backend "s3" {} block in the main.tf file in each of the three TF root modules. You then use the task tft:init:local, rather than tft:init.

5: Register Your Cluster with Kubernetes Tools #

Use the AWS command-line tool to register the new cluster with your kubectl configuration.

If you are running the TF deployment from your own system, first ensure that you have AWS credentials in your shell session:

eval $(aws configure export-credentials --format env --profile your-aws-profile)

Run this command to add the cluster to your kubectl configuration:

aws eks update-kubeconfig --name your-eks-cluster-name

To set this cluster as the default context for your Kubernetes tools, run this command:

kubectl config set-context your-eks-cluster-arn

6: Test Your Cluster #

To test the connection to the API endpoint for the cluster, first assume the IAM role for human operators. Run this command to get the credentials:

aws sts assume-role --role-arn your-human-ops-role-arn --role-session-name human-ops-session

Set these values as environment variables:

AccessKeyId -> AWS_ACCESS_KEY_ID
SecretAccessKey -> AWS_SECRET_ACCESS_KEY
SessionToken -> AWS_SESSION_TOKEN

Next, run this command to get a response from the cluster:

kubectl version

The command should return output like this:

Client Version: v1.32.3
Kustomize Version: v5.5.0
Server Version: v1.32.3-eks-bcf3d70

Once you can successfully connect to a cluster, you can use the flux command-line tool to work with Flux on that cluster. The example project provides tasks for this.

To check the current status of Flux on the cluster:

task flux:status

Flux checks the Git branches and applies changes to the cluster every few minutes. Use this task to trigger Flux on the cluster, rather than waiting for a scheduled run:

task flux:apply

7: Going Further #

The code in the example project is a minimal configuration for an EKS Auto Mode cluster, along with a simple example Web application that is managed by Flux and Helm. You can use EKS add-ons or Flux to deploy additional applications and services on the clusters. Flux also provides a range of management capabilities, including automated update of container images and notifications.

The initial configuration is designed to work with minimal tuning. To harden the systems:

Replace the generated IAM policies that are provided with custom policies.
Disable public access to the cluster endpoint.
Deploy the EKS clusters to private subnets and deploy the load balancers to public subnets.

The current version of this project does not include continuous integration with GitLab. If you decide to use GitLab to manage changes, consider installing the GitLab cluster agent.

Extra: How the TF Code Works #

The tasks for TF are provided by my tooling template.

I have made several decisions in the example TF code for this project:

The example code uses the EKS module from the terraform-modules project. This module enables you to deploy an EKS cluster by setting a relatively small number of values.
We use a setting in the TF provider for AWS to apply tags on all AWS resources. This ensures that resources have a consistent set of tags with minimal code.
To ensure that resource identifiers are unique, the TF code always constructs resource names in locals. The code for resources then uses these locals.
The code supports TF test, the built-in testing framework for TF. You may decide to use other testing frameworks.
The constructed names of AWS resources include a variant, which is set as a tfvar. The variant is either the name of the current TF workspace, or a random identifier for TF test runs.

Resources #

Amazon EKS #

Official Amazon EKS Documentation
EKS Workshop - Official AWS training for EKS
Amazon EKS Auto Mode Workshop
Amazon EKS Blueprints for Terraform
Amazon EKS Auto Mode ENABLED - Build your super-powered cluster - A walk-through EKS Auto Mode with TF

GitLab #

Official GitLab documentation for integrating with Kubernetes clusters