Saving Costs with Google Kubernetes Engine (GKE)

Another Kubernetes cost-saving blog

A common issue with Kubernetes is cost control. How do you prevent huge surprise bills, and how can you optimize your workloads and infrastructure to lower costs as much as possible? This blog post aims to give you a list of steps to reduce costs when using Google Kubernetes Engine on Google cloud, including all the technical details.

What is the common cause of high costs?

What causes high costs when using GKE? The answer is simple: provisioning more resources than your workloads consume. So, the answer to cost savings on GKE is reducing slack resources: stuff you ask for but don’t use. Okay, but how do we do that? These are the high-level steps we need to take:

Make sure you request as many resources as you are going to use. No more, no less.
Make sure your node pools match the requested resources.
Make sure the cluster autoscaler can do its job.
Profit!

Step 1: Requesting the right amount of resources

When deploying your application, it will run in Kubernetes ‘pods’. These pods need memory and CPU, and maybe even GPU or disk resources. We’ll focus on CPU and memory since those are most commonly over-provisioned. The resources required are specified in the ‘resources’ section of the pod YAML:

apiVersion: v1
kind: Pod
metadata:
  name: frontend
spec:
  containers:
  - name: app
    image: images.my-company.example/app:v4
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"

As you can see, we request 64Mi of memory and 250 milli CPUs. This should match the amount of memory and CPU your app needs. If you specify too much, Kubernetes will provision too many resources, which may cause extra nodes to be spun up, costing you more money. Note that from a cost-savings perspective, “limits” are not relevant.

Tuning these resource requests manually can be quite a task, especially if you have many applications in active development with changing resource requirements. Luckily there are automated solutions: The Horizontal and Vertical pod autoscaler (HPA and VPA). Deploying these two resources together with your app will automatically adjust the resource requirements of your pods based on historic usage metrics. Really cool!

Deploying HPA and VPA together
Most tutorials about HPA and VPA will tell you that using both HPA and VPA on one deployment is impossible because they would be in each other’s way: VPA trying to adjust CPU requests and HPA trying to scale based on CPU usage. However, there’s a solution: You can tell VPA only to adjust memory and still have an HPA scale on CPU usage. Here’s an example of a VPA that does that:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
  namespace: my-app
spec:
  resourcePolicy:
    containerPolicies:
    - containerName: '*'
      controlledResources:
      - memory
      controlledValues: RequestsOnly
      mode: Auto
    - containerName: istio-proxy
      mode: "Off"
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: Auto

It also works well to scale your deployment based on a custom metric, like GPU usage or HTTP requests per second, and use VPA to manage CPU and memory fully.

Step 2: Make sure your node pools match the requested resources.

Now that we know our pods are tuned and requesting only what they need, the next step is to ensure our node pools are made up of machines that match the workload as close as possible. Doing this by hand is, again, a lot of work, and with changing demands, you’re always behind. But, again, Google comes to the rescue: They have created an excellent product called “Node Auto Provisioner.” It will automatically generate node pools based on the requests of pods in ‘pending’ state. In addition, it supports preemptible nodes, GPUs, custom machine types, custom node labels, and more.

Speaking of preemptible nodes: use them! This type of node is much cheaper (up to 60%) than regular nodes. The downside is that they only ‘last’ for a maximum of 24 hours. If your workload is resilient to failure (as it should be on Kubernetes), this should not be a problem. With GKE version 1.20 and later, your pods will even be nicely shut down when the 24 hours end. When using node auto-provisioning (NAP), you can request preemptible nodes for your app like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80
      tolerations:
      - effect: NoSchedule
        key: cloud.google.com/gke-preemptible
        operator: Equal
        value: "true"

Adding the toleration will signal NAP to create a node pool with preemptible machines for this workload.

Step 3: Make sure the cluster autoscaler can do its job

Now our resource requests are perfect; our node pools are sized just right. And yet, it still seems like there are too many nodes in the cluster for the actual workload. Unfortunately, this common and expensive issue often goes undetected: the cluster autoscaler by default is cautious when picking pods to evict to free up a node for scaling down. Too cautious because it will rarely scale down unless you do some additional work. NOTE: starting with GKE version 1.22 the autoscaler will evict pods with local storage by default, with no additional configuration. Yay!

Fix 1: mark your pods as ‘safe to evict.’
When the autoscaler looks at pods, it will not allow pods with local storage to be evicted. For example, any pod with an ‘emptyDir’ volume will not be allowed to move. Nodes with such pods will never be scaled down, even when their usage is below the threshold. This is common when using Istio since it injects a sidecar container with an emptyDir volume. The solution is to mark your pods ‘as safe to evict’ like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
      labels:
        app: nginx
… etc.

Fix 2: Create Pod Disruption Budgets for system components
Another common cause of node pools not scaling down is the presence of system pods without a pod disruption budget (PDB). The autoscaler will refuse to move system pods unless they have a PDB with enough allowed disruptions. Here’s an example of how the PDB’s could look:

NAME                                  MIN        MAX         ALLOWED
calico-node-vertical-autoscaler-pdb    0         N/A               1
calico-typha-autoscaler-pdb            0         N/A               2
calico-typha-pdb                       1         N/A               1
event-exporter-gke-pdb                 0         N/A               1
kube-dns-autoscaler-pdb                0         N/A               1
kube-dns-pdb                           1         N/A               1
l7-default-backend-pdb                 0         N/A               1
metrics-server-pdb                     0         N/A               1
stackdriver-metadata-agent-cluster-level-pdb  0  N/A               1

Make sure to set minAvailable to ‘1’ for essential components like kube-dns, coredns, and calico-typha so your cluster won’t break when nodes are removed.

Checking the autoscaler logs
Once you have applied the fixes above, the autoscaler should do its job properly and remove redundant nodes from your cluster without issues. To check if this is really happening, go to ‘Kubernetes Clusters’ -> <Your cluster> -> Logs -> Autoscaler Logs. Look for events of type ‘noDecisionStatus,’ and look at the description to find out why a scaledown could not take place. With the fixes above, this should no longer happen!

Step 4: Profit!

Now that we’ve tuned everything and automated our resource tuning and provisioning, a nice decline in costs should be visible in the billing dashboard. I’ve seen costs drop by as much as 30–40% from just these improvements. So what are you waiting for? Go, save some costs!

Please feel free to share your thoughts and experiences!