Scaling workloads based on GPU utilization in GKE

With GPU accelerated workloads becoming more common, scaling workloads based on CPU is no longer optimal. Instead, it would be better to scale on actual GPU usage. Here’s how to do just that in Google cloud / Google Kubernetes Engine (GKE).

Step 1: Setting up the Google cloud metrics adapter
First, we’ll need to install a component into Kubernetes that exposes the Google cloud metrics (formerly known as Stackdriver) to Kubernetes through the ‘external metrics API.’ Google has created the ‘Custom Metrics Stackdriver Adapter’ for this. Installing it can be as simple as:

$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml

I strongly recommend reviewing all the resources that it creates, though. Or even better: deploying it with your favorite Infrastructure-as-code tool. Mine is Terraform, as you’ll see below.

NOTE: the adapter will register itself by default for both the ‘custom.metrics.k8s.io’ API and the ‘external.metrics.k8s.io’ API. The first one can cause problems if there are other metrics adapters in use. In that case, remove the ApiService with ‘name: v1beta1.custom.metrics.k8s.io’ from the manifest. We only need the external metrics API for GPU-based scaling (or any other standard GCP metric).

After installation, we need to give permissions to the service account that the adapter uses to read google cloud metrics. I recommend using “Workload Identity” for this since it’s the most secure and relatively easy way to do it. Here’s how using Terraform:

# Create a google service account for the adapter to use
resource "google_service_account" "custom-metrics-adapter" {
  account_id = "custom-metrics-adapter"
}

# Give the account the 'monitoring.viewer' role to allow it
# to read google cloud metrics
resource "google_project_iam_member" "cma-monitoring-reader" {
  member = google_service_account.custom-metrics-adapter.member
  role   = "roles/monitoring.viewer"
}

# Get the current project id from terraform provider configuration
data "google_client_config" "gcp" {}

# Enable workload identity for the serviceaccount, binding it to the
# kubernetes SA used by stackdriver-adapter. Note I'm using the
# 'monitoring' namespace
resource "google_service_account_iam_member" "cma-wlid" {
  role               = "roles/iam.workloadIdentityUser"
  service_account_id = google_service_account.custom-metrics-adapter.id
  member             = "serviceAccount:${data.google_client_config.gcp.project}.svc.id.goog[monitoring/custom-metrics-stackdriver-adapter]"
}

All we need to do now is annotate the existing Kubernetes service account with the name of the google service account:

$ kubectl annotate serviceaccounts custom-metrics-stackdriver-adapter "iam.gke.io/gcp-service-account"="custom-metrics-adapter@<projectid>.iam.gserviceaccount.com" -n monitoring

Step 2: Using the external metric in a horizontal pod autoscaler
Now that the metrics are available in Kubernetes, we can use them to scale our workloads with the HorizonalPodAutoscaler:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: gpu-workload
spec:
  minReplicas: 1
  maxReplicas: 10
  metrics:
    - type: External
      external:
        metric:
          name: kubernetes.io|container|accelerator|duty_cycle
          selector:
            matchLabels:
              resource.labels.container_name: gpu-container-name
        target:
          type: AverageValue
          averageValue: 80
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 80
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: "gpu-workload"

Some things to note:
1. The example is using the “autoscaling/v2” API, which allows us to scale based on more than one metric.
2. The first metric is the ‘External’ metric called “accelerator/duty_cycle.” This metric is available per container, which we filter on in the selector.
3. The second metric is the usual CPU-based metric. Just in case a workload becomes CPU-bound, it will also scale on this.

That’s it! You should now see your HorizontalPodAutoscaler scale based on both GPU duty cycle and CPU:

$ kubectl get hpa
NAME         REFERENCE               TARGETS              MIN MAX
gpu-workload Deployment/gpu-workload 26/80 (avg), 47%/80% 1   10