Using Google Cloud Build with private GKE clusters

To isolate workloads from the dangers of the internet, deploying “private” GKE clusters is recommended: the Kubernetes API endpoint is only reachable from the VPC that the private cluster is peered to. For an excellent guide on deploying a private GKE cluster, see this article.

Having a private GKE cluster creates some challenges. This article will address one of them: how to connect Cloud Build (CI/CD deployment) to your private GKE cluster.

Connecting Cloud Build to GKE: why doesn’t it “just work”?
Maybe you’re thinking: both Cloud Build and GKE are Google products that have been around for a while. GKE in private mode is common, so why doesn’t this work out of the box? That’s an excellent question. I was a bit surprised myself to discover that this doesn’t work. The problem is that google uses VPC peerings for both GKE and Cloud build and that VPC peerings are not ‘transitive.’ The GKE masters are deployed in a google-managed VPC peered to your project’s VPC. If you use Cloud Build private pools, they are also created in a google-managed (but different) VPC, peered to your VPC network. But since these peerings are not transitive, the Cloud Build VPC cannot talk to the GKE master VPC without some (rather nasty) trickery.

Is there a workaround?
The official “Google” way to make it work is documented here. This solution uses a site-to-site VPN between 2 projects to work around the non-transitive nature of VPC peerings. A site-to-site VPN between VPC’s does advertise the peering routes, allowing connectivity to peered networks. However, this workaround feels very hacky, introduces a site-to-site VPN and makes some assumptions about the VPC setup (requiring 2 VPC’s, for example).

So what’s the alternative?
Well, you could, of course, consider using a different CI/CD solution that runs ‘inside’ GKE, such as Flux, Google Cloud Deploy, or a private Gitlab instance. However, the easiest way I could find that doesn’t involve a complete CI/CD redesign is to deploy a HTTP(S) proxy into the VPC network and use that proxy from a Cloud Build private pool to talk to the GKE cluster. The steps below will show how to deploy this solution.

Step 1: Deploy a proxy in your VPC
First, let’s deploy a small VM that will act as our proxy:

data "google_client_config" "default" {}

### Deploy a proxy VM ###
resource "google_compute_instance" "proxy" {
  machine_type              = "e2-small"
  name                      = "proxy"
  zone                      = data.google_client_config.default.zone
  allow_stopping_for_update = true
  metadata_startup_script   = file("${path.module}/install_proxy.sh")
  tags = ["proxy"]
  boot_disk {
    auto_delete = true
    initialize_params {
      image = "ubuntu-os-pro-cloud/ubuntu-pro-2404-lts"
    }
  }
  network_interface {
    network    = <reference to your vpc network>
    subnetwork = <reference to your subnetwork>access_config {}
  }
  shielded_instance_config {
    enable_secure_boot          = true
    enable_integrity_monitoring = true
  }
}

### Create a firewall rule to allow access to the proxy ###
resource "google_compute_firewall" "allow-proxy" {
  name          = "allow-proxy"
  description   = "Allow access to the proxy"
  network       = <reference to your VPC network>
  direction     = "INGRESS"
  source_ranges = ["10.0.0.0/8"] # This could be more limited
  allow {
    protocol = "tcp"
    ports    = [3128]
  }
  target_tags = ["proxy"]
}

### Create an internal DNS zone ###
resource "google_dns_managed_zone" "example_internal" {
  private_visibility_config {
    networks {
      network_url = <reference to your network>
    }
  }
  visibility = "private"
  dns_name   = "example.internal."
  name       = "example-internal"
}

### Publish the proxy IP in DNS ###
resource "google_dns_record_set" "proxy_internal" {
  managed_zone = google_dns_managed_zone.example_internal.name
  name         = "proxy.example.internal."
  type         = "A"
  rrdatas      = [google_compute_instance.proxy.network_interface.0.network_ip]
}

NOTE: the firewall source IP range could/should be limited to the Cloud Build VPC servicenetworking IP range.

install_proxy.sh will look something like this:

# Install squid proxy
export DEBIAN_FRONTEND=noninteractive
apt-get -yy update
apt-get -yy install squidcat << EOF > /etc/squid/squid.conf
acl localnet src 10.0.0.0/8
acl localnet src fc00::/7
acl localnet src fe80::/10
acl to_metadata dst 169.254.169.254

acl SSL_ports port 443
acl Safe_ports port 80
acl Safe_ports port 443
acl CONNECT method CONNECT

http_access deny !Safe_ports
http_access deny CONNECT !SSL_ports
http_access deny to_metadata
http_access allow localhost manager
http_access deny manager

include /etc/squid/conf.d/*

http_access allow localnet
http_access deny all
http_port 3128
visible_hostname proxy.example.internal

access_log none
EOF

systemctl enable squid.service
systemctl restart squid.service

IMPORTANT: This proxy configuration blocks access to the google metadata server. This is important to prevent access to the VM’s service account and resources it has access to.

That should start a proxy VM, create a firewall rule to allow access, and publish the internal IP in DNS as ‘proxy.example.internal’.

Step 2: Create a peering connection to the ‘services network’
This step will create the peering connection to the google hosted ‘services network’ where the Cloud Build private pool instances will live. If you’ve created your private GKE cluster first, a peering connection may already exist.

### Create a private service peering address range ###
resource "google_compute_global_address" "psc-range" {
  name          = "private-service-ip-alloc"
  purpose       = "VPC_PEERING"
  address_type  = "INTERNAL"
  prefix_length = 16
  network       = <reference to your network>
}

### Create the service peering connection ###
resource "google_service_networking_connection" "psc" {
  network                 = <reference to your network>
  service                 = "servicenetworking.googleapis.com"
  reserved_peering_ranges = [google_compute_global_address.psc-range.name]
}

### Enable DNS resolving for example.internal in service networks###
resource "google_service_networking_peered_dns_domain" "dns-peering" {
  name       = "internal-dns-peering"
  network    = <reference to your network>
  dns_suffix = "example.internal."
}

The ‘DNS-peering’ allows the google-managed VPC for cloud build pools to resolve records in our ‘example.internal’ private DNS zone.

Step 3: Create a private Cloud Build worker pool
Next, it’s time to create our private Cloud Build pool. Unfortunately, the API and terraform resource for this are still in beta and ‘invite only’ at the moment of writing, so we’ll have to resort to using the ‘gcloud’ command-line tool or the console:

gcloud builds worker-pools create my-pool \
        --project=my-project \
        --region=europe-west4 \
        --peered-network=projects/my-project/global/networks/m-network \
        --worker-machine-type=e2-medium \
        --worker-disk-size=100

Step 4: Adjust your cloud build jobs to use the private pool and proxy
Almost there! Everything is now set up to run cloud build jobs in our own pool and access private resources like GKE. Let’s deploy a test job to try it out:

$ mkdir tmp && cd tmp
$ cat << EOF
steps:
- name: 'gcr.io/cloud-builders/kubectl'
  args: ['get', 'nodes']
  env:
  - 'CLOUDSDK_COMPUTE_REGION=<your GKE cluster region>'
  - 'CLOUDSDK_CONTAINER_CLUSTER=<your GKE cluster name>'
  - 'no_proxy=google.internal,googleapis.com'
  - 'https_proxy=proxy.example.internal:3128'
options:
  workerPool:
    'projects/my-project/locations/europe-west4/workerPools/my-pool'
EOF
$ gcloud builds submit --config test.yaml --project my-project

If all goes well, this should output a list of nodes in your GKE cluster! If not, check that your proxy VM has access to the GKE control plane (is it listed in master_authorized_networks?)

Conclusion
As you can see, this is quite a bit of work to do something that should be very simple and (IMHO) work out of the box. Hopefully, Google will realize this at some point and create a better solution. Until then, I hope the above will help to quickly get around this annoying issue.