Autopilot cluster random 502s on deployment

VolkerOrange · 06-05-2023 05:28 AM

Hi,

we're using an autopilot cluster on GKE with a HTTPS (+HTTP2) load balancer.

Since about a month, we're seeing an issue where after deployment of a new version of a workload, once the pod is up and running, the load balancer will not connect to it. Everything is green, there are no error states on GKE. But the load balancer shows "0 of 0" healthy in all regions.

Requests to the web service fail with 502 errors. We also have a TCP socket running which is working flawelessly, and all jobs on the pods are executed as well, so the pod is up and running nicely.

When we either scale up the workload (eg. to 2 replicas) or manually kill the unreachable pod, it usually comes back and successfully connects to the loadbalancer.

I've been able to see any errors and I'm running out of ideas on how to fix this (other than getting rid of autopilot-cluster and seeing if that's the cause).

This happens on both our dev and our production clusters.

Any idea what might be causing this and how to fix it?