Troubleshooting GKE Pods in CrashLoopBackOff: How to Start with Strace

In most real-world Kubernetes cases, infrastructure engineers don’t need to understand how a container itself works. This is reasonable because container image developers are typically responsible for it. However, a problem often arises when a container is deployed on Kubernetes, which is such a complex deployment and orchestration system. Sometimes, unexpected components can influence a container’s behavior.

Debugging an unexpected container is feasible for self-hosted Kubernetes, where engineers have better observability and testing capability. However, if you are using GKE or another cloud provider’s Kubernetes engine, the situation can become complicated.

Here’s a problem I recently faced.

The Challenge: Debugging a Crashing vLLM Container on GKE

Recently, I deployed a Large Language Model (LLM) model serving container on GKE as a Deployment. When I ran the serving framework in a non-containerized environment, everything worked perfectly, but it consistently crashed when running inside a Kubernetes pod.

The most challenging aspect of the situation was that no application stdout or logs were being generated by the application. The only information I had was from kubectl, which showed that the container had exited with status code 1.‘

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
meta-deployment-789d98c8f7-4fsz4 0/1 CrashLoopBackOff 5 10m

To better understand why this issue was happening, I tried the same container with different entry points and commands. I found that the issue was highly related to the keyword “vllm“. Every time a command contained the string “vllm“, it would crash (except when I used echo vllm). It seemed that echo worked because it’s a shell built-in command, but any command that involved exec crashed.

With this context, I decided to use gdb it to debug the container. To do that, I needed to rebuild the container image on top of the existing one. However, even running gdb --args python -m vllm.entrypoints.api_server .... still resulted in a crash. To move forward, I decided to rely on strace, a kernel tracing utility, to understand what was happening.

Here’s the tricky part: GKE‘s default worker node OS image is COS (Container-Optimized OS) , a Debian-based Linux OS with many restrictions. For example, it lacks binary/package managers, and most paths under its file system are non-executable. So, how could I easily download and use strace?

Step-by-Step Guide: Debugging GKE CrashLoopBackOff with Strace

Since we were already on a containerized platform, the simplest approach was to deploy another container and perform all operations within it. This container also needed to fulfill three criteria:

  1. The strace pod must be privileged.
  2. The strace pod must have the SYS_PTRACE Linux capability.
  3. The strace pod must be in the host PID namespace to view all processes.

Here’s an example of pod I used

$ cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: strace
spec:
  hostPID: true
  containers:
  - name: ubuntu-container
    image: ubuntu:latest
    command: ["sleep"]
    args: ["infinity"]
    securityContext:
      privileged: true
      capabilities:
        add: ["SYS_PTRACE"]
EOF

After deploying the pod, I accessed its shell to check the PID of the target process I wanted to trace. The output below is an example (not from my actual pod). We can see that the container’s entry point, bash, has a PID of 4128.

$ kubectl exec -it strace -- bash
root@strace:/$ apt update && apt -y install strace
root@strace:/$ ps auxf
...
root        2863  0.0  0.0 1238252 13984 ?       Sl   15:05   0:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id eb73d524f84a7813897af328350daf417e26d91661596ce64a47002ce84e0218 -address /run/containerd/conta
root        2936  0.0  0.0  20436 10512 ?        Ss   15:05   0:00  \_ /pause
root        4128  1.3  0.2 3005172 88204 ?       Ssl  15:05   0:50  \_ /bash
...

The next step was to use strace to monitor the process, including any child processes that would be created. Meanwhile, I re-ran the vllm container’s entry point to observe the issue.

# In strace pod
root@strace:/$ strace -f -p [PID] 2>&1 | tee strace.log # pid to the target container, e.g. 4128

# Go back to target pod's shell
root@meta-deployment-789d98c8f7-4fsz4:/$ python -m vllm.entrypoints.api_server ....

# Inside strace pod
wait(-1, 
fork(....

After that, I was able to examine the strace log to understand why and when my container was crashing.

Summary

In this post, I explained a complicated scenario and the limitations of GKE when debugging it. I also described the solution to overcome these debugging difficulties by enabling the powerful strace utility on a GKE node. As for the reason behind this strange behavior, that will be discussed in a later post. It’s a different but interesting problem.

Cross-references

[1] vLLM deployment example from GCP Vertex Model Garden

apiVersion: apps/v1
kind: Deployment
metadata:
  name: meta-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: meta-server
  template:
    metadata:
      labels:
        app: meta-server
        ai.gke.io/model: Llama-3-1-8B-Instruct
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: model-garden
    spec:
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20240821_1034_RC00
        resources:
          requests:
            cpu: 8
            memory: 29Gi
            ephemeral-storage: 80Gi
            nvidia.com/gpu : 1
          limits:
            cpu: 8
            memory: 29Gi
            ephemeral-storage: 80Gi
            nvidia.com/gpu : 1
        command:
        args:
        - python
        - -m
        - vllm.entrypoints.api_server
        - --host=0.0.0.0
        - --port=7080
        - --swap-space=16
        - --gpu-memory-utilization=0.9
        - --max-model-len=32768
        - --trust-remote-code
        - --disable-log-stats
        - --model=gs://vertex-model-garden-public-us/llama3.1/Meta-Llama-3.1-8B-Instruct
        - --tensor-parallel-size=1
        - --max-num-seqs=12
        - --enforce-eager
        - --disable-custom-all-reduce
        - --enable-chunked-prefill
        env:
        - name: MODEL_ID
          value: "meta-llama/Llama-3.1-8B-Instruct"
        - name: DEPLOY_SOURCE
          value: "UI_NATIVE_MODEL"
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
---
apiVersion: v1
kind: Service
metadata:
  name: meta-service
spec:
  selector:
    app: meta-server
  type: ClusterIP
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 7080

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *