How do we know if Kubelet leaks Inotify watchers

Apr 18, 2024

—

Kubelet as the node agent of Kubernetes OSS, always needs to monitor paths. Using Inotify to do so, Kubelet exposes to possibility of leaking of Inotify watchers. In recent, I observed a case where the Kubelet was hung for enormous Inotify usage. This Post briefly discussed how I debug the process and locate the problem.

Symptom

Recently, I faced an issue where Kubelet on a node reported error message failed to create file descriptors.

error creating file watcher: too many open files
error creating file watcher: no space left on device

After short checking, I found the node has max_user_watches of 10000, but the TotalinotifyWatches is beyond this value. (P. S still not sure why watcher can initiate more than cap). In order to find which process occupied the most watchers. I used following command[^flbit_ino] to find it out.

echo -e "COUNT\tPID\tUSER\tCOMMAND"; sudo find /proc/[0-9]*/fdinfo -type f 2>/dev/null | sudo xargs grep ^inotify 2>/dev/null | cut -d/ -f 3 | uniq -c | sort -nr | { while read -rs COUNT PID; do echo -en "$COUNT\t$PID\t"; ps -p $PID -o user=,command=; done}

COUNT   PID USER    COMMAND
7491    8412    root /home/kubernetes/bin/kubelet --v=2 --cloud-provide=gce --experi
2620    1   root /sbin/init
....

Surprisingly, Kubelet initiated more than 7000 inotify watchers. I assumed there was an inotify leakage in Kubelet.

Leakage check

Clean Kubelet

To better understand the situation, I created a clean cluster with only 1 clean node on GKE. Roughly 70 inotify watchers were there. I created a single nginx pod and the number increased by 3. Theoretically, these 3 watchers are used by Kubelet to monitor any changes on rootfs , kube-api-access and PodSandbox. But to verify it, we need to check more details on which inodes are monitored by Kubelet.

Check inotify file descriptors

To do so, let’s take a look how to track a single inotify file descriptor. Opened processes’ fdinfo folder, we can examine each or them to find an inotify fd.

# Find kubelet pid
ps -aux | grep kubelet
KPID=2430

# File the an example fd
sudo ls /proc/2430/fdinfo

0 1 10 11 12 13 14 2 3 4 5 6 7 8 9

...

sudo cat /proc/2430/fdinfo/8

pos:    0
flags:  02004000
mnt_id: 15
ino:    1057
inotify wd:1 ino:3f327 sdev:800001 mask:fc6 ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:27f30300e5059ea2

This is very confusing, so I rely on man proc^man_proc to understand every piece of them. In given fd, the needed information to continue sit in the last line. It’s an inotify entry represents the 1 file or folder to be monitored. And the most useful data is ino:3f327 which represents the inode number of target file (in hexadecimal). And sdev:800001, which represents the ID of device where the inode sit on, and it’s also in hex.

Using lsblk, I can see there’s only 1 disk I’m using on the node, so finding the target file would be easy.

# Cast to decimal
ino=3f327
dec="$((16#${ino}))"

# Find the target file
loc="debugfs -R 'ncheck ${dec}' /dev/sda1"
sudo eval $loc 2>/dev/null

debugfd 1.46.5 (30-Dec-2021)
Inode   Pathname
258855  /etc/srv/kubernetes/pki/ca-certificates.crt

Put all processes above into one single script(please see the updated version in the appendix), I can retrieve all target files, that would help to understand if there’s a real leakage. Also, I count the unique inode amount, this could also help to know which inode are monitored multiple times.

cat << EOF | sudo tee -a test.sh
echo "kubelet pid="${PID}
in_fds=$(find /proc/${PID}/fdinfo -type f 2>/dev/null | xargs grep ^inotify | cut -d " " -f 3 | cut -d ":" -f 2)
echo ${in_fds}
echo "Count: $(find /proc/${PID}/fdinfo -type f 2>/dev/null | xargs grep ^inotify | wc -l)"

uniq_fds=$(echo "${in_fds[@]}" | sort | uniq)
echo ${uniq_fds}

while read -r element;
do
  count=$(echo "${in_fds[@]}" | grep -o "${element}" | wc -l)
  dec="$((16#${element}))"
  loc="debugfs -R 'ncheck ${dec}' /dev/sda1"
  loc=$(eval $loc 2>/dev/null | tail -1 | cut -d " " -f 4)
  printf "%-6s %-10s %-6s %s\n" "${element}" "${dec}" "${count}" "${loc}"
done <<< "${uniq_fds}"
EOF

sudo bash test.sh

kubelet pid=2430
3f327 3f321 ...
Count: 120
1 10b 1259 128a ...
1   1   72  Inode   Pathname
10b 267 1   267 /etc/systemd/system/multi-user.target.wants/snapd.service
...

The given results are consists by following parts:

One line for get Kubelet pid
One line for all target inode numbers
One line for tell how many unique inode (120)
One line of sorted target
Following 120 lines, each of them represents a unique inode number, its decimal number, count, another time of decimal number and the target file path.

I used the same script to the problematic node, and it showed the following result. In summary, most Kubelet watchers were targeting ino:1 . And there are 6649 targets files, which likely to be leakage, because there were only 150 pods on this pod. Unfortunately, debugfs can’t find any target files, so the output showed as meaningless string "Inode Pathname".

kubelet pid=8412
...
Count: 7491
...
1   1   6649    Inode Pathname

Bad apple

Why debugfs can’t help anymore? The reason is simple, each cgroup for a pod is using its own rootfs. This means the watcher are somehow residing on different rootfs and using independent inode index. There are some other ways to do it, I choose the most common tool grep to find out.

sudo grep / -inum 1
/home/kubernetes/containerized_mounter/rootfs/dev
/home/kubernetes/containerized_mounter/rootfs/proc
...
/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/pods/5325873d-f2a0-48df-83e2-0b911df2f77f/volumes/kubernetes.io~projected/kube-api-access-227jg
...
/dev
/boot/efi
...

This turns things easy, because I can just use pod ID to compare between running pods to find out if there are any terminated pods are there. And it did show there’s some non-existence pods still being watched somehow.

What to expect next

[How Kubelet leaked inotify watchers?]()
[debugfs]()

Appendix

Updated script to get inotify watchers initiated by kubelet

Thanks for yujuhong@‘s momentum and helps in finishing updated script.

PID=$(echo $(ps -aux | grep "/home/kubernetes/bin/kubelet" | head -1) |  cut -d " " -f 2)
echo "Kubelet Pid:" ${PID}

inums_raw=$(find /proc/${PID}/fdinfo -type f 2>/dev/null | xargs grep ^inotify)
# echo ${inums_raw}
echo "Count: $(find /proc/${PID}/fdinfo -type f 2>/dev/null | xargs grep ^inotify | wc -l)"

while read -r line;
do
        reg="ino:([0-9a-f]*) sdev:([0-9a-f]*)"
        if [[ ${line} =~ $reg ]]; then
                ino="${BASH_REMATCH[1]}"
                sdev="${BASH_REMATCH[2]}"
                # echo $ino $sdev
        else
                echo "wrong line"
        fi

        sdev_in_dec=$((16#$sdev))
        minor=$((sdev_in_dec % 256))
        major=$((sdev_in_dec / 256))
        # echo "${major}:${minor}"

        in_fds_sdev+=("${ino}-${major}:${minor}")
done <<< "${inums_raw}"

uniq_pairs=($(echo "${in_fds_sdev[@]}" | sort | uniq))
echo "Unique target" ${#uniq_pairs[@]}

printf "%-10s %-10s %-6s %s\n" "INUM" "DEV" "COUNT" "TARGET"
for pair in "${uniq_pairs[@]}"
do
        count=$(echo "${in_fds_sdev[@]}" | grep -o "${pair}" | wc -l)
        fd_hex=$(echo ${pair} | cut -d "-" -f 1)
        dev=$(echo ${pair} | cut -d "-" -f 2)
        fd_dec="$((16#${fd_hex}))"

        mount_info=$(grep ${dev} /proc/$PID/mountinfo)
        if [[ -z $mount_info ]]; then
                echo "Can't find mount info for" $dev
        else
                tmpfs_path=$(echo $mount_info | cut -d " " -f 5)
                # echo $tmpfs_path
                loc=$(find ${tmpfs_path} -inum ${fd_dec})
                printf "%-10s %-10s %-6s %s\n" "${fd_dec}" "${dev}" "${count}" "${loc}"
        fi
done

References

[^flbit_ino]: Fluentbit error “cannot adjust chunk size” on GKE

[^list_ino]: Listing the files that are being watched by inotify instances