Kubelet as the node agent of Kubernetes OSS, always needs to monitor paths. Using Inotify to do so, Kubelet exposes to possibility of leaking of Inotify watchers. In recent, I observed a case where the Kubelet was hung for enormous Inotify usage. This Post briefly discussed how I debug the process and locate the problem.
Symptom
Recently, I faced an issue where Kubelet on a node reported error message failed to create file descriptors.
error creating file watcher: too many open files error creating file watcher: no space left on device
After short checking, I found the node has max_user_watches
of 10000, but the TotalinotifyWatches
is beyond this value. (P. S still not sure why watcher can initiate more than cap). In order to find which process occupied the most watchers. I used following command[^flbit_ino] to find it out.
echo -e "COUNT\tPID\tUSER\tCOMMAND"; sudo find /proc/[0-9]*/fdinfo -type f 2>/dev/null | sudo xargs grep ^inotify 2>/dev/null | cut -d/ -f 3 | uniq -c | sort -nr | { while read -rs COUNT PID; do echo -en "$COUNT\t$PID\t"; ps -p $PID -o user=,command=; done} COUNT PID USER COMMAND 7491 8412 root /home/kubernetes/bin/kubelet --v=2 --cloud-provide=gce --experi 2620 1 root /sbin/init ....
Surprisingly, Kubelet initiated more than 7000 inotify watchers. I assumed there was an inotify leakage in Kubelet.
Leakage check
Clean Kubelet
To better understand the situation, I created a clean cluster with only 1 clean node on GKE. Roughly 70 inotify watchers were there. I created a single nginx pod and the number increased by 3. Theoretically, these 3 watchers are used by Kubelet to monitor any changes on rootfs
, kube-api-access
and PodSandbox
. But to verify it, we need to check more details on which inodes are monitored by Kubelet.
Check inotify file descriptors
To do so, let’s take a look how to track a single inotify file descriptor. Opened processes’ fdinfo
folder, we can examine each or them to find an inotify fd.
# Find kubelet pid ps -aux | grep kubelet KPID=2430 # File the an example fd sudo ls /proc/2430/fdinfo 0 1 10 11 12 13 14 2 3 4 5 6 7 8 9 ... sudo cat /proc/2430/fdinfo/8 pos: 0 flags: 02004000 mnt_id: 15 ino: 1057 inotify wd:1 ino:3f327 sdev:800001 mask:fc6 ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:27f30300e5059ea2
This is very confusing, so I rely on man proc
^man_proc to understand every piece of them. In given fd, the needed information to continue sit in the last line. It’s an inotify entry represents the 1 file or folder to be monitored. And the most useful data is ino:3f327
which represents the inode number of target file (in hexadecimal). And sdev:800001
, which represents the ID of device where the inode sit on, and it’s also in hex.
Using lsblk
, I can see there’s only 1 disk I’m using on the node, so finding the target file would be easy.
# Cast to decimal ino=3f327 dec="$((16#${ino}))" # Find the target file loc="debugfs -R 'ncheck ${dec}' /dev/sda1" sudo eval $loc 2>/dev/null debugfd 1.46.5 (30-Dec-2021) Inode Pathname 258855 /etc/srv/kubernetes/pki/ca-certificates.crt
Put all processes above into one single script(please see the updated version in the appendix), I can retrieve all target files, that would help to understand if there’s a real leakage. Also, I count the unique inode amount, this could also help to know which inode are monitored multiple times.
cat << EOF | sudo tee -a test.sh echo "kubelet pid="${PID} in_fds=$(find /proc/${PID}/fdinfo -type f 2>/dev/null | xargs grep ^inotify | cut -d " " -f 3 | cut -d ":" -f 2) echo ${in_fds} echo "Count: $(find /proc/${PID}/fdinfo -type f 2>/dev/null | xargs grep ^inotify | wc -l)" uniq_fds=$(echo "${in_fds[@]}" | sort | uniq) echo ${uniq_fds} while read -r element; do count=$(echo "${in_fds[@]}" | grep -o "${element}" | wc -l) dec="$((16#${element}))" loc="debugfs -R 'ncheck ${dec}' /dev/sda1" loc=$(eval $loc 2>/dev/null | tail -1 | cut -d " " -f 4) printf "%-6s %-10s %-6s %s\n" "${element}" "${dec}" "${count}" "${loc}" done <<< "${uniq_fds}" EOF sudo bash test.sh kubelet pid=2430 3f327 3f321 ... Count: 120 1 10b 1259 128a ... 1 1 72 Inode Pathname 10b 267 1 267 /etc/systemd/system/multi-user.target.wants/snapd.service ...
The given results are consists by following parts:
- One line for get Kubelet pid
- One line for all target inode numbers
- One line for tell how many unique inode (120)
- One line of sorted target
- Following 120 lines, each of them represents a unique inode number, its decimal number, count, another time of decimal number and the target file path.
I used the same script to the problematic node, and it showed the following result. In summary, most Kubelet watchers were targeting ino:1
. And there are 6649 targets files, which likely to be leakage, because there were only 150 pods on this pod. Unfortunately, debugfs
can’t find any target files, so the output showed as meaningless string "Inode Pathname"
.
kubelet pid=8412 ... Count: 7491 ... 1 1 6649 Inode Pathname
Bad apple
Why debugfs
can’t help anymore? The reason is simple, each cgroup for a pod is using its own rootfs. This means the watcher are somehow residing on different rootfs and using independent inode index. There are some other ways to do it, I choose the most common tool grep
to find out.
sudo grep / -inum 1 /home/kubernetes/containerized_mounter/rootfs/dev /home/kubernetes/containerized_mounter/rootfs/proc ... /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/pods/5325873d-f2a0-48df-83e2-0b911df2f77f/volumes/kubernetes.io~projected/kube-api-access-227jg ... /dev /boot/efi ...
This turns things easy, because I can just use pod ID to compare between running pods to find out if there are any terminated pods are there. And it did show there’s some non-existence pods still being watched somehow.
What to expect next
- [How Kubelet leaked inotify watchers?]()
- [debugfs]()
Appendix
Updated script to get inotify watchers initiated by kubelet
Thanks for yujuhong@‘s momentum and helps in finishing updated script.
PID=$(echo $(ps -aux | grep "/home/kubernetes/bin/kubelet" | head -1) | cut -d " " -f 2) echo "Kubelet Pid:" ${PID} inums_raw=$(find /proc/${PID}/fdinfo -type f 2>/dev/null | xargs grep ^inotify) # echo ${inums_raw} echo "Count: $(find /proc/${PID}/fdinfo -type f 2>/dev/null | xargs grep ^inotify | wc -l)" while read -r line; do reg="ino:([0-9a-f]*) sdev:([0-9a-f]*)" if [[ ${line} =~ $reg ]]; then ino="${BASH_REMATCH[1]}" sdev="${BASH_REMATCH[2]}" # echo $ino $sdev else echo "wrong line" fi sdev_in_dec=$((16#$sdev)) minor=$((sdev_in_dec % 256)) major=$((sdev_in_dec / 256)) # echo "${major}:${minor}" in_fds_sdev+=("${ino}-${major}:${minor}") done <<< "${inums_raw}" uniq_pairs=($(echo "${in_fds_sdev[@]}" | sort | uniq)) echo "Unique target" ${#uniq_pairs[@]} printf "%-10s %-10s %-6s %s\n" "INUM" "DEV" "COUNT" "TARGET" for pair in "${uniq_pairs[@]}" do count=$(echo "${in_fds_sdev[@]}" | grep -o "${pair}" | wc -l) fd_hex=$(echo ${pair} | cut -d "-" -f 1) dev=$(echo ${pair} | cut -d "-" -f 2) fd_dec="$((16#${fd_hex}))" mount_info=$(grep ${dev} /proc/$PID/mountinfo) if [[ -z $mount_info ]]; then echo "Can't find mount info for" $dev else tmpfs_path=$(echo $mount_info | cut -d " " -f 5) # echo $tmpfs_path loc=$(find ${tmpfs_path} -inum ${fd_dec}) printf "%-10s %-10s %-6s %s\n" "${fd_dec}" "${dev}" "${count}" "${loc}" fi done
References
[^flbit_ino]: Fluentbit error “cannot adjust chunk size” on GKE
[^list_ino]: Listing the files that are being watched by inotify
instances
Leave a Reply