You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently we experienced a local network glitch when some NFS remote shares become unavailable. When that happened node_exporter started creating extra threads and host load was raised up to ~1000. The host was still available, no huge latency was observed, though in logs system claimed node_exported was out of file sockets (defaults to 1000). Our OS is RHEL7.
During the event ls /failed/mount or df -h /failed/mount was just stuck indefinitely returning nothing.
It looks like on each scrape new number of threads got started, querying filesystems again and getting stuck again. Can we have some mutex on a per-mount basis that will prevent creating another check for that FS and just will report empty data? This will help to evade bad issues like:
Indefinite wait on NFS mounts when the remote host is not available.
Stuck syscall on fusefs mounts (like SSHFS) when userland process already died but in-kernel part of the mount is stuck and can't be used until cleaned.
Thanks in advance.
The text was updated successfully, but these errors were encountered:
We fixed this in 0.17.0. There is now a mutex that watches for stuck mounts.
EDIT: I also recommend filtering out NFS and other non-local filesystems in your configuration. These are better monitored from their server side, rather than the client filesystem view.
Hello.
Recently we experienced a local network glitch when some NFS remote shares become unavailable. When that happened node_exporter started creating extra threads and host load was raised up to ~1000. The host was still available, no huge latency was observed, though in logs system claimed node_exported was out of file sockets (defaults to 1000). Our OS is RHEL7.
During the event ls /failed/mount or df -h /failed/mount was just stuck indefinitely returning nothing.
It looks like on each scrape new number of threads got started, querying filesystems again and getting stuck again. Can we have some mutex on a per-mount basis that will prevent creating another check for that FS and just will report empty data? This will help to evade bad issues like:
Thanks in advance.
The text was updated successfully, but these errors were encountered: