Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node_exporter is causing high LA on remote shares failing #1259

Closed
kworr opened this issue Feb 11, 2019 · 3 comments
Closed

node_exporter is causing high LA on remote shares failing #1259

kworr opened this issue Feb 11, 2019 · 3 comments

Comments

@kworr
Copy link

kworr commented Feb 11, 2019

Hello.

Recently we experienced a local network glitch when some NFS remote shares become unavailable. When that happened node_exporter started creating extra threads and host load was raised up to ~1000. The host was still available, no huge latency was observed, though in logs system claimed node_exported was out of file sockets (defaults to 1000). Our OS is RHEL7.

During the event ls /failed/mount or df -h /failed/mount was just stuck indefinitely returning nothing.

It looks like on each scrape new number of threads got started, querying filesystems again and getting stuck again. Can we have some mutex on a per-mount basis that will prevent creating another check for that FS and just will report empty data? This will help to evade bad issues like:

  1. Indefinite wait on NFS mounts when the remote host is not available.
  2. Stuck syscall on fusefs mounts (like SSHFS) when userland process already died but in-kernel part of the mount is stuck and can't be used until cleaned.

Thanks in advance.

@SuperQ
Copy link
Member

SuperQ commented Feb 12, 2019

This is why load average is a bad metric. 😄

We fixed this in 0.17.0. There is now a mutex that watches for stuck mounts.

EDIT: I also recommend filtering out NFS and other non-local filesystems in your configuration. These are better monitored from their server side, rather than the client filesystem view.

@discordianfish
Copy link
Member

@SuperQ Should this been fixed with #1166? That should effectively mitigate this.

@SuperQ
Copy link
Member

SuperQ commented Feb 26, 2019

Yes, that should also help.

@SuperQ SuperQ closed this as completed Feb 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants