Changed the way that stuck NFS mounts are handled. #997

mknapphrt · 2018-07-11T14:53:43Z

This PR is meant to handle some of the issues mentioned in #868 and #244. What this attempts to do is stop monitoring an NFS mount if it doesn't return from the stat call. For each a mount a channel is created that acts like a lock for the mount point. If, when scraped, the channel can't accept, it means that the previous call to stat the mount never returned, so the mount point is skipped over. Once that "stuck" mount point recovers it will write to the channel and monitoring will resume for that mount point.

I know one of biggest issues with this kind of approach was that the exporter is supposed to be stateless and this will introduce some state. But I feel the benefits of being able to monitor the mounts are worth it, but I'm not prometheus exporter so any opinions would be appreciated.

Signed-off-by: Mark Knapp mknapp@hudson-trading.com

SuperQ · 2018-07-11T15:30:31Z

At first glance, this doesn't look threadsafe. The exporter needs to be able to handle concurrent simultaneous requests for metrics.

mknapphrt · 2018-07-11T15:45:42Z

Could you elaborate why it doesn't look thread safe? Because I kept that in mind while I was putting this together and as far as my understanding goes, channels are thread safe.

SuperQ · 2018-07-11T16:03:15Z

Normally we do these kinds of things with a mutex, I'm not an expert in channels. /cc @juliusv What do you think about this?

mknapphrt · 2018-07-11T16:12:41Z

I had originally done it with mutex's, but i wanted to avoid any blocking if possible, which I think can be done using channels instead. If it matches standard practices better i can create a new PR with mutex's instead.

SuperQ · 2018-07-11T16:34:11Z

I will take a look at this soon. For now, please DCO sign-off (git commit -s) your commits. The buildkite error was an intermittent issue with github, retrying.

juliusv · 2018-07-11T16:38:25Z

Yeah, this doesn't seem quite right to me. Besides philosophical questions of introducing such state, there can be two scrapes at the same time and if they both try to stat the FS at roughly the same time, only one of them will return metrics for it, although nothing is "stuck".

You'd have to explicitly track statfs calls that are really stuck, i.e. taking longer than some timeout value, and only block those.

But even then I don't think it's a good practice to just fail silently and just drop metrics. There should be at least some metric that indicates that a given FS is experiencing collection errors, or something like that.

mknapphrt · 2018-07-11T17:40:56Z

Sorry about the DCO signoff, I made a quick change on github and didn't do it there.

What would be an appropriate timeout value to use? And the metric isn't dropped completely silently, it's reported as a device error.

mknapphrt · 2018-07-11T17:47:19Z

As far as mutex vs channels, would this approach be better? https://github.com/mknapphrt/node_exporter/blob/stuckmountmutex/collector/filesystem_linux.go

juliusv · 2018-07-11T20:04:09Z

What would be an appropriate timeout value to use?

Not sure, maybe 30s?

And the metric isn't dropped completely silently, it's reported as a device error.

Ah sorry, I missed that. Great.

As far as mutex vs channels, would this approach be better?

Yeah, I don't think we need channels here.

But we should first define on a high-level how to treat "stuck" FSes before talking about the exact implementation. So let's say one statfs call takes >30s, then it would globally mark that filesystem as stuck, and others would avoid stat-ing it. When would it be marked unstuck, if ever? When the Collect() call that initially marked it stuck does complete after some time? Or would it be marked unstuck unconditionally after some time, even if that particular Collect() is still stuck on stat-ing it? Probably the former? In that case, it could be implemented in the following way:

In Collect(), skip devices that are already marked as stuck (in a set of type map[string]struct{} protected by a mutex, both of which unfortunately have to be global because collectors are newly constructed on every scrape).
In parallel to doing the statfs call, spin off a watcher goroutine that either terminates without action when the statfs call finishes on time or marks the FS as stuck when it takes too long. (good termination communicated via closed channel, bad via timeout timer channel, and differentiated by select).
In the main goroutine, always mark an FS as unstuck after the statfs call has succeeded (delete from stuck set).

There's a race here between the second and third step where you want to ensure that if you detect a timeout, but the statfs call just finishes at that moment and marks it unstuck, you don't then still mark it as stuck, because then it will never be marked unstuck again. That would also have to be coordinated / locked.

mknapphrt · 2018-07-12T13:17:45Z

@juliusv Would you recommend making a different PR with those specs or just modifying this one?

SuperQ · 2018-07-12T13:45:55Z

You can just force-push your branch to do any fixups you need, no need for a separate PR.

mknapphrt · 2018-07-12T18:32:29Z

Would it be best to just hardcode the 30 second timeout, or add it as a flag?

grobie · 2018-07-12T18:48:17Z

I'd not add flags unless there are divergent user needs requiring them. The goal should be that exporters work out of the box for everyone.

…

On Thu, Jul 12, 2018, 20:32 mknapphrt ***@***.***> wrote: Would it be best to just hardcode the 30 second timeout, or add it as a flag? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#997 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAANaOM483de3j_Qdupy8EQ9s40XNFaGks5uF5Y-gaJpZM4VLTDH> .

juliusv · 2018-07-13T08:35:10Z

collector/filesystem_linux.go

 )

+var stuckMounts = make(map[string]struct{})
+var mutex = &sync.Mutex{}


Rather than giving it a generic type name, name a mutex after what it is protecting, like stuckMountsMtx.

juliusv · 2018-07-13T08:36:22Z

collector/filesystem_linux.go

+		}
+		mutex.Unlock()
+
+		// The success channel is use do tell the "watcher" that the stat


is use -> is used

juliusv · 2018-07-13T08:38:09Z

collector/filesystem_linux.go

+		success := make(chan struct{})
+		// Lock is used to ensure that a mount point isn't labelled as stuck
+		// even after success as there may be a race condition if a stat call
+		// finished at the same time the timeout procs.


"at the same time the timeout procs"... hmm somehow that sentence fragment doesn't parse for me?

juliusv · 2018-07-13T08:41:13Z

collector/filesystem_linux.go

+// watcher listens on the given success channel and if the channel closes
+// then the watcher does nothing. If instead the timeout is reached, the
+// mount point that is being watched is marked as stuck.
+func watcher(mountPoint string, success chan struct{}, lock chan struct{}) {


I'd call this stuckMountWatcher or something that makes its purpose clearer.

juliusv · 2018-07-13T08:43:21Z

collector/filesystem_linux.go

+		// Timed out, mark mount as stuck
+		mutex.Lock()
+		select {
+		case <-lock:


I'm not sure we need this second lock channel. Since we are already holding the mutex here, can't we just check again under the mutex that the success channel hasn't been closed yet?

(Though I think that requires moving the closing of the success channel into the mutex-protected section in GetStats() too)

I think you're right, for some reason I think I convinced myself of a case where it wouldn't work like that but I don't remember what it was and I don't see a way it wouldn't work now.

juliusv · 2018-07-13T15:06:39Z

collector/filesystem_linux.go

+		stuckMountsMtx.Unlock()
+
+		// The success channel is used do tell the "watcher" that the stat
+		// finished successfully.The channel is closed on success.


nit: ".The" -> ". The"

juliusv · 2018-07-13T15:07:14Z

collector/filesystem_linux.go

+		stuckMountsMtx.Lock()
+		select {
+		case <-success:
+			//Success came in just after the timeout was reached, don't label the mount as stuck


style nit: add space after "//"

juliusv · 2018-07-13T15:07:36Z

Looks great to me now besides last nits.

juliusv · 2018-07-13T15:21:57Z

@mknapphrt Ah sorry, the DCO check is still not passing because some of the commits don't have a signoff line. Could you squash it all into one, with signoff line?

…turn, it will stop being queried until it returns. Fixed spelling mistakes. Update transport_generic.go Changed to a mutex approach instead of channels and added a timeout before declaring a mount stuck. Removed unnecessary lock channel and clarified some var names. Fixed style nits. Signed-off-by: Mark Knapp <mknapp@hudson-trading.com>

mknapphrt · 2018-07-13T17:50:37Z

How would I go about checking why buildkite failed?

SuperQ · 2018-07-14T07:51:37Z

Buildkite is failing because it's out of disk space on a couple of the platform builds.

juliusv · 2018-07-14T09:10:17Z

Yeah, I believe we can ignore the buildkite error here.

👍 Thanks!

SuperQ · 2018-07-15T06:32:52Z

Next up we might want some metrics here to provide a stuck status.

juliusv · 2018-07-15T20:19:38Z

@SuperQ This is already tracked in a node_filesystem_device_error metric with value 1 and FS labels. Is that sufficient?

SuperQ · 2018-07-15T20:21:14Z

Sounds fine.

…turn, it will stop being queried until it returns. (prometheus#997) Fixed spelling mistakes. Update transport_generic.go Changed to a mutex approach instead of channels and added a timeout before declaring a mount stuck. Removed unnecessary lock channel and clarified some var names. Fixed style nits. Signed-off-by: Mark Knapp <mknapp@hudson-trading.com>

SuperQ requested a review from juliusv July 11, 2018 16:32

juliusv reviewed Jul 13, 2018

View reviewed changes

mknapphrt force-pushed the stuckmount branch from acf79c0 to 7719ff4 Compare July 13, 2018 15:37

juliusv merged commit 09b4305 into prometheus:master Jul 14, 2018

wfhu mentioned this pull request Oct 23, 2018

NFS server's disconnection caused the host high load and high VIRT size #1121

Closed

SuperQ mentioned this pull request Feb 12, 2019

node_exporter is causing high LA on remote shares failing #1259

Closed

mknapphrt mentioned this pull request Sep 23, 2019

Add a flag to adjust mount timeout #1486

Merged

discordianfish mentioned this pull request Aug 13, 2023

node_filesystem_{size,avail}_bytes report wrong values for ZFS filesystems #1498

Open

rexagod mentioned this pull request Jun 22, 2024

collector/filesystem: s/MNT_NOWAIT/MNT_WAIT #2960

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed the way that stuck NFS mounts are handled. #997

Changed the way that stuck NFS mounts are handled. #997

mknapphrt commented Jul 11, 2018

SuperQ commented Jul 11, 2018

mknapphrt commented Jul 11, 2018

SuperQ commented Jul 11, 2018

mknapphrt commented Jul 11, 2018

SuperQ commented Jul 11, 2018

juliusv commented Jul 11, 2018

mknapphrt commented Jul 11, 2018

mknapphrt commented Jul 11, 2018

juliusv commented Jul 11, 2018

mknapphrt commented Jul 12, 2018

SuperQ commented Jul 12, 2018

mknapphrt commented Jul 12, 2018

grobie commented Jul 12, 2018 via email

juliusv Jul 13, 2018

juliusv Jul 13, 2018

juliusv Jul 13, 2018

juliusv Jul 13, 2018

juliusv Jul 13, 2018

juliusv Jul 13, 2018

mknapphrt Jul 13, 2018

juliusv Jul 13, 2018

juliusv Jul 13, 2018

juliusv commented Jul 13, 2018

juliusv commented Jul 13, 2018

mknapphrt commented Jul 13, 2018

SuperQ commented Jul 14, 2018

juliusv commented Jul 14, 2018

SuperQ commented Jul 15, 2018

juliusv commented Jul 15, 2018

SuperQ commented Jul 15, 2018

Changed the way that stuck NFS mounts are handled. #997

Changed the way that stuck NFS mounts are handled. #997

Conversation

mknapphrt commented Jul 11, 2018

SuperQ commented Jul 11, 2018

mknapphrt commented Jul 11, 2018

SuperQ commented Jul 11, 2018

mknapphrt commented Jul 11, 2018

SuperQ commented Jul 11, 2018

juliusv commented Jul 11, 2018

mknapphrt commented Jul 11, 2018

mknapphrt commented Jul 11, 2018

juliusv commented Jul 11, 2018

mknapphrt commented Jul 12, 2018

SuperQ commented Jul 12, 2018

mknapphrt commented Jul 12, 2018

grobie commented Jul 12, 2018 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juliusv commented Jul 13, 2018

juliusv commented Jul 13, 2018

mknapphrt commented Jul 13, 2018

SuperQ commented Jul 14, 2018

juliusv commented Jul 14, 2018

SuperQ commented Jul 15, 2018

juliusv commented Jul 15, 2018

SuperQ commented Jul 15, 2018