Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed the way that stuck NFS mounts are handled. #997

Merged
merged 1 commit into from
Jul 14, 2018

Conversation

mknapphrt
Copy link
Contributor

This PR is meant to handle some of the issues mentioned in #868 and #244. What this attempts to do is stop monitoring an NFS mount if it doesn't return from the stat call. For each a mount a channel is created that acts like a lock for the mount point. If, when scraped, the channel can't accept, it means that the previous call to stat the mount never returned, so the mount point is skipped over. Once that "stuck" mount point recovers it will write to the channel and monitoring will resume for that mount point.

I know one of biggest issues with this kind of approach was that the exporter is supposed to be stateless and this will introduce some state. But I feel the benefits of being able to monitor the mounts are worth it, but I'm not prometheus exporter so any opinions would be appreciated.

Signed-off-by: Mark Knapp mknapp@hudson-trading.com

@SuperQ
Copy link
Member

SuperQ commented Jul 11, 2018

At first glance, this doesn't look threadsafe. The exporter needs to be able to handle concurrent simultaneous requests for metrics.

@mknapphrt
Copy link
Contributor Author

Could you elaborate why it doesn't look thread safe? Because I kept that in mind while I was putting this together and as far as my understanding goes, channels are thread safe.

@SuperQ
Copy link
Member

SuperQ commented Jul 11, 2018

Normally we do these kinds of things with a mutex, I'm not an expert in channels. /cc @juliusv What do you think about this?

@mknapphrt
Copy link
Contributor Author

I had originally done it with mutex's, but i wanted to avoid any blocking if possible, which I think can be done using channels instead. If it matches standard practices better i can create a new PR with mutex's instead.

@SuperQ SuperQ requested a review from juliusv July 11, 2018 16:32
@SuperQ
Copy link
Member

SuperQ commented Jul 11, 2018

I will take a look at this soon. For now, please DCO sign-off (git commit -s) your commits. The buildkite error was an intermittent issue with github, retrying.

@juliusv
Copy link
Member

juliusv commented Jul 11, 2018

Yeah, this doesn't seem quite right to me. Besides philosophical questions of introducing such state, there can be two scrapes at the same time and if they both try to stat the FS at roughly the same time, only one of them will return metrics for it, although nothing is "stuck".

You'd have to explicitly track statfs calls that are really stuck, i.e. taking longer than some timeout value, and only block those.

But even then I don't think it's a good practice to just fail silently and just drop metrics. There should be at least some metric that indicates that a given FS is experiencing collection errors, or something like that.

@mknapphrt
Copy link
Contributor Author

Sorry about the DCO signoff, I made a quick change on github and didn't do it there.

What would be an appropriate timeout value to use? And the metric isn't dropped completely silently, it's reported as a device error.

@mknapphrt
Copy link
Contributor Author

As far as mutex vs channels, would this approach be better? https://github.com/mknapphrt/node_exporter/blob/stuckmountmutex/collector/filesystem_linux.go

@juliusv
Copy link
Member

juliusv commented Jul 11, 2018

What would be an appropriate timeout value to use?

Not sure, maybe 30s?

And the metric isn't dropped completely silently, it's reported as a device error.

Ah sorry, I missed that. Great.

As far as mutex vs channels, would this approach be better?

Yeah, I don't think we need channels here.

But we should first define on a high-level how to treat "stuck" FSes before talking about the exact implementation. So let's say one statfs call takes >30s, then it would globally mark that filesystem as stuck, and others would avoid stat-ing it. When would it be marked unstuck, if ever? When the Collect() call that initially marked it stuck does complete after some time? Or would it be marked unstuck unconditionally after some time, even if that particular Collect() is still stuck on stat-ing it? Probably the former? In that case, it could be implemented in the following way:

  • In Collect(), skip devices that are already marked as stuck (in a set of type map[string]struct{} protected by a mutex, both of which unfortunately have to be global because collectors are newly constructed on every scrape).
  • In parallel to doing the statfs call, spin off a watcher goroutine that either terminates without action when the statfs call finishes on time or marks the FS as stuck when it takes too long. (good termination communicated via closed channel, bad via timeout timer channel, and differentiated by select).
  • In the main goroutine, always mark an FS as unstuck after the statfs call has succeeded (delete from stuck set).

There's a race here between the second and third step where you want to ensure that if you detect a timeout, but the statfs call just finishes at that moment and marks it unstuck, you don't then still mark it as stuck, because then it will never be marked unstuck again. That would also have to be coordinated / locked.

@mknapphrt
Copy link
Contributor Author

@juliusv Would you recommend making a different PR with those specs or just modifying this one?

@SuperQ
Copy link
Member

SuperQ commented Jul 12, 2018

You can just force-push your branch to do any fixups you need, no need for a separate PR.

@mknapphrt
Copy link
Contributor Author

Would it be best to just hardcode the 30 second timeout, or add it as a flag?

@grobie
Copy link
Member

grobie commented Jul 12, 2018 via email

)

var stuckMounts = make(map[string]struct{})
var mutex = &sync.Mutex{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than giving it a generic type name, name a mutex after what it is protecting, like stuckMountsMtx.

}
mutex.Unlock()

// The success channel is use do tell the "watcher" that the stat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is use -> is used

success := make(chan struct{})
// Lock is used to ensure that a mount point isn't labelled as stuck
// even after success as there may be a race condition if a stat call
// finished at the same time the timeout procs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"at the same time the timeout procs"... hmm somehow that sentence fragment doesn't parse for me?

// watcher listens on the given success channel and if the channel closes
// then the watcher does nothing. If instead the timeout is reached, the
// mount point that is being watched is marked as stuck.
func watcher(mountPoint string, success chan struct{}, lock chan struct{}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call this stuckMountWatcher or something that makes its purpose clearer.

// Timed out, mark mount as stuck
mutex.Lock()
select {
case <-lock:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need this second lock channel. Since we are already holding the mutex here, can't we just check again under the mutex that the success channel hasn't been closed yet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Though I think that requires moving the closing of the success channel into the mutex-protected section in GetStats() too)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right, for some reason I think I convinced myself of a case where it wouldn't work like that but I don't remember what it was and I don't see a way it wouldn't work now.

stuckMountsMtx.Unlock()

// The success channel is used do tell the "watcher" that the stat
// finished successfully.The channel is closed on success.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ".The" -> ". The"

stuckMountsMtx.Lock()
select {
case <-success:
//Success came in just after the timeout was reached, don't label the mount as stuck
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: add space after "//"

@juliusv
Copy link
Member

juliusv commented Jul 13, 2018

Looks great to me now besides last nits.

@juliusv
Copy link
Member

juliusv commented Jul 13, 2018

@mknapphrt Ah sorry, the DCO check is still not passing because some of the commits don't have a signoff line. Could you squash it all into one, with signoff line?

…turn, it will stop being queried until it returns.

Fixed spelling mistakes.

Update transport_generic.go

Changed to a mutex approach instead of channels and added a timeout before declaring a mount stuck.

Removed unnecessary lock channel and clarified some var names.

Fixed style nits.

Signed-off-by: Mark Knapp <mknapp@hudson-trading.com>
@mknapphrt
Copy link
Contributor Author

How would I go about checking why buildkite failed?

@SuperQ
Copy link
Member

SuperQ commented Jul 14, 2018

Buildkite is failing because it's out of disk space on a couple of the platform builds.

@juliusv
Copy link
Member

juliusv commented Jul 14, 2018

Yeah, I believe we can ignore the buildkite error here.

👍 Thanks!

@juliusv juliusv merged commit 09b4305 into prometheus:master Jul 14, 2018
@SuperQ
Copy link
Member

SuperQ commented Jul 15, 2018

Next up we might want some metrics here to provide a stuck status.

@juliusv
Copy link
Member

juliusv commented Jul 15, 2018

@SuperQ This is already tracked in a node_filesystem_device_error metric with value 1 and FS labels. Is that sufficient?

@SuperQ
Copy link
Member

SuperQ commented Jul 15, 2018

Sounds fine.

oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this pull request Apr 9, 2024
…turn, it will stop being queried until it returns. (prometheus#997)

Fixed spelling mistakes.

Update transport_generic.go

Changed to a mutex approach instead of channels and added a timeout before declaring a mount stuck.

Removed unnecessary lock channel and clarified some var names.

Fixed style nits.

Signed-off-by: Mark Knapp <mknapp@hudson-trading.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants