-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed the way that stuck NFS mounts are handled. #997
Conversation
At first glance, this doesn't look threadsafe. The exporter needs to be able to handle concurrent simultaneous requests for metrics. |
Could you elaborate why it doesn't look thread safe? Because I kept that in mind while I was putting this together and as far as my understanding goes, channels are thread safe. |
Normally we do these kinds of things with a mutex, I'm not an expert in channels. /cc @juliusv What do you think about this? |
I had originally done it with mutex's, but i wanted to avoid any blocking if possible, which I think can be done using channels instead. If it matches standard practices better i can create a new PR with mutex's instead. |
I will take a look at this soon. For now, please DCO sign-off ( |
Yeah, this doesn't seem quite right to me. Besides philosophical questions of introducing such state, there can be two scrapes at the same time and if they both try to stat the FS at roughly the same time, only one of them will return metrics for it, although nothing is "stuck". You'd have to explicitly track statfs calls that are really stuck, i.e. taking longer than some timeout value, and only block those. But even then I don't think it's a good practice to just fail silently and just drop metrics. There should be at least some metric that indicates that a given FS is experiencing collection errors, or something like that. |
Sorry about the DCO signoff, I made a quick change on github and didn't do it there. What would be an appropriate timeout value to use? And the metric isn't dropped completely silently, it's reported as a device error. |
As far as mutex vs channels, would this approach be better? https://github.com/mknapphrt/node_exporter/blob/stuckmountmutex/collector/filesystem_linux.go |
Not sure, maybe 30s?
Ah sorry, I missed that. Great.
Yeah, I don't think we need channels here. But we should first define on a high-level how to treat "stuck" FSes before talking about the exact implementation. So let's say one statfs call takes >30s, then it would globally mark that filesystem as stuck, and others would avoid stat-ing it. When would it be marked unstuck, if ever? When the
There's a race here between the second and third step where you want to ensure that if you detect a timeout, but the statfs call just finishes at that moment and marks it unstuck, you don't then still mark it as stuck, because then it will never be marked unstuck again. That would also have to be coordinated / locked. |
@juliusv Would you recommend making a different PR with those specs or just modifying this one? |
You can just force-push your branch to do any fixups you need, no need for a separate PR. |
Would it be best to just hardcode the 30 second timeout, or add it as a flag? |
I'd not add flags unless there are divergent user needs requiring them. The
goal should be that exporters work out of the box for everyone.
…On Thu, Jul 12, 2018, 20:32 mknapphrt ***@***.***> wrote:
Would it be best to just hardcode the 30 second timeout, or add it as a
flag?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#997 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAANaOM483de3j_Qdupy8EQ9s40XNFaGks5uF5Y-gaJpZM4VLTDH>
.
|
collector/filesystem_linux.go
Outdated
) | ||
|
||
var stuckMounts = make(map[string]struct{}) | ||
var mutex = &sync.Mutex{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than giving it a generic type name, name a mutex after what it is protecting, like stuckMountsMtx
.
collector/filesystem_linux.go
Outdated
} | ||
mutex.Unlock() | ||
|
||
// The success channel is use do tell the "watcher" that the stat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is use -> is used
collector/filesystem_linux.go
Outdated
success := make(chan struct{}) | ||
// Lock is used to ensure that a mount point isn't labelled as stuck | ||
// even after success as there may be a race condition if a stat call | ||
// finished at the same time the timeout procs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"at the same time the timeout procs"... hmm somehow that sentence fragment doesn't parse for me?
collector/filesystem_linux.go
Outdated
// watcher listens on the given success channel and if the channel closes | ||
// then the watcher does nothing. If instead the timeout is reached, the | ||
// mount point that is being watched is marked as stuck. | ||
func watcher(mountPoint string, success chan struct{}, lock chan struct{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd call this stuckMountWatcher
or something that makes its purpose clearer.
collector/filesystem_linux.go
Outdated
// Timed out, mark mount as stuck | ||
mutex.Lock() | ||
select { | ||
case <-lock: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need this second lock
channel. Since we are already holding the mutex here, can't we just check again under the mutex that the success
channel hasn't been closed yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Though I think that requires moving the closing of the success
channel into the mutex-protected section in GetStats()
too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right, for some reason I think I convinced myself of a case where it wouldn't work like that but I don't remember what it was and I don't see a way it wouldn't work now.
collector/filesystem_linux.go
Outdated
stuckMountsMtx.Unlock() | ||
|
||
// The success channel is used do tell the "watcher" that the stat | ||
// finished successfully.The channel is closed on success. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ".The" -> ". The"
collector/filesystem_linux.go
Outdated
stuckMountsMtx.Lock() | ||
select { | ||
case <-success: | ||
//Success came in just after the timeout was reached, don't label the mount as stuck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit: add space after "//"
Looks great to me now besides last nits. |
@mknapphrt Ah sorry, the DCO check is still not passing because some of the commits don't have a signoff line. Could you squash it all into one, with signoff line? |
…turn, it will stop being queried until it returns. Fixed spelling mistakes. Update transport_generic.go Changed to a mutex approach instead of channels and added a timeout before declaring a mount stuck. Removed unnecessary lock channel and clarified some var names. Fixed style nits. Signed-off-by: Mark Knapp <mknapp@hudson-trading.com>
How would I go about checking why buildkite failed? |
Buildkite is failing because it's out of disk space on a couple of the platform builds. |
Yeah, I believe we can ignore the buildkite error here. 👍 Thanks! |
Next up we might want some metrics here to provide a stuck status. |
@SuperQ This is already tracked in a |
Sounds fine. |
…turn, it will stop being queried until it returns. (prometheus#997) Fixed spelling mistakes. Update transport_generic.go Changed to a mutex approach instead of channels and added a timeout before declaring a mount stuck. Removed unnecessary lock channel and clarified some var names. Fixed style nits. Signed-off-by: Mark Knapp <mknapp@hudson-trading.com>
This PR is meant to handle some of the issues mentioned in #868 and #244. What this attempts to do is stop monitoring an NFS mount if it doesn't return from the stat call. For each a mount a channel is created that acts like a lock for the mount point. If, when scraped, the channel can't accept, it means that the previous call to stat the mount never returned, so the mount point is skipped over. Once that "stuck" mount point recovers it will write to the channel and monitoring will resume for that mount point.
I know one of biggest issues with this kind of approach was that the exporter is supposed to be stateless and this will introduce some state. But I feel the benefits of being able to monitor the mounts are worth it, but I'm not prometheus exporter so any opinions would be appreciated.
Signed-off-by: Mark Knapp mknapp@hudson-trading.com