Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client stalls when accessing a data dir that is already in use #6348

Merged
merged 1 commit into from
Sep 24, 2019

Conversation

jazzyfresh
Copy link
Contributor

@jazzyfresh jazzyfresh commented Sep 18, 2019

Overview

When you run two clients with the same data_dir, the second one will indefinitely stall rather than failing.

Behavior

Before: Client setup stalls

Screenshot from 2019-09-18 11-43-20

After: Client setup fails

Screenshot from 2019-09-20 10-14-45
Screenshot from 2019-09-20 10-14-59

Reproduction

Usually, two clients with the same config will fail on port conflict first, but you can get into this stalled state in two ways

  • Have clients use different port, but same data dir
  • Have a Consul service stanza (this gets it into a weird state to be documented in another ticket)

Implementation

  • Configure boltdb to use a 5 second timeout when accessing the data dir for the Nomad state store
  • Special case error message for bolt.ErrTimeout, suggest that another client may be running on the same data_dir

Todo

  • I'm not sure how to improve the error message to indicate that the client tried to access a data dir that was already in use
  • submit e2e tests in separate pull request

@jazzyfresh jazzyfresh self-assigned this Sep 18, 2019
@jazzyfresh jazzyfresh added this to the 0.10.1 milestone Sep 18, 2019
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://godoc.org/github.com/boltdb/bolt#Options

Unfortunately this doesn't sound like it works on Windows. Still a step in the right direction!

@nickethier
Copy link
Member

It looks like the client exits from that error here:

if err := c.setupAgent(config, logger, logOutput, inmem); err != nil {
logGate.Flush()
return 1
}

Perhaps after that logGate.Flush() we log an error so it shows up at the end. Or even better we just move the error thats already logged up to this block so its logged after the flush.

@notnoop
Copy link
Contributor

notnoop commented Sep 18, 2019

@nickethier That is an excellent suggestion - I've hit this before.

This is a good incremental improvement for sure. We can follow up with using a nomad process file lock using a cross-platform library (e.g. https://github.com/gofrs/flock) early on in initialization process. This can handle windows and ensures that we never even attempt to start consul goroutines in the first place.

As for improving the error message, I'd suggest dropping the "failed to create database: " prefix, and special casing bolt.ErrTimeout error with something more descriptive (e.g. "timed out while openning database; is there another nomad process running?")`

@jazzyfresh jazzyfresh merged commit a7c41a5 into master Sep 24, 2019
@jazzyfresh jazzyfresh deleted the b-agent-stalls branch September 24, 2019 23:28
@jazzyfresh jazzyfresh modified the milestones: 0.10.1, 0.10.0 Sep 25, 2019
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants