Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-connect to Nomad if connection is lost #134

Merged
merged 4 commits into from
May 20, 2020
Merged

Re-connect to Nomad if connection is lost #134

merged 4 commits into from
May 20, 2020

Conversation

lgfa29
Copy link
Contributor

@lgfa29 lgfa29 commented May 20, 2020

This PR fixes an issue where the Autoscaler would fail to re-connect to the Nomad server if the connection was dropped.

The main cause for this issue is that the policyIDsErrCh was not being consumed, and so the goroutine monitoring the list of policy IDs would block in case of an error.

I also uncovered some more subtle ways that the Autoscaler could fail to re-connect properly:

  • policy handlers would exit in case of connection error and never be re-created.
  • the policy list ID monitor would silently continue if the Autoscler re-connected to a Nomad cluster where the list of policies were different but the modify index had the same value.

To fix these problems I made a few assumptions:

  • Policy handlers only exit when their context is closed. They will keep looping even if an error occurs.
  • The policy manager is responsible for stopping handlers in case of an unrecoverable errors.
  • The policy manager will re-run if it exists until its context is closed.

Closes #119

@lgfa29 lgfa29 added this to the 0.0.2 milestone May 20, 2020
@lgfa29 lgfa29 requested review from cgbaker and jrasell May 20, 2020 02:46
@lgfa29 lgfa29 self-assigned this May 20, 2020
Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
One question about the change in the MonitorIDs func that I am interested to understand.

policy/nomad/source.go Outdated Show resolved Hide resolved
policy/manager.go Show resolved Hide resolved
policy/handler.go Show resolved Hide resolved
Copy link
Contributor

@cgbaker cgbaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cursory examination is all i have bandwidth for, but this looks good

@lgfa29 lgfa29 merged commit e6614b8 into master May 20, 2020
@lgfa29 lgfa29 deleted the f-gh-119 branch May 20, 2020 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Autoscaler fails to re-connect to Nomad cluster
3 participants