-
Notifications
You must be signed in to change notification settings - Fork 262
Manager: a stateful group plugin with leader detection #283
Conversation
Signed-off-by: David Chung <david.chung@docker.com>
Signed-off-by: David Chung <david.chung@docker.com>
Signed-off-by: David Chung <david.chung@docker.com>
Current coverage is 70.24% (diff: 33.33%)@@ master #283 diff @@
==========================================
Files 25 25
Lines 1366 1368 +2
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
Hits 961 961
- Misses 326 327 +1
- Partials 79 80 +1
|
Please sign your commits following these rules: $ git clone -b "stateful-group" git@github.com:chungers/infrakit.git somewhere
$ cd somewhere
$ git rebase -i HEAD~842354164800
editor opens
change each 'pick' to 'edit'
save the file and quit
$ git commit --amend -s --no-edit
$ git rebase --continue # and repeat the amend for each commit
$ git push -f Amending updates the existing PR. You DO NOT need to open a new one. |
@@ -0,0 +1,67 @@ | |||
package docker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems redundant with infrakit/example/flavor/swarm/docker.go
. Any reason you're not using that code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is meant to be reused (hence in a util
package). I am planning to change the example swarm flavor to use this instead of including a client in its own example
package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. Can you do it in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with versioned package for Docker client and changed the example to use the same client as a library.
## Running | ||
|
||
```shell | ||
$ infrakit-manager -h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider starting with $ make binaries
Should infrakit-manager
be prefixed with build/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
const ( | ||
// LeaderFileEnvVar is the environment variable that may be used to customize the plugin leader detection | ||
LeaderFileEnvVar = "INFRAKIT_LEADER_FILE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move these env variables to command line args? They will be easier to discover if presented in help text.
log.Infoln("Connect to docker", host, "err=", err) | ||
if err != nil { | ||
log.Error(err) | ||
os.Exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a recent patch, i made some tweaks so returning errors from RunE
CLI functions should work as expected. So from now on it should be fine to replace
log.Error(err)
os.Exit(1)
with
return err
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
|
||
pollInterval := 5 * time.Second | ||
filename := defaultLeaderFile() | ||
storeDir := defaultStoreDir() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: how would you feel about adopting this pattern for var defaults:
var storeDir string
...
cmd.Flags.StringVar(&storeDir, "store-dir", defaultStoreDir(), ...)
Personally, i find that easier to follow as the variable is not both the input and the output. This has the added benefit of making it easier to share variable definitions with common defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally don't like uninitialized variables and having to look for flag bindings to know what default values are -- if the value were even bound as a flag.
In any case I will make the change assuming it's just some convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hear you, and I don't think the design of the cobra API makes it easy to structure well.
Another problem resulting from the initializing approach here is you have two places to check for the default - the variable and flag definition to ensure that it was wired into the default parameter. I've had to correct a few places already where the latter was missing, leading to confusion.
func (m *manager) Start() (<-chan struct{}, error) { | ||
m.lock.Lock() | ||
if m.running != nil { | ||
m.lock.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit uneasy with the ad hoc locking starting to show up.
Possibly only worth dealing with later, but it would be nice to establish a straightforward convention for lock management. It would be nice if we could strongly favor associating lock acquisition with function calls, like the pattern used elsewhere in this file:
func ... {
m.lock.Lock()
defer m.lock.Unlock()
}
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to use an init function that returns true only when the first caller caused initialization of the channel variable. Another concurrent caller of Start won't proceed.
// unwatch without first watching. If the manager crashes and comes back up again, | ||
// there will be no possibility to unwatch existing groups. | ||
m.lock.Lock() | ||
config := m.currentConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some functions have access to a function-scoped GlobalSpec
as well as m.currentConfig
, which creates potential for bugs due to not using the right spec.
Given that this is the only read of m.currentConfig
, and the fact that you could ask the plugin for the list of groups being watched, it seems like you can get away without having currentConfig
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that querying is better than an in-memory copy since I noted elsewhere in a comment that using the in-memory copy to notify the downstream plugins won't work if the manager exited and comes back up. So I am changing to querying for now the proxied group plugin. Note that this isn't the general case where a global config has multiple groups with different group plugins implementations running (instead of just the single one proxied here). To construct the global spec requires querying all the group plugins (imagine one that's group-default and one that's backed by some autoscaling group) to construct this global collection of groups and their specs.
Anyway, added a TODO / comment and changing to querying the backend plugin for now.
} | ||
|
||
if err != nil { | ||
log.Warningln("Error watching group:", spec.ID, "Err=", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest making this fatal, and i'd argue the same for the 'Already watching' case above to be conservative against corner cases for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing to warning. I really don't like to panic or have a fatal which will cause the process to exit. In this case, if it's already watching there's no adverse effect. Watch should be idempotent, I'd imagine. I think logging or a warn should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the problem - watch
is not idempotent (it does not initiate a rolling update), so in the current behavior this signals a potential problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more on this, while this condition does indicate a state mismatch that the user must reconcile, making this fatal could result in a state that's difficult to reconcile. This is because the next attempt to restart this daemon would try to watch groups, and likely fail again indefinitely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal view on this... if the Group's Watch operation isn't idempotent, then it's the proxied group plugin that should panic, not this layer. This manager's concerns are only
- persist user's spec for multiple groups
- determine if it's the leader
- activate plugins (including install -- TBD later)
- update / watch on the groups
if usr, err := user.Current(); err == nil { | ||
home = usr.HomeDir | ||
} | ||
return home |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: restructure for brevity and greater readability:
if usr, err := user.Current(); err == nil {
return usr.HomeDir
}
return os.Getenv("HOME")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
backend := &backend{ | ||
id: "group", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the shared mutable state here makes the code harder to read than it should be. Can you instead convert the PersistentPostRunE
to a function you call and pass a backend
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
…example to use the same client code Signed-off-by: David Chung <david.chung@docker.com>
Signed-off-by: David Chung <david.chung@docker.com>
Signed-off-by: David Chung <david.chung@docker.com>
Signed-off-by: David Chung <david.chung@docker.com>
cli.SetLogLevel(logLevel) | ||
}, | ||
PersistentPostRunE: func(c *cobra.Command, args []string) error { | ||
return runMain(backend) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incremental readability improvement on my end - eliminate the PersistentPostRunE
and have the subcommands call runMain()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
group_rpc.PluginServer(manager), | ||
func() error { | ||
log.Infoln("Stopping manager") | ||
manager.Stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider moving this to after <-stopped
below for better readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once outstanding comments are addressed. Let's make sure to not slip on tests - there should be a quick follow-up.
Signed-off-by: David Chung <david.chung@docker.com>
Add ability to invoke local CFN JSON
This PR combines the earlier PRs for state storage and leader detection to create
a manager (entity + cmd binary). The manager offers a Group interface and can record
state in a swarm or local file.
There's a new executable
infrakit-manager
that can be started as a point of contact for users. This binary will handle leader detection. Seecmd/manager/README.md
for details.Signed-off-by: David Chung david.chung@docker.com