-
Notifications
You must be signed in to change notification settings - Fork 302
Conversation
Related: #628 |
Are you suggesting that unit state is (effectively) stored with a key that's unitname+machineid instead of just unitname? |
@jonboulle you got it |
Capturing our discussion, my proposal is to just use the existing |
@bcwaldon sanity check on this approach? |
if isKeyNotFound(err) { | ||
err = nil | ||
if err != nil && isKeyNotFound(err) { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we now returning an error on KeyNotFound?
Any feedback? |
//TODO: Handle the error generated by unmarshal | ||
unmarshal(resp.Node.Value, &usm) | ||
if err := unmarshal(resp.Node.Value, &usm); err != nil { | ||
log.Errorf("Error unmarshalling UnitState: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnitState(%s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
LGTM |
shipit |
This used to be an issue but turned into a PR and now I can't reopen. @jonboulle maybe you should stop using the fancy 'convert issue to PR' thing |
fleet does a fine job at reporting unit state in the simple cases, but adding any complexity to the unit lifecycle causes mis-publishing of unit state into etcd.
For example, starting and destroying a single unit will result in all states being published properly. Now imagine seting an ExecStop option that takes a long time to finish (i.e. /usr/bin/sleep 10s). Start it, unload it and immediately start it again. If that unit is scheduled to a different machine the second time it is started, an active state will be published, but 10s later an inactive state will overwrite it.
We need to stop treating units like only one agent can possibly report state for it at a given time. It is incredibly important to know if a unit is still running in some capacity on a node, when it shouldn't be.
Taking systemd's lead, it allows you to visualize processes across a bunch of systemd-nspawn containers like so: