Scope unit state by machine #638

jonboulle · 2014-07-18T22:25:40Z

fleet does a fine job at reporting unit state in the simple cases, but adding any complexity to the unit lifecycle causes mis-publishing of unit state into etcd.

For example, starting and destroying a single unit will result in all states being published properly. Now imagine seting an ExecStop option that takes a long time to finish (i.e. /usr/bin/sleep 10s). Start it, unload it and immediately start it again. If that unit is scheduled to a different machine the second time it is started, an active state will be published, but 10s later an inactive state will overwrite it.

We need to stop treating units like only one agent can possibly report state for it at a given time. It is incredibly important to know if a unit is still running in some capacity on a node, when it shouldn't be.

Taking systemd's lead, it allows you to visualize processes across a bunch of systemd-nspawn containers like so:

$ systemctl --recursive list-units fleet.service
UNIT                 LOAD   ACTIVE SUB     DESCRIPTION
fleet.service        loaded active running fleet
smoke0:fleet.service loaded active running fleet.service
smoke1:fleet.service loaded active running fleet.service
smoke2:fleet.service loaded active running fleet.service
smoke3:fleet.service loaded active running fleet.service

bcwaldon · 2014-07-09T22:54:29Z

Related: #628

jonboulle · 2014-07-09T23:01:57Z

Are you suggesting that unit state is (effectively) stored with a key that's unitname+machineid instead of just unitname?

bcwaldon · 2014-07-09T23:02:08Z

@jonboulle you got it

jonboulle · 2014-07-17T23:47:18Z

Capturing our discussion, my proposal is to just use the existing MACHINE field for this duplication, rather than the prefix, to ensure that a unit foo:bar.service is unambiguous from unit bar.service running on machine foo

jonboulle · 2014-07-18T22:28:42Z

@bcwaldon sanity check on this approach?
(first effort at getting Registry test coverage above measly single digits!)

bcwaldon · 2014-07-18T22:33:50Z

registry/unit_state.go

-	if isKeyNotFound(err) {
-		err = nil
+	if err != nil && isKeyNotFound(err) {
+		return err


Why are we now returning an error on KeyNotFound?

jonboulle · 2014-07-20T08:21:05Z

Any feedback?

bcwaldon · 2014-07-20T16:00:44Z

registry/unit_state.go

-	//TODO: Handle the error generated by unmarshal
-	unmarshal(resp.Node.Value, &usm)
+	if err := unmarshal(resp.Node.Value, &usm); err != nil {
+		log.Errorf("Error unmarshalling UnitState: %v", err)


UnitState(%s)

bcwaldon · 2014-07-20T16:04:28Z

LGTM

bcwaldon · 2014-07-20T17:17:33Z

shipit

Scope unit state by machine

bcwaldon · 2014-07-28T18:44:44Z

This used to be an issue but turned into a PR and now I can't reopen. @jonboulle maybe you should stop using the fancy 'convert issue to PR' thing

bcwaldon added refactor labels Jul 9, 2014

sukrit007 mentioned this pull request Jul 10, 2014

Serialize systemd jobs properly #646

Closed

This was referenced Jul 11, 2014

Move unit event generation into fleet #651

Merged

fleetctl list-units does not expose necessary data #663

Closed

Expose target machine/state fields in list-units #664

Merged

jonboulle mentioned this pull request Jul 17, 2014

Supporting changes to prepare for global units #679

Closed

6 tasks

jonboulle changed the title ~~Unit state mismanagement~~ Scope unit state by machine Jul 17, 2014

jonboulle self-assigned this Jul 17, 2014

bcwaldon reviewed Jul 18, 2014
View reviewed changes

jonboulle mentioned this pull request Jul 18, 2014

Compare existing UnitState before removing from Registry #465

Closed

bcwaldon reviewed Jul 20, 2014
View reviewed changes

registry: dual-publish UnitStates and add associated tests

413da7f

jonboulle added a commit that referenced this pull request Jul 20, 2014

Merge pull request #638 from jonboulle/638

0bd72d8

Scope unit state by machine

jonboulle merged commit 0bd72d8 into coreos:master Jul 20, 2014

jonboulle deleted the 638 branch July 20, 2014 17:23

bcwaldon mentioned this pull request Jul 28, 2014

Scope unit state by machine #722

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scope unit state by machine #638

Scope unit state by machine #638

jonboulle commented Jul 18, 2014

bcwaldon commented Jul 9, 2014

jonboulle commented Jul 9, 2014

bcwaldon commented Jul 9, 2014

jonboulle commented Jul 17, 2014

jonboulle commented Jul 18, 2014

bcwaldon Jul 18, 2014

jonboulle commented Jul 20, 2014

bcwaldon Jul 20, 2014

jonboulle Jul 20, 2014

bcwaldon commented Jul 20, 2014

bcwaldon commented Jul 20, 2014

bcwaldon commented Jul 28, 2014

Scope unit state by machine #638

Scope unit state by machine #638

Conversation

jonboulle commented Jul 18, 2014

bcwaldon commented Jul 9, 2014

jonboulle commented Jul 9, 2014

bcwaldon commented Jul 9, 2014

jonboulle commented Jul 17, 2014

jonboulle commented Jul 18, 2014

bcwaldon Jul 18, 2014

Choose a reason for hiding this comment

jonboulle commented Jul 20, 2014

bcwaldon Jul 20, 2014

Choose a reason for hiding this comment

jonboulle Jul 20, 2014

Choose a reason for hiding this comment

bcwaldon commented Jul 20, 2014

bcwaldon commented Jul 20, 2014

bcwaldon commented Jul 28, 2014