Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

agent should recover from local state before starting reconciliation #993

Open
jonboulle opened this issue Oct 22, 2014 · 2 comments
Open

Comments

@jonboulle
Copy link
Contributor

As touched on in #720 and #866: the agent is not truly recovering on start-up before it starts reconciling:

core-01 ~ # systemctl kill -s SIGKILL fleet
core-01 ~ # Oct 21 23:03:30 core-01 systemd[1]: fleet.service: main process exited, code=killed, status=9/KILL
Oct 21 23:03:30 core-01 systemd[1]: Unit fleet.service entered failed state.

core-01 ~ # systemctl status foo
● foo.service
   Loaded: loaded (/run/fleet/units/foo.service; linked-runtime)
   Active: active (running) since Tue 2014-10-21 22:54:43 UTC; 8min ago
 Main PID: 1864 (sleep)
   CGroup: /system.slice/foo.service
           └─1864 /bin/sleep 999999999

Oct 21 22:58:17 core-01 systemd[1]: Started foo.service.

core-01 ~ # Oct 21 23:03:40 core-01 systemd[1]: fleet.service holdoff time over, scheduling restart.
Oct 21 23:03:40 core-01 systemd[1]: Stopping fleet daemon...
Oct 21 23:03:40 core-01 systemd[1]: Starting fleet daemon...
Oct 21 23:03:40 core-01 systemd[1]: Started fleet daemon.
Oct 21 23:03:40 core-01 fleetd[1978]: INFO fleet.go:58: Starting fleet version 0.8.3+git
Oct 21 23:03:40 core-01 fleetd[1978]: INFO fleet.go:162: No provided or default config file found - proceeding without
Oct 21 23:03:40 core-01 fleetd[1978]: INFO server.go:153: Establishing etcd connectivity
Oct 21 23:03:40 core-01 fleetd[1978]: INFO server.go:164: Starting server components
Oct 21 23:03:40 core-01 fleetd[1978]: INFO manager.go:262: Writing systemd unit foo.service (41b)
Oct 21 23:03:40 core-01 fleetd[1978]: INFO manager.go:198: Instructing systemd to reload units
Oct 21 23:03:40 core-01 fleetd[1978]: INFO reconcile.go:309: AgentReconciler completed task: type=LoadUnit job=foo.service reason="unit scheduled here but not loaded"
Oct 21 23:03:40 core-01 fleetd[1978]: INFO manager.go:134: Triggered systemd unit foo.service start: job=8409
Oct 21 23:03:40 core-01 fleetd[1978]: INFO reconcile.go:309: AgentReconciler completed task: type=StartUnit job=foo.service reason="unit currently loaded but desired state is launched"

core-01 ~ # systemctl status foo.service
● foo.service
   Loaded: loaded (/run/fleet/units/foo.service; linked-runtime)
   Active: active (running) since Tue 2014-10-21 22:54:43 UTC; 10min ago
 Main PID: 1864 (sleep)
   CGroup: /system.slice/foo.service
           └─1864 /bin/sleep 999999999

Oct 21 22:58:17 core-01 systemd[1]: Started foo.service.
Oct 21 23:03:40 core-01 systemd[1]: Started foo.service.

It just so happens that the LoadUnit/StartUnit operations are idempotent, so there is no interruption in foo.service, and the illusion of continuity; but really, we should not be invoking LoadUnit/StartUnit and writing the unit to disk again.

@jonboulle
Copy link
Contributor Author

Relatedly:

  • we should start tracking the unit state of these units at the point of recovery; right now, we only add them to the set of subscribed units because we happen to call LoadUnit on them again
  • ideally we could leverage the unit state that the UnitStateGenerator/UnitStatePublisher collect and cache, rather than fetching it anew in each reconciliation.

@bcwaldon
Copy link
Contributor

Introducing more of a startup process for the Agent could address #1003

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants