Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

fleet agent does not compare contents of units in reconciler #866

Merged
merged 2 commits into from
Oct 24, 2014

Conversation

jonboulle
Copy link
Contributor

  1. fleetctl start a unit
  2. wait for agent to pick it up
  3. fleetctl destroy unit
  4. quickly fleetctl start unit with different contents

Assuming the unit gets scheduled back to the same host, easily reproducible using global units, the fleet agent will not deploy the new contents and restart the service.

@bcwaldon bcwaldon added the bug label Sep 4, 2014
@bcwaldon bcwaldon added this to the v0.8.1 milestone Sep 4, 2014
@jonboulle
Copy link
Contributor

Oh yes. Might be time to tweak the UnitManager interface: GetUnitStates() already exposes this information but Units() needs to be reworked (or gutted)

@jonboulle
Copy link
Contributor

Naming/structs are a little awkward and can probably be improved..

@bcwaldon
Copy link
Contributor Author

LGTM

@jonboulle
Copy link
Contributor

Well that was easy.

@bcwaldon
Copy link
Contributor Author

@jonboulle wait, how does this work with #720?

@jonboulle
Copy link
Contributor

@bcwaldon it doesn't particularly well; I imagine it will forcibly unload/reload the units. Any ideas?

@bcwaldon
Copy link
Contributor Author

Yeah, fails as predicted:

Oct 20 18:41:05 core-01 systemd[1]: Starting fleet daemon...
Oct 20 18:41:05 core-01 systemd[1]: Started fleet daemon.
Oct 20 18:41:05 core-01 fleetd[1215]: INFO fleet.go:42: Starting fleet version 0.8.3+git
Oct 20 18:41:05 core-01 fleetd[1215]: INFO fleet.go:146: No provided or default config file found - proceeding without
Oct 20 18:41:05 core-01 fleetd[1215]: INFO server.go:137: Establishing etcd connectivity
Oct 20 18:41:05 core-01 fleetd[1215]: INFO server.go:148: Starting server components
Oct 20 18:41:05 core-01 fleetd[1215]: INFO engine.go:170: Engine leadership acquired
Oct 20 18:41:17 core-01 fleetd[1215]: INFO engine.go:256: Scheduled Unit(hello.service) to Machine(590993edfe3c4ebfa8a1013b6bbdcd13)
Oct 20 18:41:17 core-01 fleetd[1215]: INFO reconciler.go:147: EngineReconciler completed task: {Type: AttemptScheduleUnit, JobName: hello.service, MachineID: 590993edfe3c4ebfa8a1013b6bbdcd13, Reason: "target state launched and unit not scheduled"}
Oct 20 18:41:18 core-01 fleetd[1215]: INFO manager.go:220: Writing systemd unit hello.service (119b)
Oct 20 18:41:18 core-01 fleetd[1215]: INFO reconcile.go:293: AgentReconciler completed task: type=LoadUnit job=hello.service reason="unit scheduled here but not loaded"
Oct 20 18:41:18 core-01 fleetd[1215]: INFO manager.go:80: Triggered systemd unit hello.service start: job=2373
Oct 20 18:41:18 core-01 fleetd[1215]: INFO reconcile.go:293: AgentReconciler completed task: type=StartUnit job=hello.service reason="unit currently loaded but desired state is launched"
Oct 20 18:41:44 core-01 systemd[1]: fleet.service: main process exited, code=killed, status=9/KILL
Oct 20 18:41:44 core-01 systemd[1]: Unit fleet.service entered failed state.
Oct 20 18:41:44 core-01 systemd[1]: Starting fleet daemon...
Oct 20 18:41:54 core-01 systemd[1]: fleet.service holdoff time over, scheduling restart.
Oct 20 18:41:54 core-01 systemd[1]: Stopping fleet daemon...
Oct 20 18:41:54 core-01 systemd[1]: Starting fleet daemon...
Oct 20 18:41:54 core-01 systemd[1]: Started fleet daemon.
Oct 20 18:41:54 core-01 fleetd[1285]: INFO fleet.go:42: Starting fleet version 0.8.3+git
Oct 20 18:41:54 core-01 fleetd[1285]: INFO fleet.go:146: No provided or default config file found - proceeding without
Oct 20 18:41:54 core-01 fleetd[1285]: INFO server.go:137: Establishing etcd connectivity
Oct 20 18:41:54 core-01 fleetd[1285]: INFO server.go:148: Starting server components
Oct 20 18:41:54 core-01 fleetd[1285]: INFO engine.go:170: Engine leadership acquired
Oct 20 18:41:54 core-01 fleetd[1285]: INFO manager.go:91: Triggered systemd unit hello.service stop: job=2689
Oct 20 18:41:54 core-01 fleetd[1285]: INFO manager.go:233: Removing systemd unit hello.service
Oct 20 18:41:54 core-01 fleetd[1285]: INFO reconcile.go:293: AgentReconciler completed task: type=UnloadUnit job=hello.service reason="unit loaded but hash differs to expected"
Oct 20 18:41:59 core-01 fleetd[1285]: INFO manager.go:220: Writing systemd unit hello.service (119b)
Oct 20 18:41:59 core-01 fleetd[1285]: INFO reconcile.go:293: AgentReconciler completed task: type=LoadUnit job=hello.service reason="unit scheduled here but not loaded"
Oct 20 18:41:59 core-01 fleetd[1285]: INFO manager.go:80: Triggered systemd unit hello.service start: job=2690
Oct 20 18:41:59 core-01 fleetd[1285]: INFO reconcile.go:293: AgentReconciler completed task: type=StartUnit job=hello.service reason="unit currently loaded but desired state is launched"

We obviously need to address #720 before this PR can merge. I think the most correct thing we can do here is to stop caching unit hashes in the UnitManager and pull them directly from the filesystem.

@jonboulle
Copy link
Contributor

@bcwaldon Sounds expensive. how about just doing that on startup?

@bcwaldon
Copy link
Contributor Author

@jonboulle That would be fine. Clearly someone could change the unit file in /var/run/fleet/units while fleetd is running, but that is not exactly a supported operation in the first place.

@bcwaldon bcwaldon modified the milestones: v0.9.0, v0.8.4 Oct 20, 2014
@bcwaldon
Copy link
Contributor Author

@jonboulle code in #987 addresses the problem

@bcwaldon
Copy link
Contributor Author

@jonboulle good to go here

This incorporates hashes into the decision when agent reconciler is
calculating what tasks should be performed.
@bcwaldon
Copy link
Contributor Author

@jonboulle I see you pushed a new commit. Anything notable?

@jonboulle
Copy link
Contributor

No, just rebasing. I haven't tested to my satisfaction yet. The behaviour I
described in #720 is still the case and I don't think it should be... But
something to address elsewhere.
On Oct 21, 2014 5:00 PM, "Brian Waldon" notifications@github.com wrote:

@jonboulle https://github.com/jonboulle I see you pushed a new commit.
Anything notable?


Reply to this email directly or view it on GitHub
#866 (comment).

@bcwaldon
Copy link
Contributor Author

@jonboulle We are still setting the current states to inactive on startup, if that's what you're referencing. Unless there's something else, I suggest we file that as a separate bug and move on.

@jonboulle
Copy link
Contributor

Yes pretty much - we are still loading the unit again when we shouldn't be. Will chase it up tomorrow.

@jonboulle
Copy link
Contributor

#993

@jonboulle
Copy link
Contributor

@bcwaldon updated to optimistically load/launch units

filter.Add(u)
}

units, err := a.um.GetUnitStates(filter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than stealing the variable units, can we just call this states?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we're stealing the variable states. Naming is hard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't see that. systemdStates?

filter.Add(u)
}

units, err := a.um.GetUnitStates(filter)
if err != nil {
return nil, fmt.Errorf("failed fetching unit states from UnitManager: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to add this additional context to the error? What will the underlying error text look like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it'll be straight from dbus

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonboulle spell that out for me. Does that mean it's just going to say something like "connection error"? Or is the underlying library nice enough to say "dbus communication error: timed out waiting for response"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will either say something arbitrary taken directly from dbus, or it will be one of a few slightly nicer godbus errors like dbus: invalid method name

@bcwaldon
Copy link
Contributor Author

stunning

@jonboulle
Copy link
Contributor

Updated to address comments.

@bcwaldon
Copy link
Contributor Author

@jonboulle shipit

jonboulle added a commit that referenced this pull request Oct 24, 2014
fleet agent does not compare contents of units in reconciler
@jonboulle jonboulle merged commit bf2966e into coreos:master Oct 24, 2014
@jonboulle jonboulle deleted the 866_agent_reconciler branch October 24, 2014 17:07
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants