Ordered task execution #1134

bcwaldon · 2015-02-27T03:33:10Z

On master, a given fleet agent reconciles its current state against the desired state by generating a set of taskChain objects. A taskChain is a series of tasks that should be executed against a given unit. For example, a new unit is started on an agent with a taskChain containing a LoadUnit and a StartUnit task. A unit file is replaced with a chain of UnloadUnit, LoadUnit, StartUnit. The flaw here, however, is that there isn't any ordering between the execution of the taskChains themselves. So if you have unit foo.service that BindsTo bar.service, fleet will race to start them both, and sometimes foo.service will fail with "No such file or directory" since bar.service may not have actually been written to disk yet.

The alternative proposed here is to throw the taskChain concept out the window in favor of an ordered set of tasks, where each task is executed in serial. Before starting execution, the tasks are sorted such that all LoadUnit tasks are executed before StartUnit tasks. This fixes the "No such file or directory" problems many folks have been running into by making sure all unit files are available on disk before taking any actions on them. The loss of concurrency should not be an issue as fleet already uses the DBus APIs for job control and doesn't actually block on completion.

Fix #900
Fix #997
Fix #1003
Fix #1127

Agents need to execute all unit file load operations before attemping to start anything. The taskChain approach did not provide this safetly. An ordered list of tasks gives us what we need and greatly simplifies the codebase.

philips · 2015-02-27T18:15:38Z

agent/reconcile.go

-		return
-	}
-
-	go func() {


why is it now safe to get rid of the go routine here?

It's sort of a long story. The short version is that it used to be safe to spin off a goroutine here because the TaskManager would reject new taskChains for units that already have work in progress. Now that the TaskManager does not track the in-flight tasks (and therefore is no longer threadsafe), the reconciler should not interact with the TaskManager concurrently (which this goroutine would allow).

Independent of the architectural change being made here, this goroutine could be removed from master with minimal impact on the operation of the Agent.

bcwaldon · 2015-02-27T21:43:27Z

Moving forward with this and cutting a v0.9.1 release.

Ordered task execution

bcwaldon · 2015-03-02T20:49:02Z

fleet v0.9.1 has been released to the CoreOS Alpha channel. It will be rolled out to Beta and Stable later this week. If anyone who was affected by this bug can validate the fix on Alpha, we would appreciate it.

rynbrd · 2015-03-02T20:52:54Z

We've got the update on prod. So far so good!

bcwaldon · 2015-03-02T20:53:35Z

@bluedragonx thanks for the verification!

rufman · 2015-03-03T22:25:00Z

I'm still seeing the error failed to load: No such file or directory in the alpha channel

rufman · 2015-03-03T22:33:05Z

running systemctl daemon-reload, destroying the service and restarting works (which didn't work most time in v0.9.0). I am starting 3 services at the same time (2 node and an nginx container). Sometimes both node containers start successfully, other times only one of them does. The nginx container (loaded last) has never started, unless I run daemon-reload and then manually start the service.

bcwaldon · 2015-03-03T22:37:33Z

@rufman Would you please share the exact logs and unit files that you are using to reproduce this bug in a new GitHub issue? And are you confident you are running v0.9.1?

rufman · 2015-03-03T22:39:31Z

yes, I double check fleet -version is 0.9.1

guruvan · 2015-03-08T01:06:02Z

@bcwaldon just updated some hosts to 607.0 with 0.9.1 fleet - no dice, reboot hosts with running , units don't start, start units and they just load dead, units consistently need daemon-reload or no such file or directory

Today, I noticed that fleetctl is no longer destroying services all the time from the command line all the time. I had to manually rm /run/fleet/units/the-offending@unit.service

this issue has me essentially managing every service by hand and logging directly into the host to use systemctl. My options are narrowing - revert back 2 or 3 previous stable releases, or find another solution - my network has been in a state of rolling outages for weeks now (and it's been constant baby sitting since implementing 607.0

this is happening with any of over 50 unit files, on 15-20 host network and sanitizing and sifting the logs is a huge chore - please give me exactly what you need so I can find it, otherwise, I'm literally looking for a needle in a hayfield

bcwaldon · 2015-03-08T16:50:41Z

@guruvan I'm more than happy to help debug your issue - could you share any logs and the unit files themselves? Please file a separate issue regarding the fleetctl issue with explicit repro steps and debug logs from fleectl (use the --debug flag).

patrickbcullen · 2015-03-13T01:07:50Z

I am seeing the same bug with 0.9.1 as 0.9.0. Every other destroy/start cycle causes a failure, but only if you use a template and reuse the unit id (i.e. myunit@.service and myunit@1.service).

It should be pretty easy to recreate this on your own. If you cannot let me know.

timfallmk · 2015-03-24T21:22:08Z

@bcwaldon I am consistently seeing the same issue with 0.9.1. I can share any logs and unit files you would like.

bcwaldon · 2015-03-31T00:51:33Z

Just released https://github.com/coreos/fleet/releases/tag/v0.9.2 with another major big fixed. Please reverify unexpected behavior with those binaries. They should be available in Alpha on Thursday.

simonvanderveldt · 2016-11-04T14:44:29Z

I'm not sure if I should create a new issue for this, but is it correct one still has to manually execute fleet steps in the correct order to make dependent unit starting work?
In our case we have a matching monotonic .timer unit to a .service unit and the only way to get this to work is to use a timer that's based on OnUnitActiveSec and has a MachineOf=app.service property and then execute the following commands:

fleetctl start -no-block app.timer
fleetctl start app.service

The -no-block is necessary because otherwise fleet will hang indefinitely because it can't load the timer unit because there's no service to schedule it next to.
And when starting the service unit first and then starting the timer the timer doesn't work because it apparently doesn't pick up the service unit's start time if it's started after the service.

This might be related to #1697

dongsupark · 2016-11-04T15:29:00Z

@simonvanderveldt Thanks for the report. But this PR is already merged, and it has been inactive since 17 months. It's unlikely for anyone to get back to here and read the comment again.
Can you please create a new issue, or just write a comment in #1697?

simonvanderveldt · 2016-11-04T15:30:30Z

@dongsupark thanks for the response, I'll create a new issue!

agent: replace taskChain with sorted []task

7b44072

Agents need to execute all unit file load operations before attemping to start anything. The taskChain approach did not provide this safetly. An ordered list of tasks gives us what we need and greatly simplifies the codebase.

philips reviewed Feb 27, 2015
View reviewed changes

bcwaldon added a commit that referenced this pull request Feb 27, 2015

Merge pull request #1134 from bcwaldon/catasktrophe

a35ee29

Ordered task execution

bcwaldon merged commit a35ee29 into coreos:master Feb 27, 2015

bcwaldon deleted the catasktrophe branch February 27, 2015 21:43

bcwaldon mentioned this pull request Feb 27, 2015

fleet can start units out of order on startup #997

Closed

This was referenced Mar 2, 2015

Fleet starts units before required ones are loaded after a reboot #1003

Closed

Fleet unit uploaded, yet missing #1127

Closed

guruvan mentioned this pull request Mar 10, 2015

Unit spontaneously fails and doesn't restart #1130

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ordered task execution #1134

Ordered task execution #1134

bcwaldon commented Feb 27, 2015

philips Feb 27, 2015

bcwaldon Feb 27, 2015

bcwaldon Feb 27, 2015

bcwaldon commented Feb 27, 2015

bcwaldon commented Mar 2, 2015

rynbrd commented Mar 2, 2015

bcwaldon commented Mar 2, 2015

rufman commented Mar 3, 2015

rufman commented Mar 3, 2015

bcwaldon commented Mar 3, 2015

rufman commented Mar 3, 2015

guruvan commented Mar 8, 2015

bcwaldon commented Mar 8, 2015

patrickbcullen commented Mar 13, 2015

timfallmk commented Mar 24, 2015

bcwaldon commented Mar 31, 2015

simonvanderveldt commented Nov 4, 2016

dongsupark commented Nov 4, 2016

simonvanderveldt commented Nov 4, 2016

Ordered task execution #1134

Ordered task execution #1134

Conversation

bcwaldon commented Feb 27, 2015

philips Feb 27, 2015

Choose a reason for hiding this comment

bcwaldon Feb 27, 2015

Choose a reason for hiding this comment

bcwaldon Feb 27, 2015

Choose a reason for hiding this comment

bcwaldon commented Feb 27, 2015

bcwaldon commented Mar 2, 2015

rynbrd commented Mar 2, 2015

bcwaldon commented Mar 2, 2015

rufman commented Mar 3, 2015

rufman commented Mar 3, 2015

bcwaldon commented Mar 3, 2015

rufman commented Mar 3, 2015

guruvan commented Mar 8, 2015

bcwaldon commented Mar 8, 2015

patrickbcullen commented Mar 13, 2015

timfallmk commented Mar 24, 2015

bcwaldon commented Mar 31, 2015

simonvanderveldt commented Nov 4, 2016

dongsupark commented Nov 4, 2016

simonvanderveldt commented Nov 4, 2016