Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

Ordered task execution #1134

Merged
merged 1 commit into from
Feb 27, 2015
Merged

Ordered task execution #1134

merged 1 commit into from
Feb 27, 2015

Conversation

bcwaldon
Copy link
Contributor

On master, a given fleet agent reconciles its current state against the desired state by generating a set of taskChain objects. A taskChain is a series of tasks that should be executed against a given unit. For example, a new unit is started on an agent with a taskChain containing a LoadUnit and a StartUnit task. A unit file is replaced with a chain of UnloadUnit, LoadUnit, StartUnit. The flaw here, however, is that there isn't any ordering between the execution of the taskChains themselves. So if you have unit foo.service that BindsTo bar.service, fleet will race to start them both, and sometimes foo.service will fail with "No such file or directory" since bar.service may not have actually been written to disk yet.

The alternative proposed here is to throw the taskChain concept out the window in favor of an ordered set of tasks, where each task is executed in serial. Before starting execution, the tasks are sorted such that all LoadUnit tasks are executed before StartUnit tasks. This fixes the "No such file or directory" problems many folks have been running into by making sure all unit files are available on disk before taking any actions on them. The loss of concurrency should not be an issue as fleet already uses the DBus APIs for job control and doesn't actually block on completion.

Fix #900
Fix #997
Fix #1003
Fix #1127

Agents need to execute all unit file load operations before
attemping to start anything. The taskChain approach did not
provide this safetly. An ordered list of tasks gives us what
we need and greatly simplifies the codebase.
return
}

go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it now safe to get rid of the go routine here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's sort of a long story. The short version is that it used to be safe to spin off a goroutine here because the TaskManager would reject new taskChains for units that already have work in progress. Now that the TaskManager does not track the in-flight tasks (and therefore is no longer threadsafe), the reconciler should not interact with the TaskManager concurrently (which this goroutine would allow).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent of the architectural change being made here, this goroutine could be removed from master with minimal impact on the operation of the Agent.

@bcwaldon
Copy link
Contributor Author

Moving forward with this and cutting a v0.9.1 release.

bcwaldon added a commit that referenced this pull request Feb 27, 2015
@bcwaldon bcwaldon merged commit a35ee29 into coreos:master Feb 27, 2015
@bcwaldon bcwaldon deleted the catasktrophe branch February 27, 2015 21:43
@bcwaldon
Copy link
Contributor Author

bcwaldon commented Mar 2, 2015

fleet v0.9.1 has been released to the CoreOS Alpha channel. It will be rolled out to Beta and Stable later this week. If anyone who was affected by this bug can validate the fix on Alpha, we would appreciate it.

@rynbrd
Copy link

rynbrd commented Mar 2, 2015

We've got the update on prod. So far so good!

@bcwaldon
Copy link
Contributor Author

bcwaldon commented Mar 2, 2015

@bluedragonx thanks for the verification!

@rufman
Copy link

rufman commented Mar 3, 2015

I'm still seeing the error failed to load: No such file or directory in the alpha channel

@rufman
Copy link

rufman commented Mar 3, 2015

running systemctl daemon-reload, destroying the service and restarting works (which didn't work most time in v0.9.0). I am starting 3 services at the same time (2 node and an nginx container). Sometimes both node containers start successfully, other times only one of them does. The nginx container (loaded last) has never started, unless I run daemon-reload and then manually start the service.

@bcwaldon
Copy link
Contributor Author

bcwaldon commented Mar 3, 2015

@rufman Would you please share the exact logs and unit files that you are using to reproduce this bug in a new GitHub issue? And are you confident you are running v0.9.1?

@rufman
Copy link

rufman commented Mar 3, 2015

yes, I double check fleet -version is 0.9.1

@guruvan
Copy link

guruvan commented Mar 8, 2015

@bcwaldon just updated some hosts to 607.0 with 0.9.1 fleet - no dice, reboot hosts with running , units don't start, start units and they just load dead, units consistently need daemon-reload or no such file or directory

Today, I noticed that fleetctl is no longer destroying services all the time from the command line all the time. I had to manually rm /run/fleet/units/the-offending@unit.service

this issue has me essentially managing every service by hand and logging directly into the host to use systemctl. My options are narrowing - revert back 2 or 3 previous stable releases, or find another solution - my network has been in a state of rolling outages for weeks now (and it's been constant baby sitting since implementing 607.0

this is happening with any of over 50 unit files, on 15-20 host network and sanitizing and sifting the logs is a huge chore - please give me exactly what you need so I can find it, otherwise, I'm literally looking for a needle in a hayfield

@bcwaldon
Copy link
Contributor Author

bcwaldon commented Mar 8, 2015

@guruvan I'm more than happy to help debug your issue - could you share any logs and the unit files themselves? Please file a separate issue regarding the fleetctl issue with explicit repro steps and debug logs from fleectl (use the --debug flag).

@patrickbcullen
Copy link

I am seeing the same bug with 0.9.1 as 0.9.0. Every other destroy/start cycle causes a failure, but only if you use a template and reuse the unit id (i.e. myunit@.service and myunit@1.service).

It should be pretty easy to recreate this on your own. If you cannot let me know.

@timfallmk
Copy link

@bcwaldon I am consistently seeing the same issue with 0.9.1. I can share any logs and unit files you would like.

@bcwaldon
Copy link
Contributor Author

Just released https://github.com/coreos/fleet/releases/tag/v0.9.2 with another major big fixed. Please reverify unexpected behavior with those binaries. They should be available in Alpha on Thursday.

@simonvanderveldt
Copy link

I'm not sure if I should create a new issue for this, but is it correct one still has to manually execute fleet steps in the correct order to make dependent unit starting work?
In our case we have a matching monotonic .timer unit to a .service unit and the only way to get this to work is to use a timer that's based on OnUnitActiveSec and has a MachineOf=app.service property and then execute the following commands:

fleetctl start -no-block app.timer
fleetctl start app.service

The -no-block is necessary because otherwise fleet will hang indefinitely because it can't load the timer unit because there's no service to schedule it next to.
And when starting the service unit first and then starting the timer the timer doesn't work because it apparently doesn't pick up the service unit's start time if it's started after the service.

This might be related to #1697

@dongsupark
Copy link
Contributor

@simonvanderveldt Thanks for the report. But this PR is already merged, and it has been inactive since 17 months. It's unlikely for anyone to get back to here and read the comment again.
Can you please create a new issue, or just write a comment in #1697?

@simonvanderveldt
Copy link

@dongsupark thanks for the response, I'll create a new issue!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
9 participants