Race condition in unloading units #1216

patrickbcullen · 2015-05-07T17:14:04Z

I have noticed a race condition in production where job is supposed to be unloaded, but never stops.

I think I found the bug here

Line 98 in 7a64877

a.registry.ClearUnitHeartbeat(unitName)

.

I have an example where the systemd file is gone, but the unit is still running. This happened because the kill command in the systemd file did not succeed. Since this code does not check for error it happily removes all the systemd files. Now that the systemd files are gone I cannot stop it through fleet since fleet cannot call the systemd stop because it already deleted the systemd file.

bcwaldon · 2015-07-09T21:53:14Z

@patrickbcullen Can you provide an explicit set of repro steps for this?

In Agent.unloadUnit(), if systemdUnitManager.TriggerStop() returns any error, do not unload systemd units. Otherwise the unit could get into a state where the unit cannot be stopped via fleet, because the unit file was already removed. Fixes coreos#1216

bcwaldon added bug reviewed/needs more information labels Jul 9, 2015

jonboulle added kind/bug and removed bug labels Sep 24, 2015

dongsupark mentioned this issue Jul 20, 2016

systemd,agent: unload unit only when TriggerStop() runs successfully #1646

Merged

dongsupark closed this as completed in #1646 Jul 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in unloading units #1216

Race condition in unloading units #1216

patrickbcullen commented May 7, 2015

bcwaldon commented Jul 9, 2015

Race condition in unloading units #1216

Race condition in unloading units #1216

Comments

patrickbcullen commented May 7, 2015

bcwaldon commented Jul 9, 2015