Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

ExecStop / ExecStopPost not executed when fleetctl destroy is used #1000

Open
flantel opened this issue Oct 24, 2014 · 26 comments
Open

ExecStop / ExecStopPost not executed when fleetctl destroy is used #1000

flantel opened this issue Oct 24, 2014 · 26 comments

Comments

@flantel
Copy link

flantel commented Oct 24, 2014

Hi

Using fleetctl version 0.8.3 and CoreOS stable, if I stop a unit with 'fleetctl destroy unit.service' the ExecStop and ExecStopPost are not executed. If I use 'fleetctl stop ...' they are.

Is this expected behaviour? I would have expected (and I think this was the case with earlier versions) that destroy would execute StopExec and StopExecPost.

Thanks.

-Barry Flanagan

@bcwaldon
Copy link
Contributor

@flantel fleet should allow systemd to run whatever Exec* commands are defined. I'll look into this

@bcwaldon bcwaldon added the bug label Oct 25, 2014
@bcwaldon
Copy link
Contributor

So with this unit:

[Service]
ExecStartPre=/usr/bin/echo "ExecStartPre"
ExecStart=/usr/bin/bash -c "echo \"ExecStart\"; sleep 10000"
ExecStartPost=/usr/bin/echo "ExecStartPost"

ExecStop=/usr/bin/echo "ExecStop"
ExecStopPost=/usr/bin/echo "ExecStopPost"

I see the following from systemctl status foo@2.service after a fleetctl start foo@2.service; fleetctl destroy foo@2.service:

$ systemctl status foo@2.service
● foo@2.service
   Loaded: not-found (Reason: No such file or directory)
   Active: inactive (dead)

Oct 25 00:48:49 core-01 systemd[1]: Starting foo@2.service...
Oct 25 00:48:49 core-01 echo[1277]: ExecStartPre
Oct 25 00:48:49 core-01 bash[1279]: ExecStart
Oct 25 00:48:49 core-01 echo[1280]: ExecStartPost
Oct 25 00:48:49 core-01 systemd[1]: Started foo@2.service.
Oct 25 00:49:04 core-01 systemd[1]: Stopping foo@2.service...
Oct 25 00:49:04 core-01 echo[1293]: ExecStop
Oct 25 00:49:04 core-01 systemd[1]: Stopped foo@2.service.
Oct 25 00:49:04 core-01 echo[1295]: ExecStopPost

@flantel Can you share more details of your units? Maybe your ExecStop is taking a "long time" and systemctl is killing it forcefully?

@flantel
Copy link
Author

flantel commented Oct 25, 2014

@bcwaldon Thanks for looking at this. Your test unit works as you demonstrate. However, the following unit, which loads a container, does not.

[Unit]
Description=Test
[Service]
ExecStartPre=/usr/bin/echo "ExecStartPre"
ExecStart=/usr/bin/docker run --rm --name test ubuntu /bin/bash -c "sleep 10000"
ExecStop=/usr/bin/docker stop test
ExecStopPost=/usr/bin/echo "ExecStopPost"

With the above, running fleetctl destroy test.service results in the following:

# systemctl  status test.service
● test.service
   Loaded: not-found (Reason: No such file or directory)
   Active: failed (Result: exit-code) since Sat 2014-10-25 08:20:41 UTC; 12s ago
 Main PID: 8561 (code=exited, status=255)

Oct 25 08:20:16 coreos-01 echo[8558]: ExecStartPre
Oct 25 08:20:16 coreos-01 systemd[1]: Started Test.
Oct 25 08:20:30 coreos-01 systemd[1]: Stopping Test...
Oct 25 08:20:41 coreos-01 docker[8589]: test
Oct 25 08:20:41 coreos-01 systemd[1]: test.service: main process exited, code=exited, status=255/n/a
Oct 25 08:20:41 coreos-01 systemd[1]: Stopped test.service.
Oct 25 08:20:41 coreos-01 systemd[1]: Unit test.service entered failed state.

Thanks.

@bcwaldon
Copy link
Contributor

@flantel Ah, I think I figured it out. You can reproduce this directly by emulating fleetd directly with systemctl:

$ sudo systemctl stop --no-block foo@1.service && sudo rm /etc/systemd/system/foo@1.service && sudo systemctl daemon-reload

$ journalctl -lf -ufoo@1.service
Nov 14 18:11:30 core-01 systemd[1]: Starting foo@1.service...
Nov 14 18:11:30 core-01 systemd[1]: Started foo@1.service.
Nov 14 18:11:30 core-01 echo[1655]: ExecStartPre
Nov 14 18:11:47 core-01 systemd[1]: Stopping foo@1.service...
Nov 14 18:11:57 core-01 docker[1709]: test
Nov 14 18:11:57 core-01 systemd[1]: foo@1.service: main process exited, code=exited, status=255/n/a
Nov 14 18:11:57 core-01 systemd[1]: Stopped foo@1.service.
Nov 14 18:11:57 core-01 systemd[1]: Unit foo@1.service entered failed state.

If the unit is no longer loaded in systemd when the ExecStop finishes, it's ExecStopPost commands won't get triggered. fleetd v0.8.3 doesn't wait for the ExecStop to complete before removing the unit file from disk and calling daemon-reload.

In master (soon to be v0.9.0), fleetd does not call daemon-reload unless it absolutely needs to. I attempted the same fleetctl start/fleetctl destroy test on master, and ExecStopPost was executed as expected. Given that observation, I think your bug is already solved in master.

@flantel
Copy link
Author

flantel commented Nov 14, 2014

On 14 November 2014 18:27, Brian Waldon notifications@github.com wrote:

@flantel https://github.com/flantel Ah, I think I figured it out. You
can reproduce this directly by emulating fleetd directly with systemctl:

$ sudo systemctl stop --no-block foo@1.service && sudo rm /etc/systemd/system/foo@1.service && sudo systemctl daemon-reload

$ journalctl -lf -ufoo@1.service
Nov 14 18:11:30 core-01 systemd[1]: Starting foo@1.service...
Nov 14 18:11:30 core-01 systemd[1]: Started foo@1.service.
Nov 14 18:11:30 core-01 echo[1655]: ExecStartPre
Nov 14 18:11:47 core-01 systemd[1]: Stopping foo@1.service...
Nov 14 18:11:57 core-01 docker[1709]: test
Nov 14 18:11:57 core-01 systemd[1]: foo@1.service: main process exited, code=exited, status=255/n/a
Nov 14 18:11:57 core-01 systemd[1]: Stopped foo@1.service.
Nov 14 18:11:57 core-01 systemd[1]: Unit foo@1.service entered failed state.

If the unit is no longer loaded in systemd when the ExecStop finishes,
it's ExecStopPost commands won't get triggered. fleetd v0.8.3 doesn't
wait for the ExecStop to complete before removing the unit file from disk
and calling daemon-reload.

In master (soon to be v0.9.0), fleetd does not call daemon-reload unless
it absolutely needs to. I attempted the same fleetctl start/fleetctl
destroy test on master, and ExecStopPost was executed as expected.
Given that observation, I think your bug is already solved in master.

That's great to know. Thanks very much for the update Brian!

Regards,

-Barry Flanagan


Reply to this email directly or view it on GitHub
#1000 (comment).

@bcwaldon
Copy link
Contributor

Closing as v0.9.0 has been released. Reopen if this is still an issue.

@philk
Copy link
Contributor

philk commented Apr 29, 2015

It looks like we're actually running into this still on fleetctl 0.9.2

journalctl logs

Apr 28 23:41:28 ip-10-0-1-135.us-west-2.compute.internal fleetd[24918]: INFO manager.go:140: Triggered systemd unit vor@64a51e9-1.service stop: job=1461615
Apr 28 23:41:28 ip-10-0-1-135.us-west-2.compute.internal systemd[1]: Stopping Vor...
Apr 28 23:41:28 ip-10-0-1-135.us-west-2.compute.internal fleetd[24918]: INFO manager.go:261: Removing systemd unit vor@64a51e9-1.service
Apr 28 23:41:28 ip-10-0-1-135.us-west-2.compute.internal fleetd[24918]: INFO manager.go:184: Instructing systemd to reload units
Apr 28 23:41:28 ip-10-0-1-135.us-west-2.compute.internal systemd[1]: Reloading.
Apr 28 23:41:28 ip-10-0-1-135.us-west-2.compute.internal systemd[1]: Unit type .busname is not supported on this system.
Apr 28 23:41:28 ip-10-0-1-135.us-west-2.compute.internal fleetd[24918]: INFO reconcile.go:332: AgentReconciler completed task: type=UnloadUnit job=vor@64a51e9-1.service reason="unit loaded but not scheduled here"
Apr 28 23:41:28 ip-10-0-1-135.us-west-2.compute.internal fleetd[24918]: INFO reconcile.go:332: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"

The unit file looks like this (hiding our service discovery info)

[Unit]
Description=Vor
After=docker.service
Requires=docker.service

[Service]
TimeoutStartSec=0
EnvironmentFile=/etc/environment
ExecStartPre=/bin/bash -c "# docker pull from our registry"
ExecStart=/bin/bash -c "\
    /usr/bin/docker run --name vor-%i \
    # a bunch of environment variables and ports in here \
    internal_registry/vor:$(echo %i | cut -f1 -d'-')"
ExecStartPost=-/bin/bash -c "# some etcdctl set commands that work fine"
ExecStop=-/bin/bash -c "# a couple of etcdctl remove commands in here that work fine"
ExecStop=/usr/bin/docker stop vor-%i
ExecStopPost=-/usr/bin/docker kill vor-%i
ExecStopPost=-/usr/bin/docker rm -f vor-%i
Restart=always
RestartSec=10s

[X-Fleet]
MachineMetadata=type=cluster
Conflicts=vor@*.service

In our case we have an internal service that isn't responding to the SIGTERM sent by docker stop, fleetctl removes the unit file and reloads the daemon before the docker stop 10 second time out and that causes ExecStopPost to never run.

We're fixing the SIGTERM issue in our system since that's just sloppy but it feels like fleet should be able to guarantee (reasonably) that if it says a service is gone there isn't some process or container just lingering on the system indefinitely.

Running fleetctl stop waiting for 10-15 seconds (until docker stop times out and sends SIGKILL) then fleetctl destroy works fine as well so we can work around this for now. Just seems like some pretty unexpected behavior.

@dhawal55
Copy link

I'm also seeing the same issue on fleet v0.9.2.

This is how my unit file looks:

[Unit]
Description=Vulcand registration
Requires=etcd.service
After=etcd.service

[Service]
RemainAfterExit=no

ExecStartPre=/usr/bin/etcdctl set /vulcand/frontends/test1/frontend \
  '{"BackendId":"test", \
    "Route":"Host(`example.com`) && Path(`/health`)", \
    "Type":"http"}'
ExecStart=/usr/bin/etcdctl set /vulcand/frontends/test2/frontend \
  '{"BackendId":"test", \
    "Route":"Host(`example.com`) && Path(`/.*`)", \
    "Type":"http"}'


ExecStop=/usr/bin/etcdctl rm --recursive /vulcand/frontends/test1
ExecStopPost=/usr/bin/etcdctl rm --recursive /vulcand/frontends/test2

[Install]
WantedBy=multi-user.target

fleetctl destroy only removes the test1 frontend and leaves test2.
As mentioned by @philk, fleetctl stop and then destroy works fine.

@xied75
Copy link

xied75 commented Jul 30, 2015

Definitely worth re-opening:

I got two nginx one front one mid, the front is called for both stop and rm, the mid on the other hand, is only called for stop, and for the mid it did a reload.

v0.9.2

@cusspvz
Copy link

cusspvz commented Sep 18, 2015

+1 for reopening

We are using ExecStopPost and it isn't being triggered with fleetctl destroy.

fleet 0.10.2

cc @bcwaldon @fernandoneto

@fernandoneto
Copy link

+1 for reopening

@timkoopmans
Copy link

+1

@cusspvz
Copy link

cusspvz commented Sep 24, 2015

@bcwaldon seems that post owner doesn't show recent activity. Could you please reopen to avoid creating a new one?

@jonboulle jonboulle reopened this Oct 1, 2015
@brmakana
Copy link

brmakana commented Dec 9, 2015

Any news on this? Definitely still seeing the problem in 0.10.2

@shayts7
Copy link

shayts7 commented Jan 24, 2016

+1, We see the same problem with fleet version 0.11.5

@stmuraka
Copy link

+1 also reproduced with version 0.11.5

@kayrus
Copy link
Contributor

kayrus commented Jan 29, 2016

Can you guys post your fleet unit files? We have to reproduce the issue.

@shayts7
Copy link

shayts7 commented Jan 31, 2016

Sure. Here is mine (fleet version 0.11.5):

# Copyright 2015 HP Software
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

[Unit]
Description=Gaia Result Upload Service
Documentation=https://github.com/gaia-adm/result-upload-service

After=registrator.service
Wants=registrator.service

[Service]
# Let processes take awhile to start up (for first run Docker containers)
# Large start timeout is to allow for pulling down Docker images from Registry
TimeoutStartSec=20min
TimeoutStopSec=15

# Change killmode from "control-group" to "none" to let Docker remove
# work correctly.
KillMode=none

# Get CoreOS environmental variables
EnvironmentFile=/etc/environment

# Restart policy
Restart=on-failure
RestartSec=20s

# Pre-start and Start
## Directives with "=-" are allowed to fail without consequence
ExecStartPre=-/bin/mkdir /upload
ExecStartPre=-/usr/bin/docker stop result-upload-service
ExecStartPre=-/usr/bin/docker rm result-upload-service
ExecStartPre=/usr/bin/docker run --rm=true -v /opt/bin:/target gaiaadm/volume-tools-${ENVIRON_LOCATION}
ExecStartPre=/opt/bin/gaia-mount.sh result-upload-${ENVIRON}

ExecStart=/bin/bash -a -c '/usr/bin/docker run \
--name result-upload-service
-v $(/opt/bin/gaia-get-mount-path.sh result-upload-${ENVIRON}):/upload \
gaiaadm/result-upload-service:54-master'

# Stop
ExecStop=/usr/bin/docker stop result-upload-service
ExecStop=-/opt/bin/gaia-unmount.sh result-upload-${ENVIRON}
ExecStopPost=-/opt/bin/gaia-unmount.sh result-upload-${ENVIRON}

[Install]
WantedBy=multi-user.target

@kayrus kayrus self-assigned this Feb 1, 2016
@kayrus
Copy link
Contributor

kayrus commented Feb 1, 2016

@tixxdz

Steps to reproduce.

Unit file myapp.service:

[Service]
ExecStart=/usr/bin/docker run --name busybox --rm busybox sh -c 'while true; do echo "Hello, world"; sleep 1; done'
ExecStop=-/usr/bin/docker stop busybox
ExecStopPost=/usr/bin/echo 'trigger ExecStopPost'

fleetctl start myapp.service then run and watch for output journalctl -f -u myapp.service.

Run fleetctl stop myapp.service and you'll see trigger ExecStopPost message.

Then run fleetctl stop myapp.service again and then fleetctl destroy myapp.service. You won't notice trigger ExecStopPost.

Stop logs:

Feb 01 13:20:52 coreos1 systemd[1]: Started myapp.service.
Feb 01 13:20:53 coreos1 docker[1242]: Hello, world
...
Feb 01 13:20:57 coreos1 systemd[1]: Stopping myapp.service...
Feb 01 13:21:07 coreos1 docker[1280]: busybox
Feb 01 13:21:07 coreos1 systemd[1]: myapp.service: Main process exited, code=exited, status=137/n/a
Feb 01 13:21:07 coreos1 systemd[1]: Stopped myapp.service.
Feb 01 13:21:07 coreos1 systemd[1]: myapp.service: Unit entered failed state.
Feb 01 13:21:07 coreos1 systemd[1]: myapp.service: Failed with result 'exit-code'.
Feb 01 13:21:07 coreos1 echo[1311]: trigger ExecStopPost

Destroy logs:

Feb 01 13:22:35 coreos1 systemd[1]: Started myapp.service.
Feb 01 13:22:36 coreos1 docker[1367]: Hello, world
...
Feb 01 13:22:40 coreos1 systemd[1]: Stopping myapp.service...
Feb 01 13:22:41 coreos1 docker[1367]: Hello, world
...
Feb 01 13:22:50 coreos1 docker[1402]: busybox
Feb 01 13:22:50 coreos1 systemd[1]: myapp.service: Main process exited, code=exited, status=137/n/a
Feb 01 13:22:50 coreos1 systemd[1]: Stopped myapp.service.
Feb 01 13:22:50 coreos1 systemd[1]: myapp.service: Unit entered failed state.
Feb 01 13:22:50 coreos1 systemd[1]: myapp.service: Failed with result 'exit-code'.

@kayrus
Copy link
Contributor

kayrus commented Feb 1, 2016

@tixxdz
same effect with multiple ExecStop's. Looks like systemd runs only first ExecStop.

@kayrus
Copy link
Contributor

kayrus commented Feb 2, 2016

We can reproduce similar behavior using these bash commands:

submit and schedule unit:

[Unit]
Description=MyApp
After=docker.service
Requires=docker.service

[Service]
TimeoutStartSec=0
ExecStartPre=-/usr/bin/docker kill busybox1
ExecStartPre=-/usr/bin/docker rm busybox1
ExecStartPre=/usr/bin/docker pull busybox
ExecStart=/usr/bin/docker run --name busybox1 busybox /bin/sh -c "while true; do echo Hello World2; sleep 1; done"
ExecStop=/usr/bin/docker stop busybox1
ExecStopPost=/usr/bin/echo Hello from stop
sudo cp myapp.service /etc/systemd/system && sudo systemctl daemon-reload && sudo systemctl start myapp

stop unit, destroy it, reload systemd (in parallel):

sudo systemctl stop myapp.service & sudo rm /etc/systemd/system/myapp.service & sudo systemctl daemon-reload

When systemd reloads stopping units - it doesn't take into consideration defined ExecStopPost= and ExecStop= hooks.

Fleet side fix should contain code which will wait for the unit "stopped" status, then destroy it.

As for systemd, looks like this bug relates to this issue: systemd/systemd#518

@tixxdz
Copy link
Contributor

tixxdz commented Feb 9, 2016

Updates:

We are working on this right now, basically this is a bug in both fleet and systemd. In systemd this is a well known issue. In systemd when doing a "daemon-reload" some "Exec*" may just be skipped since units are replaced in memory and previous commands are not there any more. Please see below.

Summary:

  • fleet

On fleet side when doing a destroy we should do our best to wait for at least a stop, but currently I'm new to fleet so my understanding and from reading this documentation https://coreos.com/fleet/docs/latest/using-the-client.html
"The destroy command does two things: Instruct systemd on the host machine to stop the unit, deferring to systemd completely for any custom stop directives (i.e. ExecStop option in the unit file).
Remove the unit file from the cluster, making it impossible to start again until it has been re-submitted.
Once a unit is destroyed, state will continue to be reported for it in fleetctl list-units. Only once the unit has stopped will its state be removed."

So either we wait for units to stop then do the destroy, or we could add "--block" or "--no-block" switch to destroy command. With "--block" we will be backward compatible but I don't like it. With "--no-block" we will have to change the behaviour and make destory block by default unless you set the "--no-block" switch which is also the same in systemd when you do "systemctl --no-block daemon-reload" and of course use it are your own risk but we have to add some debug messages and make fleet and systemd a bit smarter. However even with this change there will still be bugs here and there, the main issue is when X (unrelated unit, user or whatever) triggers a "daemon-reload" in systemd, if you happen to have changed unit Y and didn't want to do a "daemon-reload" cause you know that some "Exec*" directives will be skipped... then you are also out of luck! yeh X just messed up with Y...

  • systemd

The related issue is here: systemd/systemd#518

So it's not only related to fleet destroy it may happen at anytime, we can improve fleet as noted above but we have to add at least some debug logs and make systemd smarter.

So we could check in systemd what section did change or save the line numbers and continue... we are trying to figure out the best solution, at least doing systemctl status changed.service should also tell you that not only you need a "daemon-reload" but perhaps you also need to do a stop unit before that.. and also make it log every time daemon-reload is triggered and some "Exec*" where lost not executed etc.

Anyway just to say that we are working on this. Thanks!

@jonboulle
Copy link
Contributor

This is a pretty fundamental issue in systemd that is very difficult for us to try mitigate in fleet itself.

For posterity, the suggested workaround for now is to run fleetctl stop foo && fleetctl destroy foo

@arthur-c
Copy link

I just noticed ExecStopPost is not played if I stop a fleet unit and immediatly destroy it after fleetctl stop. The workaround fleetctl stop && fleetctl destroy should be fleetctl stop && sleep x && fleetctl destroy

@antrik
Copy link
Contributor

antrik commented Apr 20, 2016

That last observation might be related to #1025 I guess?...

@jonboulle
Copy link
Contributor

This is basically blocked by systemd/systemd#518

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests