Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

Service periodically restarted #1402

Closed
eduardBM opened this issue Jan 14, 2016 · 25 comments
Closed

Service periodically restarted #1402

eduardBM opened this issue Jan 14, 2016 · 25 comments

Comments

@eduardBM
Copy link

Hi,

I see something similar to #1366 on my setup:

  • 3 nodes, Ubuntu 16.04, etcd 2.2.3, fleetd 0.11.5
  • started service with "X-Fleet : MachineID="
  • i'm trying to run/manage a service on the third node
  • every 15-20 second the service is reloaded:
    Log:
    Jan 14 14:39:24 o11n203 fleetd[4494]: DEBUG reconcile.go:257: Desired hash "5672ec4bdaefa511f7617e25c40a50fd6814bde1" differs to current hash 08e8f21aa9788333ec99d7eed7ef9754a33fb724 of Job(ovs-snmp@10.130.11.203.service) - unloading
    Jan 14 14:39:24 o11n203 fleetd[4494]: DEBUG reconcile.go:321: AgentReconciler attempting tasks [{UnloadUnit unit loaded but hash differs to expected %!s(_job.Unit=&{ovs-snmp@10.130.11.203.service {map[Unit:map[After:[ovs-watcher-framework.service] Description:[ovs snmp server] Requires:[ovs-watcher-framework.service]] Service:map[ExecStart:[/usr/bin/python2 /opt/OpenvStorage/ovs/extensions/snmp/ovssnmpserver.py --port 161] Restart:[on-failure] RestartSec:[5] TimeoutStopSec:[60] Type:[simple] Environment:[PYTHONPATH=/opt/OpenvStorage] WorkingDirectory:[/opt/OpenvStorage]] Install:map[WantedBy:[multi-user.target]] X-Fleet:map[MachineID:[93aae5bf7736d103238cb3b0569655ae]]] [0xc82086ef60 0xc82086efc0 0xc82086f020 0xc82086f080 0xc82086f0e0 0xc82086f140 0xc82086f1a0 0xc82086f200 0xc82086f260 0xc82086f2c0 0xc82086f320 0xc82086f3b0]} })} {UnloadUnit unit loaded but hash differs to expected %!s(_job.Unit=&{ovs-support-agent@10.130.11.203.service {map[Unit:map[Description:[Open vStorage support agent]] Service:map[Type:[simple] Environment:[PYTHONPATH=/opt/OpenvStorage] ExecStart:[/usr/bin/python2 /opt/OpenvStorage/ovs/extensions/support/agent.py] Restart:[on-failure] TimeoutStopSec:[3600]] Install:map[WantedBy:[multi-user.target]] X-Fleet:map[MachineID:[93aae5bf7736d103238cb3b0569655ae]]] [0xc82018b770 0xc82018b7d0 0xc82018b830 0xc82018b890 0xc82018b8f0 0xc82018b950 0xc82018b9b0 0xc82018ba40]} })} {LoadUnit unit scheduled here but not loaded %!s(*job.Unit=&{ovs-snmp@10.130.11.203.service {map[Unit:map[Description:[ovs snmp server] Requires:[ovs-watcher-framework.service] After:[ovs-watcher-framework.service]] Service:map[Environment:[PYTHONPATH=/opt/OpenvStorage] WorkingDirectory:[/opt/OpenvStorage] ExecStart:[/usr/bin/python2 /opt/OpenvStorage/ovs/extensions/snmp/ovssnmpserver.py --port 161] Restart:[on-failure] RestartSec:[5] TimeoutStopSec:[60] Type:[simple]] Install:map[WantedBy:[multi-user.target]] X-Fleet:map[MachineID:[93aae5bf7736d103238cb3b0569655ae]]] [0xc82086ef60 0xc82086efc0 0xc82086f020 0xc82086f080 0xc82086f0e0 0xc82086f140 0xc82086f1a0 0xc8208
    Jan 14 14:39:24 o11n203 fleetd[4494]: INFO manager.go:138: Triggered systemd unit ovs-snmp@10.130.11.203.service stop: job=165352
    Jan 14 14:39:24 o11n203 fleetd[4494]: INFO manager.go:259: Removing systemd unit ovs-snmp@10.130.11.203.service

I looked at #720, but i don't see a way to fix this.
Any ideas?

@eduardBM
Copy link
Author

Hash seems to "flip" between values:
root@o11n201:~# fleetctl list-unit-files
UNIT HASH DSTATE STATE TARGET
...
ovs-snmp@10.130.11.203.service 08e8f21 launched launched 93aae5bf.../10.130.11.203
...
couple of seconds later
UNIT HASH DSTATE STATE TARGET
...
ovs-snmp@10.130.11.203.service 5672ec4 launched launched 93aae5bf.../10.130.11.203
...

Etcd values:
root@o11n201:# etcdctl get /_coreos.com/fleet/states/ovs-snmp@10.130.11.203.service/93aae5bf7736d103238cb3b0569655ae
{"loadState":"loaded","activeState":"active","subState":"running","machineState":{"ID":"93aae5bf7736d103238cb3b0569655ae","PublicIP":"","Metadata":null,"Version":""},"unitHash":"bc168196b66ccff42597d530515839d93ecab098"}
root@o11n201:
# etcdctl get /_coreos.com/fleet/states/ovs-snmp@10.130.11.203.service/93aae5bf7736d103238cb3b0569655ae
{"loadState":"loaded","activeState":"active","subState":"running","machineState":{"ID":"93aae5bf7736d103238cb3b0569655ae","PublicIP":"","Metadata":null,"Version":""},"unitHash":"08e8f21aa9788333ec99d7eed7ef9754a33fb724"}

@eduardBM
Copy link
Author

Hash "flipping" values:

root@o11n201:~# fleetctl list-unit-files --full
UNIT                HASH                        DSTATE      STATE       TARGET
ovs-snmp@10.130.11.201.service  d0103a37319c73e0616f609cb84923ab23970857    launched    launched    5f3cf828f82fdf825465c1355698e4d1/10.130.11.201
root@o11n201:~# fleetctl list-unit-files --full
UNIT                HASH                        DSTATE      STATE       TARGET
ovs-snmp@10.130.11.201.service  498cd185046e77fd643b3b85b3d319a254445bc6    launched    launched    5f3cf828f82fdf825465c1355698e4d1/10.130.11.201
root@o11n201:~# fleetctl list-unit-files --full
UNIT                HASH                        DSTATE      STATE       TARGET
ovs-snmp@10.130.11.201.service  f1630e5a6280f52dc2d842d384569e1c36617bd7    launched    launched    5f3cf828f82fdf825465c1355698e4d1/10.130.11.201
root@o11n201:~# fleetctl list-unit-files --full
UNIT                HASH                        DSTATE      STATE       TARGET
ovs-snmp@10.130.11.201.service  d0103a37319c73e0616f609cb84923ab23970857    launched    launched    5f3cf828f82fdf825465c1355698e4d1/10.130.11.201
root@o11n201:~# fleetctl list-unit-files --full
UNIT                HASH                        DSTATE      STATE       TARGET
ovs-snmp@10.130.11.201.service  80a4c6aaf80cf181a67694622971159953a44940    launched    launched    5f3cf828f82fdf825465c1355698e4d1/10.130.11.201
root@o11n201:~# fleetctl list-unit-files --full
UNIT                HASH                        DSTATE      STATE       TARGET
ovs-snmp@10.130.11.201.service  d0103a37319c73e0616f609cb84923ab23970857    launched    launched    5f3cf828f82fdf825465c1355698e4d1/10.130.11.201
root@o11n201:~# fleetctl list-unit-files --full
UNIT                HASH                        DSTATE      STATE       TARGET
ovs-snmp@10.130.11.201.service  80a4c6aaf80cf181a67694622971159953a44940    launched    launched    5f3cf828f82fdf825465c1355698e4d1/10.130.11.201
root@o11n201:~# fleetctl list-unit-files --full
UNIT                HASH                        DSTATE      STATE       TARGET
ovs-snmp@10.130.11.201.service  f1630e5a6280f52dc2d842d384569e1c36617bd7    launched    launched    5f3cf828f82fdf825465c1355698e4d1/10.130.11.201
root@o11n201:~# fleetctl list-unit-files --full
UNIT                HASH                        DSTATE      STATE       TARGET
ovs-snmp@10.130.11.201.service  f1630e5a6280f52dc2d842d384569e1c36617bd7    launched    launched    5f3cf828f82fdf825465c1355698e4d1/10.130.11.201
root@o11n201:~# fleetctl list-unit-files --full
UNIT                HASH                        DSTATE      STATE       TARGET
ovs-snmp@10.130.11.201.service  f1630e5a6280f52dc2d842d384569e1c36617bd7    launched    launched    5f3cf828f82fdf825465c1355698e4d1/10.130.11.201
root@o11n201:~# fleetctl list-unit-files --full
UNIT                HASH                        DSTATE      STATE       TARGET
ovs-snmp@10.130.11.201.service  f1630e5a6280f52dc2d842d384569e1c36617bd7    launched    launched    5f3cf828f82fdf825465c1355698e4d1/10.130.11.201
root@o11n201:~# fleetctl list-unit-files --full
UNIT                HASH                        DSTATE      STATE       TARGET
ovs-snmp@10.130.11.201.service  f1630e5a6280f52dc2d842d384569e1c36617bd7    launched    launched    5f3cf828f82fdf825465c1355698e4d1/10.130.11.201
root@o11n201:~# fleetctl list-unit-files --full
UNIT                HASH                        DSTATE      STATE       TARGET
ovs-snmp@10.130.11.201.service  d0103a37319c73e0616f609cb84923ab23970857    launched    launched    5f3cf828f82fdf825465c1355698e4d1/10.130.11.201

@eduardBM
Copy link
Author

Unit file keeps changing layout

root@o11n203:~# fleetctl cat ovs-snmp@10.130.11.201.service
[Service]
Type=simple
Environment=PYTHONPATH=/opt/OpenvStorage
WorkingDirectory=/opt/OpenvStorage
ExecStart=/usr/bin/python2 /opt/OpenvStorage/ovs/extensions/snmp/ovssnmpserver.py --port 161
Restart=on-failure
RestartSec=5
TimeoutStopSec=60

[Install]
WantedBy=multi-user.target

[X-Fleet]
MachineID=5f3cf828f82fdf825465c1355698e4d1

[Unit]
Description=ovs snmp server
Requires=ovs-watcher-framework.service
After=ovs-watcher-framework.service
root@o11n203:~# fleetctl cat ovs-snmp@10.130.11.201.service
[Unit]
Description=ovs snmp server
Requires=ovs-watcher-framework.service
After=ovs-watcher-framework.service

[Service]
Type=simple
Environment=PYTHONPATH=/opt/OpenvStorage
WorkingDirectory=/opt/OpenvStorage
ExecStart=/usr/bin/python2 /opt/OpenvStorage/ovs/extensions/snmp/ovssnmpserver.py --port 161
Restart=on-failure
RestartSec=5
TimeoutStopSec=60

[Install]
WantedBy=multi-user.target

[X-Fleet]
MachineID=5f3cf828f82fdf825465c1355698e4d1

@jonboulle jonboulle added this to the v0.12.0 milestone Jan 19, 2016
@kayrus
Copy link
Contributor

kayrus commented Jan 26, 2016

@eduardBM Can you provide your etcd configuration? etcdctl member list and etcdctl cluster-health?

@eduardBM
Copy link
Author

@kayrus
Hi
Retested with single node (to exclude issues with the cluster) and on Ubuntu 15.10 (to exclude issues with ubuntu dev version - 16.04).
Reproduced:

  • Ubuntu 15.10, single node
  • installed:
wget  http://launchpadlibrarian.net/223781734/fleet_0.11.5+dfsg-1_amd64.deb
apt-get install golang-go -y
dpkg -i fleet_0.11.5+dfsg-1_amd64.deb
git clone https://github.com/cnelson/python-fleet
cd python-fleet && python setup.py install
  • created service
fleetctl list-units
UNIT                        MACHINE             ACTIVE      SUB
...
ovs-support-agent@10.130.11.201.service     e0ed4c9b.../10.130.11.201   active      running
...
  • unit file
fleetctl list-unit-files
ovs-support-agent@10.130.11.201.service     a73d54f launched    launched    e0ed4c9b.../10.130.11.201
...
  • couple of seconds later
ovs-support-agent@10.130.11.201.service     50dc3f0 launched    launched    e0ed4c9b.../10.130.11.201

Requested information:

etcdctl cluster-health
member c13b2ac7dd38d0e5 is healthy: got healthy result from http://10.130.11.201:2379
cluster is healthy
etcdctl member list
c13b2ac7dd38d0e5: name=unqjMrez0JhNaaCk peerURLs=http://10.130.11.201:2380 clientURLs=http://10.130.11.201:2379

Thanks,
Eduard

@kayrus
Copy link
Contributor

kayrus commented Feb 10, 2016

@eduardBM how did you submit this unit?

@jonboulle
Copy link
Contributor

fleet currently uses go-systemd @ cf3cdf77462baaad163ad2d5d1984b9c1b493701:

"Rev": "cf3cdf77462baaad163ad2d5d1984b9c1b493701"

coreos/go-systemd@cf3cdf7

First thing to check is that that version of go-systemd guaranteed stable serialisation of units.

(There have been a number of changes since then, e.g. coreos/go-systemd@6654289 and coreos/go-systemd@3130945, but that won't be relevant until #1375 lands)

@eduardBM
Copy link
Author

@kayrus
I used both the cli command "fleetctl load X.service" (where X.service is a local file containing the file) and the python client (https://github.com/cnelson/python-fleet) "fleet_client.create_unit... ".

@eduardBM
Copy link
Author

@jonboulle
not sure how to check that.

@jonboulle
Copy link
Contributor

@eduardBM sorry, that was a note for @kayrus or @tixxdz to look into it, not you :-)

@kayrus
Copy link
Contributor

kayrus commented Feb 10, 2016

@eduardBM can you reproduce this behavior with the pure fleetctl start unifile.service?

@eduardBM
Copy link
Author

@kayrus , yes i was able to reproduce it with the cli command.
I will reinstall and retest tomorrow with latest code.
LE:
reinstalled ubuntu 15.10, fleet 0.11.5 deb package

root@o11n203:~# fleetctl load ovs-snmp.service 
Unit ovs-snmp.service inactive

Unit ovs-snmp.service loaded on c1a32b6f.../10.130.11.203
root@o11n203:~# 
root@o11n203:~# fleetctl start ovs-snmp.service 
WARNING: Unit ovs-snmp.service in registry differs from local unit file ovs-snmp.service
Unit ovs-snmp.service launched on c1a32b6f.../10.130.11.203

root@o11n203:~# fleetctl status ovs-snmp.service 
● ovs-snmp.service - ovs snmp server
   Loaded: loaded (/run/fleet/units/ovs-snmp.service; enabled; vendor preset: enabled)
...snip...

root@o11n203:~# fleetctl list-unit-files
UNIT            HASH    DSTATE      STATE       TARGET
ovs-snmp.service    e89d030 launched    launched    c1a32b6f.../10.130.11.203
root@o11n203:~# fleetctl list-unit-files
UNIT            HASH    DSTATE      STATE       TARGET
ovs-snmp.service    9359df7 launched    launched    c1a32b6f.../10.130.11.203

Unit HASH changes every couple of seconds.

@kayrus
Copy link
Contributor

kayrus commented Feb 11, 2016

Here is how I reproduced the issue on ubuntu 16.04 vm:

$ apt-get install build-essential golang-go python-dev pipexec fleet etcd -y
$ dpkg -i fleet_0.11.5+dfsg-1_amd64.deb etcd_2.2.3+dfsg-1_amd64.deb
$ git clone https://github.com/cnelson/python-fleet
$ cd python-fleet && python setup.py install
$ systemctl start fleet
$ fleetctl list-machines --full
MACHINE                                 IP              METADATA
22e41ca31da38ea956ade49269567301        192.168.122.147  -
$ cat app.service
[Unit]
Description=simple app

[Service]
Type=simple
ExecStart=/bin/bash -c 'while true; do echo Hello, World; sleep 1; done'
Restart=on-failure
RestartSec=5
TimeoutStopSec=60

[Install]
WantedBy=multi-user.target

[X-Fleet]
MachineID=22e41ca31da38ea956ade49269567301
$ fleetctl start app
Unit app.service inactive
Unit app.service launched on 22e41ca3.../192.168.122.147
$ fleetctl start app
WARNING: Unit app.service in registry differs from local unit file app.service
$ fleetctl start app
WARNING: Unit app.service in registry differs from local unit file app.service
$ fleetctl start app
$ fleetctl start app
$ fleetctl start app
WARNING: Unit app.service in registry differs from local unit file app.service
$ sha1sum app.service
fb30d997678396b3938b82ffcef1b5ac4d214e0a  app.service
$ etcdctl ls --recursive /_coreos.com/fleet/unit
etcdctl ls --recursive /_coreos.com/fleet/unit
/_coreos.com/fleet/unit/39e247ee21fae23bf0f1a907d33f249d6f90d27f

And the normal behavior should be:

$ sha1sum app.service
fb30d997678396b3938b82ffcef1b5ac4d214e0a  app.service
$ etcdctl ls --recursive /_coreos.com/fleet/unit
/_coreos.com/fleet/unit/fb30d997678396b3938b82ffcef1b5ac4d214e0a
$ etcdctl get /_coreos.com/fleet/job/app.service/object | sed -r 's/.*\[([,0-9]+)\].*/\1/g' | tr ',' '\n' | xargs -I{} printf "%x" {} && echo
fb30d997678396b3938b82ffcef1b5ac4d214e0a

I've added extra debug code into ubuntu package:

if err == nil && luf.Hash() != suf.Hash() {
  stderr("ERROR: %s", err)
  stderr("WARNING: %s != %s", luf.Hash(), suf.Hash())
  stderr("WARNING: Unit %s in registry differs from local unit file %s", su.Name, loc)
  return
}

And here is the result:

fleetctl start app.service 
ERROR: %!s(<nil>)
WARNING: 55651717f8a83c2815ff428b328bf26000135988 != 28887a7ede907207d9ea8a8e8dbda1d9d98b1e8a
WARNING: Unit app.service in registry differs from local unit file app.service
root@ubuntu1:~# fleetctl start app.service 
ERROR: %!s(<nil>)
WARNING: 28887a7ede907207d9ea8a8e8dbda1d9d98b1e8a != 28887a7ede907207d9ea8a8e8dbda1d9d98b1e8a
WARNING: Unit app.service in registry differs from local unit file app.service
root@ubuntu1:~# fleetctl start app.service 
ERROR: %!s(<nil>)
WARNING: 39e247ee21fae23bf0f1a907d33f249d6f90d27f != 28887a7ede907207d9ea8a8e8dbda1d9d98b1e8a
WARNING: Unit app.service in registry differs from local unit file app.service
root@ubuntu1:~# fleetctl start app.service 
ERROR: %!s(<nil>)
WARNING: 55651717f8a83c2815ff428b328bf26000135988 != 55651717f8a83c2815ff428b328bf26000135988
WARNING: Unit app.service in registry differs from local unit file app.service

Sometimes when you submit new unit, fleet just returns this:

$ fleetctl start app
2016/02/11 11:50:54 WARN fleetctl.go:801: Error retrieving Unit(app.service) from Registry: googleapi: got HTTP response code 500 with body: {"error":{"code":500,"message":""}}

Looks like desired units structure has different order every loop. And this bug is valid only in Ubuntu Xenial build. If you build fleet in docker env using ./build-docker it works as expected.

@antrik
Copy link
Contributor

antrik commented Feb 23, 2016

Now that go-systemd has been bumped in fleet master, could you please check whether this issue persists?

@antrik
Copy link
Contributor

antrik commented Feb 23, 2016

Oh, I think this wasn't entirely clear from previous discussion (at least not to me): The problem manifests only with the Ubuntu packaging for the (yet unreleased) "xenial" distribution (i.e. 16.04). A "manual" build (using docker) works fine, even on Ubuntu xenial.

Considering that, should we still consider it part of the release milestone?...

@eduardBM
Copy link
Author

@antrik
I will check first thing in the morning if this issue persists.
For clarification: i was using fleet 0.11.5 deb package on Ubuntu 15.10, so if a new package is available i can test with that, otherwise i need to wait until the package is available.

@antrik
Copy link
Contributor

antrik commented Feb 23, 2016

@eduardBM no, there is no new package; just an update in the upstream master branch. Also, it turns out this is actually unlikely to change anything...

BTW, is this issue known to the Ubuntu packagers? It seems to me that in this instance they might actually be more qualified to investigate it...

@eduardBM
Copy link
Author

@antrik
If there's no new package then i have to manually build fleet from git (i'll try that later).

Not sure if this issue is known to the Ubuntu packagers? How do i get their attention?

@kayrus
Copy link
Contributor

kayrus commented Feb 24, 2016

@eduardBM we still don't know exactly what causes this bug (code in fleet, go dependencies or something else). but you are free to report a bug here: https://bugs.launchpad.net/ubuntu/+source/fleet/+filebug

@antrik
Copy link
Contributor

antrik commented Feb 24, 2016

@eduardBM you can use the "reportbug" command line tool, or the web interface kayrus linked if you prefer. Either way, be sure to include a link to this discussion :-)

As for building it manually, this should help if you need a working version urgently. (Use the "build-docker" script for a known working result.) Otherwise you might want to wait for feedback from the Ubuntu packagers...

@kayrus
Copy link
Contributor

kayrus commented Apr 11, 2016

I've just compiled and tested this fleet package:
https://packages.debian.org/sid/fleet
with the golang-github-coreos-go-systemd-dev=v5 and golang-dbus-dev=v3 dependency.

this bug doesn't exist.

Default go-systemd package in ubuntu xenial has v3 version. It looks like go-systemd should be bumped.

@kayrus
Copy link
Contributor

kayrus commented Apr 11, 2016

@eduardBM go-systemd v4 has fix for serialization order: coreos/go-systemd@3130945

@kayrus
Copy link
Contributor

kayrus commented Apr 11, 2016

@kayrus
Copy link
Contributor

kayrus commented Apr 12, 2016

Consider the matter closed

@kayrus kayrus closed this as completed Apr 12, 2016
@eduardBM
Copy link
Author

@kayrus Thanks for your help, i'll keep an eye on the other bug report.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants