Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

usability: garbage collection in /_coreos.com/fleet/job registry directory #1456

Open
kayrus opened this issue Mar 2, 2016 · 6 comments
Open

Comments

@kayrus
Copy link
Contributor

kayrus commented Mar 2, 2016

steps to reproduce:

  1. check etcd registry:
$ etcdctl ls --recursive /_coreos.com
/_coreos.com/fleet
/_coreos.com/fleet/machines
/_coreos.com/fleet/machines/78b98c5dfe468973b93e098a5a5829ea
/_coreos.com/fleet/machines/78b98c5dfe468973b93e098a5a5829ea/object
/_coreos.com/fleet/machines/f8f4ec94d4193d2f160e8bd22ed360be
/_coreos.com/fleet/machines/f8f4ec94d4193d2f160e8bd22ed360be/object
/_coreos.com/fleet/machines/0903835ba13faf7f62cf5d8ddc7812de
/_coreos.com/fleet/machines/0903835ba13faf7f62cf5d8ddc7812de/object
/_coreos.com/fleet/engine
/_coreos.com/fleet/engine/version
/_coreos.com/fleet/lease
/_coreos.com/fleet/lease/engine-leader
  1. create template unit:
[Unit]
Description=My Service

[Service]
TimeoutStartSec=0
ExecStart=/bin/sh -c "trap 'exit 0' INT TERM; while true; do echo Hello World %i; sleep 10; done"
  1. submit and start units:
$ fleetctl submit hello@.service          
Unit hello@.service inactive
$ fleetctl start hello@{1..10}.service      
Unit hello@1.service inactive
Unit hello@2.service inactive
Unit hello@3.service inactive
Unit hello@4.service inactive
Unit hello@5.service inactive
Unit hello@6.service inactive
Unit hello@7.service inactive
Unit hello@8.service inactive
Unit hello@9.service inactive
Unit hello@10.service inactive
Unit hello@2.service launched
Unit hello@4.service launched
Unit hello@5.service launched
Unit hello@1.service launched on 78b98c5d.../192.168.122.188
Unit hello@3.service launched on f8f4ec94.../192.168.122.135
Unit hello@6.service launched on f8f4ec94.../192.168.122.135
Unit hello@7.service launched on 78b98c5d.../192.168.122.188
Unit hello@8.service launched on 0903835b.../192.168.122.166
Unit hello@10.service launched on f8f4ec94.../192.168.122.135
Unit hello@9.service launched on 0903835b.../192.168.122.166
  1. destroy units:
$ fleetctl list-unit-files --no-legend | awk '/hello@[0-9]+/ {print "/usr/bin/fleetctl destroy "$1}' | sh -s
Destroyed hello@1.service
Destroyed hello@10.service
Destroyed hello@2.service
Destroyed hello@3.service
Destroyed hello@4.service
Destroyed hello@5.service
Destroyed hello@6.service
Destroyed hello@7.service
Destroyed hello@8.service
Destroyed hello@9.service
  1. list etcd registry, some units are still there, but fleet doesn't see them:
$ etcdctl ls --recursive --sort /_coreos.com/fleet/job
/_coreos.com/fleet/job/hello@10.service
/_coreos.com/fleet/job/hello@3.service
/_coreos.com/fleet/job/hello@5.service
/_coreos.com/fleet/job/hello@5.service/target
/_coreos.com/fleet/job/hello@7.service
/_coreos.com/fleet/job/hello@7.service/target
/_coreos.com/fleet/job/hello@8.service
/_coreos.com/fleet/job/hello@8.service/job-state
/_coreos.com/fleet/job/hello@9.service
$ fleetctl list-unit-files
UNIT    HASH    DSTATE  STATE   TARGET
$ fleetctl list-units     
UNIT    MACHINE ACTIVE  SUB
  1. systemd still has some unit files in memory:
$ systemctl status hello@1
● hello@1.service
   Loaded: not-found (Reason: No such file or directory)
   Active: inactive (dead)

Mar 02 13:00:19 coreos1 sh[10821]: Hello World 1
Mar 02 13:00:29 coreos1 sh[10821]: Hello World 1
Mar 02 13:00:39 coreos1 sh[10821]: Hello World 1
Mar 02 13:00:49 coreos1 sh[10821]: Hello World 1
Mar 02 13:00:59 coreos1 sh[10821]: Hello World 1
Mar 02 13:01:09 coreos1 sh[10821]: Hello World 1
Mar 02 13:01:19 coreos1 sh[10821]: Hello World 1
Mar 02 13:01:23 coreos1 systemd[1]: Stopping My Service...
Mar 02 13:01:23 coreos1 systemd[1]: Stopped My Service.
Mar 02 13:01:23 coreos1 sh[10821]: Terminated

relates to #1344, #1433 PRs.

/cc @jonboulle @tixxdz @antrik @reneengelhard

@mischief
Copy link
Contributor

mischief commented Mar 2, 2016

i dont think systemctl status hello@1 shows that systemd has the unit 'in memory' it shows that the journald contains some entries for _SYSTEMD_UNIT=hello@1.service.

@kayrus
Copy link
Contributor Author

kayrus commented Mar 2, 2016

@mischief ok, thus we can skip that.

@kayrus
Copy link
Contributor Author

kayrus commented Mar 11, 2016

In addition when you destroy unit, it still exists inside the etcd registry, i.e. /_coreos.com/fleet/unit/79699e32624ad46c366ef37de89006ae9da617cd

This function doesn't remove hash: https://github.com/coreos/fleet/blob/master/registry/job.go#L302

probably it should. @jonboulle why this function doesn't remove hashedUnitPath?

getUnitByHash is also called by dirToUnit and getUnitFromObjectNode.

@dongsupark
Copy link
Contributor

@kayrus Thanks for filing this bug and following it up. During looking into other issues, I stumbled across this case as well. Next week I'm going to have a look.

@dongsupark
Copy link
Contributor

@kayrus It turns out, this is apparently not a new topic. And someone already submitted a patch, which has never been merged. #1291
Frankly speaking, I don't see a reason not to merge it. From the current design, units are stored with hash of the unit content as name in the registry. Hash collision could happen regardless of that change.

@tixxdz
Copy link
Contributor

tixxdz commented Mar 18, 2016

@kayrus @mischief not sure I got you, but as you see from that example "systemctl status" shows the state of unit and if it's active sure it's in memory!! and of course journald entries can be old and do not reflect that current state...

dongsupark pushed a commit to endocode/fleet that referenced this issue Mar 18, 2016
So far each command "fleetctl destroy unit" has removed job entries from
the etcd registry, under /_coreos.com/fleet/job. But it has not removed
its unit file, under /_coreos.com/fleet/unit. As a result, fleet left
lots of garbages in the etcd registry, so users had to manually clean
them up.

So this patch gets unit contents deleted actually from etcd registry
when DestroyUnit() gets called. To avoid potential hash collisions,
it first fetches a list of units from registry, to check there's any
duplicated entry. Only if no duplicated unit is found, fleetd actually
deletes the unit from registry.

Fixes: coreos#1456
Fixes: coreos#1290
Reference: coreos#1291
dongsupark pushed a commit to endocode/fleet that referenced this issue Mar 21, 2016
So far each command "fleetctl destroy unit" has removed job entries from
the etcd registry, under /_coreos.com/fleet/job. But it has not removed
its unit file, under /_coreos.com/fleet/unit. As a result, fleet left
lots of garbages in the etcd registry, so users had to manually clean
them up.

So this patch gets unit contents deleted actually from etcd registry
when DestroyUnit() gets called. To avoid potential hash collisions,
it first fetches a list of units from registry, to check there's any
duplicated entry. Only if no duplicated unit is found, fleetd actually
deletes the unit from registry.

Fixes: coreos#1456
Fixes: coreos#1290
Reference: coreos#1291
dongsupark pushed a commit to endocode/fleet that referenced this issue Mar 24, 2016
So far each command "fleetctl destroy unit" has removed job entries from
the etcd registry, under /_coreos.com/fleet/job. But it has not removed
its unit file, under /_coreos.com/fleet/unit. As a result, fleet left
lots of garbages in the etcd registry, so users had to manually clean
them up.

So this patch gets unit contents deleted actually from etcd registry
when DestroyUnit() gets called. To avoid potential hash collisions,
it first fetches a list of units from registry, to check there's any
duplicated entry. Only if no duplicated unit is found, fleetd actually
deletes the unit from registry.

Fixes: coreos#1456
Fixes: coreos#1290
Reference: coreos#1291
dongsupark pushed a commit to endocode/fleet that referenced this issue Mar 24, 2016
So far each command "fleetctl destroy unit" has removed job entries from
the etcd registry, under /_coreos.com/fleet/job. But it has not removed
its unit file, under /_coreos.com/fleet/unit. As a result, fleet left
lots of garbages in the etcd registry, so users had to manually clean
them up.

So this patch gets unit contents deleted actually from etcd registry
when DestroyUnit() gets called. To avoid potential hash collisions,
it first fetches a list of units from registry, to check there's any
duplicated entry. Only if no duplicated unit is found, fleetd actually
deletes the unit from registry.

Fixes: coreos#1456
Fixes: coreos#1290
Reference: coreos#1291
dongsupark pushed a commit to endocode/fleet that referenced this issue Apr 4, 2016
So far each command "fleetctl destroy unit" has removed job entries from
the etcd registry, under /_coreos.com/fleet/job. But it has not removed
its unit file, under /_coreos.com/fleet/unit. As a result, fleet left
lots of garbages in the etcd registry, so users had to manually clean
them up.

So this patch gets unit contents deleted actually from etcd registry
when DestroyUnit() gets called. To avoid potential hash collisions,
it first fetches a list of units from registry, to check there's any
duplicated entry. Only if no duplicated unit is found, fleetd actually
deletes the unit from registry.

Fixes: coreos#1456
Fixes: coreos#1290
Reference: coreos#1291
dongsupark pushed a commit to endocode/fleet that referenced this issue Apr 19, 2016
So far each command "fleetctl destroy unit" has removed job entries from
the etcd registry, under /_coreos.com/fleet/job. But it has not removed
its unit file, under /_coreos.com/fleet/unit. As a result, fleet left
lots of garbages in the etcd registry, so users had to manually clean
them up.

So this patch gets unit contents deleted actually from etcd registry
when DestroyUnit() gets called. To avoid potential hash collisions,
it first fetches a list of units from registry, to check there's any
duplicated entry. Only if no duplicated unit is found, fleetd actually
deletes the unit from registry.

Fixes: coreos#1456
Fixes: coreos#1290
Reference: coreos#1291
dongsupark pushed a commit to endocode/fleet that referenced this issue Apr 25, 2016
So far each command "fleetctl destroy unit" has removed job entries from
the etcd registry, under /_coreos.com/fleet/job. But it has not removed
its unit file, under /_coreos.com/fleet/unit. As a result, fleet left
lots of garbages in the etcd registry, so users had to manually clean
them up.

So this patch gets unit contents deleted actually from etcd registry
when DestroyUnit() gets called. To avoid potential hash collisions,
it first fetches a list of units from registry, to check there's any
duplicated entry. Only if no duplicated unit is found, fleetd actually
deletes the unit from registry.

Fixes: coreos#1456
Fixes: coreos#1290
Reference: coreos#1291
dongsupark pushed a commit to endocode/fleet that referenced this issue Apr 29, 2016
So far each command "fleetctl destroy unit" has removed job entries from
the etcd registry, under /_coreos.com/fleet/job. But it has not removed
its unit file, under /_coreos.com/fleet/unit. As a result, fleet left
lots of garbages in the etcd registry, so users had to manually clean
them up.

So this patch gets unit contents deleted actually from etcd registry
when DestroyUnit() gets called. To avoid potential hash collisions,
it first fetches a list of units from registry, to check there's any
duplicated entry. Only if no duplicated unit is found, fleetd actually
deletes the unit from registry.

Fixes: coreos#1456
Fixes: coreos#1290
Reference: coreos#1291
dongsupark pushed a commit to dongsupark/fleet that referenced this issue May 26, 2016
So far each command "fleetctl destroy unit" has removed job entries from
the etcd registry, under /_coreos.com/fleet/job. But it has not removed
its unit file, under /_coreos.com/fleet/unit. As a result, fleet left
lots of garbages in the etcd registry, so users had to manually clean
them up.

So this patch gets unit contents deleted actually from etcd registry
when DestroyUnit() gets called. To avoid potential hash collisions,
it first fetches a list of units from registry, to check there's any
duplicated entry. Only if no duplicated unit is found, fleetd actually
deletes the unit from registry.

Fixes: coreos#1456
Fixes: coreos#1290
Reference: coreos#1291
dongsupark pushed a commit to endocode/fleet that referenced this issue Jul 21, 2016
So far each command "fleetctl destroy unit" has removed job entries from
the etcd registry, under /_coreos.com/fleet/job. But it has not removed
its unit file, under /_coreos.com/fleet/unit. As a result, fleet left
lots of garbages in the etcd registry, so users had to manually clean
them up.

So this patch gets unit contents deleted actually from etcd registry
when DestroyUnit() gets called. To avoid potential hash collisions,
it first fetches a list of units from registry, to check there's any
duplicated entry. Only if no duplicated unit is found, fleetd actually
deletes the unit from registry.

Fixes: coreos#1456
Fixes: coreos#1290
Reference: coreos#1291
dongsupark pushed a commit to endocode/fleet that referenced this issue Aug 12, 2016
So far each command "fleetctl destroy unit" has removed job entries from
the etcd registry, under /_coreos.com/fleet/job. But it has not removed
its unit file, under /_coreos.com/fleet/unit. As a result, fleet left
lots of garbages in the etcd registry, so users had to manually clean
them up.

So this patch gets unit contents deleted actually from etcd registry
when DestroyUnit() gets called. To avoid potential hash collisions,
it first fetches a list of units from registry, to check there's any
duplicated entry. Only if no duplicated unit is found, fleetd actually
deletes the unit from registry.

Fixes: coreos#1456
Fixes: coreos#1290
Reference: coreos#1291
dongsupark pushed a commit to endocode/fleet that referenced this issue Aug 31, 2016
So far each command "fleetctl destroy unit" has removed job entries from
the etcd registry, under /_coreos.com/fleet/job. But it has not removed
its unit file, under /_coreos.com/fleet/unit. As a result, fleet left
lots of garbages in the etcd registry, so users had to manually clean
them up.

So this patch gets unit contents deleted actually from etcd registry
when DestroyUnit() gets called. To avoid potential hash collisions,
it first fetches a list of units from registry, to check there's any
duplicated entry. Only if no duplicated unit is found, fleetd actually
deletes the unit from registry.

Fixes: coreos#1456
Fixes: coreos#1290
Reference: coreos#1291
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants