Deployment objects are not garbage collected #3244

wuub · 2017-09-18T13:00:12Z

Nomad version

0.6.3

Issue

As mentioned in #3157, during normal use the number of objects returned by nomad deployment list increases without any apparent bound.

Reproduction steps

Use nomad cluster for a while.

$ nomad deployment list | wc -l            
1224

Cry.

While I do not see any performance or stability hit as of now, I'm starting to get interested in knowing where we will start to observe some kind of cluster degradation? never? in 5 minutes? 2K/10K/100K/1M/10M deployments?

The text was updated successfully, but these errors were encountered:

hsmade · 2017-09-21T13:07:29Z

I have 100K deployments, and pulling on the API causes about 100MB of data...(0.6.0 here)

dadgar · 2017-09-21T19:24:02Z

Can you all share a sample deployment that won’t gc

wuub · 2017-09-21T19:28:47Z

@dadgar are you sure gc for deployments is implemented? Because AFAICT all out ours are just piling up, and there's nothing special about them, super simple update block 99% with single max_parallel statement.

dadgar · 2017-09-21T19:43:10Z

Yes it is implemented. The deployments won't be garbage collected when there is a running allocation referencing it.

wuub · 2017-09-21T19:48:07Z

That's interesting :). I'll try to see if any of our job specs trigger this on a dev cluster.

Is the GC event driver and triggers immediately after the last allocation is removed, or do I have to wait for periodic clean-up?

dadgar · 2017-09-21T19:48:57Z

You would have to wait for a periodic clean up or you can run curl -XPUT http://127.0.0.1:4646/v1/system/gc

jippi · 2017-09-21T19:50:29Z

in hashi-ui thats is also a button under system :)

wuub · 2017-09-21T19:51:40Z

Thanks. I'll try to investigate deeper first thing tomorrow morning.

wuub · 2017-09-22T08:17:54Z

Soooo. I sent curl -XPUT $NOMAD_ADDR/v1/system/gc to our prod cluster, and deployments list shrunk significantly (~1500 -> 159) (cc: @hsmade)

The only jobs that have more than one deployment share a job version, so it's most likely that the newer ones were just moving a subset of allocations due to node failure.

BUT.

Any reason why GC is not running on its own?

EDIT/UPDATE:

after one forced GC no other cleanup ran fot the past 7h+

183cb4ea  jobjobjob-stg   21   failed      Failed due to unhealthy allocations
056cb247  jobjobjob-stg   20   successful  Deployment completed successfully
83b31064  jobjobjob-stg   18   cancelled   Cancelled because job is stopped
1c020421  jobjobjob-stg   16   failed      Failed due to unhealthy allocations
112b1b50  jobjobjob-stg   14   failed      Failed due to unhealthy allocations
1c80f6dc  jobjobjob-stg   13   successful  Deployment completed successfully
834da81a  jobjobjob-stg   13   successful  Deployment completed successfully
ca3894e6  jobjobjob-stg   13   successful  Deployment completed successfully
f7fa35b9  jobjobjob-stg   13   successful  Deployment completed successfully

dadgar · 2017-09-25T17:59:30Z

@wuub That is the bug :( the loop creating the GC jobs wasn't doing it for the deployments. Will get a fix soon!

@hsmade Can you run the force and check it clears a bunch of your deployments?

Fixes #3244

wuub · 2017-09-25T18:09:08Z

Great news :) thank you @dadgar

hsmade · 2017-09-25T21:11:02Z

@dadgar it does, clean out when forcing GC, thx!

dadgar · 2017-09-25T21:12:02Z

@hsmade Sweet! Thanks both of you! Will be fixed in 0.7!

hsmade · 2017-09-25T21:13:19Z

BTW.. A note on the huge data transfer I saw. This was mainly caused by us configuring the reserved ports (20000-32000) because of a misunderstanding of the docs (fixed the docs since then). If you have 64 clients, and each of the produces about 1MB of data for just the port reservations, that adds up :)

dadgar · 2017-09-25T21:14:31Z

@hsmade Yeah that is definitely something we are aware of and would like to make a simple two integers for a range rather than materializing each port!

hsmade · 2017-09-26T05:38:23Z

hehe, yes :)
Luckily, I don't actually need it. So that was a simple(ish) fix. Apart from the rolling restart :P (Which works great, I can just restart my cluster without influencing / stopping my containers. Kudos for that!!)

github-actions · 2022-12-07T02:18:17Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added a commit that referenced this issue Sep 25, 2017

Run deployment garbage collector on an interval

5db4f4c

Fixes #3244

dadgar mentioned this issue Sep 25, 2017

Run deployment garbage collector on an interval #3267

Merged

dadgar added a commit that referenced this issue Sep 25, 2017

Run deployment garbage collector on an interval

c07a932

Fixes #3244

dadgar closed this as completed in #3267 Sep 25, 2017

github-actions bot locked as resolved and limited conversation to collaborators Dec 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment objects are not garbage collected #3244

Deployment objects are not garbage collected #3244

wuub commented Sep 18, 2017

hsmade commented Sep 21, 2017 •

edited

Loading

dadgar commented Sep 21, 2017

wuub commented Sep 21, 2017

dadgar commented Sep 21, 2017

wuub commented Sep 21, 2017

dadgar commented Sep 21, 2017

jippi commented Sep 21, 2017

wuub commented Sep 21, 2017

wuub commented Sep 22, 2017 •

edited

Loading

dadgar commented Sep 25, 2017

wuub commented Sep 25, 2017

hsmade commented Sep 25, 2017

dadgar commented Sep 25, 2017

hsmade commented Sep 25, 2017

dadgar commented Sep 25, 2017

hsmade commented Sep 26, 2017

github-actions bot commented Dec 7, 2022

Deployment objects are not garbage collected #3244

Deployment objects are not garbage collected #3244

Comments

wuub commented Sep 18, 2017

Nomad version

Issue

Reproduction steps

hsmade commented Sep 21, 2017 • edited Loading

dadgar commented Sep 21, 2017

wuub commented Sep 21, 2017

dadgar commented Sep 21, 2017

wuub commented Sep 21, 2017

dadgar commented Sep 21, 2017

jippi commented Sep 21, 2017

wuub commented Sep 21, 2017

wuub commented Sep 22, 2017 • edited Loading

dadgar commented Sep 25, 2017

wuub commented Sep 25, 2017

hsmade commented Sep 25, 2017

dadgar commented Sep 25, 2017

hsmade commented Sep 25, 2017

dadgar commented Sep 25, 2017

hsmade commented Sep 26, 2017

github-actions bot commented Dec 7, 2022

hsmade commented Sep 21, 2017 •

edited

Loading

wuub commented Sep 22, 2017 •

edited

Loading