Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment objects are not garbage collected #3244

Closed
wuub opened this issue Sep 18, 2017 · 17 comments
Closed

Deployment objects are not garbage collected #3244

wuub opened this issue Sep 18, 2017 · 17 comments

Comments

@wuub
Copy link
Contributor

wuub commented Sep 18, 2017

Nomad version

0.6.3

Issue

As mentioned in #3157, during normal use the number of objects returned by nomad deployment list increases without any apparent bound.

Reproduction steps

  1. Use nomad cluster for a while.
$ nomad deployment list | wc -l            
1224
  1. Cry.

While I do not see any performance or stability hit as of now, I'm starting to get interested in knowing where we will start to observe some kind of cluster degradation? never? in 5 minutes? 2K/10K/100K/1M/10M deployments?

@hsmade
Copy link
Contributor

hsmade commented Sep 21, 2017

I have 100K deployments, and pulling on the API causes about 100MB of data...(0.6.0 here)

@dadgar
Copy link
Contributor

dadgar commented Sep 21, 2017

Can you all share a sample deployment that won’t gc

@wuub
Copy link
Contributor Author

wuub commented Sep 21, 2017

@dadgar are you sure gc for deployments is implemented? Because AFAICT all out ours are just piling up, and there's nothing special about them, super simple update block 99% with single max_parallel statement.

@dadgar
Copy link
Contributor

dadgar commented Sep 21, 2017

Yes it is implemented. The deployments won't be garbage collected when there is a running allocation referencing it.

@wuub
Copy link
Contributor Author

wuub commented Sep 21, 2017

That's interesting :). I'll try to see if any of our job specs trigger this on a dev cluster.

Is the GC event driver and triggers immediately after the last allocation is removed, or do I have to wait for periodic clean-up?

@dadgar
Copy link
Contributor

dadgar commented Sep 21, 2017

You would have to wait for a periodic clean up or you can run curl -XPUT http://127.0.0.1:4646/v1/system/gc

@jippi
Copy link
Contributor

jippi commented Sep 21, 2017

in hashi-ui thats is also a button under system :)

@wuub
Copy link
Contributor Author

wuub commented Sep 21, 2017

Thanks. I'll try to investigate deeper first thing tomorrow morning.

@wuub
Copy link
Contributor Author

wuub commented Sep 22, 2017

Soooo. I sent curl -XPUT $NOMAD_ADDR/v1/system/gc to our prod cluster, and deployments list shrunk significantly (~1500 -> 159) (cc: @hsmade)

The only jobs that have more than one deployment share a job version, so it's most likely that the newer ones were just moving a subset of allocations due to node failure.

BUT.

Any reason why GC is not running on its own?

EDIT/UPDATE:

after one forced GC no other cleanup ran fot the past 7h+

183cb4ea  jobjobjob-stg   21   failed      Failed due to unhealthy allocations
056cb247  jobjobjob-stg   20   successful  Deployment completed successfully
83b31064  jobjobjob-stg   18   cancelled   Cancelled because job is stopped
1c020421  jobjobjob-stg   16   failed      Failed due to unhealthy allocations
112b1b50  jobjobjob-stg   14   failed      Failed due to unhealthy allocations
1c80f6dc  jobjobjob-stg   13   successful  Deployment completed successfully
834da81a  jobjobjob-stg   13   successful  Deployment completed successfully
ca3894e6  jobjobjob-stg   13   successful  Deployment completed successfully
f7fa35b9  jobjobjob-stg   13   successful  Deployment completed successfully

@dadgar
Copy link
Contributor

dadgar commented Sep 25, 2017

@wuub That is the bug :( the loop creating the GC jobs wasn't doing it for the deployments. Will get a fix soon!

@hsmade Can you run the force and check it clears a bunch of your deployments?

@wuub
Copy link
Contributor Author

wuub commented Sep 25, 2017

Great news :) thank you @dadgar

@hsmade
Copy link
Contributor

hsmade commented Sep 25, 2017

@dadgar it does, clean out when forcing GC, thx!

@dadgar
Copy link
Contributor

dadgar commented Sep 25, 2017

@hsmade Sweet! Thanks both of you! Will be fixed in 0.7!

@hsmade
Copy link
Contributor

hsmade commented Sep 25, 2017

BTW.. A note on the huge data transfer I saw. This was mainly caused by us configuring the reserved ports (20000-32000) because of a misunderstanding of the docs (fixed the docs since then). If you have 64 clients, and each of the produces about 1MB of data for just the port reservations, that adds up :)

@dadgar
Copy link
Contributor

dadgar commented Sep 25, 2017

@hsmade Yeah that is definitely something we are aware of and would like to make a simple two integers for a range rather than materializing each port!

@hsmade
Copy link
Contributor

hsmade commented Sep 26, 2017

hehe, yes :)
Luckily, I don't actually need it. So that was a simple(ish) fix. Apart from the rolling restart :P (Which works great, I can just restart my cluster without influencing / stopping my containers. Kudos for that!!)

@github-actions
Copy link

github-actions bot commented Dec 7, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants