Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad doens't make GC for manualy stoped batch allocations #4532

Closed
tantra35 opened this issue Jul 25, 2018 · 5 comments
Closed

Nomad doens't make GC for manualy stoped batch allocations #4532

tantra35 opened this issue Jul 25, 2018 · 5 comments

Comments

@tantra35
Copy link
Contributor

tantra35 commented Jul 25, 2018

Nomad version

Nomad v0.8.4 (dbee1d7)

Issue

If we made stop foar batch jobs then launch it again throw nomad run. Old Allocations will not be gc, manual gc also doens' help

Reproduction steps

for example we have test job:

job "test"
{
        datacenters = ["test"]
        type = "batch"

        constraint
        {
                attribute = "${attr.kernel.name}"
                value = "linux"
        }

        constraint
        {
                attribute = "${node.unique.name}"
                operator = "="
                value = "dockerworker-1"
        }

        task "diamondbcapacitycollector"
        {
                driver = "exec"

                config
                {
                        command = "sleep"
                        args = ["600"]
                }

                logs
                {
                        max_files = 3
                        max_file_size = 10
                }

                resources
                {
                        cpu = 100
                        memory = 300
                }
        }
}

then we launch it throw nomad run ./test.nomad, then stop(nomad stop test) then again nomad run ./test.nomad, then again nomad stop test and finally nomad run ./test.nomad, so we have follow state of job

vagrant@consulnomad-1:~$ nomad status test
ID            = test
Name          = test
Submit Date   = 2018-07-25T16:34:18+03:00
Type          = batch
Priority      = 50
Datacenters   = test
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group                 Queued  Starting  Running  Failed  Complete  Lost
diamondbcapacitycollector  0       0         0        0       3         0

Allocations
ID        Node ID   Task Group                 Version  Desired  Status    Created     Modified
7657bd07  2b5a3435  diamondbcapacitycollector  4        run      complete  10m43s ago  39s ago
490ccb3f  2b5a3435  diamondbcapacitycollector  2        stop     complete  11m30s ago  11m8s ago
324a29b9  2b5a3435  diamondbcapacitycollector  0        stop     complete  13m30s ago  11m8s ago

After that manipulations we try to made manual GC(we made it 2-3 times):

curl -XPUT http://localhost:4646/v1/system/gc

And nothing happens, old allocations doesn't clean. Only if we fully stop batch job GC for it will clean all allocations. All this we made on test stand, but on real environment we have following situations:

ID            = bobrovnik-a-jupiternoteBook
Name          = bobrovnik-a-jupiternoteBook
Submit Date   = 2018-07-23T15:32:57+03:00
Type          = batch
Priority      = 50
Datacenters   = jupiterhub
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group                   Queued  Starting  Running  Failed  Complete  Lost
bobrovnik-a-jupiternoteBook  0       0         1        0       3         0

Allocations
ID        Node ID   Task Group                   Version  Desired  Status    Created   Modified
10450ef1  13a464de  bobrovnik-a-jupiternoteBook  5        run      running   2d1h ago  2d1h ago
e790b6ff  f31e7019  bobrovnik-a-jupiternoteBook  3        stop     complete  5d4h ago  2d2h ago
b4ea1f29  f31e7019  bobrovnik-a-jupiternoteBook  2        stop     complete  5d4h ago  2d2h ago
1d50255b  bab9fa3c  bobrovnik-a-jupiternoteBook  0        stop     complete  5d7h ago  2d2h ago
@tantra35 tantra35 changed the title Nomad doens't make gc for manualy stoped batch allocations Nomad doens't make GC for manualy stoped batch allocations Jul 25, 2018
@tantra35
Copy link
Contributor Author

@preetapan Is this expected behavior?

@boardwalk
Copy link

I'm seeing this problem. While the job isn't registered anymore, I'm able to see the allocations via the /v1/allocations endpoint. I've run a manual GC -- and even a normal GC should have caught them at this point. Normally I'd be fine just ignoring them, but there are a lot of them (the evaluation endpoint gives me back ~500 MiB+ of JSON) from when I encountered a bug over the weekend that caused a ton of evaluations to occur of a periodic job, to the point where some of my Nomad instances are using significant amounts of memory:

dsma01:
62992  1-02:15:37  5.1 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.4/bin/nomad agent -config=/tmp/zADXVfkr.hcl
dsma02:
60951 14-21:53:53  1.1 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.4/bin/nomad agent -config=/tmp/EwLX7he3.hcl
dsma03:
18985 14-21:46:39  8.5 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.4/bin/nomad agent -config=/tmp/rGAA6KeG.hcl
dsmb01:
24854  5-20:21:14  0.7 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.4/bin/nomad agent -config=/tmp/RuRF9NaD.hcl
dsmb02:
58087  5-20:19:38  0.2 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.4/bin/nomad agent -config=/tmp/6GicXcyG.hcl
dsmb03:
22497  5-20:18:47  0.6 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.4/bin/nomad agent -config=/tmp/02o2Ftgc.hcl
dsmdeva01:
19877  1-17:53:39 10.1 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.2/bin/nomad agent -config=/tmp/g6mOCwgO.hcl
dsmdeva02:
22086 41-00:52:08 11.9 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.2/bin/nomad agent -config=/tmp/23i9LvVR.hcl
dsmstagea01:
32023  1-01:03:05 39.3 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.5/bin/nomad agent -config=/tmp/Hd7QNY5f.json
dsmstagea02:
32262  1-01:24:39 34.1 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.2/bin/nomad agent -config=/tmp/i3IQuPdS.hcl
dsmstagea03:
23177    01:00:26 26.4 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.8/bin/nomad agent -config=/tmp/EEve4AXR.json
dsmstageb01:
27174 41-00:53:23 43.5 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.2/bin/nomad agent -config=/tmp/dhz6FVaZ.hcl
dsmstageb02:
21987 41-00:53:06 23.9 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.2/bin/nomad agent -config=/tmp/KpBf5Jed.hcl
dsmstageb03:
53397  1-00:14:41 40.0 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.2/bin/nomad agent -config=/tmp/n5PAOczw.hcl

Where the float number there is the % of memory on the VM (out of 8 or 16 GiB). This is using vanilla 0.8.7.

The allocations look like this:

[{"AnnotatePlan":false,"BlockedEval":"","ClassEligibility":null,"CreateIndex":2900362,"DeploymentID":"","EscapedComputedClass":false,"FailedTGAllocs":null,"ID":"000205df-5f0b-c0f6-f76a-9712aabadb06","JobID":"stage-a-restart-services/periodic-1552197600","JobModifyIndex":2900361,"LeaderACL":"","ModifyIndex":2900362,"Namespace":"default","NextEval":"","NodeID":"","NodeModifyIndex":0,"PreviousEval":"","Priority":50,"QueuedAllocations":null,"QuotaLimitReached":"","SnapshotIndex":0,"Status":"pending","StatusDescription":"","TriggeredBy":"periodic-job","Type":"batch","Wait":0,"WaitUntil":"0001-01-01T00:00:00Z"},
 {"AnnotatePlan":false,"BlockedEval":"","ClassEligibility":null,"CreateIndex":1846653,"DeploymentID":"","EscapedComputedClass":false,"FailedTGAllocs":null,"ID":"00022322-9920-7679-4c7c-b475bcd92eb9","JobID":"stage-a-restart-services/periodic-1552197600","JobModifyIndex":1846652,"LeaderACL":"","ModifyIndex":1846653,"Namespace":"default","NextEval":"","NodeID":"","NodeModifyIndex":0,"PreviousEval":"","Priority":50,"QueuedAllocations":null,"QuotaLimitReached":"","SnapshotIndex":0,"Status":"pending","StatusDescription":"","TriggeredBy":"periodic-job","Type":"batch","Wait":0,"WaitUntil":"0001-01-01T00:00:00Z"},
 {"AnnotatePlan":false,"BlockedEval":"","ClassEligibility":null,"CreateIndex":2387713,"DeploymentID":"","EscapedComputedClass":false,"FailedTGAllocs":null,"ID":"00027399-797d-7551-3bea-3c4d2ed6e851","JobID":"stage-a-restart-services/periodic-1552197600","JobModifyIndex":2387712,"LeaderACL":"","ModifyIndex":2387713,"Namespace":"default","NextEval":"","NodeID":"","NodeModifyIndex":0,"PreviousEval":"","Priority":50,"QueuedAllocations":null,"QuotaLimitReached":"","SnapshotIndex":0,"Status":"pending","StatusDescription":"","TriggeredBy":"periodic-job","Type":"batch","Wait":0,"WaitUntil":"0001-01-01T00:00:00Z"},
 {"AnnotatePlan":false,"BlockedEval":"","ClassEligibility":null,"CreateIndex":1562539,"DeploymentID":"","EscapedComputedClass":false,"FailedTGAllocs":null,"ID":"00028a09-2d6a-cb2b-b7b3-ade83240fcf3","JobID":"stage-a-restart-services/periodic-1552197600","JobModifyIndex":1562538,"LeaderACL":"","ModifyIndex":1562539,"Namespace":"default","NextEval":"","NodeID":"","NodeModifyIndex":0,"PreviousEval":"","Priority":50,"QueuedAllocations":null,"QuotaLimitReached":"","SnapshotIndex":0,"Status":"pending","StatusDescription":"","TriggeredBy":"periodic-job","Type":"batch","Wait":0,"WaitUntil":"0001-01-01T00:00:00Z"},
 {"AnnotatePlan":false,"BlockedEval":"","ClassEligibility":null,"CreateIndex":1685747,"DeploymentID":"","EscapedComputedClass":false,"FailedTGAllocs":null,"ID":"0002906c-5f97-0dfc-f862-7f9785b08810","JobID":"stage-a-restart-services/periodic-1552197600","JobModifyIndex":1685746,"LeaderACL":"","ModifyIndex":1685747,"Namespace":"default","NextEval":"","NodeID":"","NodeModifyIndex":0,"PreviousEval":"","Priority":50,"QueuedAllocations":null,"QuotaLimitReached":"","SnapshotIndex":0,"Status":"pending","StatusDescription":"","TriggeredBy":"periodic-job","Type":"batch","Wait":0,"WaitUntil":"0001-01-01T00:00:00Z"},
 {"AnnotatePlan":false,"BlockedEval":"","ClassEligibility":null,"CreateIndex":1369536,"DeploymentID":"","EscapedComputedClass":false,"FailedTGAllocs":null,"ID":"0002aa6c-92ed-ab22-adc9-a5f3f9f2913b","JobID":"stage-a-restart-services/periodic-1552197600","JobModifyIndex":1369534,"LeaderACL":"","ModifyIndex":1369536,"Namespace":"default","NextEval":"","NodeID":"","NodeModifyIndex":0,"PreviousEval":"","Priority":50,"QueuedAllocations":null,"QuotaLimitReached":"","SnapshotIndex":0,"Status":"pending","StatusDescription":"","TriggeredBy":"periodic-job","Type":"batch","Wait":0,"WaitUntil":"0001-01-01T00:00:00Z"},
 {"AnnotatePlan":false,"BlockedEval":"","ClassEligibility":null,"CreateIndex":2887201,"DeploymentID":"","EscapedComputedClass":false,"FailedTGAllocs":null,"ID":"0002bac9-1f4a-0230-739f-9a7c6670b2d4","JobID":"stage-a-restart-services/periodic-1552197600","JobModifyIndex":2887200,"LeaderACL":"","ModifyIndex":2887201,"Namespace":"default","NextEval":"","NodeID":"","NodeModifyIndex":0,"PreviousEval":"","Priority":50,"QueuedAllocations":null,"QuotaLimitReached":"","SnapshotIndex":0,"Status":"pending","StatusDescription":"","TriggeredBy":"periodic-job","Type":"batch","Wait":0,"WaitUntil":"0001-01-01T00:00:00Z"},
 {"AnnotatePlan":false,"BlockedEval":"","ClassEligibility":null,"CreateIndex":2310569,"DeploymentID":"","EscapedComputedClass":false,"FailedTGAllocs":null,"ID":"0002f2e9-4a12-c6bc-6a9f-a4981e4e91e1","JobID":"stage-a-restart-services/periodic-1552197600","JobModifyIndex":2310568,"LeaderACL":"","ModifyIndex":2310569,"Namespace":"default","NextEval":"","NodeID":"","NodeModifyIndex":0,"PreviousEval":"","Priority":50,"QueuedAllocations":null,"QuotaLimitReached":"","SnapshotIndex":0,"Status":"pending","StatusDescription":"","TriggeredBy":"periodic-job","Type":"batch","Wait":0,"WaitUntil":"0001-01-01T00:00:00Z"},
...

I tried removing $DATA_DIR/server/raft but it looks like it was just re-replicated. Does anyone have an idea on how to clean this up?

@stale
Copy link

stale bot commented Jun 10, 2019

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

@stale
Copy link

stale bot commented Jul 10, 2019

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants