Nomad doens't make GC for manualy stoped batch allocations #4532

tantra35 · 2018-07-25T13:48:32Z

Nomad version

Nomad v0.8.4 (dbee1d7)

Issue

If we made stop foar batch jobs then launch it again throw nomad run. Old Allocations will not be gc, manual gc also doens' help

Reproduction steps

for example we have test job:

job "test"
{
        datacenters = ["test"]
        type = "batch"

        constraint
        {
                attribute = "${attr.kernel.name}"
                value = "linux"
        }

        constraint
        {
                attribute = "${node.unique.name}"
                operator = "="
                value = "dockerworker-1"
        }

        task "diamondbcapacitycollector"
        {
                driver = "exec"

                config
                {
                        command = "sleep"
                        args = ["600"]
                }

                logs
                {
                        max_files = 3
                        max_file_size = 10
                }

                resources
                {
                        cpu = 100
                        memory = 300
                }
        }
}

then we launch it throw nomad run ./test.nomad, then stop(nomad stop test) then again nomad run ./test.nomad, then again nomad stop test and finally nomad run ./test.nomad, so we have follow state of job

vagrant@consulnomad-1:~$ nomad status test
ID            = test
Name          = test
Submit Date   = 2018-07-25T16:34:18+03:00
Type          = batch
Priority      = 50
Datacenters   = test
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group                 Queued  Starting  Running  Failed  Complete  Lost
diamondbcapacitycollector  0       0         0        0       3         0

Allocations
ID        Node ID   Task Group                 Version  Desired  Status    Created     Modified
7657bd07  2b5a3435  diamondbcapacitycollector  4        run      complete  10m43s ago  39s ago
490ccb3f  2b5a3435  diamondbcapacitycollector  2        stop     complete  11m30s ago  11m8s ago
324a29b9  2b5a3435  diamondbcapacitycollector  0        stop     complete  13m30s ago  11m8s ago

After that manipulations we try to made manual GC(we made it 2-3 times):

curl -XPUT http://localhost:4646/v1/system/gc

And nothing happens, old allocations doesn't clean. Only if we fully stop batch job GC for it will clean all allocations. All this we made on test stand, but on real environment we have following situations:

ID            = bobrovnik-a-jupiternoteBook
Name          = bobrovnik-a-jupiternoteBook
Submit Date   = 2018-07-23T15:32:57+03:00
Type          = batch
Priority      = 50
Datacenters   = jupiterhub
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group                   Queued  Starting  Running  Failed  Complete  Lost
bobrovnik-a-jupiternoteBook  0       0         1        0       3         0

Allocations
ID        Node ID   Task Group                   Version  Desired  Status    Created   Modified
10450ef1  13a464de  bobrovnik-a-jupiternoteBook  5        run      running   2d1h ago  2d1h ago
e790b6ff  f31e7019  bobrovnik-a-jupiternoteBook  3        stop     complete  5d4h ago  2d2h ago
b4ea1f29  f31e7019  bobrovnik-a-jupiternoteBook  2        stop     complete  5d4h ago  2d2h ago
1d50255b  bab9fa3c  bobrovnik-a-jupiternoteBook  0        stop     complete  5d7h ago  2d2h ago

The text was updated successfully, but these errors were encountered:

tantra35 · 2018-09-11T21:59:22Z

@preetapan Is this expected behavior?

boardwalk · 2019-03-12T18:56:09Z

I'm seeing this problem. While the job isn't registered anymore, I'm able to see the allocations via the /v1/allocations endpoint. I've run a manual GC -- and even a normal GC should have caught them at this point. Normally I'd be fine just ignoring them, but there are a lot of them (the evaluation endpoint gives me back ~500 MiB+ of JSON) from when I encountered a bug over the weekend that caused a ton of evaluations to occur of a periodic job, to the point where some of my Nomad instances are using significant amounts of memory:

dsma01:
62992  1-02:15:37  5.1 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.4/bin/nomad agent -config=/tmp/zADXVfkr.hcl
dsma02:
60951 14-21:53:53  1.1 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.4/bin/nomad agent -config=/tmp/EwLX7he3.hcl
dsma03:
18985 14-21:46:39  8.5 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.4/bin/nomad agent -config=/tmp/rGAA6KeG.hcl
dsmb01:
24854  5-20:21:14  0.7 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.4/bin/nomad agent -config=/tmp/RuRF9NaD.hcl
dsmb02:
58087  5-20:19:38  0.2 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.4/bin/nomad agent -config=/tmp/6GicXcyG.hcl
dsmb03:
22497  5-20:18:47  0.6 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.4/bin/nomad agent -config=/tmp/02o2Ftgc.hcl
dsmdeva01:
19877  1-17:53:39 10.1 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.2/bin/nomad agent -config=/tmp/g6mOCwgO.hcl
dsmdeva02:
22086 41-00:52:08 11.9 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.2/bin/nomad agent -config=/tmp/23i9LvVR.hcl
dsmstagea01:
32023  1-01:03:05 39.3 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.5/bin/nomad agent -config=/tmp/Hd7QNY5f.json
dsmstagea02:
32262  1-01:24:39 34.1 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.2/bin/nomad agent -config=/tmp/i3IQuPdS.hcl
dsmstagea03:
23177    01:00:26 26.4 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.8/bin/nomad agent -config=/tmp/EEve4AXR.json
dsmstageb01:
27174 41-00:53:23 43.5 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.2/bin/nomad agent -config=/tmp/dhz6FVaZ.hcl
dsmstageb02:
21987 41-00:53:06 23.9 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.2/bin/nomad agent -config=/tmp/KpBf5Jed.hcl
dsmstageb03:
53397  1-00:14:41 40.0 /home/fds/dsotm/FDSdsotm_nomad_0.8.7.2/bin/nomad agent -config=/tmp/n5PAOczw.hcl

Where the float number there is the % of memory on the VM (out of 8 or 16 GiB). This is using vanilla 0.8.7.

The allocations look like this:

[{"AnnotatePlan":false,"BlockedEval":"","ClassEligibility":null,"CreateIndex":2900362,"DeploymentID":"","EscapedComputedClass":false,"FailedTGAllocs":null,"ID":"000205df-5f0b-c0f6-f76a-9712aabadb06","JobID":"stage-a-restart-services/periodic-1552197600","JobModifyIndex":2900361,"LeaderACL":"","ModifyIndex":2900362,"Namespace":"default","NextEval":"","NodeID":"","NodeModifyIndex":0,"PreviousEval":"","Priority":50,"QueuedAllocations":null,"QuotaLimitReached":"","SnapshotIndex":0,"Status":"pending","StatusDescription":"","TriggeredBy":"periodic-job","Type":"batch","Wait":0,"WaitUntil":"0001-01-01T00:00:00Z"},
 {"AnnotatePlan":false,"BlockedEval":"","ClassEligibility":null,"CreateIndex":1846653,"DeploymentID":"","EscapedComputedClass":false,"FailedTGAllocs":null,"ID":"00022322-9920-7679-4c7c-b475bcd92eb9","JobID":"stage-a-restart-services/periodic-1552197600","JobModifyIndex":1846652,"LeaderACL":"","ModifyIndex":1846653,"Namespace":"default","NextEval":"","NodeID":"","NodeModifyIndex":0,"PreviousEval":"","Priority":50,"QueuedAllocations":null,"QuotaLimitReached":"","SnapshotIndex":0,"Status":"pending","StatusDescription":"","TriggeredBy":"periodic-job","Type":"batch","Wait":0,"WaitUntil":"0001-01-01T00:00:00Z"},
 {"AnnotatePlan":false,"BlockedEval":"","ClassEligibility":null,"CreateIndex":2387713,"DeploymentID":"","EscapedComputedClass":false,"FailedTGAllocs":null,"ID":"00027399-797d-7551-3bea-3c4d2ed6e851","JobID":"stage-a-restart-services/periodic-1552197600","JobModifyIndex":2387712,"LeaderACL":"","ModifyIndex":2387713,"Namespace":"default","NextEval":"","NodeID":"","NodeModifyIndex":0,"PreviousEval":"","Priority":50,"QueuedAllocations":null,"QuotaLimitReached":"","SnapshotIndex":0,"Status":"pending","StatusDescription":"","TriggeredBy":"periodic-job","Type":"batch","Wait":0,"WaitUntil":"0001-01-01T00:00:00Z"},
 {"AnnotatePlan":false,"BlockedEval":"","ClassEligibility":null,"CreateIndex":1562539,"DeploymentID":"","EscapedComputedClass":false,"FailedTGAllocs":null,"ID":"00028a09-2d6a-cb2b-b7b3-ade83240fcf3","JobID":"stage-a-restart-services/periodic-1552197600","JobModifyIndex":1562538,"LeaderACL":"","ModifyIndex":1562539,"Namespace":"default","NextEval":"","NodeID":"","NodeModifyIndex":0,"PreviousEval":"","Priority":50,"QueuedAllocations":null,"QuotaLimitReached":"","SnapshotIndex":0,"Status":"pending","StatusDescription":"","TriggeredBy":"periodic-job","Type":"batch","Wait":0,"WaitUntil":"0001-01-01T00:00:00Z"},
 {"AnnotatePlan":false,"BlockedEval":"","ClassEligibility":null,"CreateIndex":1685747,"DeploymentID":"","EscapedComputedClass":false,"FailedTGAllocs":null,"ID":"0002906c-5f97-0dfc-f862-7f9785b08810","JobID":"stage-a-restart-services/periodic-1552197600","JobModifyIndex":1685746,"LeaderACL":"","ModifyIndex":1685747,"Namespace":"default","NextEval":"","NodeID":"","NodeModifyIndex":0,"PreviousEval":"","Priority":50,"QueuedAllocations":null,"QuotaLimitReached":"","SnapshotIndex":0,"Status":"pending","StatusDescription":"","TriggeredBy":"periodic-job","Type":"batch","Wait":0,"WaitUntil":"0001-01-01T00:00:00Z"},
 {"AnnotatePlan":false,"BlockedEval":"","ClassEligibility":null,"CreateIndex":1369536,"DeploymentID":"","EscapedComputedClass":false,"FailedTGAllocs":null,"ID":"0002aa6c-92ed-ab22-adc9-a5f3f9f2913b","JobID":"stage-a-restart-services/periodic-1552197600","JobModifyIndex":1369534,"LeaderACL":"","ModifyIndex":1369536,"Namespace":"default","NextEval":"","NodeID":"","NodeModifyIndex":0,"PreviousEval":"","Priority":50,"QueuedAllocations":null,"QuotaLimitReached":"","SnapshotIndex":0,"Status":"pending","StatusDescription":"","TriggeredBy":"periodic-job","Type":"batch","Wait":0,"WaitUntil":"0001-01-01T00:00:00Z"},
 {"AnnotatePlan":false,"BlockedEval":"","ClassEligibility":null,"CreateIndex":2887201,"DeploymentID":"","EscapedComputedClass":false,"FailedTGAllocs":null,"ID":"0002bac9-1f4a-0230-739f-9a7c6670b2d4","JobID":"stage-a-restart-services/periodic-1552197600","JobModifyIndex":2887200,"LeaderACL":"","ModifyIndex":2887201,"Namespace":"default","NextEval":"","NodeID":"","NodeModifyIndex":0,"PreviousEval":"","Priority":50,"QueuedAllocations":null,"QuotaLimitReached":"","SnapshotIndex":0,"Status":"pending","StatusDescription":"","TriggeredBy":"periodic-job","Type":"batch","Wait":0,"WaitUntil":"0001-01-01T00:00:00Z"},
 {"AnnotatePlan":false,"BlockedEval":"","ClassEligibility":null,"CreateIndex":2310569,"DeploymentID":"","EscapedComputedClass":false,"FailedTGAllocs":null,"ID":"0002f2e9-4a12-c6bc-6a9f-a4981e4e91e1","JobID":"stage-a-restart-services/periodic-1552197600","JobModifyIndex":2310568,"LeaderACL":"","ModifyIndex":2310569,"Namespace":"default","NextEval":"","NodeID":"","NodeModifyIndex":0,"PreviousEval":"","Priority":50,"QueuedAllocations":null,"QuotaLimitReached":"","SnapshotIndex":0,"Status":"pending","StatusDescription":"","TriggeredBy":"periodic-job","Type":"batch","Wait":0,"WaitUntil":"0001-01-01T00:00:00Z"},
...

I tried removing $DATA_DIR/server/raft but it looks like it was just re-replicated. Does anyone have an idea on how to clean this up?

stale · 2019-06-10T19:51:22Z

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

stale · 2019-07-10T20:30:26Z

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

github-actions · 2022-11-21T02:29:20Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tantra35 changed the title ~~Nomad doens't make gc for manualy stoped batch allocations~~ Nomad doens't make GC for manualy stoped batch allocations Jul 25, 2018

tantra35 mentioned this issue Nov 10, 2018

Remove terminal allocations associated with older job modify index #4839

Merged

tantra35 closed this as completed Nov 21, 2018

tantra35 reopened this Nov 21, 2018

tantra35 mentioned this issue Nov 24, 2018

Run job deregistering in a single transaction #4861

Merged

boardwalk mentioned this issue Mar 12, 2019

Uncontrolled evaluation of periodic batch jobs during DST change #5410

Closed

stale bot added the stage/waiting-reply label Jun 10, 2019

stale bot closed this as completed Jul 10, 2019

This was referenced Oct 31, 2022

Infinite memory growth in Nomad Agent and Nomad Client when using batch jobs. #15090

Closed

[15090] Ensure no leakage of evaluations for batch jobs. #15097

Merged

github-actions bot locked as resolved and limited conversation to collaborators Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad doens't make GC for manualy stoped batch allocations #4532

Nomad doens't make GC for manualy stoped batch allocations #4532

tantra35 commented Jul 25, 2018 •

edited

Loading

tantra35 commented Sep 11, 2018

boardwalk commented Mar 12, 2019

stale bot commented Jun 10, 2019

stale bot commented Jul 10, 2019

github-actions bot commented Nov 21, 2022

Nomad doens't make GC for manualy stoped batch allocations #4532

Nomad doens't make GC for manualy stoped batch allocations #4532

Comments

tantra35 commented Jul 25, 2018 • edited Loading

Nomad version

Issue

Reproduction steps

tantra35 commented Sep 11, 2018

boardwalk commented Mar 12, 2019

stale bot commented Jun 10, 2019

stale bot commented Jul 10, 2019

github-actions bot commented Nov 21, 2022

tantra35 commented Jul 25, 2018 •

edited

Loading