Nomad do not clean old periodic tasks #10456

TheSpbra1n · 2021-04-27T17:48:59Z

Nomad version

Nomad v1.0.2 (4c1d4fc)

Operating system and Environment details

Codename: xenial

Issue

I set in config: JobGCThreshold=48h

root@nomad-01:/home/user# curl -s  'http://localhost:4646/v1/agent/self' | jq . | grep 'JobGCThreshold'
      "JobGCThreshold": "48h"

But in "nomad status" I see a task that ended 6 days ago

root@nomad-01:/home/user# nomad alloc status 118cf459-8aa2-fbf2-6e61-77ed211cb531
ID                  = 118cf459-8aa2-fbf2-6e61-77ed211cb531
Eval ID             = 7ee96439
Name                = correct_suspended_tickets/periodic-1619002800.correct_suspended_tickets[0]
Node ID             = 263fdfd7
Node Name           = docker-15
Job ID              = correct_suspended_tickets/periodic-1619002800
Job Version         = 9
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = run
Desired Description = <none>
Created             = 6d6h ago
Modified            = 2h42m ago

Task "correct_suspended_tickets" is "dead"
Task Resources
CPU      Memory   Disk     Addresses
100 MHz  512 MiB  300 MiB

Task Events:
Started At     = 2021-04-21T11:01:42Z
Finished At    = 2021-04-21T11:01:44Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2021-04-27T17:35:21+03:00  Killing     Sent interrupt. Waiting 5s before force killing
2021-04-27T17:35:18+03:00  Received    Task received by client
2021-04-26T21:20:15+03:00  Killing     Sent interrupt. Waiting 5s before force killing
2021-04-26T21:20:12+03:00  Received    Task received by client
2021-04-26T01:04:06+03:00  Killing     Sent interrupt. Waiting 5s before force killing
2021-04-26T01:04:04+03:00  Received    Task received by client
2021-04-24T08:20:08+03:00  Killing     Sent interrupt. Waiting 5s before force killing
2021-04-24T08:20:00+03:00  Received    Task received by client
2021-04-22T18:38:15+03:00  Killing     Sent interrupt. Waiting 5s before force killing
2021-04-21T14:01:44+03:00  Terminated  Exit Code: 0

Nomad Client logs (if appropriate)

In the clients logs, I see that the GC was running, but the allocation is actually still present:

root@docker-15:/home/user# journalctl -u nomad --since "12:00" | grep 118cf459
Apr 27 17:35:18 docker-15 nomad[2838]:     2021-04-27T17:35:18.624+0300 [INFO]  client.gc: marking allocation for GC: alloc_id=118cf459-8aa2-fbf2-6e61-77ed211cb531
Apr 27 17:35:21 docker-15 nomad[2838]:     2021-04-27T17:35:21.710+0300 [INFO]  client.gc: garbage collecting allocation: alloc_id=118cf459-8aa2-fbf2-6e61-77ed211cb531 reason="number of allocations (251) is over the limit (200)"
Apr 27 17:35:21 docker-15 nomad[2838]:     2021-04-27T17:35:21.716+0300 [INFO]  client.gc: marking allocation for GC: alloc_id=118cf459-8aa2-fbf2-6e61-77ed211cb531
Apr 27 17:35:21 docker-15 nomad[2838]:     2021-04-27T17:35:21.721+0300 [DEBUG] client.gc: alloc garbage collected: alloc_id=118cf459-8aa2-fbf2-6e61-77ed211cb531
root@docker-15:/home/user# date
Tue Apr 27 20:41:41 MSK 2021
root@docker-15:/home/user# nomad alloc status 118cf459
ID                  = 118cf459-8aa2-fbf2-6e61-77ed211cb531
Eval ID             = 7ee96439
...

The text was updated successfully, but these errors were encountered:

tgross · 2021-04-27T18:34:04Z

Hi @TheSpbra1n! This one is tricky!

The log message you're seeing is about GC on the client node, where we clean up the allocation directory on the client. The nomad alloc status is querying the server's view of the world. That JobGCThreshold config value controls that server state.

But what I'm seeing in your nomad alloc status output is that this periodic job is getting placed every few hours, and that it has Desired Status = run. So it looks like the server expects it to be running, which is why it's not GC'ing it. Can you share a redacted version of the job spec? That might help us narrow down the behavior you're seeing.

TheSpbra1n · 2021-04-28T12:08:38Z

Thanks for the answer.
SPEC:

# job name
job "correct_suspended_tickets" {
  region      = "global"
  datacenters = ["dc1", "dc2"]

  type = "batch"

  periodic {
    cron      = "00,30 * * * * *"
    time_zone = "Europe/Moscow"
	prohibit_overlap = true
  }

  constraint {
    attribute = "${node.class}"
    value     = "nomad1"
  }

  meta {
    # Application version
    APP_VERSION       = "0.0.1"
    START_SCRIPT = "xxx"
  }

  group "correct_suspended_tickets" {
    count = 1

    task "correct_suspended_tickets" {
      kill_signal = "SIGTERM"
      driver = "docker"

      config {
        image        = "docker-registry/container:${NOMAD_META_APP_VERSION}"
        advertise_ipv6_address = true
        network_mode = "bridge"
        command = "${NOMAD_META_START_SCRIPT}"
      }

      resources {
        memory = 512 # MB
      }
    }
  }
}

tgross · 2021-04-28T18:12:19Z

Thanks for the jobspec @TheSpbra1n. That spec looks fine to me, which leads me back to wondering why the Desired Status = run still. The code that does job GC has some special casing around the status for periodic jobs (ref). We did fix a bug in the setting of status for periodic jobs in #10145 and I'm wondering if that could be at play.

If you do a nomad job status correct_suspended_tickets, does the Children Job Summary section have the correct counts? If not, you may have run into the bug described in #10145. That patch will ship in the upcoming Nomad 1.1.0 beta (very soon), and we're planning on backporting it to Nomad 1.0.5 as well.

TheSpbra1n · 2021-04-29T17:36:45Z

After nomad system gc, i see:

root@nomad-01:/home/user# nomad system gc
root@nomad-01:/home/user# nomad job status correct_suspended_tickets
ID                       = correct_suspended_tickets
Name                 = correct_suspended_tickets
Submit Date      = 2021-04-05T19:17:04+03:00
Type                   = batch
Priority               = 50
Datacenters      = dc1,dc2
Namespace       = default
Status                = running
Periodic             = true
Parameterized  = false
Next Periodic Launch = 2021-04-29T20:30:00+03:00 (4m16s from now)

Children Job Summary
Pending  Running  Dead
4328     0        29485

No instances of periodic job found
root@nomad-01:/home/user#

I think that "pending" count have a strange value.

TheSpbra1n · 2021-04-29T17:48:13Z

Will waiting 1.0.5, thanks!

tgross · 2021-04-29T20:01:21Z

Thanks for that job status output @TheSpbra1n. That definitely looks suspiciously like the #10145 bug. I'm going to mark this issue for verifying once 1.0.5 ships. Thanks again!

tgross · 2021-06-03T14:48:21Z

Hi @TheSpbra1n! Have you had a chance to see if 1.0.5+ fixes the issue for you?

tgross · 2021-07-07T12:40:39Z

Doing some issue cleanup and I'm going to close this one out. If you do get a chance to test with Nomad 1.0.5+ or later and it doesn't fix the problem, please feel free to reopen.

TheSpbra1n · 2021-07-07T13:19:57Z

Thanks!
We have planned an update, but haven't done it yet

TheSpbra1n · 2021-07-29T15:41:47Z

Hi @tgross! We upgraded nomad to 1.1.2

# nomad version
Nomad v1.1.2 (60638a086ef9630e2a9ba1e237e8426192a44244)

Вut the problem remains:

# curl -s  'http://localhost:4646/v1/agent/self' | jq . | grep 'JobGCThreshold'
      "JobGCThreshold": "48h",
# nomad status correct_suspended_tickets/periodic-1627318800
ID            = correct_suspended_tickets/periodic-1627318800
Name          = correct_suspended_tickets/periodic-1627318800
Submit Date   = 2021-07-26T20:00:00+03:00
Type          = batch
Priority      = 50
Datacenters   = dc1,dc2
Namespace     = default
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group                 Queued  Starting  Running  Failed  Complete  Lost
correct_suspended_tickets  0       0         0        0       1         0

Allocations
ID        Node ID   Task Group                 Version  Desired  Status    Created    Modified
08a5c3d8  32e62217  correct_suspended_tickets  9        run      complete  2d22h ago  1d4h ago

Alloc status:

# nomad alloc status 08a5c3d8
ID                  = 08a5c3d8-f25d-0b56-7b67-03212862552c
Eval ID             = 438f32ce
Name                = correct_suspended_tickets/periodic-1627318800.correct_suspended_tickets[0]
Node ID             = 32e62217
Node Name           = docker-02
Job ID              = correct_suspended_tickets/periodic-1627318800
Job Version         = 9
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = run
Desired Description = <none>
Created             = 2d22h ago
Modified            = 1d4h ago

Task "correct_suspended_tickets" is "dead"
Task Resources
CPU      Memory   Disk     Addresses
100 MHz  512 MiB  300 MiB

Task Events:
Started At     = 2021-07-26T17:01:16Z
Finished At    = 2021-07-26T17:01:18Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2021-07-26T20:01:18+03:00  Terminated  Exit Code: 0
2021-07-26T20:01:16+03:00  Started     Task started by client
2021-07-26T20:00:01+03:00  Driver      Downloading image
2021-07-26T20:00:01+03:00  Task Setup  Building Task Directory
2021-07-26T20:00:01+03:00  Received    Task received by client
#

Children job count:

# nomad status correct_suspended_tickets | grep -A3 'Children Job Summary'
Children Job Summary
Pending  Running  Dead
8325     -3       33830

TheSpbra1n · 2021-08-09T14:40:13Z

#11017

lgfa29 · 2021-08-19T17:33:16Z

Re-opening since the problem persists in v1.1.2

kemko · 2022-11-28T16:52:40Z

This issue still seems reproducible in v1.4.2.

TheSpbra1n added the type/bug label Apr 27, 2021

tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Apr 27, 2021

tgross added theme/batch Issues related to batch jobs and scheduling stage/waiting-reply labels Apr 27, 2021

tgross self-assigned this Apr 27, 2021

tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Apr 27, 2021

tgross added the theme/scheduling label Apr 27, 2021

tgross added stage/needs-verification Issue needs verifying it still exists and removed stage/waiting-reply labels Apr 29, 2021

tgross added the stage/waiting-reply label Jun 3, 2021

tgross closed this as completed Jul 7, 2021

Nomad - Community Issues Triage automation moved this from In Progress to Done Jul 7, 2021

tgross removed the stage/waiting-reply label Jul 7, 2021

lgfa29 mentioned this issue Aug 19, 2021

Wrong desired status in periodic tasks. #11017

Closed

lgfa29 reopened this Aug 19, 2021

Nomad - Community Issues Triage automation moved this from Done to Needs Triage Aug 19, 2021

lgfa29 added stage/needs-investigation and removed stage/needs-verification Issue needs verifying it still exists labels Aug 19, 2021

tgross removed their assignment Nov 3, 2021

tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad do not clean old periodic tasks #10456

Nomad do not clean old periodic tasks #10456

TheSpbra1n commented Apr 27, 2021

tgross commented Apr 27, 2021

TheSpbra1n commented Apr 28, 2021

tgross commented Apr 28, 2021 •

edited

Loading

TheSpbra1n commented Apr 29, 2021

TheSpbra1n commented Apr 29, 2021

tgross commented Apr 29, 2021

tgross commented Jun 3, 2021

tgross commented Jul 7, 2021

TheSpbra1n commented Jul 7, 2021

TheSpbra1n commented Jul 29, 2021

TheSpbra1n commented Aug 9, 2021

lgfa29 commented Aug 19, 2021

kemko commented Nov 28, 2022

Nomad do not clean old periodic tasks #10456

Nomad do not clean old periodic tasks #10456

Comments

TheSpbra1n commented Apr 27, 2021

Nomad version

Operating system and Environment details

Issue

Nomad Client logs (if appropriate)

tgross commented Apr 27, 2021

TheSpbra1n commented Apr 28, 2021

tgross commented Apr 28, 2021 • edited Loading

TheSpbra1n commented Apr 29, 2021

TheSpbra1n commented Apr 29, 2021

tgross commented Apr 29, 2021

tgross commented Jun 3, 2021

tgross commented Jul 7, 2021

TheSpbra1n commented Jul 7, 2021

TheSpbra1n commented Jul 29, 2021

TheSpbra1n commented Aug 9, 2021

lgfa29 commented Aug 19, 2021

kemko commented Nov 28, 2022

tgross commented Apr 28, 2021 •

edited

Loading