ProhibitOverlap not respected after a restart #14505

sammynx · 2022-09-08T16:19:41Z

Nomad version

Output from nomad version
Nomad v1.3.1 (2b054e3)

Operating system and Environment details

Linux 4.19.0-21-cloud-amd64 #1 SMP Debian 4.19.249-2 (2022-06-30) x86_64 GNU/Linux

Issue

I have a scheduled job running with ProhibitOverlap=true in its periodic stanza. When the job was running the machine crashed and when it came back Nomad started the running periodic job again. But when the next period was triggered it started a second one while the first was still running.

I expected Nomad to wait until the running job was finished.

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Used periodic config

        "Periodic": {
            "Enabled": true,
            "ProhibitOverlap": true,
            "Spec": "51 * * * *",
            "SpecType": "cron",
            "TimeZone": "UTC"
        },

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

The text was updated successfully, but these errors were encountered:

DerekStrickland · 2022-09-09T19:22:59Z

Thanks for reporting this issue @sammynx. Just to clarify, is this what happened?

You had an instance of the job running (run 1)
The entire machine crashed - not just the Nomad agent process
The machine restarted
Nomad restarted
Nomad did not resume the previous job run (run 1)
Nomad scheduled a new run (run 2)
Before run 2 could finish, a new run (run 3) started

Is this accurate? Do you have any logs you can provide, and your full jobspec (with any sensitive data removed)?

sammynx · 2022-09-14T09:47:21Z

This is what happened

one instance of a periodic job was running (run 1)
the entire machine crashed
the machine restarted
Nomad restarted
Nomad resumed all running jobs, including run 1
Nomad scheduled a new run for the periodic job (run 2)
while run 1 was still running, Nomad started run 2

This is the entire jobspec:

{
  "Region": "global",
  "Namespace": "default",
  "ID": "20b946ca-e52f-47e8-a695-8480b19577d6",
  "Name": "20b946ca-e52f-47e8-a695-8480b19577d6",
  "Type": "batch",
  "Priority": 50,
  "AllAtOnce": false,
  "Datacenters": [
    "dc1"
  ],
  "Constraints": null,
  "Affinities": null,
  "TaskGroups": [
    {
      "Name": "group1",
      "Count": 1,
      "Constraints": null,
      "Affinities": null,
      "Tasks": [
        {
          "Name": "20b946ca-e52f-47e8-a695-8480b19577d6",
          "Driver": "docker",
          "User": "",
          "Lifecycle": null,
          "Config": {
            "image": "image"1",
            "network_mode": "host"
          },
          "Constraints": null,
          "Affinities": null,
          "Env": {},
          "Services": null,
          "Resources": {
            "CPU": 100,
            "Cores": 0,
            "MemoryMB": 32,
            "MemoryMaxMB": 300,
            "DiskMB": 0,
            "Networks": null,
            "Devices": null,
            "IOPS": 0
          },
          "RestartPolicy": {
            "Interval": 600000000000,
            "Attempts": 3,
            "Delay": 15000000000,
            "Mode": "fail"
          },
          "Meta": null,
          "KillTimeout": 5000000000,
          "LogConfig": {
            "MaxFiles": 2,
            "MaxFileSizeMB": 3
          },
          "Artifacts": null,
          "Vault": null,
          "Templates": null,
          "DispatchPayload": null,
          "VolumeMounts": null,
          "Leader": false,
          "ShutdownDelay": 0,
          "KillSignal": "",
          "Kind": "",
          "ScalingPolicies": null
        }
      ],
      "Spreads": null,
      "Volumes": null,
      "RestartPolicy": {
        "Interval": 600000000000,
        "Attempts": 3,
        "Delay": 60000000000,
        "Mode": "fail"
      },
      "ReschedulePolicy": {
        "Attempts": 0,
        "Interval": 86400000000000,
        "Delay": 5000000000,
        "DelayFunction": "constant",
        "MaxDelay": 0,
        "Unlimited": false
      },
      "EphemeralDisk": {
        "Sticky": false,
        "Migrate": false,
        "SizeMB": 100
      },
      "Update": null,
      "Migrate": null,
      "Networks": null,
      "Meta": null,
      "Services": null,
      "ShutdownDelay": null,
      "StopAfterClientDisconnect": null,
      "MaxClientDisconnect": null,
      "Scaling": null,
      "Consul": {
        "Namespace": ""
      }
    }
  ],
  "Update": {
    "Stagger": 0,
    "MaxParallel": 0,
    "HealthCheck": "",
    "MinHealthyTime": 0,
    "HealthyDeadline": 0,
    "ProgressDeadline": 0,
    "Canary": 0,
    "AutoRevert": false,
    "AutoPromote": false
  },
  "Multiregion": null,
  "Spreads": null,
  "Periodic": {
    "Enabled": true,
    "Spec": "14 8 * * *",
    "SpecType": "cron",
    "ProhibitOverlap": true,
    "TimeZone": "UTC"
  },
  "ParameterizedJob": null,
  "Reschedule": null,
  "Migrate": null,
  "Meta": {},
  "ConsulToken": "",
  "VaultToken": "",
  "Stop": false,
  "ParentID": "",
  "Dispatched": false,
  "DispatchIdempotencyToken": "",
  "Payload": null,
  "ConsulNamespace": "",
  "VaultNamespace": "",
  "NomadTokenID": "",
  "Status": "running",
  "StatusDescription": "",
  "Stable": false,
  "Version": 0,
  "SubmitTime": 1663140700759458661,
  "CreateIndex": 2511456,
  "ModifyIndex": 2511456,
  "JobModifyIndex": 2511456
}

DerekStrickland · 2022-09-22T14:28:26Z

Linking this to #11052. This is possibly a duplicate.

DerekStrickland · 2022-09-22T14:32:48Z

@sammynx Do you have the logs available, and if so can you tell if a leader election was in progress?

JNProtzman · 2023-01-25T14:58:32Z

Hi @DerekStrickland - I was hoping to comment on #11052, but it's been locked.

I'd be interested in working on this bug. I'm not sure if @mikenomitch is still involved - but per his comment - I'd be happy to get ramped up and try to resolve this issue. Let me know if I can help!

tgross · 2023-02-13T21:17:07Z

@JNProtzman, it turns out that Derek has moved on to greener pastures. But you're definitely welcome to take a look if you're interested! Thanks!

jrasell · 2023-04-04T10:06:44Z

I believe this issue is now resolved as part of the work done within #16583 which will be released and backported in the near future. I will therefore close this issue. If the bug continues to exist, please feel free to reopen this issue, and we will continue to investigate.

sammynx added the type/bug label Sep 8, 2022

DerekStrickland added this to Needs Triage in Nomad - Community Issues Triage via automation Sep 9, 2022

DerekStrickland added the stage/needs-investigation label Sep 9, 2022

DerekStrickland self-assigned this Sep 9, 2022

DerekStrickland removed the stage/needs-investigation label Sep 9, 2022

DerekStrickland assigned DerekStrickland and unassigned DerekStrickland Sep 16, 2022

DerekStrickland moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Sep 16, 2022

DerekStrickland mentioned this issue Sep 22, 2022

Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true #11052

Closed

tgross unassigned DerekStrickland Feb 13, 2023

tgross moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Feb 13, 2023

tgross added the theme/batch Issues related to batch jobs and scheduling label Feb 13, 2023

mikenomitch added the hcc/pty label Feb 14, 2023

jrasell assigned lgfa29 and jrasell and unassigned lgfa29 Mar 3, 2023

jrasell closed this as completed Apr 4, 2023

Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProhibitOverlap not respected after a restart #14505

ProhibitOverlap not respected after a restart #14505

sammynx commented Sep 8, 2022

DerekStrickland commented Sep 9, 2022

sammynx commented Sep 14, 2022

DerekStrickland commented Sep 22, 2022

DerekStrickland commented Sep 22, 2022

JNProtzman commented Jan 25, 2023 •

edited

Loading

tgross commented Feb 13, 2023

jrasell commented Apr 4, 2023

ProhibitOverlap not respected after a restart #14505

ProhibitOverlap not respected after a restart #14505

Comments

sammynx commented Sep 8, 2022

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

DerekStrickland commented Sep 9, 2022

sammynx commented Sep 14, 2022

DerekStrickland commented Sep 22, 2022

DerekStrickland commented Sep 22, 2022

JNProtzman commented Jan 25, 2023 • edited Loading

tgross commented Feb 13, 2023

jrasell commented Apr 4, 2023

JNProtzman commented Jan 25, 2023 •

edited

Loading