Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProhibitOverlap not respected after a restart #14505

Closed
sammynx opened this issue Sep 8, 2022 · 7 comments
Closed

ProhibitOverlap not respected after a restart #14505

sammynx opened this issue Sep 8, 2022 · 7 comments
Assignees
Labels
hcc/pty theme/batch Issues related to batch jobs and scheduling type/bug

Comments

@sammynx
Copy link

sammynx commented Sep 8, 2022

Nomad version

Output from nomad version
Nomad v1.3.1 (2b054e3)

Operating system and Environment details

Linux 4.19.0-21-cloud-amd64 #1 SMP Debian 4.19.249-2 (2022-06-30) x86_64 GNU/Linux

Issue

I have a scheduled job running with ProhibitOverlap=true in its periodic stanza. When the job was running the machine crashed and when it came back Nomad started the running periodic job again. But when the next period was triggered it started a second one while the first was still running.

I expected Nomad to wait until the running job was finished.

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Used periodic config

        "Periodic": {
            "Enabled": true,
            "ProhibitOverlap": true,
            "Spec": "51 * * * *",
            "SpecType": "cron",
            "TimeZone": "UTC"
        },

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

@DerekStrickland
Copy link
Contributor

Thanks for reporting this issue @sammynx. Just to clarify, is this what happened?

  • You had an instance of the job running (run 1)
  • The entire machine crashed - not just the Nomad agent process
  • The machine restarted
  • Nomad restarted
  • Nomad did not resume the previous job run (run 1)
  • Nomad scheduled a new run (run 2)
  • Before run 2 could finish, a new run (run 3) started

Is this accurate? Do you have any logs you can provide, and your full jobspec (with any sensitive data removed)?

@sammynx
Copy link
Author

sammynx commented Sep 14, 2022

This is what happened

  • one instance of a periodic job was running (run 1)
  • the entire machine crashed
  • the machine restarted
  • Nomad restarted
  • Nomad resumed all running jobs, including run 1
  • Nomad scheduled a new run for the periodic job (run 2)
  • while run 1 was still running, Nomad started run 2

This is the entire jobspec:

{
  "Region": "global",
  "Namespace": "default",
  "ID": "20b946ca-e52f-47e8-a695-8480b19577d6",
  "Name": "20b946ca-e52f-47e8-a695-8480b19577d6",
  "Type": "batch",
  "Priority": 50,
  "AllAtOnce": false,
  "Datacenters": [
    "dc1"
  ],
  "Constraints": null,
  "Affinities": null,
  "TaskGroups": [
    {
      "Name": "group1",
      "Count": 1,
      "Constraints": null,
      "Affinities": null,
      "Tasks": [
        {
          "Name": "20b946ca-e52f-47e8-a695-8480b19577d6",
          "Driver": "docker",
          "User": "",
          "Lifecycle": null,
          "Config": {
            "image": "image"1",
            "network_mode": "host"
          },
          "Constraints": null,
          "Affinities": null,
          "Env": {},
          "Services": null,
          "Resources": {
            "CPU": 100,
            "Cores": 0,
            "MemoryMB": 32,
            "MemoryMaxMB": 300,
            "DiskMB": 0,
            "Networks": null,
            "Devices": null,
            "IOPS": 0
          },
          "RestartPolicy": {
            "Interval": 600000000000,
            "Attempts": 3,
            "Delay": 15000000000,
            "Mode": "fail"
          },
          "Meta": null,
          "KillTimeout": 5000000000,
          "LogConfig": {
            "MaxFiles": 2,
            "MaxFileSizeMB": 3
          },
          "Artifacts": null,
          "Vault": null,
          "Templates": null,
          "DispatchPayload": null,
          "VolumeMounts": null,
          "Leader": false,
          "ShutdownDelay": 0,
          "KillSignal": "",
          "Kind": "",
          "ScalingPolicies": null
        }
      ],
      "Spreads": null,
      "Volumes": null,
      "RestartPolicy": {
        "Interval": 600000000000,
        "Attempts": 3,
        "Delay": 60000000000,
        "Mode": "fail"
      },
      "ReschedulePolicy": {
        "Attempts": 0,
        "Interval": 86400000000000,
        "Delay": 5000000000,
        "DelayFunction": "constant",
        "MaxDelay": 0,
        "Unlimited": false
      },
      "EphemeralDisk": {
        "Sticky": false,
        "Migrate": false,
        "SizeMB": 100
      },
      "Update": null,
      "Migrate": null,
      "Networks": null,
      "Meta": null,
      "Services": null,
      "ShutdownDelay": null,
      "StopAfterClientDisconnect": null,
      "MaxClientDisconnect": null,
      "Scaling": null,
      "Consul": {
        "Namespace": ""
      }
    }
  ],
  "Update": {
    "Stagger": 0,
    "MaxParallel": 0,
    "HealthCheck": "",
    "MinHealthyTime": 0,
    "HealthyDeadline": 0,
    "ProgressDeadline": 0,
    "Canary": 0,
    "AutoRevert": false,
    "AutoPromote": false
  },
  "Multiregion": null,
  "Spreads": null,
  "Periodic": {
    "Enabled": true,
    "Spec": "14 8 * * *",
    "SpecType": "cron",
    "ProhibitOverlap": true,
    "TimeZone": "UTC"
  },
  "ParameterizedJob": null,
  "Reschedule": null,
  "Migrate": null,
  "Meta": {},
  "ConsulToken": "",
  "VaultToken": "",
  "Stop": false,
  "ParentID": "",
  "Dispatched": false,
  "DispatchIdempotencyToken": "",
  "Payload": null,
  "ConsulNamespace": "",
  "VaultNamespace": "",
  "NomadTokenID": "",
  "Status": "running",
  "StatusDescription": "",
  "Stable": false,
  "Version": 0,
  "SubmitTime": 1663140700759458661,
  "CreateIndex": 2511456,
  "ModifyIndex": 2511456,
  "JobModifyIndex": 2511456
}

@DerekStrickland
Copy link
Contributor

Linking this to #11052. This is possibly a duplicate.

@DerekStrickland
Copy link
Contributor

@sammynx Do you have the logs available, and if so can you tell if a leader election was in progress?

@JNProtzman
Copy link

JNProtzman commented Jan 25, 2023

Hi @DerekStrickland - I was hoping to comment on #11052, but it's been locked.

I'd be interested in working on this bug. I'm not sure if @mikenomitch is still involved - but per his comment - I'd be happy to get ramped up and try to resolve this issue. Let me know if I can help!

@tgross
Copy link
Member

tgross commented Feb 13, 2023

@JNProtzman, it turns out that Derek has moved on to greener pastures. But you're definitely welcome to take a look if you're interested! Thanks!

@tgross tgross moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Feb 13, 2023
@tgross tgross added the theme/batch Issues related to batch jobs and scheduling label Feb 13, 2023
@jrasell jrasell assigned lgfa29 and jrasell and unassigned lgfa29 Mar 3, 2023
@jrasell
Copy link
Member

jrasell commented Apr 4, 2023

I believe this issue is now resolved as part of the work done within #16583 which will be released and backported in the near future. I will therefore close this issue. If the bug continues to exist, please feel free to reopen this issue, and we will continue to investigate.

@jrasell jrasell closed this as completed Apr 4, 2023
Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/pty theme/batch Issues related to batch jobs and scheduling type/bug
Projects
Development

No branches or pull requests

7 participants