System Scheduler use new Update stanza and Deployments #4740

aaroncline · 2018-10-01T19:41:29Z

If you have a question, prepend your issue with [question] or preferably use the nomad mailing list.

If filing a bug please include the following:

Nomad version

Nomad v0.8.4 (dbee1d7)

Operating system and Environment details

CentOS 7
Consul v1.0.7
fabiolb 1.5.6
3 nomad clients
3 nomad servers

Issue

When deploying Fabio using the system scheduler and the exec driver, Nomad does not seem to respect the Update section hierarchy between the job and group sections.

Also, it does not seem as though Nomad treats this as a "deployment". No deployment ID is available in the job submission evaluation.

Reproduction steps

Use the job file below to launch fabio into an environment. Alter the force_job_restart epoch ENV and redeploy and you should see all fabio executions stop at essentially the same time. There is also no deployment ID which is how we track successful deployments on our service scheduled tasks. If you then change the job Update section to match the Group Update section, the tasks will be staggered appropriately.

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Job file (if appropriate)

{
    "Job": {
        "AllAtOnce": false,
        "Constraints": null,
        "CreateIndex": 635411,
        "Datacenters": [
            "us-east-1"
        ],
        "Dispatched": false,
        "ID": "fabio",
        "JobModifyIndex": 855141,
        "Meta": null,
        "Migrate": null,
        "ModifyIndex": 855141,
        "Name": "fabio",
        "Namespace": "default",
        "ParameterizedJob": null,
        "ParentID": "",
        "Payload": null,
        "Periodic": null,
        "Priority": 50,
        "Region": "aws",
        "Reschedule": null,
        "Stable": false,
        "Status": "running",
        "StatusDescription": "",
        "Stop": false,
        "SubmitTime": 1538421216716838569,
        "TaskGroups": [
            {
                "Constraints": null,
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "Meta": null,
                "Migrate": null,
                "Name": "devops",
                "ReschedulePolicy": null,
                "RestartPolicy": {
                    "Attempts": 2,
                    "Delay": 15000000000,
                    "Interval": 1800000000000,
                    "Mode": "fail"
                },
                "Tasks": [
                    {
                        "Artifacts": [
                            {
                                "GetterMode": "any",
                                "GetterOptions": {
                                    "checksum": "sha256:2dfe26aaa74b659a0e595654eb8f9247d947cbf652cbebe03fd8133c2851cb4a"
                                },
                                "GetterSource": "https://github.com/fabiolb/fabio/releases/download/v1.5.6/fabio-1.5.6-go1.9.2-linux_amd64",
                                "RelativeDest": "local/"
                            }
                        ],
                        "Config": {
                            "command": "fabio-1.5.6-go1.9.2-linux_amd64"
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "exec",
                        "Env": {
                            "force_job_restart": "1538421216"
                        },
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Leader": false,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "devops_fabio_exec",
                        "Resources": {
                            "CPU": 200,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 512,
                            "Networks": [
                                {
                                    "CIDR": "",
                                    "Device": "",
                                    "DynamicPorts": null,
                                    "IP": "",
                                    "MBits": 10,
                                    "ReservedPorts": [
                                        {
                                            "Label": "fabio_9999",
                                            "Value": 9999
                                        },
                                        {
                                            "Label": "fabio_9998",
                                            "Value": 9998
                                        }
                                    ]
                                }
                            ]
                        },
                        "Services": [
                            {
                                "AddressMode": "auto",
                                "CanaryTags": null,
                                "CheckRestart": null,
                                "Checks": [
                                    {
                                        "AddressMode": "",
                                        "Args": null,
                                        "CheckRestart": null,
                                        "Command": "",
                                        "GRPCService": "",
                                        "GRPCUseTLS": false,
                                        "Header": null,
                                        "Id": "",
                                        "InitialStatus": "",
                                        "Interval": 10000000000,
                                        "Method": "",
                                        "Name": "service: \"fabio\" check",
                                        "Path": "",
                                        "PortLabel": "fabio_9999",
                                        "Protocol": "",
                                        "TLSSkipVerify": true,
                                        "Timeout": 5000000000,
                                        "Type": "tcp"
                                    }
                                ],
                                "Id": "",
                                "Name": "fabio",
                                "PortLabel": "fabio_9999",
                                "Tags": null
                            }
                        ],
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "",
                        "Vault": null
                    }
                ],
                "Update": {
                    "MaxParallel": 1,
                    "Stagger": 30000000000
                }
            }
        ],
        "Type": "system",
        "Update": {
            "MaxParallel": 0,
            "Stagger": 0
        },
        "VaultToken": "",
        "Version": 7
    }
}

The text was updated successfully, but these errors were encountered:

aaroncline · 2018-10-01T20:17:49Z

I misreported initially and have made some edits. This actually appears to be a bug in the hierarchy of the Group and Job Update stanza's. According to your docs, the Group stanza should have the higher precedence. https://www.nomadproject.io/docs/job-specification/update.html

dadgar · 2018-10-02T00:09:01Z

@aaroncline Hey Aaron,

The system job currently doesn't support the new update system using deployments. You can see the callout here: https://www.nomadproject.io/docs/job-specification/update.html

I am going to rename the issue to reflect this

ricbartm · 2019-04-11T14:39:35Z

Hello @dadgar . We have a use case where we want to deploy a custom job on every node of a pool of nodes distributed around the globe and we thought the system scheduler is the best fit for this use case. Nevertheless, given that the new deployment and deployment stanza configurations are not being honoured, we may need to workaround it by using the service scheduler, some job contraints to avoid multiple copies of same job deployed in the same node, and some automation to increase the overall job count number to match our cluster size if it grows or shrinks. This is doable, but far from ideal.

Said that, this issue has been opened long time ago and it had very few activity. So, is there any chance that you could share with me what the plans are of this? I'd like to set some expectations (even the answer is "we don't have plans for this") to be able to take the most informed decision about it.

Finally, a shout to other folks, but specially to @aaroncline to know how they finally workaround this issue for their use case.

calavera · 2020-05-28T17:17:27Z

@dadgar we're investigating using Nomad at Netlify for a large heterogeneous deployment. Solving this issue would help us tremendously to decide whether to use Nomad. Is there anything we can do to help it move forward? The documentation says that this will be fixed in "future releases", but it'd be great to know whether you have more specific plans to address it.

schmichael · 2020-05-28T18:48:38Z

That's super exciting @calavera! nomadproject.io itself uses Netlify, so it would be exciting to be "self-hosted" in a way.

Unfortunately this feature is not planned for the upcoming 0.11.x or 0.12.0 releases. It is absolutely in our queue for prioritization after 0.12.0, but I don't want to make any promises at this time. Would it be possible to elaborate on your use case in case there's a workaround we could help provide?

I'll try to update this issue when it's prioritized on our roadmap.

This PR implements a new "System Batch" scheduler type. Jobs can make use of this new scheduler by setting their type to 'sysbatch'. Like the name implies, sysbatch can be thought of as a hybrid between system and batch jobs - it is for running short lived jobs intended to run on every compatible node in the cluster. As with batch jobs, sysbatch jobs can also be periodic and/or parameterized dispatch jobs. A sysbatch job is considered complete when it has been run on all compatible nodes until reaching a terminal state (success or failed on retries). Feasibility and preemption are governed the same as with system jobs. In this PR, the update stanza is not yet supported. The update stanza is sill limited in functionality for the underlying system scheduler, and is not useful yet for sysbatch jobs. Further work in #4740 will improve support for the update stanza and deployments. Closes #2527

apkrymov · 2022-07-14T13:58:22Z

@schmichael Any updates? We really need this feature
Our use case same as mentioned @ricbartm. We need to deploy service based on the host constraints, not on fixed count of replicas in cluster. So, system scheduler fits perfectly, but we can not control deployment process due to current Update stanza limitations.

ebarriosjr · 2022-09-26T09:52:30Z

@schmichael any updates? We could also really use this feature.
Thanks!

schmichael · 2022-10-04T20:37:53Z

Unfortunately no updates. Sorry for letting this slip. Definitely still something we want to do, but I don't want to keep overpromising and underdelivering on timelines. 😬

axsuul · 2022-10-22T13:23:10Z

Same here, we make heavy use of system jobs and really need a way to do rolling updates for them.

hyungjic · 2024-05-24T22:59:17Z

@schmichael Any updates on this issue?😄

aaroncline changed the title ~~System Scheduler Does Not Rotate Deployments or provide Deployment ID~~ System Scheduler Does Not Respect Update Stanza Hierarchy Oct 1, 2018

dadgar changed the title ~~System Scheduler Does Not Respect Update Stanza Hierarchy~~ System Scheduler use new Update stanza and Deployments Oct 2, 2018

dadgar added type/enhancement theme/deployments theme/system-scheduler labels Oct 2, 2018

preetapan mentioned this issue Nov 14, 2018

Rolling updates doens;t work well for system jobs #4786

Closed

yishan-lin mentioned this issue Jun 29, 2020

Batch System Jobs #2527

Closed

tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Aug 24, 2020

shoenig mentioned this issue Nov 9, 2020

core: implement system batch scheduler #9160

Merged

tgross mentioned this issue Feb 1, 2021

Nomad does not wait until allocations become healthy. #9915

Closed

tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021

tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021

schmichael mentioned this issue Apr 28, 2023

Add jitter or spread to periodic job scheduling #17024

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System Scheduler use new Update stanza and Deployments #4740

System Scheduler use new Update stanza and Deployments #4740

aaroncline commented Oct 1, 2018 •

edited

Loading

aaroncline commented Oct 1, 2018

dadgar commented Oct 2, 2018

ricbartm commented Apr 11, 2019

calavera commented May 28, 2020

schmichael commented May 28, 2020

apkrymov commented Jul 14, 2022 •

edited

Loading

ebarriosjr commented Sep 26, 2022

schmichael commented Oct 4, 2022

axsuul commented Oct 22, 2022

hyungjic commented May 24, 2024

System Scheduler use new Update stanza and Deployments #4740

System Scheduler use new Update stanza and Deployments #4740

Comments

aaroncline commented Oct 1, 2018 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Job file (if appropriate)

aaroncline commented Oct 1, 2018

dadgar commented Oct 2, 2018

ricbartm commented Apr 11, 2019

calavera commented May 28, 2020

schmichael commented May 28, 2020

apkrymov commented Jul 14, 2022 • edited Loading

ebarriosjr commented Sep 26, 2022

schmichael commented Oct 4, 2022

axsuul commented Oct 22, 2022

hyungjic commented May 24, 2024

aaroncline commented Oct 1, 2018 •

edited

Loading

apkrymov commented Jul 14, 2022 •

edited

Loading