Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to change default SchedulerAlgorithm via Api #8070

Closed
idrennanvmware opened this issue May 28, 2020 · 8 comments
Closed

Unable to change default SchedulerAlgorithm via Api #8070

idrennanvmware opened this issue May 28, 2020 · 8 comments

Comments

@idrennanvmware
Copy link
Contributor

idrennanvmware commented May 28, 2020

Nomad version

Nomad 0.11.1

Operating system and Environment details

Photon OS3

Issue

We have a cluster that has been stood up and bootstrapped with ACLS (no default scheduler was set at the time of creation.

The results of the api call /v1/operator/scheduler/configuration referenced here: https://www.nomadproject.io/api-docs/operator/#update-scheduler-configuration is:

{"SchedulerConfig":{"PreemptionConfig":{"SystemSchedulerEnabled":false,"BatchSchedulerEnabled":false,"ServiceSchedulerEnabled":false},"CreateIndex":5,"ModifyIndex":167},"Index":167,"LastContact":0,"KnownLeader":true}

When calling PUT OR POST on http://localhost:4646/v1/operator/scheduler/configuration?cas=167 we receive:
"Updated":true,"Index":177}

but when requerying /v1/operator/scheduler/configuration we receive
{"SchedulerConfig":{"PreemptionConfig":{"SystemSchedulerEnabled":true,"BatchSchedulerEnabled":false,"ServiceSchedulerEnabled":false},"CreateIndex":5,"ModifyIndex":179},"Index":179,"LastContact":0,"KnownLeader":true}

The modification index has incremented as expected but we aren't seeing the value of the scheduler being updated.

The file contents for the PUT are

{
"SchedulerAlgorithm": "spread",
"PreemptionConfig": {
"SystemSchedulerEnabled": true,
"BatchSchedulerEnabled": false,
"ServiceSchedulerEnabled": false
}
}

@cgbaker
Copy link
Contributor

cgbaker commented May 28, 2020

I tested this using the payload you included above and it worked fine. I'm a little curious as to why the ModifyIndex on the scheduler configuration has changed from the 177 returned from your PUT /v1/operator/scheduler/configuration to the 179 listed in the follow-up query. One explanation is that another call has been made to change the scheduler configuration. Can you perhaps provide server logs? Also, can you give some more details about your cluster? How many servers, plus the output from nomad server members?

@idrennanvmware
Copy link
Contributor Author

idrennanvmware commented May 28, 2020

Hi @cgbaker

This is a 3 server cluster (The acl tokens are transient so I don't mind them being here). The result of nomad server members is:

root [ /home/vagrant ]# /usr/local/bin/nomad server members
Name Address Port Status Leader Protocol Build Datacenter Region
one-photon.global 192.168.50.91 4648 alive false 2 0.11.1 dev-setup global
three-photon.global 192.168.50.93 4648 alive true 2 0.11.1 dev-setup global
two-photon.global 192.168.50.92 4648 alive false 2 0.11.1 dev-setup global

The nomad server logs don't seem to change prior or post these operations - here's the logs

2020-05-28T16:06:02.652Z [WARN] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/opt/nomad/plugins
2020-05-28T16:06:02.653Z [INFO] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
2020-05-28T16:06:02.653Z [INFO] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
2020-05-28T16:06:02.653Z [INFO] agent: detected plugin: name=java type=driver plugin_version=0.1.0
2020-05-28T16:06:02.653Z [INFO] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
2020-05-28T16:06:02.653Z [INFO] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
2020-05-28T16:06:02.653Z [INFO] agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
2020-05-28T16:06:02.657Z [INFO] nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:192.168.50.91:4647 Address:192.168.50.91:4647} {Suffrage:Voter ID:192.168.50.92:4647 Address:192.168.50.92:4647} {Suffrage:Voter ID:192.168.50.93:4647 Address:192.168.50.93:4647}]"
2020-05-28T16:06:02.658Z [INFO] nomad: serf: EventMemberJoin: one-photon.global 192.168.50.91
2020-05-28T16:06:02.659Z [INFO] nomad: starting scheduling worker(s): num_workers=2 schedulers=[service, batch, system, _core]
2020-05-28T16:06:02.659Z [INFO] client: using state directory: state_dir=/opt/nomad/client
2020-05-28T16:06:02.659Z [INFO] client: using alloc directory: alloc_dir=/opt/nomad/alloc
2020-05-28T16:06:02.659Z [INFO] nomad.raft: entering follower state: follower="Node at 192.168.50.91:4647 [Follower]" leader=
2020-05-28T16:06:02.660Z [INFO] client.fingerprint_mgr.cgroup: cgroups are available
2020-05-28T16:06:02.661Z [INFO] nomad: serf: Attempting re-join to previously known node: two-photon.global: 192.168.50.92:4648
2020-05-28T16:06:02.661Z [INFO] nomad: adding server: server="one-photon.global (Addr: 192.168.50.91:4647) (DC: dev-setup)"
2020-05-28T16:06:02.663Z [INFO] nomad: serf: EventMemberJoin: two-photon.global 192.168.50.92
2020-05-28T16:06:02.664Z [INFO] nomad: serf: EventMemberJoin: three-photon.global 192.168.50.93
2020-05-28T16:06:02.664Z [WARN] nomad: memberlist: Refuting a suspect message (from: one-photon.global)
2020-05-28T16:06:02.664Z [INFO] nomad: serf: Re-joined to previously known node: two-photon.global: 192.168.50.92:4648
2020-05-28T16:06:02.664Z [INFO] nomad: adding server: server="two-photon.global (Addr: 192.168.50.92:4647) (DC: dev-setup)"
2020-05-28T16:06:02.664Z [INFO] nomad: adding server: server="three-photon.global (Addr: 192.168.50.93:4647) (DC: dev-setup)"
2020-05-28T16:06:02.664Z [INFO] client.fingerprint_mgr.consul: consul agent is available
2020-05-28T16:06:02.670Z [INFO] client.fingerprint_mgr.vault: Vault is available
2020-05-28T16:06:02.725Z [INFO] nomad.vault: successfully renewed token: next_renewal=35h59m59.999992142s
2020-05-28T16:06:03.161Z [WARN] nomad.raft: failed to get previous log: previous-index=265 last-index=261 error="log not found"
2020-05-28T16:06:06.676Z [INFO] client.plugin: starting plugin manager: plugin-type=csi
2020-05-28T16:06:06.676Z [INFO] client.plugin: starting plugin manager: plugin-type=driver
2020-05-28T16:06:06.676Z [INFO] client.plugin: starting plugin manager: plugin-type=device
2020-05-28T16:06:06.687Z [INFO] client: started client: node_id=f4572c35-7aab-9d58-64ea-d1d72f814649
2020-05-28T16:06:06.712Z [INFO] client: node registration complete
2020-05-28T16:06:11.927Z [INFO] client: node registration complete

Here is the first curl call and result:

curl --header "X-Nomad-Token: f176d883-d010-6918-fc5a-003f7a7b6688" --request GET http://localhost:4646/v1/operator/scheduler/configuration

{"SchedulerConfig":{"PreemptionConfig":{"SystemSchedulerEnabled":true,"BatchSchedulerEnabled":false,"ServiceSchedulerEnabled":false},"CreateIndex":5,"ModifyIndex":270},"Index":270,"LastContact":0,"KnownLeader":true}

Then the next call:
curl -X PUT -H "Content-Type: application/json" --header "X-Nomad-Token: f176d883-d010-6918-fc5a-003f7a7b6688" -d @temp.json

http://localhost:4646/v1/operator/scheduler/configuration?cas=270
{"Updated":true,"Index":277}

And the final call:

curl --header "X-Nomad-Token: f176d883-d010-6918-fc5a-003f7a7b6688" --request GET http://localhost:4646/v1/operator/scheduler/configuration
{"SchedulerConfig":{"PreemptionConfig":{"SystemSchedulerEnabled":true,"BatchSchedulerEnabled":false,"ServiceSchedulerEnabled":false},"CreateIndex":5,"ModifyIndex":277},"Index":277,"LastContact":0,"KnownLeader":true}

These calls are made back to back with NO jobs schedule and just the cluster running idle.

Here is a server config:

log_level = "INFO"
data_dir = "/opt/nomad"
log_file = "/opt/nomad/nomad.log"
log_rotate_max_files = 5
log_rotate_bytes = 10000000
disable_update_check = true
datacenter = "dev-setup"
bind_addr = "0.0.0.0"
advertise {
http = "192.168.50.92"
rpc = "192.168.50.92"
serf = "192.168.50.92"
}

consul {
address = "127.0.0.1:8500"
token = "028dfee4-eef0-8a4a-6260-53e0e5148c3a"
}

server {
enabled = "true"
default_scheduler_config {
scheduler_algorithm = "spread"
}
encrypt = "31hdzqUGwY97/mz8fv6wKg=="
bootstrap_expect = "3"
}
plugin "docker" {
config {
auth {
config = "/etc/docker/config.json"
}
}
}

NOTE: The server config stanza looked as follows at the time the cluster was created

server {
enabled = "true"
encrypt = "31hdzqUGwY97/mz8fv6wKg=="
bootstrap_expect = "3"
}

@idrennanvmware
Copy link
Contributor Author

idrennanvmware commented May 28, 2020

I ran the command a few more times and i can confirm that if I set the file to

{
"SchedulerAlgorithm": "spread",
"PreemptionConfig": {
"SystemSchedulerEnabled": true,
"BatchSchedulerEnabled": false,
"ServiceSchedulerEnabled": false
}
}

and switch to

{
"SchedulerAlgorithm": "spread"
}
then I see the preemptive values change - but I never see a return result showing the SchedulerAlgorithm value

@idrennanvmware
Copy link
Contributor Author

idrennanvmware commented May 28, 2020

@cgbaker i switched the servers over to debug and grabbed the leader logs for the entire duration. Maybe this sheds some light, but nothing stands out to me

leaderlog.txt

edit: I also tried declaring the temp.json file contents as

{
"SchedulerConfig":{
"SchedulerAlgorithm":"spread",
"PreemptionConfig":{
"SystemSchedulerEnabled":true,
"BatchSchedulerEnabled":false,
"ServiceSchedulerEnabled":false
}
}
}

But I got the similar behavior except now I don't see the expected change in SystemSchedulerEnabled (so the behavior of my original temp file is better) - so I get a success save but no change on the get.

{"SchedulerConfig":{"PreemptionConfig":{"SystemSchedulerEnabled":false,"BatchSchedulerEnabled":false,"ServiceSchedulerEnabled":false},"CreateIndex":5,"ModifyIndex":54},"Index":54,"LastContact":0,"KnownLeader":true}

@idrennanvmware
Copy link
Contributor Author

Realized we are 0.11.1 not 0.11.2 so I updated the main post accordingly.

@idrennanvmware
Copy link
Contributor Author

@cgbaker is it possible this is simply a response missing some content ( SchedulerAlgorithm )? Since we're seeing the boolean value for SystemSchedulerEnabled flip back and forth - it just got me wondering, or did you see the value in your tests?

@idrennanvmware
Copy link
Contributor Author

Problem between the keyboard and chair :(

We are running 0.11.1 and 0.11.2 has the scheduling feature. Sorry!

#7810

@github-actions
Copy link

github-actions bot commented Nov 6, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants