-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rolling updates doens;t work well for system jobs #4786
Comments
Perhaps, one of one way to reproduce is to make a node drain in moment of rolling update. But before, when we saw this behavior, we doesn't do any |
@tantra35 system jobs can only have a count of 1. Do you have a full job spec you can post here? |
@preetapan Sorry I misspelled it. I mean job running on 3 instances. And here job spec
|
@preetapan @cgbaker We test with follow job definition
Then to test rolling updates we begin reduce memory from 500 to 400 then to 300 then to 200, and see follow
Distance between allocations launch, not 30 minutes, and much less |
https://www.nomadproject.io/docs/job-specification/update.html
You probably care about
|
@jippi yes you right, for service job type new update block work, and reaction on On test stand we 100% reproduce situation when one of nodes are draining and system scheduler restart all allocation for a short time, without waiting required timeouts. Also we found, that rolling update for |
To demonstrate what actually goes wrong, on test stand we launch nomad with debug logs. Configurations of test stand is follow 3 nomad servers and 4 nomad clients. We launch system job, with follow update stanza:
then make some change in it and launch
then on some node we made node drain (
And as expected one of allocations are updated, but this it too early(in job rolling updates are made in 30m intervals with count of 1) then we undrain node (
and again one of allocations updated too early So as i final we have follow picture:
Distance between allocations But i must said that in our production, similar situation happens without node drain(here i mention about this only as a key to reproduce). The key reason why this happens is understandably,
nomad server must ignore all remaining allocations and not update them if this not needed, but if any event happens in cluster(for example nodes flapping(which throw node-update) due bad network - this now happens often in aws environment) with nodes on which allocations are placed and that must be rolling updated, this allocations will be relaunched too early So rolling updates are partially works wrong(behaive unstable if comparable with previous versions of nomad) |
@tantra35 sorry for the late reply. This is a known issue with the system scheduler, the We are tracking improvements to the system scheduler in #4740 where we will add all recent enhancements to deployments for batch and service jobs to the system scheduler. I am closing this issue in favor of #4740 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
0.8.6
We have system job with follow roling update config
Job have one task group with count of 3. When we update it some times nomad doens't wait
30m
between task restarts, and update all task at one time:In output above we can see, that distance between task restarts much less that must be(30m)
The text was updated successfully, but these errors were encountered: