Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch jobs scheduled multiple times when node goes down, regardless if drained or just stopped #1050

Closed
g0t4 opened this issue Apr 7, 2016 · 12 comments

Comments

@g0t4
Copy link

g0t4 commented Apr 7, 2016

Nomad version

Nomad v0.3.1

Operating system and Environment details

centos 7 3.10.0-327.10.1.el7.x86_64
Docker version 1.10.3, build 20f81dd

Issue

I have batch jobs with 50 tasks. If i take a node down while it is processing work and then bring a new node up, the work from the node that went down tends to be run multiple times, like up to 50 times, seemingly endlessly. I have to stop the job to get it to stop scheduling new allocations, even as the previous runs complete successfully, that doesn't stop Nomad from continuing to schedule new allocations.

I ran into this issue both by just shutting down a node, and by draining a node. So this seems like an issue in either case.

Is something going wrong with evaluations if a node goes down?

Shouldn't we be able to lose a node and have the processing eventually move to another machine?

I don't have count set on the tasks, so I don't know why nomad would run multiple occurrences.

Reproduction steps

Launch a batch job with work running on multiple nodes. Take one node down, bring up a new node in its place. Might not need to bring up a new node, that's just something auto scaling is doing in my situation.

@g0t4 g0t4 closed this as completed Apr 7, 2016
@g0t4 g0t4 reopened this Apr 7, 2016
@g0t4
Copy link
Author

g0t4 commented Apr 7, 2016

@g0t4 g0t4 changed the title Batch jobs scheduled multiple times when node goes down Batch jobs scheduled multiple times when node goes down, regardless if drained or just stopped Apr 7, 2016
@dadgar
Copy link
Contributor

dadgar commented Apr 11, 2016

@g0t4 Could you please share the Nomad server logs of this happening and the job file. How many clients do you have running?

I was not able to reproduce using the instructions. I used two clients running a docker container in batch mode, killing one of the clients.

@g0t4
Copy link
Author

g0t4 commented Apr 12, 2016

I was able to reproduce with a generic batch job composed of sleeps, attached is the job file
repro.hcl.txt

Here's a video explaining the steps to reproduce: https://youtu.be/Pm0nQtqQlWk

Hope this helps, let me know if you want anything else. The video includes a dump of the server logs in that example if you want to see those without trying the repro on your end.

@dadgar
Copy link
Contributor

dadgar commented Apr 12, 2016

@g0t4 just watched the video. That does look like a nasty bug! I will pull down that file tomorrow and try to do the repro. I think one of the differences when I was trying to reproduce it earlier today was that I had a task group with one task and had the count set to ~20, whereas you have each task separately defined.

@g0t4
Copy link
Author

g0t4 commented Apr 12, 2016

Glad the video helped! I figured that was easier to explain what was going
on :)

Let me know if I can help with anything else, I'd be happy to test things
when you have updates.

On Mon, Apr 11, 2016 at 9:10 PM, Alex Dadgar notifications@github.com
wrote:

@g0t4 https://github.com/g0t4 just watched the video. That does look
like a nasty bug! I will pull down that file tomorrow and try to do the
repro. I think one of the differences when I was trying to reproduce it
earlier today was that I had a task group with one task and had the count
set to ~20, whereas you have each task separately defined.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1050 (comment)

@dadgar
Copy link
Contributor

dadgar commented Apr 12, 2016

Hey Wes,

Do you want to try this branch and make sure the fix works:
#1086

Thanks,
Alex

On Mon, Apr 11, 2016 at 6:52 PM, Wes Higbee notifications@github.com
wrote:

Glad the video helped! I figured that was easier to explain what was going
on :)

Let me know if I can help with anything else, I'd be happy to test things
when you have updates.

On Mon, Apr 11, 2016 at 9:10 PM, Alex Dadgar notifications@github.com
wrote:

@g0t4 https://github.com/g0t4 just watched the video. That does look
like a nasty bug! I will pull down that file tomorrow and try to do the
repro. I think one of the differences when I was trying to reproduce it
earlier today was that I had a task group with one task and had the count
set to ~20, whereas you have each task separately defined.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1050 (comment)


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#1050 (comment)

@g0t4
Copy link
Author

g0t4 commented Apr 13, 2016

Will do, thanks!

On Tue, Apr 12, 2016 at 7:16 PM, Alex Dadgar notifications@github.com
wrote:

Hey Wes,

Do you want to try this branch and make sure the fix works:
#1086

Thanks,
Alex

On Mon, Apr 11, 2016 at 6:52 PM, Wes Higbee notifications@github.com
wrote:

Glad the video helped! I figured that was easier to explain what was
going
on :)

Let me know if I can help with anything else, I'd be happy to test things
when you have updates.

On Mon, Apr 11, 2016 at 9:10 PM, Alex Dadgar notifications@github.com
wrote:

@g0t4 https://github.com/g0t4 just watched the video. That does look
like a nasty bug! I will pull down that file tomorrow and try to do the
repro. I think one of the differences when I was trying to reproduce it
earlier today was that I had a task group with one task and had the
count
set to ~20, whereas you have each task separately defined.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
<#1050 (comment)


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#1050 (comment)


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1050 (comment)

@g0t4
Copy link
Author

g0t4 commented Apr 13, 2016

Works with my mock job file that I sent you, if I get a chance this week
I'll try to swap this into my test cluster and put real work against it.
How stable is this branch?

On Tue, Apr 12, 2016 at 8:40 PM, Wes Higbee wes.mcclure@gmail.com wrote:

Will do, thanks!

On Tue, Apr 12, 2016 at 7:16 PM, Alex Dadgar notifications@github.com
wrote:

Hey Wes,

Do you want to try this branch and make sure the fix works:
#1086

Thanks,
Alex

On Mon, Apr 11, 2016 at 6:52 PM, Wes Higbee notifications@github.com
wrote:

Glad the video helped! I figured that was easier to explain what was
going
on :)

Let me know if I can help with anything else, I'd be happy to test
things
when you have updates.

On Mon, Apr 11, 2016 at 9:10 PM, Alex Dadgar notifications@github.com
wrote:

@g0t4 https://github.com/g0t4 just watched the video. That does
look
like a nasty bug! I will pull down that file tomorrow and try to do
the
repro. I think one of the differences when I was trying to reproduce
it
earlier today was that I had a task group with one task and had the
count
set to ~20, whereas you have each task separately defined.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
<
https://github.com/hashicorp/nomad/issues/1050#issuecomment-208652525>


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#1050 (comment)


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1050 (comment)

@g0t4
Copy link
Author

g0t4 commented Apr 13, 2016

Thanks a bunch for fixing this so quickly!

On Tue, Apr 12, 2016 at 10:24 PM, Wes Higbee wes.mcclure@gmail.com wrote:

Works with my mock job file that I sent you, if I get a chance this week
I'll try to swap this into my test cluster and put real work against it.
How stable is this branch?

On Tue, Apr 12, 2016 at 8:40 PM, Wes Higbee wes.mcclure@gmail.com wrote:

Will do, thanks!

On Tue, Apr 12, 2016 at 7:16 PM, Alex Dadgar notifications@github.com
wrote:

Hey Wes,

Do you want to try this branch and make sure the fix works:
#1086

Thanks,
Alex

On Mon, Apr 11, 2016 at 6:52 PM, Wes Higbee notifications@github.com
wrote:

Glad the video helped! I figured that was easier to explain what was
going
on :)

Let me know if I can help with anything else, I'd be happy to test
things
when you have updates.

On Mon, Apr 11, 2016 at 9:10 PM, Alex Dadgar <notifications@github.com

wrote:

@g0t4 https://github.com/g0t4 just watched the video. That does
look
like a nasty bug! I will pull down that file tomorrow and try to do
the
repro. I think one of the differences when I was trying to reproduce
it
earlier today was that I had a task group with one task and had the
count
set to ~20, whereas you have each task separately defined.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
<
https://github.com/hashicorp/nomad/issues/1050#issuecomment-208652525>


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
<#1050 (comment)


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1050 (comment)

@dadgar
Copy link
Contributor

dadgar commented Apr 13, 2016

@g0t4 thanks for testing that! I would just wait for 0.3.2-RC which should be out soon!

@g0t4
Copy link
Author

g0t4 commented Apr 13, 2016

Fantastic, looking forward to it!

On Wed, Apr 13, 2016 at 12:42 PM, Alex Dadgar notifications@github.com
wrote:

@g0t4 https://github.com/g0t4 thanks for testing that! I would just
wait for 0.3.2-RC which should be out soon!


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1050 (comment)

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants