Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug in auto healing #2069

Closed
OferE opened this issue Dec 8, 2016 · 9 comments
Closed

bug in auto healing #2069

OferE opened this issue Dec 8, 2016 · 9 comments

Comments

@OferE
Copy link

OferE commented Dec 8, 2016

Nomad version

Nomad v0.5.0

Operating system and Environment details

not relevant

Issue

When removing the nomad agent of a client machine - tasks on the machine are moving to other machines in the cluster which is fine - but tasks on the lost machine are not stopped.

Reproduction steps

just launch a job on 2 mahcine and kill the agent on on of the machines.

@dadgar
Copy link
Contributor

dadgar commented Dec 8, 2016

Hey,

If you bring the agent back up on one of the machines it will realize the work has been migrated and kill the tasks. Once the agent is dead though, there is nothing Nomad can do to clean it up as there is no Nomad process running.

If that answers your question please close this issue. If not let me know and happy to answer any questions.

@OferE
Copy link
Author

OferE commented Dec 8, 2016

The tasks themselves run under nomad executable - this executable can verify that there is no agent responding and kill the underlying task.
BTW - great tool. amazing tool.

@dadgar
Copy link
Contributor

dadgar commented Dec 8, 2016

@OferE Thanks for the kind words :)

As for the nomad executor it is a dumb shim. The agent is the one with the logic for talking to the servers. We do not want to spread complexity to all parts of the system (thats how it becomes un-reliable). When the agent comes back it will tell the executor to clean up.

@OferE
Copy link
Author

OferE commented Dec 8, 2016

It's your choice, but i would handle this case as it leaves mess in the cluster:
The consul service discovery keep displaying the tasks and resolve dns for the "lost" tasks.
It's not that difficult to solve.

It's not that critical, but it seems more elegant.

@OferE
Copy link
Author

OferE commented Dec 8, 2016

There is also another important use case:
Consider the case where u have some task that is doing something periodically.
If u stop the job - u might expect that the periodic task will stop.

In case where the "bug" happened - the periodic task will continue and will cause some mess...

All of this is rare ofcause, since the nomad is a stable piece of SW.
But if u want perfection....

Anyway - if u don't cosider this as a bug, i will close.
I just wanted to help a bit.

Again, amazing project!

@OferE
Copy link
Author

OferE commented Dec 8, 2016

Regarding stability of nomad - in large scale rare things can happen as i'm sure u know.
Even things that r not in control of Nomad SW. Non stable VM etc.

This is something i would handle :-)

@dadgar
Copy link
Contributor

dadgar commented Dec 8, 2016

Yeah we expect failures and have designed the agent to reattach to the existing executors and take the correct action. I appreciate your interest in the project! For the above mentioned reasons I am going to close the issue.

Thanks,
Alex

@dadgar dadgar closed this as completed Dec 8, 2016
@OferE
Copy link
Author

OferE commented Dec 8, 2016

sure :-)
I will try to workaround this bug myself by monitoring the agent process and killall the other nomad processes when there is no agent present.
I just hope that when the agent will return things will not crash...
Will post here my findings in case someone else will be interested (I doubt :-) )

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants