Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad in small clusters #2176

Closed
kak-tus opened this issue Jan 10, 2017 · 5 comments
Closed

Nomad in small clusters #2176

kak-tus opened this issue Jan 10, 2017 · 5 comments

Comments

@kak-tus
Copy link

kak-tus commented Jan 10, 2017

I have two nomad clusters: one production cluster at work with enough nodes, and second - "just for fun" cluster, for my private services.

First cluster with many nodes work without any trouble.

Small cluster has only 3 nodes. Each node has server and client and placed in different datacenter (so I have 1 region and 3 DC in nomad terminology).
And this small cluster is not very stable to network lags.
Masters reelecting frequently (every 1-2 days) because of temporary network lags, but it is not very bad.
Bad in this case: after masters reelection nomad begin to restart every job in cluster.

But I like nomad and I want to use it in my small cluster.
How I see fixing this problem:

  1. May be will be enough to add some configurable raft timeouts.
  2. Or may be will be good option of some "restart timeout" of services, so nomad after leader reelection wait this timeout (I can be changed it to 10-30 minutes, It's good for my usecase).
  3. Or may be option to block job restarting (so nomad didn't do job restart after election, but restart it after vault token regeneration or restart after job fault).
@dadgar
Copy link
Contributor

dadgar commented Jan 10, 2017

Hey @kak-tus,

Nomad will not restart jobs just because of a leader election. What is the ping time between the servers and can you share the logs of the servers/clients after said leader transistion/restarting jobs.

@kak-tus
Copy link
Author

kak-tus commented Jan 10, 2017

@dadgar Hm, you are right. As I remember 0.5.0 was more stable. May be something was changed in 0.5.1 or 0.5.2. But may be network stability was changed.

Normal ping between servers - 1.5-2.5 ms.

Aggregated log of a hole cluster (c1,c2,c3 in log - nodes).
https://gist.github.com/kak-tus/6b1301572b608e41d68d09b4a676d4b1
In 05:11:17 - begin network lags. And at 05:14:35 containers begin to restart.

@kak-tus
Copy link
Author

kak-tus commented Jan 12, 2017

I reverted back to 0.5.0 and will be seen cluster behavior.

@dadgar
Copy link
Contributor

dadgar commented Jan 31, 2017

@kak-tus I am going to close this is Nomad does not behave in the way described in the issue. Further the logs do show large latency between the servers. It may have just been a transient network issue

@dadgar dadgar closed this as completed Jan 31, 2017
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants