Nomad in small clusters #2176

kak-tus · 2017-01-10T14:00:03Z

I have two nomad clusters: one production cluster at work with enough nodes, and second - "just for fun" cluster, for my private services.

First cluster with many nodes work without any trouble.

Small cluster has only 3 nodes. Each node has server and client and placed in different datacenter (so I have 1 region and 3 DC in nomad terminology).
And this small cluster is not very stable to network lags.
Masters reelecting frequently (every 1-2 days) because of temporary network lags, but it is not very bad.
Bad in this case: after masters reelection nomad begin to restart every job in cluster.

But I like nomad and I want to use it in my small cluster.
How I see fixing this problem:

May be will be enough to add some configurable raft timeouts.
Or may be will be good option of some "restart timeout" of services, so nomad after leader reelection wait this timeout (I can be changed it to 10-30 minutes, It's good for my usecase).
Or may be option to block job restarting (so nomad didn't do job restart after election, but restart it after vault token regeneration or restart after job fault).

dadgar · 2017-01-10T18:37:22Z

Hey @kak-tus,

Nomad will not restart jobs just because of a leader election. What is the ping time between the servers and can you share the logs of the servers/clients after said leader transistion/restarting jobs.

kak-tus · 2017-01-10T19:05:02Z

@dadgar Hm, you are right. As I remember 0.5.0 was more stable. May be something was changed in 0.5.1 or 0.5.2. But may be network stability was changed.

Normal ping between servers - 1.5-2.5 ms.

Aggregated log of a hole cluster (c1,c2,c3 in log - nodes).
https://gist.github.com/kak-tus/6b1301572b608e41d68d09b4a676d4b1
In 05:11:17 - begin network lags. And at 05:14:35 containers begin to restart.

kak-tus · 2017-01-12T13:36:53Z

I reverted back to 0.5.0 and will be seen cluster behavior.

dadgar · 2017-01-31T19:16:59Z

@kak-tus I am going to close this is Nomad does not behave in the way described in the issue. Further the logs do show large latency between the servers. It may have just been a transient network issue

github-actions · 2022-12-16T02:12:29Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added the stage/waiting-reply label Jan 10, 2017

kak-tus mentioned this issue Jan 12, 2017

Kill Allocations when client is disconnected from servers #2185

Closed

dadgar closed this as completed Jan 31, 2017

github-actions bot locked as resolved and limited conversation to collaborators Dec 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad in small clusters #2176

Nomad in small clusters #2176

kak-tus commented Jan 10, 2017

dadgar commented Jan 10, 2017

kak-tus commented Jan 10, 2017

kak-tus commented Jan 12, 2017

dadgar commented Jan 31, 2017

github-actions bot commented Dec 16, 2022

Nomad in small clusters #2176

Nomad in small clusters #2176

Comments

kak-tus commented Jan 10, 2017

dadgar commented Jan 10, 2017

kak-tus commented Jan 10, 2017

kak-tus commented Jan 12, 2017

dadgar commented Jan 31, 2017

github-actions bot commented Dec 16, 2022