-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: revisit cluster initialization #5974
Comments
Can't we do something with the persistent storage? A node knows whether On Mon, Apr 11, 2016 at 7:54 AM marc notifications@github.com wrote:
-- Tobias |
Persistent storage will not always be an option, not when its implementation is way slower than SSDs. In terraform, we currently only pass a single join host. This is mostly due to the terraform functions being a bit lacking. In general though, flag discrepancy between nodes is badly supported, although now that some of our flags can be set from environment variables, it may not be quite as bad. Still, I think we need a better story here. |
Yeah, I'm also not entirely sure how we ideally want to run the thing. Centralized storage can be a real pain and maybe using node-local storage and never re-using it (i.e. starting new nodes as opposed to restarting old ones) is the way to go for these containerized cloud deployments? See also #5967. For terraform we should be in the clear though, right? If we manage to pass one host, that's also enough. As long as we don't pass an empty list to a newly minted node when there are other nodes out there (though ideally of course we'd pass all we know about at the time of creation). Is that really difficult to do? |
The first node should be started without I think the right way to set up a cluster in a kubernetes-style environment would be to start one node with no Alternately, we could start all the nodes with |
@bdarnell In kubernetes, replication controllers can "adopt" an existing pod. So one way to accomplish the above is to start up the boostrap pod manually, then start a replication controller to restart the pod if it dies, but add the |
@mberhault, @a-robinson Is there anything we should consider changing here before 1.0? |
I think we should. Either a "marker file", or an endpoint to trigger the bootstrap process on a single node would be good, varying arguments tend to be a pain. |
I'd also love for us to improve this before 1.0, but haven't taken the time to investigate other options yet. I wouldn't block 1.0 on it but would try pretty hard to squeeze it in. |
I'd also like to prioritize this for 1.0; the current approach is really awkward for non-toy deployments. |
Give that we're not going to be able to remove the current behavior of no |
The concrete suggestion is #14251. I'm very dissatisfied with the current approach and will try to get this in for 1.0. |
This is unfortunately trickier than it looks: A node doesn't bind any of its ports until it has either bootstrapped or talked to a node that has. This prevents us from offering an init rpc, because the server won't be listening to receive it. We'd need to A) start listening on the RPC port immediately B) ensure that operations other than the init/bootstrap RPC behave reasonably (block? fail cleanly?) C) do something about clients that assume the server is ready to serve as soon as it starts listening on its network port (including in particular haproxy in our recommended configuration). This is definitely not happening for tomorrow's beta, and I think it's going to need to wait for 1.1. Possible workarounds:
|
Another option is to punt entirely. There are issues with the current
startup process but the alternatives aren't issue free and represent non
trivial changes from both an eng perspective and a somewhat surprising UX
change for existing users.
…On Wed, Apr 19, 2017 at 1:13 AM Ben Darnell ***@***.***> wrote:
This is unfortunately trickier than it looks: A node doesn't bind any of
its ports until it has either bootstrapped or talked to a node that has.
This prevents us from offering an init rpc, because the server won't be
listening to receive it. We'd need to A) start listening on the RPC port
immediately B) ensure that operations other than the init/bootstrap RPC
behave reasonably (block? fail cleanly?) C) do something about clients that
assume the server is ready to serve as soon as it starts listening on its
network port (including in particular haproxy in our recommended
configuration). This is definitely not happening for tomorrow's beta, and I
think it's going to need to wait for 1.1.
Possible workarounds:
- use a unix signal like SIGUSR1 instead of an RPC (is this feasible
in the kinds of deployment environments the init RPC is intended to help?)
- use a separate dedicated port for the init rpc (so clusters
intending to use this feature would pass --join=node1:26257,node2:26257
--init-port=9999 and pass that same port to the init client. it's
clunky, but avoids our major concern about the first node being started
with different flags at different times)
- start the HTTP server early but not the GRPC/pgwire server, and do
the init command over http. This is probably a good idea in its own right
(to let the admin UI be served from nodes that are unable to fully
initialize themselves and provide diagnostic info), so this may be a better
idea than re-ordering startup for the main port (although it would also
prevent us from ever merging the http port back into the common port).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5974 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF3MTT73gZm3BPDEgbFxsI4uvEKcWBa7ks5rxZgRgaJpZM4IEUGK>
.
|
Done in #16371 |
To initialize a cluster, we need to start a node with
--join=
.While this is fine in manual deployments, anything that uses automated tools (eg: supervisor), or containers (kubernetes) will have a hard time due to the different flags for each node.
If we manage to hack the configuration system into letting us specify an empty
--join
on the first node only, we still run into issues if that node gets recreated. In that case, it will be initializing a new cluster.We have a few possibilities which fall in two categories:
The text was updated successfully, but these errors were encountered: