server: revisit cluster initialization #5974

mberhault · 2016-04-11T11:54:38Z

To initialize a cluster, we need to start a node with --join=.
While this is fine in manual deployments, anything that uses automated tools (eg: supervisor), or containers (kubernetes) will have a hard time due to the different flags for each node.

If we manage to hack the configuration system into letting us specify an empty --join on the first node only, we still run into issues if that node gets recreated. In that case, it will be initializing a new cluster.

We have a few possibilities which fall in two categories:

provide a previously-generated blob that can be used to know which cluster to join. we still need some node to create the first range.
similar to the above, but the blob is stored externally. This would add a dependency. bleh!
issue an rpc to a node and tell it to initialize. this would require starting up the server early if we're not already initialized. we would have to investigate what happens when called on multiple nodes.

The text was updated successfully, but these errors were encountered:

tbg · 2016-04-11T12:27:51Z

Can't we do something with the persistent storage? A node knows whether
it's initialized or not. All we need is to be able to start and stop a
first "special" pod, and then everything is symmetric. The --join flag
should only decide on bootstrapping if there is no local state. So,
assuming we can somehow seed the cluster, we can pass the same join to
everyone. I'm not familiar enough with Kubernetes; maybe this isn't how
things can work.
For Terraform, we don't seem to have an issue - if you regenerate the first
node, you presumably have other nodes set up, and these would be passed to
the newly generated node's join flag?

On Mon, Apr 11, 2016 at 7:54 AM marc notifications@github.com wrote:

To initialize a cluster, we need to start a node with --join=.
While this is fine in manual deployments, anything that uses automated
tools (eg: supervisor), or containers (kubernetes) will have a hard time
due to the different flags for each node.

If we manage to hack the configuration system into letting us specify an
empty --join on the first node only, we still run into issues if that
node gets recreated. In that case, it will be initializing a new cluster.

We have a few possibilities which fall in two categories:

provide a previously-generated blob that can be used to know which
cluster to join. we still need some node to create the first range.

similar to the above, but the blob is stored externally. This would
add a dependency. bleh!

issue an rpc to a node and tell it to initialize. this would require
starting up the server early if we're not already initialized. we would
have to investigate what happens when called on multiple nodes.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#5974

-- Tobias

mberhault · 2016-04-11T12:34:06Z

Persistent storage will not always be an option, not when its implementation is way slower than SSDs.
Some may have it, but most definitely don't, and definitely not on bare metal.

In terraform, we currently only pass a single join host. This is mostly due to the terraform functions being a bit lacking.

In general though, flag discrepancy between nodes is badly supported, although now that some of our flags can be set from environment variables, it may not be quite as bad. Still, I think we need a better story here.

tbg · 2016-04-11T13:15:36Z

Yeah, I'm also not entirely sure how we ideally want to run the thing. Centralized storage can be a real pain and maybe using node-local storage and never re-using it (i.e. starting new nodes as opposed to restarting old ones) is the way to go for these containerized cloud deployments? See also #5967.

For terraform we should be in the clear though, right? If we manage to pass one host, that's also enough. As long as we don't pass an empty list to a newly minted node when there are other nodes out there (though ideally of course we'd pass all we know about at the time of creation). Is that really difficult to do?

bdarnell · 2016-04-11T17:53:38Z

The first node should be started without --join the first time, but it should generally have a --join value on restarts so that A) it will never try to re-bootstrap itself even if its storage has gone missing and B) it will be able to rejoin the cluster even if IP addresses have changed.

I think the right way to set up a cluster in a kubernetes-style environment would be to start one node with no --join (and persistent storage), then immediately take it down and start all the nodes with the real --join value. As I said in #4027, we could add an option like --stop-after-bootstrap to start to take the guesswork out of taking the node down.

Alternately, we could start all the nodes with --join from the beginning. The cluster would be stuck and unable to start, but we could accept a special Bootstrap RPC in this state so that the cluster would be bootstrapped on the target node. So then you'd just start one set of processes with kubernetes and then run the cockroach client locally to trigger the bootstrap.

petermattis · 2016-04-11T19:17:55Z

@bdarnell In kubernetes, replication controllers can "adopt" an existing pod. So one way to accomplish the above is to start up the boostrap pod manually, then start a replication controller to restart the pod if it dies, but add the --join flag to the new invocation. Every other node would have an associated replication controller that used the --join flag from the get go.

petermattis · 2017-02-22T15:16:42Z

@mberhault, @a-robinson Is there anything we should consider changing here before 1.0?

mberhault · 2017-02-22T15:30:33Z

I think we should. Either a "marker file", or an endpoint to trigger the bootstrap process on a single node would be good, varying arguments tend to be a pain.

a-robinson · 2017-02-22T15:45:24Z

I'd also love for us to improve this before 1.0, but haven't taken the time to investigate other options yet. I wouldn't block 1.0 on it but would try pretty hard to squeeze it in.

bdarnell · 2017-02-22T19:23:39Z

I'd also like to prioritize this for 1.0; the current approach is really awkward for non-toy deployments.

spencerkimball · 2017-03-30T17:01:05Z

Give that we're not going to be able to remove the current behavior of no --join flag on an uninitialized node causing a cluster init, I'm unconvinced this is going to make it into 1.0. Does anyone have a concrete suggestion, or just general dissatisfaction with the current approach?

bdarnell · 2017-03-30T19:02:51Z

The concrete suggestion is #14251. I'm very dissatisfied with the current approach and will try to get this in for 1.0.

bdarnell · 2017-04-19T05:13:37Z

This is unfortunately trickier than it looks: A node doesn't bind any of its ports until it has either bootstrapped or talked to a node that has. This prevents us from offering an init rpc, because the server won't be listening to receive it. We'd need to A) start listening on the RPC port immediately B) ensure that operations other than the init/bootstrap RPC behave reasonably (block? fail cleanly?) C) do something about clients that assume the server is ready to serve as soon as it starts listening on its network port (including in particular haproxy in our recommended configuration). This is definitely not happening for tomorrow's beta, and I think it's going to need to wait for 1.1.

Possible workarounds:

use a unix signal like SIGUSR1 instead of an RPC (is this feasible in the kinds of deployment environments the init RPC is intended to help?)
use a separate dedicated port for the init rpc (so clusters intending to use this feature would pass --join=node1:26257,node2:26257 --init-port=9999 and pass that same port to the init client. it's clunky, but avoids our major concern about the first node being started with different flags at different times)
start the HTTP server early but not the GRPC/pgwire server, and do the init command over http. This is probably a good idea in its own right (to let the admin UI be served from nodes that are unable to fully initialize themselves and provide diagnostic info), so this may be a better idea than re-ordering startup for the main port (although it would also prevent us from ever merging the http port back into the common port).

spencerkimball · 2017-04-19T12:26:42Z

Another option is to punt entirely. There are issues with the current startup process but the alternatives aren't issue free and represent non trivial changes from both an eng perspective and a somewhat surprising UX change for existing users.

…

On Wed, Apr 19, 2017 at 1:13 AM Ben Darnell ***@***.***> wrote: This is unfortunately trickier than it looks: A node doesn't bind any of its ports until it has either bootstrapped or talked to a node that has. This prevents us from offering an init rpc, because the server won't be listening to receive it. We'd need to A) start listening on the RPC port immediately B) ensure that operations other than the init/bootstrap RPC behave reasonably (block? fail cleanly?) C) do something about clients that assume the server is ready to serve as soon as it starts listening on its network port (including in particular haproxy in our recommended configuration). This is definitely not happening for tomorrow's beta, and I think it's going to need to wait for 1.1. Possible workarounds: - use a unix signal like SIGUSR1 instead of an RPC (is this feasible in the kinds of deployment environments the init RPC is intended to help?) - use a separate dedicated port for the init rpc (so clusters intending to use this feature would pass --join=node1:26257,node2:26257 --init-port=9999 and pass that same port to the init client. it's clunky, but avoids our major concern about the first node being started with different flags at different times) - start the HTTP server early but not the GRPC/pgwire server, and do the init command over http. This is probably a good idea in its own right (to let the admin UI be served from nodes that are unable to fully initialize themselves and provide diagnostic info), so this may be a better idea than re-ordering startup for the main port (although it would also prevent us from ever merging the http port back into the common port). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5974 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF3MTT73gZm3BPDEgbFxsI4uvEKcWBa7ks5rxZgRgaJpZM4IEUGK> .

bdarnell · 2017-08-07T20:59:19Z

Done in #16371

mberhault changed the title ~~revisit cluster initialization~~ server: revisit cluster initialization Apr 11, 2016

knz mentioned this issue Dec 10, 2016

cli+ui: create safety rails for deployment #12157

Closed

petermattis assigned mberhault Feb 22, 2017

petermattis added this to the Later milestone Feb 22, 2017

petermattis modified the milestones: 1.0, Later Feb 22, 2017

a-robinson mentioned this issue Mar 21, 2017

RFC: Explicit init command #14251

Merged

spencerkimball assigned bdarnell and unassigned mberhault Mar 30, 2017

bdarnell modified the milestones: 1.1, 1.0 Apr 19, 2017

bdarnell closed this as completed Aug 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: revisit cluster initialization #5974

server: revisit cluster initialization #5974

mberhault commented Apr 11, 2016

tbg commented Apr 11, 2016

mberhault commented Apr 11, 2016

tbg commented Apr 11, 2016

bdarnell commented Apr 11, 2016

petermattis commented Apr 11, 2016

petermattis commented Feb 22, 2017

mberhault commented Feb 22, 2017

a-robinson commented Feb 22, 2017

bdarnell commented Feb 22, 2017

spencerkimball commented Mar 30, 2017

bdarnell commented Mar 30, 2017

bdarnell commented Apr 19, 2017

spencerkimball commented Apr 19, 2017 via email

bdarnell commented Aug 7, 2017

server: revisit cluster initialization #5974

server: revisit cluster initialization #5974

Comments

mberhault commented Apr 11, 2016

tbg commented Apr 11, 2016

mberhault commented Apr 11, 2016

tbg commented Apr 11, 2016

bdarnell commented Apr 11, 2016

petermattis commented Apr 11, 2016

petermattis commented Feb 22, 2017

mberhault commented Feb 22, 2017

a-robinson commented Feb 22, 2017

bdarnell commented Feb 22, 2017

spencerkimball commented Mar 30, 2017

bdarnell commented Mar 30, 2017

bdarnell commented Apr 19, 2017

spencerkimball commented Apr 19, 2017 via email

bdarnell commented Aug 7, 2017