Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: revisit cluster initialization #5974

Closed
mberhault opened this issue Apr 11, 2016 · 14 comments
Closed

server: revisit cluster initialization #5974

mberhault opened this issue Apr 11, 2016 · 14 comments
Assignees
Milestone

Comments

@mberhault
Copy link
Contributor

To initialize a cluster, we need to start a node with --join=.
While this is fine in manual deployments, anything that uses automated tools (eg: supervisor), or containers (kubernetes) will have a hard time due to the different flags for each node.

If we manage to hack the configuration system into letting us specify an empty --join on the first node only, we still run into issues if that node gets recreated. In that case, it will be initializing a new cluster.

We have a few possibilities which fall in two categories:

  • provide a previously-generated blob that can be used to know which cluster to join. we still need some node to create the first range.
  • similar to the above, but the blob is stored externally. This would add a dependency. bleh!
  • issue an rpc to a node and tell it to initialize. this would require starting up the server early if we're not already initialized. we would have to investigate what happens when called on multiple nodes.
@tbg
Copy link
Member

tbg commented Apr 11, 2016

Can't we do something with the persistent storage? A node knows whether
it's initialized or not. All we need is to be able to start and stop a
first "special" pod, and then everything is symmetric. The --join flag
should only decide on bootstrapping if there is no local state. So,
assuming we can somehow seed the cluster, we can pass the same join to
everyone. I'm not familiar enough with Kubernetes; maybe this isn't how
things can work.
For Terraform, we don't seem to have an issue - if you regenerate the first
node, you presumably have other nodes set up, and these would be passed to
the newly generated node's join flag?

On Mon, Apr 11, 2016 at 7:54 AM marc notifications@github.com wrote:

To initialize a cluster, we need to start a node with --join=.
While this is fine in manual deployments, anything that uses automated
tools (eg: supervisor), or containers (kubernetes) will have a hard time
due to the different flags for each node.

If we manage to hack the configuration system into letting us specify an
empty --join on the first node only, we still run into issues if that
node gets recreated. In that case, it will be initializing a new cluster.

We have a few possibilities which fall in two categories:

  • provide a previously-generated blob that can be used to know which
    cluster to join. we still need some node to create the first range.
  • similar to the above, but the blob is stored externally. This would
    add a dependency. bleh!
  • issue an rpc to a node and tell it to initialize. this would require
    starting up the server early if we're not already initialized. we would
    have to investigate what happens when called on multiple nodes.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#5974

-- Tobias

@mberhault
Copy link
Contributor Author

Persistent storage will not always be an option, not when its implementation is way slower than SSDs.
Some may have it, but most definitely don't, and definitely not on bare metal.

In terraform, we currently only pass a single join host. This is mostly due to the terraform functions being a bit lacking.

In general though, flag discrepancy between nodes is badly supported, although now that some of our flags can be set from environment variables, it may not be quite as bad. Still, I think we need a better story here.

@tbg
Copy link
Member

tbg commented Apr 11, 2016

Yeah, I'm also not entirely sure how we ideally want to run the thing. Centralized storage can be a real pain and maybe using node-local storage and never re-using it (i.e. starting new nodes as opposed to restarting old ones) is the way to go for these containerized cloud deployments? See also #5967.

For terraform we should be in the clear though, right? If we manage to pass one host, that's also enough. As long as we don't pass an empty list to a newly minted node when there are other nodes out there (though ideally of course we'd pass all we know about at the time of creation). Is that really difficult to do?

@bdarnell
Copy link
Contributor

The first node should be started without --join the first time, but it should generally have a --join value on restarts so that A) it will never try to re-bootstrap itself even if its storage has gone missing and B) it will be able to rejoin the cluster even if IP addresses have changed.

I think the right way to set up a cluster in a kubernetes-style environment would be to start one node with no --join (and persistent storage), then immediately take it down and start all the nodes with the real --join value. As I said in #4027, we could add an option like --stop-after-bootstrap to start to take the guesswork out of taking the node down.

Alternately, we could start all the nodes with --join from the beginning. The cluster would be stuck and unable to start, but we could accept a special Bootstrap RPC in this state so that the cluster would be bootstrapped on the target node. So then you'd just start one set of processes with kubernetes and then run the cockroach client locally to trigger the bootstrap.

@petermattis
Copy link
Collaborator

@bdarnell In kubernetes, replication controllers can "adopt" an existing pod. So one way to accomplish the above is to start up the boostrap pod manually, then start a replication controller to restart the pod if it dies, but add the --join flag to the new invocation. Every other node would have an associated replication controller that used the --join flag from the get go.

@mberhault mberhault changed the title revisit cluster initialization server: revisit cluster initialization Apr 11, 2016
@petermattis petermattis added this to the Later milestone Feb 22, 2017
@petermattis
Copy link
Collaborator

@mberhault, @a-robinson Is there anything we should consider changing here before 1.0?

@mberhault
Copy link
Contributor Author

I think we should. Either a "marker file", or an endpoint to trigger the bootstrap process on a single node would be good, varying arguments tend to be a pain.

@petermattis petermattis modified the milestones: 1.0, Later Feb 22, 2017
@a-robinson
Copy link
Contributor

I'd also love for us to improve this before 1.0, but haven't taken the time to investigate other options yet. I wouldn't block 1.0 on it but would try pretty hard to squeeze it in.

@bdarnell
Copy link
Contributor

I'd also like to prioritize this for 1.0; the current approach is really awkward for non-toy deployments.

@spencerkimball
Copy link
Member

Give that we're not going to be able to remove the current behavior of no --join flag on an uninitialized node causing a cluster init, I'm unconvinced this is going to make it into 1.0. Does anyone have a concrete suggestion, or just general dissatisfaction with the current approach?

@bdarnell
Copy link
Contributor

The concrete suggestion is #14251. I'm very dissatisfied with the current approach and will try to get this in for 1.0.

@spencerkimball spencerkimball assigned bdarnell and unassigned mberhault Mar 30, 2017
@bdarnell
Copy link
Contributor

This is unfortunately trickier than it looks: A node doesn't bind any of its ports until it has either bootstrapped or talked to a node that has. This prevents us from offering an init rpc, because the server won't be listening to receive it. We'd need to A) start listening on the RPC port immediately B) ensure that operations other than the init/bootstrap RPC behave reasonably (block? fail cleanly?) C) do something about clients that assume the server is ready to serve as soon as it starts listening on its network port (including in particular haproxy in our recommended configuration). This is definitely not happening for tomorrow's beta, and I think it's going to need to wait for 1.1.

Possible workarounds:

  • use a unix signal like SIGUSR1 instead of an RPC (is this feasible in the kinds of deployment environments the init RPC is intended to help?)
  • use a separate dedicated port for the init rpc (so clusters intending to use this feature would pass --join=node1:26257,node2:26257 --init-port=9999 and pass that same port to the init client. it's clunky, but avoids our major concern about the first node being started with different flags at different times)
  • start the HTTP server early but not the GRPC/pgwire server, and do the init command over http. This is probably a good idea in its own right (to let the admin UI be served from nodes that are unable to fully initialize themselves and provide diagnostic info), so this may be a better idea than re-ordering startup for the main port (although it would also prevent us from ever merging the http port back into the common port).

@spencerkimball
Copy link
Member

spencerkimball commented Apr 19, 2017 via email

@bdarnell bdarnell modified the milestones: 1.1, 1.0 Apr 19, 2017
@bdarnell
Copy link
Contributor

bdarnell commented Aug 7, 2017

Done in #16371

@bdarnell bdarnell closed this as completed Aug 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants