-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server,cli: Support Init Command #16371
server,cli: Support Init Command #16371
Conversation
e528993
to
809b910
Compare
Reviewed 2 of 2 files at r1, 6 of 6 files at r7. pkg/cli/flags.go, line 277 at r7 (raw file):
Why? I think pkg/cli/init.go, line 24 at r7 (raw file):
Delete before merging. pkg/cli/init.go, line 47 at r7 (raw file):
We probably want to remove these prints and logs before merging. pkg/server/init.go, line 33 at r7 (raw file):
s/mux/mu/ If the mutex only protects
But if it's held for the entire duration of pkg/server/init.go, line 42 at r7 (raw file):
If we're sharing the grpcServer with the rest of the Server, how can we be confident that the other services haven't been registered yet? The initServer could also remain after the initialization phase, which I'd prefer to avoid. Can we pass in an RPCContext and make a new grpc.Server instead? pkg/server/init.go, line 51 at r7 (raw file):
Remove these braces, they don't mean anything (in Go, braces define variable scopes but not pkg/server/init.go, line 53 at r7 (raw file):
We should probably set waiting to false when this method exits (in a deferred function), just as an extra sanity check. pkg/server/server.go, line 542 at r7 (raw file):
Ah, this is a clever approach to the problem. I'd rather not expose the initServer at all if we're already bootstrapped, though. That would mean keeping pkg/server/server.go, line 599 at r7 (raw file):
Yeah, it seems like we should close all of them or none of them here. I'm not sure why we only close this one. pkg/server/server.go, line 792 at r7 (raw file):
This could be pkg/server/serverpb/init.proto, line 1 at r7 (raw file):
s/2016/2017/ Comments from Reviewable |
809b910
to
0113003
Compare
I'm taking over this PR; I've added some tests and done a little cleaning up. There's still a little more I want to do (mainly figure out why the acceptance tests fail if I run the bootstrap against a node other than zero), but I think it's pretty much ready to go. |
Review status: 1 of 16 files reviewed at latest revision, 14 unresolved discussions, some commit checks failed. pkg/cli/init.go, line 30 at r10 (raw file):
Reminder to fill these in. pkg/cli/init.go, line 44 at r10 (raw file):
Maybe tighten this up a small bit:
pkg/server/server.go, line 581 at r10 (raw file):
Comments from Reviewable |
dadba7b
to
ee3bf84
Compare
The reason this was failing is that nodes don't start their gossip server until they're initialized. I've made an attempt to changes this, but it runs into difficulties at the RPC layer (need to stop and restart the rpc server to switch from init to regular mode, and make sure that all the long-lived gossip connections either stay connected or stop and restart cleanly) and inside gossip (the server side of gossip has some assumptions that the cluster has been initialized before the server starts) So we can either I'm leaning towards option A. @a-robinson, would that interfere with the use of the |
Option A is likely the easiest, but it feels a bit unsatisfactory. I'm a little out of date with regards to the node initialization sequence. Can you expand what is going wrong with sending the |
Can you clarify what you mean by "nodes that are join targets, instead of an arbitrary node"? Does this mean that your join configuration effectively needs to be a DAG where all nodes can reach the node that you send the |
In the current implementation, a node is either serving the init method or the regular rpc interface, but not both at the same time. When node zero (the join target) is initialized, it switches to the regular rpc interface and the other nodes can join it. When another node is initialized, it tries to join node zero, but it can't because the gossip interface is not active yet. Option B is to enable the gossip interface earlier (and deal with the issues that arise because we have not previously had an active gossip server with an unknown cluster ID); option C is to handle this by extending the initialization rpc interface.
Not just a DAG. In fact adding a cycle is one way to fix the problem. If for example all your nodes point to nodes 1, 2 and 3 as join targets, then you must The status quo has similar constraints on the join graph: every node must be able to reach the node that was started without a join flag. The new system requires that every node be able to reach the |
Ah, of course, it needs to be fully connected, not acyclic (as in #13027). Option A does feel a bit half-hearted, but it's just imposing a restriction that we could later open up to improve usability. This is how it'll effect each orchestration tool:
I assume that for option C what you're proposing is that when a node gets initialized it will effectively forward on its new cluster ID via an RPC on the |
Yes, that's the idea. It's a more isolated solution than option B, but there's still a little plumbing complexity to use the gossip bootstrap targets in this new context (and it gets tricky if the join target is a load balancer or DNS alias, although now that I think about it those cases would probably also have difficulty forming a fully connected graph). |
What happened to the goal from the RFC of verifying that the specified number nodes are connected? That would appear to not work without something a little more sophisticated (like option B). |
We decided to drop that from the RFC because it seemed like unnecessary complexity. |
That's what I thought, then I got confused by seeing it still in the summary of the RFC. I'll change it.
Is that so? Would a node that needs to be initialized have any bootstrap targets other than what was specified on the command line? Anyways, I take it this is ready to review? If so I'll take a look at it within the next day or so. I'd like to see option C get implemented if we have time, but it seems like it'll just be additive so discussing it shouldn't block reviewing option A if that's ready. |
Just the ones on the command line. It's just a little annoying to use them because we have to work outside the regular connection pool (which wants heartbeats to work, etc). Anyway, yes, this is ready to review. We can implement option C later if we have time and/or it seems to be useful. |
Nice, this turned out very clean! Reviewed 1 of 2 files at r1, 1 of 7 files at r6, 2 of 8 files at r8, 3 of 8 files at r9, 4 of 9 files at r10, 6 of 6 files at r11. pkg/cli/init.go, line 33 at r11 (raw file):
The pkg/server/init.go, line 33 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
It looks like this never got changed, but it should pkg/server/init.go, line 56 at r11 (raw file):
Not really needed/idiomatic pkg/server/init.go, line 94 at r11 (raw file):
Isn't there an inherent race between when we start serving and when we start awaiting bootstrap? It'd be safer to just initialize pkg/server/init.go, line 95 at r11 (raw file):
Since this is going to be an error that new users may see when they're failing to set up their cluster properly, something with a little more explanation would be nice. pkg/server/init.go, line 99 at r11 (raw file):
s/Node/node for consistency with the rest of our log messages pkg/server/node.go, line 333 at r11 (raw file):
This comment isn't really applicable anymore. pkg/server/server.go, line 775 at r11 (raw file):
s/No/no pkg/server/server.go, line 804 at r11 (raw file):
Did you mean to leave this in? The error should already be getting printed out by the caller, shouldn't it? Comments from Reviewable |
ee3bf84
to
6dce56a
Compare
Review status: 13 of 16 files reviewed at latest revision, 22 unresolved discussions, some commit checks failed. pkg/server/init.go, line 33 at r7 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done pkg/server/init.go, line 56 at r11 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done pkg/server/init.go, line 94 at r11 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Renamed the variable to pkg/server/init.go, line 95 at r11 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Changed to "bootstrap called after initialization complete". Better suggestions welcome. pkg/server/init.go, line 99 at r11 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done pkg/server/node.go, line 333 at r11 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Removed pkg/server/server.go, line 581 at r10 (raw file): Previously, petermattis (Peter Mattis) wrote…
Done pkg/server/server.go, line 775 at r11 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done pkg/server/server.go, line 804 at r11 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Removed Comments from Reviewable |
Reviewed 3 of 3 files at r12. pkg/cli/init.go, line 33 at r11 (raw file): Previously, a-robinson (Alex Robinson) wrote…
ping pkg/server/init.go, line 95 at r11 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I might tack a "cluster has already been initialized" on the end, but this works. pkg/server/init.go, line 78 at r12 (raw file):
s/false/true/ Comments from Reviewable |
6dce56a
to
86d5e41
Compare
Review status: 14 of 16 files reviewed at latest revision, 19 unresolved discussions, some commit checks failed. pkg/cli/init.go, line 33 at r11 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done pkg/server/init.go, line 95 at r11 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. Note that there's probably a lot more work to be done on error messages here. This error is only returned on a very narrow race condition; if init is called on a node that is fully initialized and transitioned to serving mode, the error that comes back is something like a 404 because the init server is not even running. pkg/server/init.go, line 78 at r12 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Oops. Done Comments from Reviewable |
--join flag is passed. Use cmux to initially swallow all connections into the `initL` which is served by the initServer. Only the Init grpc interface is registered so all other connections will get automatically closed. Once cockroach is ready to serve, flip the initL matcher to match nothing, effectively removing the initServer. Add `init` cli command.
86d5e41
to
639bdc0
Compare
init_command
reviewers: @bdarnell, @mberhault