Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prod: mechanisms are needed to allow cockroach to be deployed in Kubernetes #5967

Closed
linuxerwang opened this issue Apr 10, 2016 · 30 comments
Closed
Assignees

Comments

@linuxerwang
Copy link

I tried to deploy cockroach in Kubernetes (k8s) this weekend, but failed miserably.

The goal: use k8s' replication controller to establish a self-recoverable cockroach cluster.

First of all, I setup NFS persistent volume (PV) for each cockroach instance. Since k8s now has no capability to auto allocate persistent volume to specific pod, I have to create distinct PV, replication controller (with replicas = 1) for each instance.

Then I created a k8s service for the first instance. So far, it worked fine. I can access the admin UI and manipulate the database from out side of k8s.

Problems now emerge:

  1. In k8s each pod has its own VIP. When a pod died for any reason, the replication controller creates a new pod with a different VIP. Now the new cockroach instance hangs because the data stored in NFS tells it there is another initial instance.
  2. To add the 2nd, 3rd, ... instances, each of them needs to know the initial instance's IP address to join. Since the VIP of initial instance changes every reborn, it's impossible to set it up correctly.

So new mechanisms are needed to create cockroach cluster. A few alternative solutions:

  1. Create a standalone registry program and let cockroach instances register to it on startup. The program can hide behind a service. The program can even have a few replicas, the service can load balance requests to the replicas. Make it work so that when an instance dies its registration would expire in the registry, thus the cluster is dynamic.
  2. Make the other instances to join the initial one with domain name instead of IP. So the service name can be stably used as the join target.
  3. Let cockroach to identify instances not by their IPs, but by a user provided instance ID.

Solution #1 involves much more work, but easier for users to setup. #2 and #3 are workaround, easier to implement but somewhat harder to setup.

@petermattis
Copy link
Collaborator

@linuxerwang I experimented with setting CockroachDB up on kubernetes about a week ago.

My understanding is that a replication controller is useful if you want to dynamically adjust the number of replicas. Since you need a one-to-one mapping from the pod to the persistent volume, I don't see what benefit the replication controller is providing. I'm pretty sure pods restart automatically without a replication controller, though please correct me if I'm mistaken about that.

In my kubernetes experiment, I used the pod's IP as the address to bind on (i.e. the argument to the --host flag). This works fine as the IP address, while not stable across pod restarts, is also not stored in the range addressing data. The range addressing data contains node IDs which are translated to IP addresses using info that is gossiped. The one place we do store the address is in the gossip info, but I think that we maintain a map from node ID to address there, so even if the IP address for a node changes we should be fine. @spencerkimball can you confirm?

Every pod in my cockroach cluster was tagged with the label db=cockroach. And I created a cockroach service that used db=cockroach as the selector. The result was that the hostname cockroach was visible to the all pods and would DNS round-robin amongst the pods in the cluster. This allowed me to use --join=cockroach:26257 as the join address.

Hope this helps. Definitely some more work to be done here. It would be nice to figure out how to use a replication controller to manager the cluster size, but that seems to require some ability to dole out persistent volumes. I wonder if the kubernetes folks are working on this. Note that the cassandra example uses an ephemeral volume for data storage.

@linuxerwang
Copy link
Author

Hi, petermattis

Thank you for your reply.

My understanding is that a replication controller is useful if you want to dynamically adjust the number of replicas.

This is not true. Kubernetes document recommends users to define an RC even if it has only one replica, because when a pod die the RC will automatically create a new one to replace it.

I'm pretty sure pods restart automatically without a replication controller

Not true, as I mentioned above. Without RC's support, pod will not auto restart.

Your way of creating the cluster is okay in dev environment, but for production environment, pods might die and the new pods will have completely different IPs. I also created the cluster manually and tested it by killing the initial pod. What I saw is that the new pod can't join the cluster.

I agree that k8s at present lacks the ability for pinned PV (PetSet, yes they are working on it, probably in next release). It would be great if we get it. But now the problem is that cockroach relies on IP address to create the cluster. Even if we had PetSet, the problem is still there.

The problem with --join=cockroach:26257 is that in the end cockroach tries to resolve the name to an actual IP, so it requires the initial node exist and accessible.

The cassandra example is far less enough for production use. If a pod died, it took much time for the replacement pod to get the data it needs.

Thank you for explaining about the gossip info and range addressing data.

@petermattis
Copy link
Collaborator

Ah, it looks like pods will restart themselves upon process failure, but a replication controller is needed to restart a pod upon node failure. See http://kubernetes.io/docs/user-guide/pod-states/. So I was kind of right with my statement, but you are more correct in that a replication controller is required.

Yes, using --join=cockroach:26257 does imply that one of the nodes needs to be up. Also note that the first node started in the cluster needs to be started without a --join flag (this is the bootstrap node). In my experimentation, this seemed to work fine. I started the first node without a --join flag, then added additional nodes all specifying --cockroach:26257.

@linuxerwang
Copy link
Author

petermattis: I guess you were trying to say: processes in a pod will start automatically if they failed. That's true. Pods uses liveliness probe to detect if a process in container terminated and restart them automatically. If a pod is dead, it's job/rc/dameon-set's responsibility to create a replacement pod.

@tbg
Copy link
Member

tbg commented Apr 11, 2016

Is it also useful to run CockroachDB without persistent storage (but with node-local storage which is not preserved across pod restarts)? Meaning that when a pod dies, its data dies and the replication controller simply adds a new node (which joins the remaining cluster and recovers its data over the network). If pod death is relatively infrequent, that might be preferable because it lets nodes use their local data without constraining where future incarnations can live, and it removes the compatibility problems above.

@spencerkimball
Copy link
Member

@petermattis: yes if a node restarts with a new IP address, everything will continue to work fine.

@linuxerwang
Copy link
Author

@petermattis: I worry about the data recover time when a new node replaces the old one. Suppose we have 10 nodes and one of them died because of heavy load, the new node will not be available for quite long time because of the data to heal from other nodes, then the other 9 nodes will have more load to handle and one of them died ... in the end no instance is alive.

Another concern, although pod death is relatively infrequent, but there is still possibility that enough nodes died and parts of data was lost completely. If it were for enterprise use, I don't want to lose ANY data for that reason.

@bdarnell
Copy link
Contributor

I agree with @linuxerwang; there is too much risk of multiple failures at the same time to run with no persistent storage (In the event of a power outage will kubernetes even try to find the "non-persistent" storage of old jobs or will everything just be restarted from scratch?)

The first node is started without --join the first time, but in an environment where IPs can change the --join flag must be added to it when it is restarted. Otherwise it may be disconnected from the cluster if all the IPs have changed, and won't be able to reconnect until one of the other nodes gets connected to it though the DNS alias.

@tbg
Copy link
Member

tbg commented Apr 14, 2016

The first node is started without --join the first time, but in an environment where IPs can change the --join flag must be added to it when it is restarted.

I was picturing the first pod "seeded" with an ephemeral node (back to ./cockroach init, I know) and then started "for real" with the same symmetrical flags all other nodes will get for the lifetime of the cluster (in particular, --join=cockroach:26257 with the DNS service as @petermattis mentioned above). I don't know whether that's something that can work with kubernetes, though.

I agree with @linuxerwang; there is too much risk of multiple failures at the same time to run with no persistent storage

Ok, but the persistent storage options are pretty bad (NAS, NFS, GlusterFS, ...), and it's pretty insane (?) to run Cockroach on any of them.
What's needed could be something like best-effort persistent volume: try your best to colocate a Pod with one of the available best-effort volumes from a pool; if that doesn't pan out, add a new volume to the pool and start the Pod colocated to it (the volumes here are just local storage).

None of these problems are Cockroach-specific. They're issues Kubernetes (and as far as I remember all others) need to solve. The "sane" way of running with the constraint of a) using local storage and b) sticking to what the API gives us is, I think, what I suggested above (until upstream gets there), even though it won't survive a complete blackout (or even just all processes dying at the same time, that also loses the storage) at the moment.

(In the event of a power outage will kubernetes even try to find the "non-persistent" storage of old jobs or will everything just be restarted from scratch?)

I think it'll be all gone. Pods don't survive an outage (you'll get a new Pod from the ReplicationController).

@bdarnell
Copy link
Contributor

Ok, but the persistent storage options are pretty bad (NAS, NFS, GlusterFS, ...), and it's pretty insane (?) to run Cockroach on any of them.
What's needed could be something like best-effort persistent volume

Yeah, this is the problem. The container deployment systems are designed primarily for stateless services with all persistent state managed separately (e.g.GFS didn't run on borg, it was a peer to borg). And when they're adding persistence support, they're aiming to support naive unreplicated services by fully replicating the persistent filesystem. Some sort of in-between best effort persistence is needed for any of these container platforms to be a recommendable option for running cockroachdb, but I haven't seen any signs that anything like this is in the works.

@tbg
Copy link
Member

tbg commented Apr 14, 2016

There is flocker, and we should keep an eye out (probably even chime in) for the discussions in kubernetes/kubernetes#7562 (which links in kubernetes/kubernetes#598) and also those on other databases which will have the same problems, for example kubernetes/kubernetes#24030.

@petermattis
Copy link
Collaborator

@tschottdorf FYI, Spanner/BigTable run on top of clustered storage (GFS/Colossus). Google Persistent Disks (and presumably Amazon EBS) add an extra layer of overhead, but we could avoid that if we went directly to the underlying data store (Google Cloud Storage or Amazon S3). I'm not saying we should do this, but we could.

@bprashanth
Copy link

I haven't had a chance to play around with cochroach yet but plan to make it work with petset. If someone more familiar with deploying cochroach (@linuxerwang?) could write up something like kubernetes/kubernetes#23828 it would be great.

I'm looking more for deployment/scaling/failover patterns shared across databases, than how you might deploy the database onto kube given todays idioms.

So new mechanisms are needed to create cockroach cluster. A few alternative solutions:

Petset should provide identity: kubernetes/kubernetes#24030 (comment)

@linuxerwang
Copy link
Author

Hi, @bprashanth

I am probably not the best fit for this task: a) I am not working in cockroach team, b) for the next a few quarters I could not dedicate enough of my time in it.

@petermattis has been working on k8s with cockrach and achieved good progress to some extent. He would be a better choice as the owner and I could assist him.

@petermattis
Copy link
Collaborator

@bprashanth CockroachDB is designed to be simple to deploy and scale. To initialize a cluster you need to bring up a single "bootstrap" node. This node performs one time initialization such as the allocation of the cluster UUID and initialization of some system metadata. Additional nodes are added to the system by pointing them to an existing node in the cluster (e.g. the bootstrap node) using the --join flag. Once joined to the cluster, a node persists the list of other nodes in the cluster, so that disappearance of the node it initially talked to does not cause any problems.

When a node joins the cluster it is internally allocated a node ID from a monotonically increasing sequence that starts at 1. These node IDs are used internally for all addressing. When a node joins the cluster it broadcasts its node ID (which is stored on persistent disk) and IP address to the rest of the nodes. Cockroach should only have a minor hiccup if the IP address for a node changes.

The nodes in a cluster actively monitor each other to determine if a node dies "permanently" at which point recovery of the data from other nodes in the cluster begins. Currently, that recovery kicks off when a node has been unresponsive for 5 minutes.

I'm happy to answer any questions you might have. I think CockroachDB has some of the simplest deployment characteristics for a distributed storage system, yet there is still an impedance mismatch for container engines like kubernetes which expect identical replicas.

@bprashanth
Copy link

Thanks! some high level questions.

When a node joins the cluster it broadcasts its node ID (which is stored on persistent disk) and IP address to the rest of the nodes. Cockroach should only have a minor hiccup if the IP address for a node changes.

Does it accept DNS/hostnames instead of IPs? providing stable ips is significantly harder.

Additional nodes are added to the system by pointing them to an existing node in the cluster (e.g. the bootstrap node) using the --join flag.

What happens if i just bring all the nodes up with --join=first-node, but the others come up before the first, will they wait?

The nodes in a cluster actively monitor each other to determine if a node dies "permanently" at which point recovery of the data from other nodes in the cluster begins. Currently, that recovery kicks off when a node has been unresponsive for 5 minutes.

Is it quorum based?

will i run into problems like: in a cluster of 3 where 1 is lagging and 2 are in quorum, if I add 2 nodes at the same time, and the first 2 die, and the new 2 form quorum with the lagging node?

Should we always scale one at a time? is there some way to add a node as a non-voting member first?

Additional nodes are added to the system by pointing them to an existing node in the cluster (e.g. the bootstrap node) using the --join flag.

Do they download state over the network from the one node they're pointed at, or is it more like p2p? (in which case would that one node take a perf hit -- this is common in galera). i.e do we need to be careful about rsyncing data offline before joining a node, so it only downloads an increamental snapshot of transations.

@petermattis
Copy link
Collaborator

On Thu, Apr 14, 2016 at 10:01 PM, Prashanth B notifications@github.com
wrote:

Thanks! some high level questions.

When a node joins the cluster it broadcasts its node ID (which is stored
on persistent disk) and IP address to the rest of the nodes. Cockroach
should only have a minor hiccup if the IP address for a node changes.

Does it accept DNS/hostnames instead of IPs? providing stable ips is
significantly harder.

Yes, DNS/hostnames are accepted, but I was making the point that they
aren't necessary. If the IP address for node changes when a node restarts
(I hope it can't change while a node is alive), then the system will
quickly start using the new address.

Additional nodes are added to the system by pointing them to an existing
node in the cluster (e.g. the bootstrap node) using the --join flag.

What happens if i just bring all the nodes up with --join=first-node, but
the others come up before the first, will they wait?

Yes, the other nodes will wait.

The nodes in a cluster actively monitor each other to determine if a node
dies "permanently" at which point recovery of the data from other nodes in
the cluster begins. Currently, that recovery kicks off when a node has been
unresponsive for 5 minutes.

Is it quorum based?

Yes, the system is quorum based. You can configure the number of replicas
with 3 being the default. Note that the number of replicas is different
from the number of nodes in the cluster. For example, you can have a 10
node cluster and 3-way replication. Adding an additional node doesn't
affect the amount of replication. Also note that the replication occurs at
a granularity finer than nodes. This has pluses and minuses: the big plus
being that when a node goes down the data it contained has replicas spread
throughout the cluster and the entire cluster can participate in recovery.
The downside is that if a quorum of nodes goes down simultaneously you'll
have data unavailability for some fraction of the data in the cluster.

will i run into problems like: in a cluster of 3 where 1 is lagging and 2
are in quorum, if I add 2 nodes at the same time, and the first 2 die, and
the new 2 form quorum with the lagging node?

Yes, that will definitely be a problem. You need the 2 original nodes to
remain up until data is replicated to the 2 new nodes.

Should we always scale one at a time? is there some way to add a node as a
non-voting member first?

There is no need to scale one at a time, but there is a need to
decommission nodes gracefully. Currently if you take down any 2 nodes in a
cluster simultaneously (no matter what the cluster size) you can expect
loss of quorum for some fraction of the data. There are techniques to
mitigate this, but we haven't implemented them yet.

Additional nodes are added to the system by pointing them to an existing
node in the cluster (e.g. the bootstrap node) using the --join flag.

Do they download state over the network from the one node they're pointed
at, or is it more like p2p? (in which case would that one node take a perf
hit -- this is common in galera). i.e do we need to be careful about
rsyncing data offline before joining a node, so it only downloads an
increamental snapshot of transations.

The system is more p2p, but that isn't quite correct either. The node (or
nodes if DNS round-robin is used) that the --join flag points to is
simply used to connect to the cluster. It does not affect how data is
replicated. The unit of replication is similar to hbase/cassandra/bigtable
where small (64MB) ranges of data are replicated, not entire nodes. The
replicas for these ranges are spread throughout the cluster. When a new
node joins the cluster, other nodes in the cluster will gradually rebalance
part of their data onto the new, less used node.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#5967 (comment)

@therc
Copy link

therc commented Apr 18, 2016

@bprashanth the quickest way to describe CockroachDB is like D (http://www.pdl.cmu.edu/SDI/2012/101112.html), Colossus and Spanner in one binary. I am not sure if a PetSet is always the best way to run it. If e.g. you want one cluster-wide instance (I'm not sure if that's recommended), it might make more sense as a DaemonSet, perhaps only constrained to specific machine types if the cluster is not homogeneous. Then the admin/user would run the bootstrap node once, as @tschottdorf mentioned, to bring the dormant nodes to life. BTW, even D runs on Borg, although with special hooks.

@linuxerwang
Copy link
Author

Running cockroach with DaemonSet is fine for me. But even so the VIP of the pod might change over time, right?

@smarterclayton
Copy link

The network identity work Prashanth mentioned will ensure that there is a unique DNS name correlated to a unique disk instance, under either DaemonSet or PetSet. We won't guarantee that two instances might not think they have the same name at the same time, but in the presence of a fenced disk (AWS/GCE) or a cluster reconfigurer (which will be able to rely on observing a predictable sequence of cluster membership changes and applying them from the outside) we can at least ensure there is a consistent winner. The concern with joining a split brain where the "first node" is partitioned during cluster initialization - if we can wait for a quorum we can at least reduce the window.

Right now we're recommending DaemonSets for data gravity workloads (make local disks pets), while PetSet would be workloads that can handle network attached storage.

@borg286
Copy link

borg286 commented May 20, 2016

Would having a botstrapping replication controller that has a bootstrapping service with a fixed cluster IP like skydns does
https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dns/skydns-svc.yaml.in#L13
Then the non-bootstrapping ones seed from the bootstrapping service. Calls to this bootstrapping service will either hang or get retried till the bootstrapping job comes up.
Would that work?

@petermattis petermattis changed the title Mechanisms are needed to allow cockroach to be deployed in Kubernetes prod: mechanisms are needed to allow cockroach to be deployed in Kubernetes Jun 29, 2016
@josephjacks
Copy link

@petermattis since K8s 1.3 has been cut with alpha support for PetSets, perhaps this next (0)month could be a great time to progress here? I am very interested in this.

@tbg
Copy link
Member

tbg commented Jul 3, 2016

@josephjacks Thanks for the ping, I just picked this up to play around a bit. Doubt I'll make it all the way, but I'll post some progress here.

@tbg
Copy link
Member

tbg commented Jul 4, 2016

Here's the promised update - it mostly just works (though it took me a while due to "first real contact" with kubernetes and the usual DNS/networking issues).

See https://github.com/cockroachdb/cockroach/compare/tschottdorf/kubernetes-petset and the rendered README. Nothing fancy (in particular one would want to add health checks and the like), but it works.

As pointed out above, a daemon set might be the way to go due to (presumably) better support for local storage, but I assume some folks will want to run a PetSet regardless.

cc @bprashanth. (also re: is this worth landing somewhere?)

@bprashanth
Copy link

Great! Yes please add it as an example. Open up a pr in the main kubernetes repo under https://github.com/kubernetes/kubernetes/tree/master/examples and tag me.

If you think it's actually stable we can even start e2e testing it, we already do so for a few petsets like zookeeper. Daemonsets will work if you're not actually trying to use containers for the main reason most people use containers (multi-tenancy).

Regarding storage, I assume you ran it on a cloudprovider that doesn't have a dynamic provisioner? we're writing provisioners for gluster/ceph that will work on bare metal. If you have more feature requests that'd make spanner easier please update kubernetes/kubernetes#260 or kubernetes/kubernetes#18016.

@bprashanth
Copy link

If you have more feature requests that'd make spanner easier please update kubernetes/kubernetes#260 or kubernetes/kubernetes#18016.

bleh, i meant cockroachdb :)

@tbg
Copy link
Member

tbg commented Jul 4, 2016

Regarding storage, I assume you ran it on a cloudprovider that doesn't have a dynamic provisioner?

I was only able to run it locally via minikube (is there a way to run 1.3 on google container engine?).
Local storage would be nice. But I guess all databasey-looking pieces of software want that, so the thread is probably saturated.

Opened a PR: kubernetes/kubernetes#28446

@tbg
Copy link
Member

tbg commented Jul 4, 2016

Oh, and ee tests would be nice. From a CockroachDB perspective, there's really nothing happening here that wouldn't happen in our other tests (maybe some slight differences in the networking layer), so it'd be good to give that a shot (though I'm sure reality will find a way to break things).

@a-robinson a-robinson self-assigned this Aug 22, 2016
k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this issue Sep 10, 2016
Automatic merge from submit-queue

Productionize the cockroachdb example a little more

Includes:
* A service for clients to use
* Readiness/liveness probes
* An extended graceful termination period
* Automatic prometheus monitoring (when prometheus is configured to watch for annotations on services, as in [CoreOS's recent blog post](https://coreos.com/blog/prometheus-and-kubernetes-up-and-running.html), for example)

I'm leaving the management of certs to future work, but if anyone that sees this needs help with them in the meantime, don't hesitate to reach out.

Successor to #28446

@bprashanth - if you're still interested in / open to an e2e test (as mentioned in cockroachdb/cockroach#5967 (comment)), let me know and I'll put one together. If so, I assume you'd want it as part of the `petset` test group rather than the `examples` tests?

cc @tschottdorf 

**Release note**:
```release-note
NONE
```
@a-robinson
Copy link
Contributor

I think this can now be closed given our docs and blog post.

That's not to say we're done -- I plan to continue improving things (e.g. helm charts, tests, multi-datacenter deployments, etc.), but cockroachdb can definitely be deployed in kubernetes now.

@bprashanth
Copy link

Nice! thanks @a-robinson

talblubClouby96 added a commit to talblubClouby96/examples that referenced this issue Aug 2, 2024
Automatic merge from submit-queue

Productionize the cockroachdb example a little more

Includes:
* A service for clients to use
* Readiness/liveness probes
* An extended graceful termination period
* Automatic prometheus monitoring (when prometheus is configured to watch for annotations on services, as in [CoreOS's recent blog post](https://coreos.com/blog/prometheus-and-kubernetes-up-and-running.html), for example)

I'm leaving the management of certs to future work, but if anyone that sees this needs help with them in the meantime, don't hesitate to reach out.

Successor to #28446

@bprashanth - if you're still interested in / open to an e2e test (as mentioned in cockroachdb/cockroach#5967 (comment)), let me know and I'll put one together. If so, I assume you'd want it as part of the `petset` test group rather than the `examples` tests?

cc @tschottdorf 

**Release note**:
```release-note
NONE
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests