-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prod: mechanisms are needed to allow cockroach to be deployed in Kubernetes #5967
Comments
@linuxerwang I experimented with setting CockroachDB up on kubernetes about a week ago. My understanding is that a replication controller is useful if you want to dynamically adjust the number of replicas. Since you need a one-to-one mapping from the pod to the persistent volume, I don't see what benefit the replication controller is providing. I'm pretty sure pods restart automatically without a replication controller, though please correct me if I'm mistaken about that. In my kubernetes experiment, I used the pod's IP as the address to bind on (i.e. the argument to the Every pod in my cockroach cluster was tagged with the label Hope this helps. Definitely some more work to be done here. It would be nice to figure out how to use a replication controller to manager the cluster size, but that seems to require some ability to dole out persistent volumes. I wonder if the kubernetes folks are working on this. Note that the cassandra example uses an ephemeral volume for data storage. |
Hi, petermattis Thank you for your reply.
This is not true. Kubernetes document recommends users to define an RC even if it has only one replica, because when a pod die the RC will automatically create a new one to replace it.
Not true, as I mentioned above. Without RC's support, pod will not auto restart. Your way of creating the cluster is okay in dev environment, but for production environment, pods might die and the new pods will have completely different IPs. I also created the cluster manually and tested it by killing the initial pod. What I saw is that the new pod can't join the cluster. I agree that k8s at present lacks the ability for pinned PV (PetSet, yes they are working on it, probably in next release). It would be great if we get it. But now the problem is that cockroach relies on IP address to create the cluster. Even if we had PetSet, the problem is still there. The problem with --join=cockroach:26257 is that in the end cockroach tries to resolve the name to an actual IP, so it requires the initial node exist and accessible. The cassandra example is far less enough for production use. If a pod died, it took much time for the replacement pod to get the data it needs. Thank you for explaining about the gossip info and range addressing data. |
Ah, it looks like pods will restart themselves upon process failure, but a replication controller is needed to restart a pod upon node failure. See http://kubernetes.io/docs/user-guide/pod-states/. So I was kind of right with my statement, but you are more correct in that a replication controller is required. Yes, using |
petermattis: I guess you were trying to say: processes in a pod will start automatically if they failed. That's true. Pods uses liveliness probe to detect if a process in container terminated and restart them automatically. If a pod is dead, it's job/rc/dameon-set's responsibility to create a replacement pod. |
Is it also useful to run CockroachDB without persistent storage (but with node-local storage which is not preserved across pod restarts)? Meaning that when a pod dies, its data dies and the replication controller simply adds a new node (which joins the remaining cluster and recovers its data over the network). If pod death is relatively infrequent, that might be preferable because it lets nodes use their local data without constraining where future incarnations can live, and it removes the compatibility problems above. |
@petermattis: yes if a node restarts with a new IP address, everything will continue to work fine. |
@petermattis: I worry about the data recover time when a new node replaces the old one. Suppose we have 10 nodes and one of them died because of heavy load, the new node will not be available for quite long time because of the data to heal from other nodes, then the other 9 nodes will have more load to handle and one of them died ... in the end no instance is alive. Another concern, although pod death is relatively infrequent, but there is still possibility that enough nodes died and parts of data was lost completely. If it were for enterprise use, I don't want to lose ANY data for that reason. |
I agree with @linuxerwang; there is too much risk of multiple failures at the same time to run with no persistent storage (In the event of a power outage will kubernetes even try to find the "non-persistent" storage of old jobs or will everything just be restarted from scratch?) The first node is started without |
I was picturing the first pod "seeded" with an ephemeral node (back to
Ok, but the persistent storage options are pretty bad (NAS, NFS, GlusterFS, ...), and it's pretty insane (?) to run Cockroach on any of them. None of these problems are Cockroach-specific. They're issues Kubernetes (and as far as I remember all others) need to solve. The "sane" way of running with the constraint of a) using local storage and b) sticking to what the API gives us is, I think, what I suggested above (until upstream gets there), even though it won't survive a complete blackout (or even just all processes dying at the same time, that also loses the storage) at the moment.
I think it'll be all gone. Pods don't survive an outage (you'll get a new Pod from the ReplicationController). |
Yeah, this is the problem. The container deployment systems are designed primarily for stateless services with all persistent state managed separately (e.g.GFS didn't run on borg, it was a peer to borg). And when they're adding persistence support, they're aiming to support naive unreplicated services by fully replicating the persistent filesystem. Some sort of in-between best effort persistence is needed for any of these container platforms to be a recommendable option for running cockroachdb, but I haven't seen any signs that anything like this is in the works. |
There is flocker, and we should keep an eye out (probably even chime in) for the discussions in kubernetes/kubernetes#7562 (which links in kubernetes/kubernetes#598) and also those on other databases which will have the same problems, for example kubernetes/kubernetes#24030. |
@tschottdorf FYI, Spanner/BigTable run on top of clustered storage (GFS/Colossus). Google Persistent Disks (and presumably Amazon EBS) add an extra layer of overhead, but we could avoid that if we went directly to the underlying data store (Google Cloud Storage or Amazon S3). I'm not saying we should do this, but we could. |
I haven't had a chance to play around with cochroach yet but plan to make it work with petset. If someone more familiar with deploying cochroach (@linuxerwang?) could write up something like kubernetes/kubernetes#23828 it would be great. I'm looking more for deployment/scaling/failover patterns shared across databases, than how you might deploy the database onto kube given todays idioms.
Petset should provide identity: kubernetes/kubernetes#24030 (comment) |
Hi, @bprashanth I am probably not the best fit for this task: a) I am not working in cockroach team, b) for the next a few quarters I could not dedicate enough of my time in it. @petermattis has been working on k8s with cockrach and achieved good progress to some extent. He would be a better choice as the owner and I could assist him. |
@bprashanth CockroachDB is designed to be simple to deploy and scale. To initialize a cluster you need to bring up a single "bootstrap" node. This node performs one time initialization such as the allocation of the cluster UUID and initialization of some system metadata. Additional nodes are added to the system by pointing them to an existing node in the cluster (e.g. the bootstrap node) using the When a node joins the cluster it is internally allocated a node ID from a monotonically increasing sequence that starts at 1. These node IDs are used internally for all addressing. When a node joins the cluster it broadcasts its node ID (which is stored on persistent disk) and IP address to the rest of the nodes. Cockroach should only have a minor hiccup if the IP address for a node changes. The nodes in a cluster actively monitor each other to determine if a node dies "permanently" at which point recovery of the data from other nodes in the cluster begins. Currently, that recovery kicks off when a node has been unresponsive for 5 minutes. I'm happy to answer any questions you might have. I think CockroachDB has some of the simplest deployment characteristics for a distributed storage system, yet there is still an impedance mismatch for container engines like kubernetes which expect identical replicas. |
Thanks! some high level questions.
Does it accept DNS/hostnames instead of IPs? providing stable ips is significantly harder.
What happens if i just bring all the nodes up with --join=first-node, but the others come up before the first, will they wait?
Is it quorum based? will i run into problems like: in a cluster of 3 where 1 is lagging and 2 are in quorum, if I add 2 nodes at the same time, and the first 2 die, and the new 2 form quorum with the lagging node? Should we always scale one at a time? is there some way to add a node as a non-voting member first?
Do they download state over the network from the one node they're pointed at, or is it more like p2p? (in which case would that one node take a perf hit -- this is common in galera). i.e do we need to be careful about rsyncing data offline before joining a node, so it only downloads an increamental snapshot of transations. |
On Thu, Apr 14, 2016 at 10:01 PM, Prashanth B notifications@github.com
|
@bprashanth the quickest way to describe CockroachDB is like D (http://www.pdl.cmu.edu/SDI/2012/101112.html), Colossus and Spanner in one binary. I am not sure if a PetSet is always the best way to run it. If e.g. you want one cluster-wide instance (I'm not sure if that's recommended), it might make more sense as a DaemonSet, perhaps only constrained to specific machine types if the cluster is not homogeneous. Then the admin/user would run the bootstrap node once, as @tschottdorf mentioned, to bring the dormant nodes to life. BTW, even D runs on Borg, although with special hooks. |
Running cockroach with DaemonSet is fine for me. But even so the VIP of the pod might change over time, right? |
The network identity work Prashanth mentioned will ensure that there is a unique DNS name correlated to a unique disk instance, under either DaemonSet or PetSet. We won't guarantee that two instances might not think they have the same name at the same time, but in the presence of a fenced disk (AWS/GCE) or a cluster reconfigurer (which will be able to rely on observing a predictable sequence of cluster membership changes and applying them from the outside) we can at least ensure there is a consistent winner. The concern with joining a split brain where the "first node" is partitioned during cluster initialization - if we can wait for a quorum we can at least reduce the window. Right now we're recommending DaemonSets for data gravity workloads (make local disks pets), while PetSet would be workloads that can handle network attached storage. |
Would having a botstrapping replication controller that has a bootstrapping service with a fixed cluster IP like skydns does |
@petermattis since K8s 1.3 has been cut with alpha support for PetSets, perhaps this next (0)month could be a great time to progress here? I am very interested in this. |
@josephjacks Thanks for the ping, I just picked this up to play around a bit. Doubt I'll make it all the way, but I'll post some progress here. |
Here's the promised update - it mostly just works (though it took me a while due to "first real contact" with kubernetes and the usual DNS/networking issues). See https://github.com/cockroachdb/cockroach/compare/tschottdorf/kubernetes-petset and the rendered README. Nothing fancy (in particular one would want to add health checks and the like), but it works. As pointed out above, a daemon set might be the way to go due to (presumably) better support for local storage, but I assume some folks will want to run a PetSet regardless. cc @bprashanth. (also re: is this worth landing somewhere?) |
Great! Yes please add it as an example. Open up a pr in the main kubernetes repo under https://github.com/kubernetes/kubernetes/tree/master/examples and tag me. If you think it's actually stable we can even start e2e testing it, we already do so for a few petsets like zookeeper. Daemonsets will work if you're not actually trying to use containers for the main reason most people use containers (multi-tenancy). Regarding storage, I assume you ran it on a cloudprovider that doesn't have a dynamic provisioner? we're writing provisioners for gluster/ceph that will work on bare metal. If you have more feature requests that'd make spanner easier please update kubernetes/kubernetes#260 or kubernetes/kubernetes#18016. |
bleh, i meant cockroachdb :) |
I was only able to run it locally via Opened a PR: kubernetes/kubernetes#28446 |
Oh, and ee tests would be nice. From a CockroachDB perspective, there's really nothing happening here that wouldn't happen in our other tests (maybe some slight differences in the networking layer), so it'd be good to give that a shot (though I'm sure reality will find a way to break things). |
Automatic merge from submit-queue Productionize the cockroachdb example a little more Includes: * A service for clients to use * Readiness/liveness probes * An extended graceful termination period * Automatic prometheus monitoring (when prometheus is configured to watch for annotations on services, as in [CoreOS's recent blog post](https://coreos.com/blog/prometheus-and-kubernetes-up-and-running.html), for example) I'm leaving the management of certs to future work, but if anyone that sees this needs help with them in the meantime, don't hesitate to reach out. Successor to #28446 @bprashanth - if you're still interested in / open to an e2e test (as mentioned in cockroachdb/cockroach#5967 (comment)), let me know and I'll put one together. If so, I assume you'd want it as part of the `petset` test group rather than the `examples` tests? cc @tschottdorf **Release note**: ```release-note NONE ```
Nice! thanks @a-robinson |
Automatic merge from submit-queue Productionize the cockroachdb example a little more Includes: * A service for clients to use * Readiness/liveness probes * An extended graceful termination period * Automatic prometheus monitoring (when prometheus is configured to watch for annotations on services, as in [CoreOS's recent blog post](https://coreos.com/blog/prometheus-and-kubernetes-up-and-running.html), for example) I'm leaving the management of certs to future work, but if anyone that sees this needs help with them in the meantime, don't hesitate to reach out. Successor to #28446 @bprashanth - if you're still interested in / open to an e2e test (as mentioned in cockroachdb/cockroach#5967 (comment)), let me know and I'll put one together. If so, I assume you'd want it as part of the `petset` test group rather than the `examples` tests? cc @tschottdorf **Release note**: ```release-note NONE ```
I tried to deploy cockroach in Kubernetes (k8s) this weekend, but failed miserably.
The goal: use k8s' replication controller to establish a self-recoverable cockroach cluster.
First of all, I setup NFS persistent volume (PV) for each cockroach instance. Since k8s now has no capability to auto allocate persistent volume to specific pod, I have to create distinct PV, replication controller (with replicas = 1) for each instance.
Then I created a k8s service for the first instance. So far, it worked fine. I can access the admin UI and manipulate the database from out side of k8s.
Problems now emerge:
So new mechanisms are needed to create cockroach cluster. A few alternative solutions:
Solution #1 involves much more work, but easier for users to setup. #2 and #3 are workaround, easier to implement but somewhat harder to setup.
The text was updated successfully, but these errors were encountered: