-
Notifications
You must be signed in to change notification settings - Fork 734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of pzoo and zoo as persistent/ephemeral storage nodes #123
Comments
It was introduced in #34 and discussed in #26 (comment). The case for this has weakened though, with increased support for dynamic volume provisioning across different kubernetes setups, and with this setup being used for heavier workloads. I'd perfer if the two statefulsets could simply be scaled up and down individually. For example if you're on a single zone you don't have the volume portability issue. In a setup like #118 with local volumes howerver it's quite difficult to ensure quorum capabilities on single node failure. Unfortunately Zookeeper's configuration is static, prior to 3.5 wich is in development. Adapting to initial scale would be doable, I think. For example the init script could use the Kubernetes API to read the |
Hi @solsson . I have the same confusion. I've read your comment on the other issue, but it didn't help me understand why two node are using empty_dir but not persistent volumn. Could you elaborate a little more as to under what scenario they will be useful? How does it compare to use persistent volume for all 5 nodes? I'm running my Kubernetes cluster on AWS, with 6 worker nodes spreading across 3 availability zones. Thanks. |
Good that you question this. The complexity should be removed if it can't be motivated. I'm certainly prepared to switch to all-persistent Zookeeper. The design goal was to make the persistent layer as robust as the services layer. Probably not as robust as bucket stores or 3rd party hosted databases, but same uptime as your frontend is good enough. Thus workloads will have to migrate in the face of lost availability zones, like non-stateful apps will certainly do with Kubernetes. I recall https://medium.com/spire-labs/mitigating-an-aws-instance-failure-with-the-magic-of-kubernetes-128a44d44c14 "a sense of awe watching the automatic mitigation". Unless you have a volume type that can migrate, the problem is that stateful pods will only start in the zone where the volume was provisioned. With both 5 and 7 node zk across 3 zones, if a zone with 2 or 3 zk pods repsectively goes out, you're -1 pod away from losing a majority of your zk. My assumption is that lost majority means your service goes down. Zone outage can be extensive, as in the AWS case above, and due to zk's static configuration you can't reconfigure to adapt to the situation as it would cause the -1. With kafka brokers you can throw money at the problem: increase your replication factor. With zk you can't. Or maybe you can, with scale=9? |
@solsson I've tried to rephrase the reason for having pzoo and zoo below. Let me know what you think: AFAICT, there are at least two types of failures for which there should be some protection.
If there are 3 AZs, the 5 ZK pods are spread across these 3 AZs. If an AZ goes down, there is little benefit to be had of having 5 ZK pods since the AZ that went down could result in 2 ZK pods being lost. The ZK cluster is 1 more failure away from being unavailable. The situation would be the same if there were only 3 ZK pods and 1 AZ went down. However, for software errors, each pod could go down by itself and having 5 ZK nodes helps because it can tolerate 2 individual pod failures (instead of 1 in the 3ZK case). While having only 3 EBS volumes instead of 5 does keep costs low, to avoid confusion, it would be better to have a single statefulset of pzoo with 5 nodes. |
@shrinandj I think I agree at this stage. What would be even better, in particular now (unlike in the k8s 1.2 days) that support for automatic volume provisioning can be expected, would be to support scaling of the zookeeper statefulset(s). That way everyone can descide for themselves, and we can default to 5 persistent pods. Should be quite doable in the initscript, by retrieving desired number of replicas with kubect. I'd be happy to accept PRs for such things. |
Can you elaborate a bit on that?
What changes are required in the init script? |
Sounds like a good summary, and my ideas for how are sketchy at best. Sadly(?) this repo has come of age already and needs to consider backwards compatibility. Hence we might want a multi-step solution:
|
@solsson I understand that the steps mentioned above are needed due to backwards compatibility, but in case I want 5 |
@AndresPineros You'll also need to change the |
See #191 (comment) for the suggested way forward. |
Hi,
In Zookeeper we have the notion of persistent/ephemeral nodes, but I'm struggling to understand why these concepts have been used here in terms of persistent volumes in K8s.
Can someone elaborate a bit further on what the objectives are for this intentional configuration?
Thanks.
The text was updated successfully, but these errors were encountered: