-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copy volumes across AZ #923
Comments
volume cloning #207 should help. The user has to be the one to initiate the cloning though by setting dataSource. PVC:PV binding is 1:1 so without dataSource the driver has no way of knowing if 2 PVCs want the same data. How it would look is: You have an existing Pod_A+PVC_A+PV_A in zone A. You create a Pod_B+PVC_B and it gets scheduled in zone B, where PVC_B has |
Unfortunately, requiring the cluster admin to manually clone volumes doesn't solve the issue with Cluster Autoscaler where you really need any cloning to be automatic. In that sense, perhaps CA or a third-party tool could detect this situation (volume needing to be moved) and handle it automatically. But ideally if this could be handled at the storage level, that would be the most seamless solution.
I thought EBS CSI only supported aws-ebs-csi-driver/pkg/driver/controller.go Lines 38 to 45 in 990cf33
To clarify, are you saying that if a pod gets moved to a new node, there's no way to know the original volume the PVC should reference without specifying it in |
Yes, but if you have two PVCs referring to the same volume they will work as long as their pods are scheduled to the same Node. The access modes have to be read in terms of Nodes, not Pods, i.e. RWO means the volume can be read/written by one Node. Like if you want to use the same volume in different namespaces, then you need PVC_A+PV_A and PVC_B+PV_B, where PV_A and PV_B both refer to vol-1234, and both PVC_A's and PVC_B's Pods must scheduled to the same Node. CSI is less ambiguous than kubernetes API, there is more than just RWO/ROM/RWX, in CSI terms EBS is "SINGLE_NODE_MULTI_WRITER". https://github.com/container-storage-interface/spec/blob/master/spec.md#nodepublishvolume Anyway, all this isn't super relevant to the issue, but wanted to clarify.
Yes, there is no way to know because there's no explicit event/API for the storage driver to know that a Pod (using a PV that the storage driver created) has been rescheduled to a different node in a different zone. The closest we have is
I am not that familiar with the cluster autoscaler use case. But, I guess it depends on the application being scaled, is it needed for replicas to read the same (point in time) data? Of course even if the driver could automatically do the copy the data would diverge for the volumes in different zones. If it's not needed for the data to be the same, the statefulset should use a PVC template so that a new volume gets provisioned wherever the replica gets scheduled. |
Thank you for writing all that up.
This is in line with what I was envisioning. Some kind of controller that sees you need a volume from a different zone and automatically copies/moves it instead of the scheduler/provisioner/autoscaler/whatever just hanging or erroring out, which is more or less what happens today.
Technically, the issue arises any time you have a pod with a dependency on a specific AZ. But I think EBS is by far the biggest reason pods become coupled to specific AZs. If a node hosting a persistent volume terminates for whatever reason, you'll need to launch a new instance in the same AZ to reattach that volume to (assuming you don't have such an instance available already). But there's no way to force a multi-AZ ASG to give you an instance in a specific AZ. So tools like Cluster Autoscaler can't reliably launch the needed nodes. This AWS blog post provides more info: https://aws.amazon.com/blogs/containers/amazon-eks-cluster-multi-zone-auto-scaling-groups/
For the use case of moving a previously running statefulset pod to a new node, the data would need to be preserved, but multiple replicas doesn't need to be supported.
Yes, if you're fine with data not being preserved then there's no issue. But for something like a DB running on a node that just died, you want to be able to move the existing EBS volume as quickly as possible to your new instance without losing all your data. |
I think then it's up to AWS ASG to provide a way to launch in specific zone, or for cluster autoscaler to come up with some hacky workaround (like, suspend ASG processes, launch an instance in a specific zone, attach to ASG, resume ASG processes?), or for us to implement volume cloning them for some other tool to automate the dataSource/volume cloning process. Or for EBS to provide cross zone replication magic. The CSI driver being so ignorant of the kubernetes API is at too low level to solve it on its own. :-/
I understand now the issue must be that the scheduler is hanging because it is unable to find a Node to satisfy the Pod+PVC+PV. My hypothetical AttachVolume solution is realistically useless because, assuming scheduler is working correctly, the topology/node affinity API will ensure the scheduler never puts a Pod in the "wrong zone" in the first place. For the provisioner, the problem of provisioning PV's in the "wrong" zone is solved by ensuring WaitForFirstConsumer is in the StorageClass |
Yep. Any of those solutions would probably work. AWS is in the best position to fix this, either at the ASG or EBS level. Barring that, it's a complicated issue that cuts across K8s systems (storage, CA, scheduling). I think that's why I haven't seen a good solution beyond "don't use EBS with multi-AZ ASGs."
Thanks. I thought this may be the case, but I figured it was worth bringing up for discussion, especially since a solution where the CSI driver magically moves your volume when it sees a cross-AZ request would be pretty neat UX. But if a solution doesn't work at the CSI level, then I understand. Perhaps a dedicated controller, separate from the CSI driver, makes more sense? |
right, something must detect a pod can't be scheduled because there are 0 available nodes to satisfy its EBS volume requirement and then act accordingly, easier said than done cluster autoscaler docs also suggest creating 1 asg in each zone and running 1 cluster autoscaler in each https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/aws#common-notes-and-gotchas |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Is your feature request related to a problem?/Why is this needed
The fact that EBS volumes are tied to a specific availability zone is a well documented issue. From https://docs.aws.amazon.com/eks/latest/userguide/managed-node-groups.html,
Often times, the explosion in the number of ASGs that the use of EBS requires is untenable and users are forced to use alternative, cross-AZ storage backends like EFS.
Absent AWS adding support for cross-AZ EBS volumes similar to GCP's Regional Persistent Disks, it would be nice if the EBS CSI driver could copy volumes across availability zones.
Describe the solution you'd like in detail
When a pod in AZ
a
makes claim to a volume in a AZb
, the EBS CSI driver could take a snapshot of the volume and then restore it into a new volume in the correct AZ.Describe alternatives you've considered
Additional context
Classic example of the issue with EBS + Cluster Autoscaler is CA needing to place a statefulset's pod on a new node. The pod must be placed in the same AZ as the existing EBS volume. But CA can't force the node's ASG to launch an instance in a particular AZ. If the EBS volume was automatically moved when this happened, the issue would be solved.
The text was updated successfully, but these errors were encountered: