Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy volumes across AZ #923

Closed
gabegorelick opened this issue Jun 4, 2021 · 11 comments
Closed

Copy volumes across AZ #923

gabegorelick opened this issue Jun 4, 2021 · 11 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@gabegorelick
Copy link

Is your feature request related to a problem?/Why is this needed

The fact that EBS volumes are tied to a specific availability zone is a well documented issue. From https://docs.aws.amazon.com/eks/latest/userguide/managed-node-groups.html,

If you are running a stateful application across multiple Availability Zones that is backed by Amazon EBS volumes and using the Kubernetes Cluster Autoscaler, you should configure multiple node groups, each scoped to a single Availability Zone. In addition, you should enable the --balance-similar-node-groups feature.

Often times, the explosion in the number of ASGs that the use of EBS requires is untenable and users are forced to use alternative, cross-AZ storage backends like EFS.

Absent AWS adding support for cross-AZ EBS volumes similar to GCP's Regional Persistent Disks, it would be nice if the EBS CSI driver could copy volumes across availability zones.

Describe the solution you'd like in detail
When a pod in AZ a makes claim to a volume in a AZ b, the EBS CSI driver could take a snapshot of the volume and then restore it into a new volume in the correct AZ.

Describe alternatives you've considered

  • Separate ASG for each AZ. Leads to a proliferation of ASGs. It also makes it harder to for the ASGs to operate, since scaling decisions need to be made across all the "sub-ASGs".
  • Use cross-AZ storage like EFS or Lustre. Those are very different from EBS (they're distributed file systems vs EBS's block storage), with different price and performance characteristics. It's a shame to have to use them just so Cluster Autoscaler can function.

Additional context
Classic example of the issue with EBS + Cluster Autoscaler is CA needing to place a statefulset's pod on a new node. The pod must be placed in the same AZ as the existing EBS volume. But CA can't force the node's ASG to launch an instance in a particular AZ. If the EBS volume was automatically moved when this happened, the issue would be solved.

@wongma7
Copy link
Contributor

wongma7 commented Jun 4, 2021

When a pod in AZ a makes claim to a volume in a AZ b, the EBS CSI driver could take a snapshot of the volume and then restore it into a new volume in the correct AZ.

volume cloning #207 should help. The user has to be the one to initiate the cloning though by setting dataSource. PVC:PV binding is 1:1 so without dataSource the driver has no way of knowing if 2 PVCs want the same data. How it would look is: You have an existing Pod_A+PVC_A+PV_A in zone A. You create a Pod_B+PVC_B and it gets scheduled in zone B, where PVC_B has dataSource: PVC_A. Then the driver creates PV_B such that it's a copy of PV_A.

@gabegorelick
Copy link
Author

The user has to be the one to initiate the cloning though by setting dataSource.

Unfortunately, requiring the cluster admin to manually clone volumes doesn't solve the issue with Cluster Autoscaler where you really need any cloning to be automatic. In that sense, perhaps CA or a third-party tool could detect this situation (volume needing to be moved) and handle it automatically. But ideally if this could be handled at the storage level, that would be the most seamless solution.

without dataSource the driver has no way of knowing if 2 PVCs want the same data

I thought EBS CSI only supported ReadWriteOnce? Or is this more relevant in the future if you support EBS multi-attach?

// volumeCaps represents how the volume could be accessed.
// It is SINGLE_NODE_WRITER since EBS volume could only be
// attached to a single node at any given time.
volumeCaps = []csi.VolumeCapability_AccessMode{
{
Mode: csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER,
},
}

How it would look is: You have an existing Pod_A+PVC_A+PV_A in zone A. You create a Pod_B+PVC_B and it gets scheduled in zone B, where PVC_B has dataSource: PVC_A. Then the driver creates PV_B such that it's a copy of PV_A.

To clarify, are you saying that if a pod gets moved to a new node, there's no way to know the original volume the PVC should reference without specifying it in dataSource?

@AndyXiangLi AndyXiangLi added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 7, 2021
@wongma7
Copy link
Contributor

wongma7 commented Jun 7, 2021

I thought EBS CSI only supported ReadWriteOnce?

Yes, but if you have two PVCs referring to the same volume they will work as long as their pods are scheduled to the same Node. The access modes have to be read in terms of Nodes, not Pods, i.e. RWO means the volume can be read/written by one Node. Like if you want to use the same volume in different namespaces, then you need PVC_A+PV_A and PVC_B+PV_B, where PV_A and PV_B both refer to vol-1234, and both PVC_A's and PVC_B's Pods must scheduled to the same Node.

CSI is less ambiguous than kubernetes API, there is more than just RWO/ROM/RWX, in CSI terms EBS is "SINGLE_NODE_MULTI_WRITER". https://github.com/container-storage-interface/spec/blob/master/spec.md#nodepublishvolume

Anyway, all this isn't super relevant to the issue, but wanted to clarify.

To clarify, are you saying that if a pod gets moved to a new node, there's no way to know the original volume the PVC should reference without specifying it in dataSource?

Yes, there is no way to know because there's no explicit event/API for the storage driver to know that a Pod (using a PV that the storage driver created) has been rescheduled to a different node in a different zone.

The closest we have is

  1. volume cloning/dataSource API: as described, the user explicitly requests that a volume gets copied, which could mean the user intends to copy it for use in another zone but not necessarily.
  2. ControllerPublish/Attach event: if the driver sees that a vol-A referred to by PV_A exists in zone A and receives a request to attach it to an instance in zone B, then theoretically it could automatically copy the volume to vol-B in zone B and attach that instead of returning the "wrong zone" error. But, the driver is operating at a lower level than kubernetes API. It doesn't know if the Pod using PV_A is new or being rescheduled, a statefulset replica, etc., and even if it could it has no way to edit PV_A to refer to vol-B instead of vol-A according to what zone it's in.

I am not that familiar with the cluster autoscaler use case. But, I guess it depends on the application being scaled, is it needed for replicas to read the same (point in time) data? Of course even if the driver could automatically do the copy the data would diverge for the volumes in different zones. If it's not needed for the data to be the same, the statefulset should use a PVC template so that a new volume gets provisioned wherever the replica gets scheduled.

@gabegorelick
Copy link
Author

Thank you for writing all that up.

ControllerPublish/Attach event: if the driver sees that a vol-A referred to by PV_A exists in zone A and receives a request to attach it to an instance in zone B, then theoretically it could automatically copy the volume to vol-B in zone B and attach that instead of returning the "wrong zone" error.

This is in line with what I was envisioning. Some kind of controller that sees you need a volume from a different zone and automatically copies/moves it instead of the scheduler/provisioner/autoscaler/whatever just hanging or erroring out, which is more or less what happens today.

I am not that familiar with the cluster autoscaler use case.

Technically, the issue arises any time you have a pod with a dependency on a specific AZ. But I think EBS is by far the biggest reason pods become coupled to specific AZs.

If a node hosting a persistent volume terminates for whatever reason, you'll need to launch a new instance in the same AZ to reattach that volume to (assuming you don't have such an instance available already). But there's no way to force a multi-AZ ASG to give you an instance in a specific AZ. So tools like Cluster Autoscaler can't reliably launch the needed nodes.

This AWS blog post provides more info: https://aws.amazon.com/blogs/containers/amazon-eks-cluster-multi-zone-auto-scaling-groups/

is it needed for replicas to read the same (point in time) data

For the use case of moving a previously running statefulset pod to a new node, the data would need to be preserved, but multiple replicas doesn't need to be supported.

If it's not needed for the data to be the same, the statefulset should use a PVC template so that a new volume gets provisioned wherever the replica gets scheduled.

Yes, if you're fine with data not being preserved then there's no issue. But for something like a DB running on a node that just died, you want to be able to move the existing EBS volume as quickly as possible to your new instance without losing all your data.

@wongma7
Copy link
Contributor

wongma7 commented Jun 7, 2021

I think then it's up to AWS ASG to provide a way to launch in specific zone, or for cluster autoscaler to come up with some hacky workaround (like, suspend ASG processes, launch an instance in a specific zone, attach to ASG, resume ASG processes?), or for us to implement volume cloning them for some other tool to automate the dataSource/volume cloning process. Or for EBS to provide cross zone replication magic. The CSI driver being so ignorant of the kubernetes API is at too low level to solve it on its own. :-/

instead of the scheduler/provisioner/autoscaler/whatever just hanging or erroring out, which is more or less what happens today.

I understand now the issue must be that the scheduler is hanging because it is unable to find a Node to satisfy the Pod+PVC+PV. My hypothetical AttachVolume solution is realistically useless because, assuming scheduler is working correctly, the topology/node affinity API will ensure the scheduler never puts a Pod in the "wrong zone" in the first place.

For the provisioner, the problem of provisioning PV's in the "wrong" zone is solved by ensuring WaitForFirstConsumer is in the StorageClass

@gabegorelick
Copy link
Author

I think then it's up to AWS ASG to provide a way to launch in specific zone, or for cluster autoscaler to come up with some hacky workaround (like, suspend ASG processes, launch an instance in a specific zone, attach to ASG, resume ASG processes?), or for us to implement volume cloning them for some other tool to automate the dataSource/volume cloning process. Or for EBS to provide cross zone replication magic.

Yep. Any of those solutions would probably work. AWS is in the best position to fix this, either at the ASG or EBS level. Barring that, it's a complicated issue that cuts across K8s systems (storage, CA, scheduling). I think that's why I haven't seen a good solution beyond "don't use EBS with multi-AZ ASGs."

The CSI driver being so ignorant of the kubernetes API is at too low level to solve it on its own. :-/

Thanks. I thought this may be the case, but I figured it was worth bringing up for discussion, especially since a solution where the CSI driver magically moves your volume when it sees a cross-AZ request would be pretty neat UX. But if a solution doesn't work at the CSI level, then I understand. Perhaps a dedicated controller, separate from the CSI driver, makes more sense?

@wongma7
Copy link
Contributor

wongma7 commented Jun 7, 2021

right, something must detect a pod can't be scheduled because there are 0 available nodes to satisfy its EBS volume requirement and then act accordingly, easier said than done

cluster autoscaler docs also suggest creating 1 asg in each zone and running 1 cluster autoscaler in each https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/aws#common-notes-and-gotchas

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 5, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 5, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants