Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume topology aware scheduling design #1054

Merged
merged 2 commits into from
Oct 4, 2017

Conversation

msau42
Copy link
Member

@msau42 msau42 commented Sep 13, 2017

Proposal for a smarter scheduler that influences PV binding.

Part of kubernetes/enhancements#121

/sig storage
/sig scheduling
/cc @kubernetes/sig-storage-proposals @kubernetes/sig-scheduling-proposals

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. kind/design Categorizes issue or PR as related to design. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 13, 2017
@msau42
Copy link
Member Author

msau42 commented Sep 13, 2017

/assign @davidopp @jsafrane @saad-ali @thockin

preemption. Preempting a pod that uses a PV will not free up capacity on that
node because the PV lifecycle is independent of the Pod’s lifecycle.

##### Other scheduler predicates
Copy link
Contributor

@resouer resouer Sep 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also evaluate if this new predicate need special care in factory.go according to: https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go#L70-L76

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the PVC or PV objects change, then this new predicate needs to be invalidated. Also if the Node labels change. I'll add that to the predicate section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks!

and prebinding them to the PVCs. The PV controller completes the binding
transaction, handling it as a prebound PV scenario.

Prebound PVCs and PVs will still immediately be bound by the PV controller.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you mean by this? if they were prebound - why do they need binding?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user only needs to prebind one side, either the PVC or the PV. And then the PV controller finishes binding the other direction.

So if the user set PV.ClaimRef, then the PV controller will set PVC.volumeName.
And if the user set PVC.volumeName, then the PV controller will set PV.ClaimRef.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the user set PVC.volumeName, then the PV controller will set PV.ClaimRef.

That should not be the case... a PVC user cannot target a specific PV by name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried it and it works. I created 2 PVs of different sizes, and then created a PVC that requests the size of the smaller PV, but specifies the name of the larger PV. And the volume got bound to the specified PV name. From my understanding, it's been this way since the beginning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, it's been this way since the beginning.

I don't think so. It looks like that was added in 1.3 in https://github.com/kubernetes/kubernetes/pull/24331/files#diff-3aaafc34911c234537e0076b5ed40b37R202. I would not expect a namespaced user to be able to target a particular PV.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless, it's been like this for many releases now. If we want to change it to not allow prebinding PVCs, then it can be considered separately from this proposal. This proposal is not changing the current prebinding behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt, I thought it was common practice - a Pod author can set pod.spec.nodeName to bypass scheduling. This is the same thing.

Copy link
Member

@liggitt liggitt Sep 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a Pod author can set pod.spec.nodeName to bypass scheduling

that is prevented in some clusters. regarding PV/PVC, as long as precedence is given to the PV's claim ref, and a reference by a PVC doesn't allow it to be bound to a PV that doesn't match the PVC's resource requests, that seems ok.

following use cases:
* While a Pod is running, a user deletes the PVC and PV, and then recreates them.
* With the current behavior, the PVC will rebind to the same or different PV.
* With this new behavior, the PVC will not rebind because binding would only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On EBS at least, since you can't delete a volume that is mounted and in use. Deleting a dynamically bound PVC results in PV being "Released" (I think). So when user creates a new PVC with same name (because user does not want to change pod's yaml), a new PV is created. So we get 2 PVs (one released and one bound) mapped to same PVC.

Copy link
Member Author

@msau42 msau42 Sep 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all storage providers may be able to detect if a volume is in use before deleting. But assuming it can detect and fail the deletion, let's break this up into 2 scenarios actually:

  1. Delete PVC, PV is released, but not deleted. Recreate PVC.
    If the container restarts, the old volume is still used in the new container. If the Pod restarts, then it goes back through scheduling, and we rebind the PVC at that time. Pod termination works because we only need the PV, not the PVC object. So no regression here.

  2. Delete both PVC and PV objects. Recreate PVC and PV objects.
    Pod termination still works because PV object is still in the actual state. If kubelet restarts, volume reconstruction will recreate the PV object. No regression here.

So actually, I made a wrong assumption that we needed the PVC object to be bound for Pod teardown. kubelet volume manager only needs the PV object, and it can reconstruct it too if kubelet crashes. Of course, this also assumes that the volume plugin implements reconstruction correctly.

I'll remove this section unless we can think of another scenario that won't work with the new design. I'll leave the discussion of if we want to prevent PVC and PV deletion when it's still in use to a separate issue.

4. Find best matching PV for the PVC where PV topology is satisfied by the Node.
5. Temporarily cache this PV in the PVC object, keyed by Node, for fast
processing later in the priority and bind functions.
6. Return true if all PVCs are matched.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if matched PVs are deleted or PVC is resized since scheduler started matching? I am just trying to think, there appears to be a delay between when PVs/node is matched and when binding happens. Does the pod go through MatchUnbounPVCs again if something changed since match began?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same window occurs today in the PV controller. If the PV object changes, then when we try to modify it, the update will fail because we're trying to update an older version. If it fails, then we fail scheduling and retry.

If the PVC gets resized, then the binding will still happen, and afterwards, the resize controller will kick in.


```
BindUnboundPVCs(pod *v1.Pod, node *v1.Node) (err error)
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we check again if PVC constraints are still met? Can we clarify how this function accesses PV cache - because from what I see it only get pod and hence it must only have pvc name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to iron out the exact details, but I'm thinking of keeping an in-memory state in the PVC object with the chosen PV name.

Before binding, we can do some sanity checks again, but even then there can still be a race between when you check and when you actually update the object. There, we rely on the api update to fail if the PV object has changed in the meantime.

An alternative approach is to only trigger the new behavior only for volumes that
have topology constraints. A new annotation can be added to the StorageClass to
indicate that its volumes have constraints and will need to use the new
topology-aware scheduling logic.
Copy link
Contributor

@NickrenREN NickrenREN Sep 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is to say: when pv controller synchronized unbound pvc, it will always have to check storageclass's annotation to see whether it should go to the claim.Spec.VolumeName == "" block ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the alternative is to have the trigger be in the storageclass, so you need to check for the annotation every time you handle unbound volumes to determine if you should delay binding or not.


In addition, this new design will cause functional regression for the
following use cases:
* While a Pod is running, a user deletes the PVC and PV, and then recreates them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing special happens when a Pod is running and someone deletes PV or PVC today - the Pod is still running having the old PV mounted. Kubelet will clean up the mount properly (it should not need the PV for it) and A/D controller should detach the volume too (and it's a bug if it does not).

So everything is fine until someone stops the pod and tries to start a new one with the same PVC (or PVC with the same name). With this PR, the Pod needs to go through scheduler and its PVCs will get bound to something and pods is started, so user does not observe anything wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I realized kubelet doesn't need the PVC or PV object so it's ok.

With the current PV controller, it has similar behavior, except the binding happens earlier instead of waiting for the Pod to be rescheduled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you unmount the Pods storage and it doesn't die? What application can tolerate having a volume unmounted from under it without killing the underlying application.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The volume does not get unmounted while the Pod is still running, even if the PVC/PV are deleted. Unmount only happens when the Pod is terminated.

However, if PVC is deleted, and reclaim policy is Delete, then PV controller will immediately try to delete the volume. Luckily, nothing bad happens because it depends on the storage provider to prevent volume deletion while the disk is still attached to a node.

But this is potentially a problem for local storage, which doesn't have that protection. I need to think about if there's a reliable way to detect and prevent it (in a separate issue).

be triggered when the Pod has to go through scheduling. This could cause
problems with Pod volume cleanup because the PVC will no longer be bound.
* But deleting the PVC and PV (especially both) while a Pod is still using it
is also dangerous and should be prevented in the first place. There could be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should prevent it somehow, but it's for separate PR / proposal. See kubernetes/kubernetes#45143

Looking into the future, with the potential for NUMA-aware scheduling, you could
have a sub-scheduler on each node to handle the pod scheduling within a node. It
could make sense to have the volume binding as part of this sub-scheduler, to make
sure that the volume selected will have NUMA affinity with the rest of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know much about NUMA, but it sounds odd to me that a volume (local or remote) would have different "distance" (speed) to different CPUs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newer CPUs have PCI-e slots attached directly to the IO controller per socket. I found a random diagram showing the architecture here.

So this is definitely something to consider for PCI-e based local SSDs. I agree, remote storage doesn't matter here.

concrete proposals in this area yet.

### Binding multiple PVCs in one transaction
For the alpha phase, this is not required. Since the scheduler is serialized, a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR assumes that it's only the scheduler who binds PVs/PVCs and they are mostly static. There is no recovery mechanism in case of failed write.

PV controller is very robust in this regard and expects that any PV or PVC can be changed or deleted at any time - users are unpredictable and they can change their mind or they run external controllers that add / remove annotations/labels to objects. It's not that uncommon that saving a PV or PVC fails.

There should be at least some design how to make the scheduler safe from failed saves. We might paint ourselves into a corner in beta otherwise.

FYI, multi-write transactions would help a lot here but they were rejected in kubernetes/kubernetes#27548

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a single PV binding, if the PV API update fails, we'll fail scheduling and retry. So I think at least for a single PV, we are just as robust as the current PV controller.

The challenge is how to create a transaction for binding multiple PVCs together, and being able to rollback if only a subset of them bind successfully. It has the same rollback challenges as you mentioned below, and I think we could design a common rollback mechanism to handle both scenarios.

I feel like it's ok to not handle this in alpha, because even today, the PV controller doesn't have any idea of multiple PVCs being related. So this design is at least at feature parity, if not slightly better.

The internal and external provisioning APIs need to be updated to take in a node
parameter.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I miss a chapter about external schedulers. Are they still supported? Should they support binding too? How are they going to compete over unbound PVs in a consistent way?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely going to make it harder to run your own scheduler, and I think this is also one of the reasons why the scheduling sig added the scheduler extender interface, so that you can still use the default scheduler + your custom extensions. In general, the scheduling team has been adding many advanced scheduling features to the default scheduler that it's becoming harder to run your own already.

If you don't use the default scheduler, and you use persistent volumes, then you'll have to pull this code and integrate it into your custom scheduler. Of course, we will have to give notice well in advance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought more about it, if two pods use the same unbound PVC and two different schedulers bind this PVC to a different PV (i.e. one saves pv1.spec.claimRef = pvc and the second one pv2.spec.claimRef = pvc), PV controller is smart enough to recover from this conflict by un-binding one of these PVs, i.e. it clears say pv1.spec.claimRef=nil (and deletes it when necessary).

If both schedulers implement step "9. **NEW:** If PVC binding or provisioning is required, we do NOT AssumePod ... then **wait until the PVCs are bound successfully** or encounter failure" then the PVC gets eventually bound to one of these PVs. Not necessarily the one that a particular scheduler has chosen, but that should not be a problem, right?

So, we're fine as long as both schedulers implement the rules you specified. They may generate some noise by provisioning an unnecessary PV or do extra binding, but PV controller should clean up this mess.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the analysis. It should be fine, but the only potential problem is if the 2nd scheduler bound the PVC in a way that cannot satisfy the all the predicates in the 1st scheduler. In general, we should only recommend that one scheduler handle PVC binding.

Can @kubernetes/sig-scheduling-pr-reviews comment on the practice of running two schedulers against the same Pod? Is that a supported configuration, or do we only support a model where:

  • You run your own scheduler by itself
  • You run default scheduler + your custom extensions

### Recovering from kubelet rejection of pod
Again, for alpha phase, manual intervention will be required. For beta, we’ll
have to consider some sort of rollback mechanism for more robustness. Perhaps
this rollback mechanism can also be used for binding multiple PVCs in one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be very careful about this rollback. It may be necessary in some cases, but it may destroy data in others.

Consider this flow:

  • Two pods want to use the same RWX PVC.
  • Scheduler binds the first pod to a node and binds its PVC to a LocalVolume there.
  • Scheduler binds the second pod to the same node (PVC is already bound there)
  • Kubelet rejects the first pod.
  • Kubelet starts the second pod.
  • Scheduler must not roll back anything!

Again, some design here would be nice. Should we mark the PVCs as "clean" during binding and "dirty" when they're admitted by kubelet? Kubelet must make sure that it marked them as dirty before any pod containers are started and scheduler must not roll back any dirty PVCs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. We need a way to know if it's safe to rollback a binding, and in my initial brainstorm, I was thinking of adding some additional state to the PVC/PV to mark as clean and dirty.

I think the rollback also needs to change the state machine too. We shouldn't need to go move to Retain or Released state during a rollback since the volume wasn't used. Although maybe if the volume was dynamically provisioned, then we should still delete it during the rollback.

Also this potentially could bring back the same issues that we had with the state changes for recycling? I'm not familiar with what the exact issues were.

Copy link
Member

@jsafrane jsafrane Sep 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that recycler (and deleter) is safe here, PV controller should be robust enough already.

On the other hand, any rollback in recycler scheduler may conflict with binding in PV controller. Scheduler sets pv.spec.claimRef and leaves the rest to PV controller, right? PV controller will save pvc.spec.volumeName, update status and some annotations. In case the scheduler wants to revert the binding, clearing pv.spec.claimRef won't be enough, it needs to clear everything. And it will race with the PV controller trying to fix the binding, it tries to fix the status and binding if something overwrites it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah we will need to add some new state so that the PV controller will not try to fix the binding on a rollback.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit: rollback in recycler -> rollback in scheduler

```
1. For static PV binding:
1. Prebind the PV by updating the `PersistentVolume.ClaimRef` field.
2. If the prebind fails, revert the cache updates.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is pod.spec.nodeName already set? Will it be reverted? It probably should.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell if you mean the pod's node name is set in the API. If so, that's irreversible

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't let the pod reach node binding/assume until all PVCs are bound.

But I thought of a potential problem. If a user sets pod.spec.NodeName, does it bypass the scheduling? Then it would not trigger binding code and PVCs would never bind. Hmmmmm...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another hole is DaemonSets, which does its own scheduling.

```
1. For static PV binding:
1. Prebind the PV by updating the `PersistentVolume.ClaimRef` field.
2. If the prebind fails, revert the cache updates.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, at least in alpha it won't roll back any previously saved PVs, right? Does it then revert this binding from cache? It should probably stay there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't rollback any API updates. But it will rollback any cache-only updates made in AssumePVCs. I'm thinking the rollback can be done by reloading the latest API object into the cache.

6. Return true if all PVCs are matched.
7. If there are still unmatched PVCs, check if dynamic provisioning is possible.
For this alpha phase, the provisioner is not topology aware, so the predicate
will just return true if there is a provisioner specified (internal or external).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean ?

if there is a provisioner specified

Do you mean the pvc provisioner annotation is set ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that:

  1. The StorageClass exists.
  2. The provisioner field is set.

scheduler assumes and binds the Pod to a Node.
11. Kubelet starts the Pod.

This new workflow will have the scheduler handle unbound PVCs by choosing PVs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving the binding responsibility from the binder controller into the scheduler means you no longer have a single writer, and that all external schedulers must also implement PVC binding, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if you use a custom scheduler, and not the default scheduler, then you will have to pull in the PV matching and binding logic into your own scheduler. We can try to modularize the library calls as much as possible to make it easier to import the methods. Note that we don't pull in the entire binding state machine into the scheduler. The scheduler's responsibilities are to:

  1. Find an appropriate PV for unbound PVCs
  2. Initiate (but not complete) the binding by setting the PV.ClaimRef.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduler's responsibilities are to:

Find an appropriate PV for unbound PVCs
Initiate (but not complete) the binding by setting the PV.ClaimRef.

the split of binding responsibilities definitely makes the PV binder more difficult to reason about...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?

  • The scheduler handles choosing an appropriate PV for the PVC based on all the pod's requirements. It's like a user manually picking their volumes, except automated.
  • The PV controller handles the mechanics of binding the PVC and PV together, and the cleanup lifecycle.

Instead, a new bind function, BindPVCs, will be called asynchronously, passing
in the selected node. The bind function will prebind the PV to the PVC, or
trigger dynamic provisioning, and then wait until the PVCs are bound
successfully or encounter failure. Then, it always sends the Pod through the
Copy link
Member

@bsalamat bsalamat Sep 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, scheduler revisits scheduling of the pod to the node in a second round. What if the pod is no longer schedulable on the node, because, for example, a different pod is already scheduled there and the node no longer has enough resources. Doesn't this cause PV stranding and starvation of the pod?

Copy link
Member Author

@msau42 msau42 Sep 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is possible with this design (and also is the current behavior). One possible fix that we're thinking of for beta, is to implement a PVC binding "rollback" if the volumes have not been mounted by any pods yet.

Also in general for local storage once it's bound, since your node choices get constrained to 1, we are recommending that those pods have high priority to avoid starvation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think having a way of "unbinding" volumes is necessary, but even then the logic would be a bit complex. It will be something like this:

  1. Try to find a node for the pod.
  2. Bind the volumes.
  3. Try to schedule the pod again (probably by recording which node was picked at step 1 and try that node)
  4. If step 3 fails, unbind the volumes and go to 1 (after wiping the previously chosen node)

Step 3 is not an orthodox way of scheduling. It probably needs some changes to the guts of scheduler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to remember though, that rolling back a volume binding is a lot harder and should be less common than rolling back pod bindings. Once a pod starts using the volume we cannot unbind it, even if scheduling fails.

For zonal storage, failing scheduling after binding is more rare because you have more nodes to choose from. For local storage, it's more likely since there's only one node, so your pods should be higher priority to force preemption

* The more constraints you add to your pod, the less flexible it becomes
in terms of placement. Because of this, tightly constrained storage, such as
local storage, is only recommended for specific use cases, and the pods should
have high priority.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/high priority./higher scheduling priority than unconstrained pods./


## Problem
Volumes can have topology constraints that restrict the set of nodes that the
volume is accessible from. For example, a GCE PD can only be accessed from a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/the volume is accessible from./can mount them./

alternatively /can access them./

## Problem
Volumes can have topology constraints that restrict the set of nodes that the
volume is accessible from. For example, a GCE PD can only be accessed from a
single zone, and a local disk can only be accessible from a single node. In the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/accessible/accessed/

1. How close the PVC’s requested capacity and PV’s capacity are.
2. Matching static PVs is preferred over dynamic provisioning.

TODO (beta): figure out weights and exact calculation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that multiple node fit the constraints and are all equal in weight, is the ranked order of these nodes random or deterministically based on some identifier?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today, the scheduler does a round robin selection of nodes with equal priority by incrementing an index.

1. For each Node, get the cached PV matches for the Pod’s PVCs.
2. Compute a priority score for the Node using the following factors:
1. How close the PVC’s requested capacity and PV’s capacity are.
2. Matching static PVs is preferred over dynamic provisioning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are static PVs preferred over dynamic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's assuming that you had specifically statically created the PVs for your application.


TODO: flow chart

The proposed new workflow for volume binding is:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this assume that workload spreading across failure domains will be achieved by anti-affinity predicates?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

* The scheduler itself only handles one pod at a time. It’s possible the
two pods may not run at the same time either, so there’s no guarantee that you
will know both pod’s requirements at once.
* For two+ pods simultaneously sharing a PVC, this scenario may require an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies to two Pods sharing a PVC bound to a local volume correct? If the volume is remote and zonal than the only issue would be if you wanted to mount read-many volumes across zones. I would think specifying that combination of affinity predicate and storage would be a user error. For two Pods sharing the same local volume, couldn't a user achieve the same results by combining the the containers of both into a single Pod? If all the containers must reference the same local storage, then they must be scheduled on the same Node anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm mainly referring to local volumes, and I agree, the simplest scenario is to combine all your containers into one Pod so that they will be scheduled together. But for whatever reason, if a user may want two different Pods (maybe they're not running simultaneously), then this is trying to say that we can't guarantee scheduling of the second Pod if its scheduling constraints are not compatible with the first Pod.

We definitely want to support the use case of specifying pod anti-affinity/affinity along with local storage, and having the volume binding take that into account (spread my pods and local storage across nodes/racks/zones).

@msau42
Copy link
Member Author

msau42 commented Sep 29, 2017

/assign @bsalamat

to be bound. The bound assumption needs to be changed in order to work with
this new workflow.

TODO: how to handle race condition of PVCs becoming bound in the middle of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mentioned, running PVC binding in parallel with the scheduler will introduce a race condition and ordering of the predicates will not address the problem. IIUC, synchronization will be needed and synchronization could significantly impact scheduler's performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possible way to deal with this is to mark the Pod at the very beginning of scheduling if any PVCs are unbound. Then at the end, we can check for this to determine if we need to send the Pod through a second pass.

schedulable, and it does this by re-evaluating predicates without the lower
priority Pod.

If we had put the MatchUnboundPVCs predicate at the end, then pod preemption
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can possibly be addressed by registering extra functions that preemption logic should call to filter nodes further. This might be an easier change compared to the current proposal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think that would eliminate the need for some of the caching from the current design. But it will still require a second pass afterwards.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. The second pass will still be needed.

clearly communicate these changes to the community before changing the behavior
by default in GA.

### Integrating binding with scheduling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From software engineering point of view, mixing the responsibilities of modules and removing the isolation of responsibilities is not a great practice. With this design, the scheduler process would perform volume binding which is not one of the usual responsibilities of a cluster scheduler. Other than removing the modularity, this could introduce runtime problems as well. For example, a null dereference in volume binding logic will cause a scheduler crash. High memory or CPU consumption in volume binding logic could cause scheduler to run out of memory or starve.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me clarify that I am not saying that we shouldn't move forward with this design, but when we consider pros and cons of various approaches, this should be considered as a negative point of this design.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do need a tighter coupling (somehow) between scheduler and volume binding in order to address the problems I outlined in the beginning. At a high level:

  • Scheduler needs to be aware of a volume's topology wrt the node when it is scheduling a pod
  • The volume binding decision needs to respect the pod's other scheduling constraints

This design proposes a very tight coupling by putting the volume binding directly into the scheduler. You are correct in that it's expanding the scheduler's responsibilities to more than just Pod->Node binding. The major risk I see with this current approach is that it's trying to add in the volume binding using the existing scheduling mechanics, which may be designed with the assumption that Pod->Node binding is the only thing the scheduler does. So this approach breaks when you consider the scenario where volumes are unbound, but the user has already manually scheduled the Pod to a Node. Fixing that I think is going to require some fundamental changes in the scheduler, which could greatly increase the scope of this design. But it may be something still worth considering, especially if more objects also need to integrate their binding with scheduling. I know @vishh and @yguo0905 are also looking into extending the scheduler for binding other resources (although may not be as complex as volume binding), and they may also run into the same issues as me.

I did also consider an alternate approach with loose coupling, where it tries to keep scheduler and PV controller separate. It's similar to this alternative discussion. The basic idea is the PV controller waits for the Pod.Spec.NodeName to be set before it does the binding. However, even though the binding process is decoupled from the scheduler, the binding decision is much more prone to split brain problems. We still need a new scheduler predicate so that the scheduler can pick an appropriate node, but because the binding happens completely asynchronously in the PV controller, it's possible that the actual binding will pick a different volume, or fail completely. Then you need a way to tell the scheduler to revert its PV cache update, and also a way for the PV controller to tell kubelet to reject the Pod so that it can be sent back through scheduling. So while this approach keeps the binding process and code separate, I found that it:

  • is more prone to split brain failures due to scheduler and PV controller trying to make the same decisions, but operating on different and out-of-sync caches.
  • required very tight, complex, and delicate coordination between scheduler, PV controller, and kubelet processes to handle recovering from these split brain failures.

I am open to revisiting this alternative if you think it looks more promising than the current approach, but from my initial investigation, it seemed much more complex and fragile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this design, the scheduler process would perform volume binding which is not one of the usual responsibilities of a cluster scheduler.

I am not sure what "cluster scheduler" is, however with this PR we're moving from Pod scheduler to Pod and PV scheduler. And I am sure more resources will come in future.

E.g. setting pod.spec.nodeName should not bypass the scheduler as it should schedule the storage too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could benefit from extending the scheduler to support multiple object scheduling/binding in a modular manner. Each module would need to:

  • Given a Pod object, evaluate if it needs to run through the scheduler
  • Register assume/bind functions
  • Register rollback function

Of course, there is a ton of detail that needs to be worked out

@msau42 msau42 force-pushed the storage-topology-design branch from 5ad3af2 to 9ec7619 Compare October 3, 2017 21:17
…e approach and its downsides, add more details about binding rollback using PV taints
@msau42 msau42 force-pushed the storage-topology-design branch from 9ec7619 to 6e486a3 Compare October 3, 2017 21:21
@msau42
Copy link
Member Author

msau42 commented Oct 3, 2017

Updated to add downsides of proposed approach, more details about the alternative approach, and an idea on how to handle PV binding rollback.

@bsalamat
Copy link
Member

bsalamat commented Oct 4, 2017

As discussed offline and in order to unblock SIG Storage, I LGTM this PR. We will hopefully have a better mechanism in scheduler that allows an easier and cleaner way of performing post scheduling tasks, such as volume or TPU binding.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 4, 2017
@k8s-github-robot
Copy link

Automatic merge from submit-queue.

@k8s-github-robot k8s-github-robot merged commit f20eb8c into kubernetes:master Oct 4, 2017
@msau42
Copy link
Member Author

msau42 commented Oct 5, 2017

Ah sorry, there were still some TODO items that I still need to sort out, especially regarding how to handle scheduler bypass scenarios. Please continue commenting/discussing.

@bsalamat
Copy link
Member

bsalamat commented Oct 5, 2017

Yeah, I was not expecting it to be auto-merged upon my lgtm. Anyway, you may want to send another PR to address the remaining items.

MadhavJivrajani pushed a commit to MadhavJivrajani/community that referenced this pull request Nov 30, 2021
Automatic merge from submit-queue.

Volume topology aware scheduling design

Proposal for a smarter scheduler that influences PV binding.

Part of kubernetes/enhancements#121

/sig storage
/sig scheduling
/cc @kubernetes/sig-storage-proposals @kubernetes/sig-scheduling-proposals
danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023
Responsibility disclosures and reporting policies should be available in all Istio repos.
Currently we have it only in https://github.com/istio/istio/security

Signed-off-by: Faseela K <faseela.k@est.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.