Skip to content

Commit

Permalink
Add caching overview
Browse files Browse the repository at this point in the history
  • Loading branch information
msau42 committed Nov 1, 2017
1 parent 31ed587 commit ec4f521
Showing 1 changed file with 31 additions and 18 deletions.
49 changes: 31 additions & 18 deletions contributors/design-proposals/storage/volume-topology-scheduling.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,8 +160,7 @@ mark the PVs with the chosen PVCs.
9. **NEW:** If PVC binding or provisioning is required, we do NOT AssumePod.
Instead, a new bind function, BindPVCs, will be called asynchronously, passing
in the selected node. The bind function will prebind the PV to the PVC, or
trigger dynamic provisioning, and then wait until the PVCs are bound
successfully or encounter failure. Then, it always sends the Pod through the
trigger dynamic provisioning. Then, it always sends the Pod through the
scheduler again for reasons explained later.
10. When a Pod makes a successful scheduler pass once all PVCs are bound, the
scheduler assumes and binds the Pod to a Node.
Expand Down Expand Up @@ -217,7 +216,7 @@ MatchUnboundPVCs(pod *v1.Pod, node *v1.Node) (canBeBound bool, err error)
decreasing requested capacity.
3. Walk through all the PVs.
4. Find best matching PV for the PVC where PV topology is satisfied by the Node.
5. Temporarily cache this PV in the PVC object, keyed by Node, for fast
5. Temporarily cache this PV choice for the PVC per Node, for fast
processing later in the priority and bind functions.
6. Return true if all PVCs are matched.
7. If there are still unmatched PVCs, check if dynamic provisioning is possible.
Expand All @@ -226,8 +225,6 @@ will just return true if there is a provisioner specified in the StorageClass
(internal or external).
8. Otherwise return false.

TODO: caching format and details

##### Priority
After all the predicates run, there is a reduced set of Nodes that can fit a
Pod. A new priority function will rank the remaining nodes based on the
Expand Down Expand Up @@ -263,11 +260,11 @@ AssumePVCs(pod *v1.Pod, node *v1.Node) (pvcBindingRequired bool, err error)
1. Get the cached matching PVs for the PVCs on that Node.
2. Validate the actual PV state.
3. Mark PV.ClaimRef in the PV cache.
4. Cache the PVs that need binding in the Pod object.
3. For in-tree and external dynamic provisioning:
1. Nothing.
1. Cache the PVCs that need provisioning in the Pod object.
4. Return true.


##### Bind
If AssumePVCs returns pvcBindingRequired, then the BindPVCs function is called
as a go routine. Otherwise, we can continue with assuming and binding the Pod
Expand All @@ -283,10 +280,6 @@ BindUnboundPVCs(pod *v1.Pod, node *v1.Node) (err error)
1. For static PV binding:
1. Prebind the PV by updating the `PersistentVolume.ClaimRef` field.
2. If the prebind fails, revert the cache updates.
3. Otherwise, wait for the PVCs to be bound, PVC/PV object is deleted, or
PV.ClaimRef field is cleared. TODO: what if there's mix of static and dynamic?
Maybe we shouldn't wait and just always send it back through the scheduler, and
have a fast path through the predicate if binding is in progress.
2. For in-tree and external dynamic provisioning:
1. Set `annStorageProvisioner` on the PVC.
3. Send Pod back through scheduling, regardless of success or failure.
Expand All @@ -295,13 +288,6 @@ order to evaluate other volume predicates that require the PVC to be bound, as
described below.
2. In the case of failure, we want to retry binding/provisioning.

Note that for dynamic provisioning, we do not wait for the PVCs to be bound, so
the Pod will be sent through scheduling repeatedly until the PVCs are bound.
This is because there is no function call for dynamic provisioning, so if we
did wait, we could be waiting forever for the PVC to bind. It’s possible that
in the meantime, a user could create a PV that satisfies the PVC and doesn’t
need the dynamic provisioner anymore.

TODO: pv controller has a high resync frequency, do we need something similar
for the scheduler too

Expand Down Expand Up @@ -353,6 +339,33 @@ evaluates the node affinity against the node’s labels to determine if the pod
can be scheduled on that node. If the volume is not bound, this predicate can
be ignored, as the binding logic will take into account the PV node affinity.

##### Caching
There are two new caches needed in the scheduler.

The first cache is for handling the PV/PVC API binding updates occurring
asynchronously with the main scheduler loop. `AssumePVCs` needs to store
the updated API objects before `BindUnboundPVCs` makes the API update, so
that future binding decisions will not choose any assumed PVs. In addition,
if the API update fails, the cached updates need to be reverted and restored
with the actual API object. The cache will return either the cached-only
object, or the informer object, whichever one is latest. Informer updates
will always override the cached-only object. The new predicate and priority
functions must get the objects from this cache intead of from the informer cache.
This cache only stores pointers to objects and most of the time will only
point to the informer object, so the memory footprint per object is small.

The second cache is for storing temporary state as the Pod goes from
predicates to priorities and then assume. This all happens serially, so
the cache can be cleared at the beginning of each pod scheduling loop. This
cache is used for:
* Indicating if all the PVCs are already bound at the beginning of the pod
scheduling loop. This is to handle situations where volumes may have become
bound in the middle of processing the predicates. We need to ensure that
all the volume predicates are fully run once all PVCs are bound.
* Caching PV matches per node decisions that the predicate had made. This is
an optimization to avoid walking through all the PVs again in priority and
assume functions.

#### Performance and Optimizations
Let:
* N = number of nodes
Expand Down

0 comments on commit ec4f521

Please sign in to comment.