fix: support scale from zero in the proxy scheduler #203
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This came up while testing a setup with a target cluster that autoscales a GPU pool from zero and a pod requesting GPU resources. The pod remains pending with:
This is because the virtual node representing the target cluster doesn't have the GPU capacity populated (it can't fetch it anyways because the target cluster has 0 nodes). The expectation is that the proxy-scheduler would still create the pod chaperon, autoscaler kicks in and the rest of the binding process happens but that wasn't the case.
On investigation, it seems that the
Filter
extension point in the proxy scheduler never gets to execute. The reason is because the scheduler configuration is using amultiPoint
extension point. To quote the docs:This means that all of
NodePorts
,PodTopologySpread
,VolumeBinding
, NodeResourcesFitand other plugins are executed. The
NodeResourcesFitwould reject the pod at the
PreFilter` step.There's few things I attempted here:
PreFilter
extension point in the custom scheduler and returningnil, framework.Success
. This won't work because allPreFilter
plugins must return success or the pod gets rejected so if any of the default plugins fail, the pod is rejected. It seems we need to explicitly disable the failing plugin.multipoint
config, explicitly define the extension points we implement in the plugin in the configpreFilter, filter, reserve, preBind, score
and enable the proxy plugin for them. That way, the default plugins won't be added. The only downside is that for every new extension point we add, we'd need to modify the config. I don't forsee these changing often.NodeResourcesFit
explicitly (and potentially other default plugins). This PR just does it forNodeResourcesFit
to enable scaling from zero but I think we could expand the list to cover all of them.I'm indifferent between 2 and 3, to me it seems that the purpose of the proxy scheduler is to handle the chaperoning and report the status back from the real schedulers in the target cluster so 2 might also make sense here if we don't care about any of the default scheduler plugins.
I believe this is also what #202 is seeing since they're attempting to scale from 0, the v-node will never register see the GPU allocatable, the
PreFilter
step will fail on theNodeResourcesFit
in the proxy scheduler