Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: support scale from zero in the proxy scheduler #203

Merged
merged 1 commit into from
May 26, 2024

Conversation

marwanad
Copy link
Contributor

@marwanad marwanad commented Jan 3, 2024

This came up while testing a setup with a target cluster that autoscales a GPU pool from zero and a pod requesting GPU resources. The pod remains pending with:

Warning  FailedScheduling   3m52s       admiralty-proxy     0/5 nodes are available: 1 Insufficient nvidia.com/gpu

This is because the virtual node representing the target cluster doesn't have the GPU capacity populated (it can't fetch it anyways because the target cluster has 0 nodes). The expectation is that the proxy-scheduler would still create the pod chaperon, autoscaler kicks in and the rest of the binding process happens but that wasn't the case.

On investigation, it seems that the Filter extension point in the proxy scheduler never gets to execute. The reason is because the scheduler configuration is using a multiPoint extension point. To quote the docs:

Starting from kubescheduler.config.k8s.io/v1beta3, all default plugins are enabled internally through MultiPoint.

This means that all of NodePorts, PodTopologySpread, VolumeBinding, NodeResourcesFitand other plugins are executed. TheNodeResourcesFitwould reject the pod at thePreFilter` step.

There's few things I attempted here:

  1. Implement the PreFilter extension point in the custom scheduler and returning nil, framework.Success. This won't work because all PreFilter plugins must return success or the pod gets rejected so if any of the default plugins fail, the pod is rejected. It seems we need to explicitly disable the failing plugin.
  2. Instead of using multipoint config, explicitly define the extension points we implement in the plugin in the config preFilter, filter, reserve, preBind, score and enable the proxy plugin for them. That way, the default plugins won't be added. The only downside is that for every new extension point we add, we'd need to modify the config. I don't forsee these changing often.
  3. Disable the NodeResourcesFit explicitly (and potentially other default plugins). This PR just does it for NodeResourcesFit to enable scaling from zero but I think we could expand the list to cover all of them.

I'm indifferent between 2 and 3, to me it seems that the purpose of the proxy scheduler is to handle the chaperoning and report the status back from the real schedulers in the target cluster so 2 might also make sense here if we don't care about any of the default scheduler plugins.

I believe this is also what #202 is seeing since they're attempting to scale from 0, the v-node will never register see the GPU allocatable, the PreFilter step will fail on the NodeResourcesFit in the proxy scheduler

@marwanad marwanad changed the title fix: Support scale from zero fix: support scale from zero in the proxy scheduler Jan 3, 2024
@marwanad
Copy link
Contributor Author

marwanad commented Jan 3, 2024

@adrienjt curious what your thoughts are/if there's a better way you see to support this usecase.

@adrienjt
Copy link
Contributor

Thank you for this PR.

I don't think that the NodeResourcesFit plugin fails at the PreFilter step, because it appears to only save data and cannot return any error.

However, the order of the Filter plugins matters, and it's possible that the NodeResourcesFit plugin runs before our plugin.

So we could reorder the plugins, giving our plugin a chance to send a candidate, and the NodeResourcesFit Filter plugin would eventually succeed when resources are reconciled on the virtual node (candidates survive scheduling cycles). We could also disable the NodeResourcesFit Filter step (not very useful as you noted), but I wouldn't want to disable the whole plugin, because we actually need the Score step to implement the LeastAllocated/MostAllocated bin-packing strategies.

Indeed, we need to keep a lot of the default plugins, so the scheduler config needs to be crafted carefully, ideally without repeating the default config, to reduce maintenance cost.

I think (not tested) that to make the proxy plugin Filter step run first, the config would look like this:

    profiles:
      - schedulerName: admiralty-proxy
        plugins:
          multiPoint:
            enabled:
              - name: proxy
          filter:
            enabled:
              - name: proxy

And to disable the NodeResourcesFit Filter step, the config would look like this:

    profiles:
      - schedulerName: admiralty-proxy
        plugins:
          multiPoint:
            enabled:
              - name: proxy
          filter:
            disabled:
              - name: NodeResourcesFit

@adrienjt adrienjt enabled auto-merge May 26, 2024 20:24
@adrienjt adrienjt merged commit b704bb7 into admiraltyio:master May 26, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants