Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Pipelines generated from kfp 2.10 ignore accelerator #11374

Closed
strangiato opened this issue Nov 13, 2024 · 6 comments · Fixed by #11373
Closed

[bug] Pipelines generated from kfp 2.10 ignore accelerator #11374

strangiato opened this issue Nov 13, 2024 · 6 comments · Fixed by #11373
Labels

Comments

@strangiato
Copy link

When executing or compiling a pipeline using the 2.10 kfp sdk with the following configuration:

task.set_accelerator_type("nvidia.com/gpu").set_accelerator_limit("1")

The pipeline server ignores the gpu option and is scheduled without the gpu in the resource configuration.

This appears to be a breaking change introduced in 2.10

Environment

  • How do you deploy Kubeflow Pipelines (KFP)?
    Red Hat OpenShift AI

  • KFP version:

  • KFP SDK version:

$ pip list | grep kfp  
kfp                      2.10.0
kfp-pipeline-spec        0.4.0
kfp-server-api           2.3.0

Steps to reproduce

  1. Create a python virtual environment
python -m venv venv
source venv/bin/activate
  1. Install kfp 2.10
pip install kfp==2.10
  1. Create the following pipeline with the file name acc-test.py
from kfp import dsl, compiler

@dsl.component()
def empty_component():
    pass

@dsl.pipeline(name='pipeline-accel')
def pipeline_accel():
    task = empty_component()
    task.set_accelerator_type("nvidia.com/gpu").set_accelerator_limit("1")

if __name__ == "__main__":
    compiler.Compiler().compile(pipeline_accel, 'pipeline.yaml')
  1. Compile the pipeline
python acc-test.py
  1. Upload the pipeline and trigger an execution.

The pods created for the step will not include the nvidia.com/gpu in the pod spec resources, and the pod will get scheduled on a non-gpu node.

Expected result

The pod should include the resources definition for the GPUs and the pod should be scheduled on a GPU node.

resources:
  limits:
    nvidia.com/gpu: 1
  requests:
    nvidia.com/gpu: 1

Materials and reference

It looks like the bug was likely introduced in:
#11097

When compiling the pipeline with 2.10 it renders the following:

        resources:
          accelerator:
            resourceCount: '1'
            resourceType: nvidia.com/gpu

With older version such as 2.9, it renders the following:

        resources:
          accelerator:
            count: '1'
            type: nvidia.com/gpu

Fix in progress:

#11373

Labels


Impacted by this bug? Give it a 👍.

@HumairAK
Copy link
Collaborator

Thanks Trevor. This is now resolved, and we'll do a patch release on 2.10 to pull in these changes.

@vanHavel
Copy link

Could it be that the same issue applies also to CPU requests and limits?
After updating to kfp SDK 2.10, we saw pods running with no resources on kfp server 2.3.
It seems that in the generated pipeline spec, the keys for CPU / RAM requests and limits have changed as well (probably in #11097).

@gregsheremeta
Copy link
Contributor

Could it be that the same issue applies also to CPU requests and limits?

yep. Good catch. 83dcf1a

Any interest in submitting a fix? You can use https://github.com/kubeflow/pipelines/pull/11373/files as an example.

@vanHavel
Copy link

I took a brief look at this, unfortunately it seems to be not that simple as bringing back the old fields as well.
Also the datatype has changed from numbers to strings, so one would potentially also need to bring back the old validation / conversion logic as well I believe.
I'm afraid I won't have enough time to work on this, also I lack some of the context on why these things were changed and don't want to break something else downstream.

@rimolive
Copy link
Member

I opened #11390 as a follow-up of the CPU/Memory requests/limits, I'll see if I work on a fix very soon.

@gregsheremeta
Copy link
Contributor

I took a brief look at this

@vanHavel thanks for trying! I appreciate it 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants