[bug] Pipelines generated from kfp 2.10 ignore accelerator #11374

strangiato · 2024-11-13T23:49:50Z

When executing or compiling a pipeline using the 2.10 kfp sdk with the following configuration:

task.set_accelerator_type("nvidia.com/gpu").set_accelerator_limit("1")

The pipeline server ignores the gpu option and is scheduled without the gpu in the resource configuration.

This appears to be a breaking change introduced in 2.10

Environment

How do you deploy Kubeflow Pipelines (KFP)?
Red Hat OpenShift AI
KFP version:
KFP SDK version:

$ pip list | grep kfp  
kfp                      2.10.0
kfp-pipeline-spec        0.4.0
kfp-server-api           2.3.0

Steps to reproduce

Create a python virtual environment

python -m venv venv
source venv/bin/activate

Install kfp 2.10

pip install kfp==2.10

Create the following pipeline with the file name acc-test.py

from kfp import dsl, compiler

@dsl.component()
def empty_component():
    pass

@dsl.pipeline(name='pipeline-accel')
def pipeline_accel():
    task = empty_component()
    task.set_accelerator_type("nvidia.com/gpu").set_accelerator_limit("1")

if __name__ == "__main__":
    compiler.Compiler().compile(pipeline_accel, 'pipeline.yaml')

Compile the pipeline

python acc-test.py

Upload the pipeline and trigger an execution.

The pods created for the step will not include the nvidia.com/gpu in the pod spec resources, and the pod will get scheduled on a non-gpu node.

Expected result

The pod should include the resources definition for the GPUs and the pod should be scheduled on a GPU node.

resources:
  limits:
    nvidia.com/gpu: 1
  requests:
    nvidia.com/gpu: 1

Materials and reference

It looks like the bug was likely introduced in:
#11097

When compiling the pipeline with 2.10 it renders the following:

        resources:
          accelerator:
            resourceCount: '1'
            resourceType: nvidia.com/gpu

With older version such as 2.9, it renders the following:

        resources:
          accelerator:
            count: '1'
            type: nvidia.com/gpu

Fix in progress:

#11373

Labels

Impacted by this bug? Give it a 👍.

The text was updated successfully, but these errors were encountered:

HumairAK · 2024-11-14T00:18:04Z

Thanks Trevor. This is now resolved, and we'll do a patch release on 2.10 to pull in these changes.

vanHavel · 2024-11-18T13:47:59Z

Could it be that the same issue applies also to CPU requests and limits?
After updating to kfp SDK 2.10, we saw pods running with no resources on kfp server 2.3.
It seems that in the generated pipeline spec, the keys for CPU / RAM requests and limits have changed as well (probably in #11097).

gregsheremeta · 2024-11-18T19:20:13Z

Could it be that the same issue applies also to CPU requests and limits?

yep. Good catch. 83dcf1a

Any interest in submitting a fix? You can use https://github.com/kubeflow/pipelines/pull/11373/files as an example.

vanHavel · 2024-11-19T13:35:03Z

I took a brief look at this, unfortunately it seems to be not that simple as bringing back the old fields as well.
Also the datatype has changed from numbers to strings, so one would potentially also need to bring back the old validation / conversion logic as well I believe.
I'm afraid I won't have enough time to work on this, also I lack some of the context on why these things were changed and don't want to break something else downstream.

rimolive · 2024-11-19T17:09:42Z

I opened #11390 as a follow-up of the CPU/Memory requests/limits, I'll see if I work on a fix very soon.

gregsheremeta · 2024-11-19T19:38:04Z

I took a brief look at this

@vanHavel thanks for trying! I appreciate it 😄

strangiato added the kind/bug label Nov 13, 2024

HumairAK mentioned this issue Nov 14, 2024

fix(sdk): accelerator type setting in kfp #11373

Merged

HumairAK closed this as completed Nov 14, 2024

strangiato mentioned this issue Nov 14, 2024

updating kfp to resolve accelerator issue redhat-composer-ai/data-ingestion#1

Merged

MichaelClifford mentioned this issue Nov 18, 2024

Bump KFP version to 2.10.1 to address accelerator bug in kfp 2.10.0 opendatahub-io/ilab-on-ocp#186

Closed

rimolive mentioned this issue Nov 19, 2024

[sdk] Pipelines generated from kfp 2.10 ignore cpu/memory requests/limits #11390

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] Pipelines generated from kfp 2.10 ignore accelerator #11374

[bug] Pipelines generated from kfp 2.10 ignore accelerator #11374

strangiato commented Nov 13, 2024

HumairAK commented Nov 14, 2024

vanHavel commented Nov 18, 2024

gregsheremeta commented Nov 18, 2024

vanHavel commented Nov 19, 2024

rimolive commented Nov 19, 2024

gregsheremeta commented Nov 19, 2024

[bug] Pipelines generated from kfp 2.10 ignore accelerator #11374

[bug] Pipelines generated from kfp 2.10 ignore accelerator #11374

Comments

strangiato commented Nov 13, 2024

Environment

Steps to reproduce

Expected result

Materials and reference

Labels

HumairAK commented Nov 14, 2024

vanHavel commented Nov 18, 2024

gregsheremeta commented Nov 18, 2024

vanHavel commented Nov 19, 2024

rimolive commented Nov 19, 2024

gregsheremeta commented Nov 19, 2024