Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UAT test scenario to use GPU #128

Closed
nishant-dash opened this issue Oct 3, 2024 · 3 comments · Fixed by #139
Closed

UAT test scenario to use GPU #128

nishant-dash opened this issue Oct 3, 2024 · 3 comments · Fixed by #139
Labels
enhancement New feature or request

Comments

@nishant-dash
Copy link

Context

Can we have a UAT test (or set of tests) that helps run a test workload that utilizes one or more gpus? Ideally this would be both a notebook and a pipeline job.
It may be tricky to make it generic enough to run in any environment out of the box, but just having a test there with minimal assumptions is better than having no tests for it.

This will be super helpful to have when running validation on various cloud deployments.

What needs to get done

UAT test (or set of tests) that helps run a test workload that utilizes one or more gpus
Ideally this would be both a notebook and a pipeline job.

Definition of Done

Working UATs for notebook and pipelines that test gpu utilization successfully.

@nishant-dash nishant-dash added the enhancement New feature or request label Oct 3, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6358.

This message was autogenerated

@orfeas-k
Copy link
Contributor

orfeas-k commented Nov 1, 2024

Thank you for the proposal @nishant-dash . We will use this issue to implement the notebook itself. First, we will have to enable the driver though in order to enable running the notebook in an automated way, which means this will be worked on after #130, #131, #132.

EDIT: The design should be linked to the KF113 epic.

@orfeas-k
Copy link
Contributor

The proposed notebook in #139 uses kfp SDK in order to create an experiment and run it. The pipeline:

  • Schedules runs in a GPU node. That means that if there is no NVIDIA GPU, the run's pod will remain pending and cause the test to timeout and fail.
  • Runs code which uses a Tensorflow framework function in order to detect if it can find a GPU in the node. If it doesn't find one, it raises resulting the test to timeout and fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants