-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add e2e test for train API #2199
base: master
Are you sure you want to change the base?
Add e2e test for train API #2199
Conversation
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Pull Request Test Coverage Report for Build 12425333047Details
💛 - Coveralls |
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
@andreyvelich No problem at all. We can also keep the e2e test for train API in integration tests, but skip this if use gang scheduling. Which way do you think is better? |
Yes, maybe it is less changes. Let's just log in the tests that we don't run this tests with gang-scheduling. |
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Hezhi (Helen) Xie <hezxie@ucdavis.edu>
@andreyvelich The e2e test for PyTorchJob is still failing due to an image pull backoff error. I suspect it might be caused by insufficient disk space. I think we have two options:
Which approach do you think is better? |
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Can you try to separate them and see if issue will be resolved ? |
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
@andreyvelich I've separated the e2e test for train API and now it works. Please review when you have time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this overall lgtm, just small comment.
/assign @deepanker13 @kubeflow/wg-training-leads @Electronic-Waste
/lgtm |
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
New changes are detected. LGTM label has been removed. |
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
I've updated the Kubernetes version to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically LGTM. I left some comments for you @helenxie-bit
strategy: | ||
fail-fast: false | ||
matrix: | ||
kubernetes-version: ["v1.31.4"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we change the Kubernetes version to be aligned with other ci tests? Like:
training-operator/.github/workflows/test-example-notebooks.yaml
Lines 16 to 18 in 69094e1
matrix: | |
kubernetes-version: ["v1.28.7", "v1.29.2", "v1.30.6"] | |
python-version: ["3.9", "3.10", "3.11"] |
What this PR does / why we need it:
Add an e2e test in the
test_e2e_train_api.py
for the train API.Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #
Checklist: