Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add e2e test for train API #2199

Open
wants to merge 112 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 108 commits
Commits
Show all changes
112 commits
Select commit Hold shift + click to select a range
15b6cb0
add e2e test for train API
helenxie-bit Aug 9, 2024
daa0054
fix peft import error
helenxie-bit Aug 9, 2024
8d4af90
update settings of the job
helenxie-bit Aug 9, 2024
86c31c8
fix format
helenxie-bit Aug 9, 2024
01870e2
fix format
helenxie-bit Aug 9, 2024
17f3c33
fix error detection
helenxie-bit Aug 9, 2024
0685dc7
resolve conflict
helenxie-bit Aug 9, 2024
83de64b
resolve conflict
helenxie-bit Aug 9, 2024
f954f2d
resolve conflict
helenxie-bit Aug 9, 2024
ff48154
fix format
helenxie-bit Aug 9, 2024
304db5d
fix NoneType error
helenxie-bit Aug 9, 2024
486154d
fix format
helenxie-bit Aug 9, 2024
016c41d
test bug
helenxie-bit Aug 9, 2024
1e7bd23
find bug
helenxie-bit Aug 11, 2024
1aced61
find bug
helenxie-bit Aug 11, 2024
3100aae
find bug
helenxie-bit Aug 11, 2024
e5b9061
add storage_config
helenxie-bit Aug 11, 2024
ffb0685
fix format
helenxie-bit Aug 11, 2024
dc1b48a
reduce pvc size
helenxie-bit Aug 12, 2024
8894517
set storage_config
helenxie-bit Aug 12, 2024
36872d7
set storage_config
helenxie-bit Aug 12, 2024
7dd8d40
set storage_config
helenxie-bit Aug 12, 2024
60c322d
set storage_config
helenxie-bit Aug 12, 2024
dd970ab
use gpu
helenxie-bit Aug 12, 2024
10bbfa0
use gpu
helenxie-bit Aug 12, 2024
d47d6a6
use gpu
helenxie-bit Aug 12, 2024
4ccd4a7
fix 'set_device' error
helenxie-bit Aug 12, 2024
0750322
add timeout error
helenxie-bit Aug 15, 2024
5ca0923
fix format
helenxie-bit Aug 15, 2024
387eb84
fix format
helenxie-bit Aug 15, 2024
9cc5429
fix format
helenxie-bit Aug 15, 2024
8a537ad
fix typo
helenxie-bit Aug 26, 2024
e508ef4
update e2e test for train api
helenxie-bit Aug 29, 2024
788359b
add num_labels
helenxie-bit Aug 29, 2024
9b4222e
update pip install
helenxie-bit Aug 29, 2024
d75938d
check disk space
helenxie-bit Aug 29, 2024
1148bc8
change sequence of e2e tests
helenxie-bit Aug 29, 2024
d29a85d
add clean-up after each e2e test of pytorchjob
helenxie-bit Aug 29, 2024
82ea9be
update cleanup function
helenxie-bit Aug 30, 2024
b45f9f7
update cleanup function
helenxie-bit Aug 30, 2024
a204746
update cleanup function-add check disk
helenxie-bit Aug 30, 2024
2d8f8b1
check docker volumes
helenxie-bit Aug 30, 2024
c748d0e
update cleanup function
helenxie-bit Aug 30, 2024
a68e182
update cleanup function
helenxie-bit Aug 30, 2024
227129e
check docker directory
helenxie-bit Aug 30, 2024
79e9e32
update pip install and 'num_workers'
helenxie-bit Aug 30, 2024
b7dbf5c
update pip install and 'num_workers'
helenxie-bit Aug 30, 2024
1f639a7
update pip install
helenxie-bit Aug 30, 2024
8322730
change the value of 'clean_pod_policy'
helenxie-bit Aug 30, 2024
ed10574
change the value of 'update cleanup function
helenxie-bit Aug 30, 2024
50ed9e8
update cleanup function
helenxie-bit Aug 30, 2024
b2cd27a
update cleanup function
helenxie-bit Aug 31, 2024
3af5d87
check docker volumes
helenxie-bit Aug 31, 2024
1a0eff3
check docker volumes
helenxie-bit Aug 31, 2024
604265a
stop the controller and restart it again to clean up
helenxie-bit Aug 31, 2024
a4f848f
update cleanup function
helenxie-bit Aug 31, 2024
3e86e90
update cleanup function
helenxie-bit Aug 31, 2024
558330b
update cleanup function
helenxie-bit Aug 31, 2024
d4ed2d8
separate e2e test for train api
helenxie-bit Sep 3, 2024
7a2ae05
fix format
helenxie-bit Sep 3, 2024
9efcce5
fix parameter of namespace
helenxie-bit Sep 3, 2024
a443ea2
fix format
helenxie-bit Sep 3, 2024
85fd8e6
reduce resources
helenxie-bit Sep 3, 2024
1a0c455
separate e2e test for train API
helenxie-bit Sep 3, 2024
afe4240
remove go setup
helenxie-bit Sep 3, 2024
250b830
adjust the version of k8s
helenxie-bit Sep 3, 2024
c5b39a4
move test file to new place
helenxie-bit Sep 3, 2024
fa99a92
fix typos
helenxie-bit Sep 4, 2024
f0d8cc4
rerun tests
helenxie-bit Sep 4, 2024
d2c3cac
update install packages
helenxie-bit Sep 21, 2024
c3f04c3
Merge remote-tracking branch 'upstream/master' into add-e2e-test-for-…
helenxie-bit Sep 21, 2024
9f42449
build and verify images of storage-intializer and trainer
helenxie-bit Sep 21, 2024
bb406ce
fix image build error
helenxie-bit Sep 21, 2024
f0b6b38
fix image build error
helenxie-bit Sep 21, 2024
45eb7e0
check disk space
helenxie-bit Sep 21, 2024
f217794
make 'setup-storage-initializer-and-trainer' executable
helenxie-bit Sep 21, 2024
083e155
separate step of loading images
helenxie-bit Sep 21, 2024
dc74844
check disk space after loading image
helenxie-bit Sep 21, 2024
de18ef0
clean up and check disk space
helenxie-bit Sep 21, 2024
ef8742c
prune docker build cache
helenxie-bit Sep 21, 2024
1eb3ef1
prune docker build cache
helenxie-bit Sep 21, 2024
1e407a5
adjust sequence of building and loading images
helenxie-bit Sep 21, 2024
7519559
move working directory
helenxie-bit Sep 21, 2024
f5d63c4
delete moving working directory
helenxie-bit Sep 22, 2024
08c8562
fix format
helenxie-bit Sep 22, 2024
d2ae542
use 'docker system prune'
helenxie-bit Sep 24, 2024
09fc8a9
make the format of the commands to be consistent
helenxie-bit Sep 24, 2024
a27e1a2
update base image
helenxie-bit Dec 17, 2024
59d8582
update base image
helenxie-bit Dec 17, 2024
581d2bc
update base image
helenxie-bit Dec 17, 2024
1140a11
delete unnecessary space clear and check code
helenxie-bit Dec 17, 2024
82de69a
merge e2e test for train api into integration tests
helenxie-bit Dec 17, 2024
f50094a
resolve conflict in integration tests
helenxie-bit Dec 17, 2024
5efaf3b
check for timeout error
helenxie-bit Dec 17, 2024
13ae587
fix name of trainer image
helenxie-bit Dec 18, 2024
dd4c2be
fix env of building storage initializer image
helenxie-bit Dec 18, 2024
b21bedd
clean format
helenxie-bit Dec 18, 2024
1669055
skip e2e test for train API when use scheduling
helenxie-bit Dec 18, 2024
3d91dfc
Update name of fileholder
helenxie-bit Dec 18, 2024
ba7297a
fix format
helenxie-bit Dec 18, 2024
16645f9
separate e2e test for train API
helenxie-bit Dec 19, 2024
5ec175f
fix format
helenxie-bit Dec 19, 2024
b7986e6
move test script
helenxie-bit Dec 19, 2024
267cbe8
update path to test script
helenxie-bit Dec 19, 2024
617dba3
update path to test script
helenxie-bit Dec 19, 2024
b5ae618
rerun tests
helenxie-bit Dec 19, 2024
9b997a3
rerun tests
helenxie-bit Dec 19, 2024
6e8f3f7
rerun tests
helenxie-bit Dec 19, 2024
f4bb238
update kubernetes version
helenxie-bit Dec 20, 2024
775ff67
update kubernetes version
helenxie-bit Dec 20, 2024
fc21273
rerun tests
helenxie-bit Dec 20, 2024
4c51b76
rerun tests
helenxie-bit Dec 20, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 61 additions & 0 deletions .github/workflows/e2e-test-train-api.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
name: E2E Test with train API
on:
- pull_request

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
e2e-test:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
kubernetes-version: ["v1.28.7"]
helenxie-bit marked this conversation as resolved.
Show resolved Hide resolved
python-version: ["3.9", "3.10", "3.11"]
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Setup E2E Tests
uses: ./.github/workflows/setup-e2e-test
with:
kubernetes-version: ${{ matrix.kubernetes-version }}
python-version: ${{ matrix.python-version }}

- name: Build trainer
run: |
./scripts/gha/build-trainer.sh
env:
TRAINER_CI_IMAGE: kubeflowtraining/trainer:test

- name: Load trainer
run: |
kind load docker-image ${{ env.TRAINER_CI_IMAGE }} --name ${{ env.KIND_CLUSTER }}
env:
KIND_CLUSTER: training-operator-cluster
TRAINER_CI_IMAGE: kubeflowtraining/trainer:test

- name: Build storage initializer
run: |
./scripts/gha/build-storage-initializer.sh
env:
STORAGE_INITIALIZER_CI_IMAGE: kubeflowtraining/storage-initializer:test
TRAINER_CI_IMAGE: kubeflowtraining/trainer:test

- name: Load storage initializer
run: |
kind load docker-image ${{ env.STORAGE_INITIALIZER_CI_IMAGE }} --name ${{ env.KIND_CLUSTER }}
env:
KIND_CLUSTER: training-operator-cluster
STORAGE_INITIALIZER_CI_IMAGE: kubeflowtraining/storage-initializer:test

- name: Run tests
run: |
pip install pytest
python3 -m pip install -e sdk/python[huggingface]
pytest -s sdk/python/test/e2e-fine-tune-llm/test_e2e_pytorch_fine_tune_llm.py --log-cli-level=debug
env:
STORAGE_INITIALIZER_IMAGE: kubeflowtraining/storage-initializer:test
TRAINER_TRANSFORMER_IMAGE_DEFAULT: kubeflowtraining/trainer:test
2 changes: 1 addition & 1 deletion .github/workflows/integration-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ jobs:
- name: Run tests
run: |
pip install pytest
python3 -m pip install -e sdk/python; pytest -s sdk/python/test --log-cli-level=debug --namespace=default
python3 -m pip install -e sdk/python; pytest -s sdk/python/test/e2e --log-cli-level=debug --namespace=default
helenxie-bit marked this conversation as resolved.
Show resolved Hide resolved
env:
GANG_SCHEDULER_NAME: ${{ matrix.gang-scheduler-name }}

Expand Down
24 changes: 24 additions & 0 deletions scripts/gha/build-storage-initializer.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/bin/bash

# Copyright 2024 The Kubeflow Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# The script is used to build Kubeflow Training image.


set -o errexit
set -o nounset
set -o pipefail

docker build sdk/python/kubeflow/storage_initializer -t ${STORAGE_INITIALIZER_CI_IMAGE} -f sdk/python/kubeflow/storage_initializer/Dockerfile
24 changes: 24 additions & 0 deletions scripts/gha/build-trainer.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/bin/bash

# Copyright 2024 The Kubeflow Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# The script is used to build Kubeflow Training image.


set -o errexit
set -o nounset
set -o pipefail

docker build sdk/python/kubeflow/trainer -t ${TRAINER_CI_IMAGE} -f sdk/python/kubeflow/trainer/Dockerfile.cpu
18 changes: 18 additions & 0 deletions sdk/python/kubeflow/trainer/Dockerfile.cpu
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Use an official Python runtime as a parent image
FROM python:3.11

# Set the working directory in the container
WORKDIR /app

# Copy the requirements.txt file into the container
COPY requirements.txt /app/requirements.txt

# Install any needed packages specified in requirements.txt
RUN pip install --no-cache-dir torch==2.5.1
RUN pip install --no-cache-dir -r requirements.txt

# Copy the Python package and its source code into the container
COPY . /app

# Run storage.py when the container launches
ENTRYPOINT ["torchrun", "hf_llm_training.py"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Copyright 2024 kubeflow.org.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import logging

import transformers
from kubeflow.storage_initializer.hugging_face import (
HuggingFaceDatasetParams,
HuggingFaceModelParams,
HuggingFaceTrainerParams,
)
from kubeflow.training import TrainingClient, constants
from peft import LoraConfig

import test.e2e.utils as utils

logging.basicConfig(format="%(message)s")
logging.getLogger("kubeflow.training.api.training_client").setLevel(logging.DEBUG)

TRAINING_CLIENT = TrainingClient(job_kind=constants.PYTORCHJOB_KIND)


def test_sdk_e2e_create_from_train_api(job_namespace="default"):
JOB_NAME = "pytorchjob-from-train-api"

# Use test case from fine-tuning API tutorial.
# https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/
TRAINING_CLIENT.train(
name=JOB_NAME,
namespace=job_namespace,
# BERT model URI and type of Transformer to train it.
model_provider_parameters=HuggingFaceModelParams(
model_uri="hf://google-bert/bert-base-cased",
transformer_type=transformers.AutoModelForSequenceClassification,
num_labels=5,
),
# In order to save test time, use 8 samples from Yelp dataset.
dataset_provider_parameters=HuggingFaceDatasetParams(
repo_id="yelp_review_full",
split="train[:8]",
),
# Specify HuggingFace Trainer parameters.
trainer_parameters=HuggingFaceTrainerParams(
training_parameters=transformers.TrainingArguments(
output_dir="test_trainer",
save_strategy="no",
evaluation_strategy="no",
do_eval=False,
disable_tqdm=True,
log_level="info",
num_train_epochs=1,
),
# Set LoRA config to reduce number of trainable parameters.
lora_config=LoraConfig(
r=8,
lora_alpha=8,
lora_dropout=0.1,
bias="none",
),
),
num_workers=1,
num_procs_per_worker=1,
resources_per_worker={
"gpu": 0,
"cpu": 2,
"memory": "10G",
},
storage_config={
"size": "10Gi",
"access_modes": ["ReadWriteOnce"],
},
)

logging.info(f"List of created {TRAINING_CLIENT.job_kind}s")
logging.info(TRAINING_CLIENT.list_jobs(job_namespace))

try:
utils.verify_job_e2e(TRAINING_CLIENT, JOB_NAME, job_namespace, wait_timeout=900)
except Exception as e:
utils.print_job_results(TRAINING_CLIENT, JOB_NAME, job_namespace)
TRAINING_CLIENT.delete_job(JOB_NAME, job_namespace)
raise Exception(f"PyTorchJob create from API E2E fails. Exception: {e}")

utils.print_job_results(TRAINING_CLIENT, JOB_NAME, job_namespace)
TRAINING_CLIENT.delete_job(JOB_NAME, job_namespace)
Loading