Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Training image needed for train api #1963

Merged
merged 14 commits into from
Jan 11, 2024

Conversation

deepanker13
Copy link
Contributor

What this PR does / why we need it:

  1. Added the training script that will be used in the PyTorch job for train api.
  2. Added the GitHub workflow to build and publish the image on pull request.
    Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
    Partially Fixes Train/Fine-tune API Proposal for LLMs #1945

Checklist:

  • Docs included if any changes are user facing

@deepanker13 deepanker13 changed the title Adding training image creation code Adding Training image needed for train api Dec 12, 2023
@coveralls
Copy link

coveralls commented Dec 12, 2023

Pull Request Test Coverage Report for Build 7493861403

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.02%) to 42.885%

Totals Coverage Status
Change from base Build 7491927722: -0.02%
Covered Lines: 3755
Relevant Lines: 8756

💛 - Coveralls

.github/workflows/publish-sdk-images.yaml Outdated Show resolved Hide resolved
sdk/python/kubeflow/trainer/hf_dockerfile Show resolved Hide resolved
sdk/python/kubeflow/trainer/hf_dockerfile Outdated Show resolved Hide resolved
@deepanker13
Copy link
Contributor Author

@andreyvelich @tenzen-y if it is good to go, can we merge this?

examples/sdk/train_api.py Outdated Show resolved Hide resolved
.github/workflows/publish-example-images.yaml Outdated Show resolved Hide resolved
sdk/python/kubeflow/trainer/hf_llm_training.py Outdated Show resolved Hide resolved
sdk/python/kubeflow/trainer/hf_llm_training.py Outdated Show resolved Hide resolved
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise lgtm

examples/sdk/train_api.py Outdated Show resolved Hide resolved
examples/sdk/train_api.py Outdated Show resolved Hide resolved
examples/sdk/train_api.py Outdated Show resolved Hide resolved
examples/sdk/train_api.py Outdated Show resolved Hide resolved
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, can you update the following line?

platforms: ${{ matrix.platforms }}

platforms: linux/amd64,linux/arm64,linux/ppc64le

.github/workflows/publish-core-images.yaml Show resolved Hide resolved
.github/workflows/publish-core-images.yaml Show resolved Hide resolved
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepanker13 Thanks!
/lgtm

/assign @andreyvelich

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @deepanker13!
I left a few comments

@@ -0,0 +1,18 @@
# Use an official Pytorch runtime as a parent image
FROM nvcr.io/nvidia/pytorch:23.12-py3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to use PyTorch image from NVIDIA for this trainer ?
Would it be better to take official PyTorch image similar to what we use in SDK ?
docker.io/pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as suggested by @tenzen-y
#1963 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. @tenzen-y Do you know if PyTorch has any official image that we can use that is supported on all platforms ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich As I remember correctly, the PyTorch doesn't provide images with multiple architecture platforms with GPU. So, we need to use the NVIDIA official images.

Comment on lines 64 to 69
def setup_peft_model(model, lora_config):
# Set up the PEFT model
lora_config = LoraConfig(**json.loads(lora_config))
print(lora_config)
model = get_peft_model(model, lora_config)
return model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we are going to have PEFT config always for this trainer ?
@johnugeorge @deepanker13

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loraconfig can be omitted by user, it is handled by setting empty loraconfig as default value in the data class

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, @deepanker13 Should we verify if lora_config is set ?

sdk/python/kubeflow/trainer/hf_llm_training.py Outdated Show resolved Hide resolved
parser.add_argument("--transformer_type", help="model transformer type")
parser.add_argument("--model_dir", help="directory containing model")
parser.add_argument("--dataset_dir", help="directory contaning dataset")
parser.add_argument("--dataset_name", help="dataset name")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add dataset_name argument for users who want to use this Trainer without SDK client ?
I am asking because in SDK client we always download dataset in storage initializer and store it in Trainer volume.
So we don't need to provide name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the same dataset_dir there can be multiple datasets, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But can we use train API to download more than one dataset ?
E.g. in your example, you just download ultrachat_10k dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, if I run with a different datasetname, it will work fine.
@andreyvelich

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but for every API execution you create a new PyTorchJob and a new Trainer image will be spin up.
So dataset is always represent single name, isn't ?

client.train(
name="hf-test",
num_workers=2,
num_procs_per_worker=0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this value is 0 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for cpu only training

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, but can torchrun be used with CPUs ?
E.g. maybe I want to run torchrun --nproc-per-node=2 where I use 2 CPU per node.
cc @johnugeorge

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It can run on cpus.

@google-oss-prow google-oss-prow bot removed the lgtm label Jan 11, 2024
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@deepanker13
Copy link
Contributor Author

deepanker13 commented Jan 11, 2024

tested gpu training example in examples/sdk/train_api.ipynb
Uploading Screenshot 2024-01-12 at 1.30.42 AM.png…

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's amazing, thank you @deepanker13!
/lgtm
/assign @johnugeorge

@johnugeorge
Copy link
Member

/approve
Thanks Deepanker

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deepanker13, johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit e10733e into kubeflow:master Jan 11, 2024
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants