KEP-2170: Design Trainer for the LLM Runtimes #2321

andreyvelich · 2024-11-05T21:30:50Z

As part of Kubeflow Training V2 work, we should design and implement custom Trainer to fine-tune LLMs that we are planning to support via TrainingRuntimes in Kubeflow upstream.

We should discuss whether we should use native PyTorch APIs or HuggingFace Transformers in the LLM Trainer implementation.

The Trainer should allow users to configure LoRA, QLoRA, FSDP, and other important configurations.

Useful resources:

LLM Trainer implementation in the Kubeflow Training V1
Recipes to fine-tune Llama models
Updated Llama recipes that we use to fine-tune Llama 3.2 - 1B: https://github.com/andreyvelich/llama-recipes/tree/kubeflow-llama

Part of: #2170

Design Doc

Initial design doc from @Electronic-Waste where we can brainstorm ideas: https://docs.google.com/document/d/1a4xWGVWZo43QKv8tIomoK_XHzBMC_byXBnDb0104htQ/edit?tab=t.0

cc @saileshd1402 @deepanker13 @kubeflow/wg-training-leads

Love this feature?

Give it a 👍 We prioritize the features with most 👍

andreyvelich · 2024-11-08T23:28:08Z

/assign @saileshd1402

We are experimenting with some PyTorch-native and Transformers APIs to design this Trainer.

google-oss-prow · 2024-11-08T23:28:11Z

@andreyvelich: GitHub didn't allow me to assign the following users: saileshd1402.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @saileshd1402

We are experimenting with some PyTorch-native and Transformers APIs to design this Trainer.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

saileshd1402 · 2024-11-08T23:30:05Z

/assign

andreyvelich added the kind/feature label Nov 5, 2024

google-oss-prow bot assigned saileshd1402 Nov 8, 2024

andreyvelich mentioned this issue Nov 18, 2024

KEP-2170: Create LLM training runtime for Llama 3.1 8B #2212

Open

Electronic-Waste mentioned this issue Nov 26, 2024

KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK #2324

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-2170: Design Trainer for the LLM Runtimes #2321

KEP-2170: Design Trainer for the LLM Runtimes #2321

andreyvelich commented Nov 5, 2024 •

edited

Loading

andreyvelich commented Nov 8, 2024

google-oss-prow bot commented Nov 8, 2024

saileshd1402 commented Nov 8, 2024

KEP-2170: Design Trainer for the LLM Runtimes #2321

KEP-2170: Design Trainer for the LLM Runtimes #2321

Comments

andreyvelich commented Nov 5, 2024 • edited Loading

Design Doc

Love this feature?

andreyvelich commented Nov 8, 2024

google-oss-prow bot commented Nov 8, 2024

saileshd1402 commented Nov 8, 2024

andreyvelich commented Nov 5, 2024 •

edited

Loading