Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Create model and dataset initializers #2303

Merged
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .github/workflows/publish-core-images.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,14 @@ jobs:
dockerfile: cmd/training-operator.v2alpha1/Dockerfile
platforms: linux/amd64,linux/arm64,linux/ppc64le
tag-prefix: v2alpha1
- component-name: model-initiailizer-v2
Copy link
Contributor

@deepanker13 deepanker13 Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a typo at multiple places

initializer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! Let me fix that.

dockerfile: cmd/initiailizer_v2/model/Dockerfile
platforms: linux/amd64,linux/arm64
tag-prefix: v2
- component-name: dataset-initiailizer-v2
dockerfile: cmd/initiailizer_v2/dataset/Dockerfile
platforms: linux/amd64,linux/arm64
tag-prefix: v2
- component-name: kubectl-delivery
dockerfile: build/images/kubectl-delivery/Dockerfile
platforms: linux/amd64,linux/arm64,linux/ppc64le
Expand Down
4 changes: 2 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ cover.out
.vscode/
__debug_bin

# Compiled python files.
*.pyc
# Python chache files
andreyvelich marked this conversation as resolved.
Show resolved Hide resolved
__pycache__/

# Emacs temporary files
*~
Expand Down
13 changes: 13 additions & 0 deletions cmd/initiailizer_v2/dataset/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
FROM python:3.11-alpine

WORKDIR /workspace

# Copy the required Python modules.
COPY cmd/initiailizer_v2/dataset/requirements.txt .
COPY sdk/python/kubeflow sdk/python/kubeflow
COPY pkg/initiailizer_v2 pkg/initiailizer_v2

# Install the needed packages.
RUN pip install -r requirements.txt

ENTRYPOINT ["python", "-m", "pkg.initiailizer_v2.dataset"]
1 change: 1 addition & 0 deletions cmd/initiailizer_v2/dataset/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
huggingface_hub==0.23.4
13 changes: 13 additions & 0 deletions cmd/initiailizer_v2/model/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
FROM python:3.11-alpine

WORKDIR /workspace

# Copy the required Python modules.
COPY cmd/initiailizer_v2/model/requirements.txt .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the folder named as cmd? It is containing docker files and requirements.txt

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to be consistent across all Training Operator components, like we do in Katib: https://github.com/kubeflow/katib/tree/master/cmd
E.g.
cmd - contains binaries/dockerfiles for execution.
pkg - contains the actual backend.

COPY sdk/python/kubeflow sdk/python/kubeflow
COPY pkg/initiailizer_v2 pkg/initiailizer_v2

# Install the needed packages.
RUN pip install -r requirements.txt

ENTRYPOINT ["python", "-m", "pkg.initiailizer_v2.model"]
1 change: 1 addition & 0 deletions cmd/initiailizer_v2/model/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
huggingface_hub==0.23.4
31 changes: 31 additions & 0 deletions pkg/initiailizer_v2/dataset/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
import logging
import os
from urllib.parse import urlparse

import pkg.initiailizer_v2.utils.utils as utils
from pkg.initiailizer_v2.dataset.huggingface import HuggingFace

logging.basicConfig(
format="%(asctime)s %(levelname)-8s [%(filename)s:%(lineno)d] %(message)s",
datefmt="%Y-%m-%dT%H:%M:%SZ",
level=logging.INFO,
)

if __name__ == "__main__":
logging.info("Starting dataset initialization")

try:
storage_uri = os.environ[utils.STORAGE_URI_ENV]
except Exception as e:
logging.error("STORAGE_URI env variable must be set.")
raise e

match urlparse(storage_uri).scheme:
# TODO (andreyvelich): Implement more dataset providers.
case utils.HF_SCHEME:
hf = HuggingFace()
hf.load_config()
hf.download_dataset()
case _:
logging.error("STORAGE_URI must have the valid dataset provider")
raise Exception
9 changes: 9 additions & 0 deletions pkg/initiailizer_v2/dataset/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
from dataclasses import dataclass
from typing import Optional


# TODO (andreyvelich): This should be moved under Training V2 SDK.
@dataclass
class HuggingFaceDatasetConfig:
storage_uri: str
access_token: Optional[str] = None
42 changes: 42 additions & 0 deletions pkg/initiailizer_v2/dataset/huggingface.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
import logging
from urllib.parse import urlparse

import huggingface_hub

import pkg.initiailizer_v2.utils.utils as utils

# TODO (andreyvelich): This should be moved to SDK V2 constants.
import sdk.python.kubeflow.storage_initializer.constants as constants
from pkg.initiailizer_v2.dataset.config import HuggingFaceDatasetConfig

logging.basicConfig(
format="%(asctime)s %(levelname)-8s [%(filename)s:%(lineno)d] %(message)s",
datefmt="%Y-%m-%dT%H:%M:%SZ",
level=logging.INFO,
)


class HuggingFace(utils.DatasetProvider):

def load_config(self):
config_dict = utils.get_config_from_env(HuggingFaceDatasetConfig)
logging.info(f"Config for HuggingFace dataset initiailizer: {config_dict}")
self.config = HuggingFaceDatasetConfig(**config_dict)

def download_dataset(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have unit tests for these, or would it be taken care in e2e? Downloading models from HF was having issues previously, it would be helpful to have it tested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvmd, just realised we have: #2305

storage_uri_parsed = urlparse(self.config.storage_uri)
dataset_uri = storage_uri_parsed.netloc + storage_uri_parsed.path

logging.info(f"Downloading dataset: {dataset_uri}")
logging.info("-" * 40)

if self.config.access_token:
huggingface_hub.login(self.config.access_token)

huggingface_hub.snapshot_download(
repo_id=dataset_uri,
repo_type="dataset",
local_dir=constants.VOLUME_PATH_DATASET,
Copy link
Contributor

@deepanker13 deepanker13 Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To speed up things should we set max_workers equal to number of files getting downloaded, currently it downloads 8 files parallely?

Copy link
Member Author

@andreyvelich andreyvelich Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any benchmarks that show that setting of max_workers to number of files speedup download time ?
What if we don't have enough CPUs for all concurrent threads ?

)

logging.info("Dataset has been downloaded")
33 changes: 33 additions & 0 deletions pkg/initiailizer_v2/model/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
import logging
import os
from urllib.parse import urlparse

import pkg.initiailizer_v2.utils.utils as utils
from pkg.initiailizer_v2.model.huggingface import HuggingFace

logging.basicConfig(
format="%(asctime)s %(levelname)-8s [%(filename)s:%(lineno)d] %(message)s",
datefmt="%Y-%m-%dT%H:%M:%SZ",
level=logging.INFO,
)

if __name__ == "__main__":
logging.info("Starting pre-trained model initialization")

try:
storage_uri = os.environ[utils.STORAGE_URI_ENV]
except Exception as e:
logging.error("STORAGE_URI env variable must be set.")
raise e

match urlparse(storage_uri).scheme:
# TODO (andreyvelich): Implement more model providers.
case utils.HF_SCHEME:
hf = HuggingFace()
hf.load_config()
hf.download_model()
case _:
logging.error(
f"STORAGE_URI must have the valid model provider. STORAGE_URI: {storage_uri}"
)
raise Exception
9 changes: 9 additions & 0 deletions pkg/initiailizer_v2/model/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
from dataclasses import dataclass
from typing import Optional


# TODO (andreyvelich): This should be moved under Training V2 SDK.
@dataclass
class HuggingFaceModelInputConfig:
storage_uri: str
access_token: Optional[str] = None
44 changes: 44 additions & 0 deletions pkg/initiailizer_v2/model/huggingface.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
import logging
from urllib.parse import urlparse

import huggingface_hub

import pkg.initiailizer_v2.utils.utils as utils

# TODO (andreyvelich): This should be moved to SDK V2 constants.
import sdk.python.kubeflow.storage_initializer.constants as constants
from pkg.initiailizer_v2.model.config import HuggingFaceModelInputConfig

logging.basicConfig(
format="%(asctime)s %(levelname)-8s [%(filename)s:%(lineno)d] %(message)s",
datefmt="%Y-%m-%dT%H:%M:%SZ",
level=logging.INFO,
)


class HuggingFace(utils.ModelProvider):

def load_config(self):
config_dict = utils.get_config_from_env(HuggingFaceModelInputConfig)
logging.info(f"Config for HuggingFace model initiailizer: {config_dict}")
self.config = HuggingFaceModelInputConfig(**config_dict)

def download_model(self):
storage_uri_parsed = urlparse(self.config.storage_uri)
model_uri = storage_uri_parsed.netloc + storage_uri_parsed.path

logging.info(f"Downloading model: {model_uri}")
logging.info("-" * 40)

if self.config.access_token:
huggingface_hub.login(self.config.access_token)

# TODO (andreyvelich): We should verify these patterns for different models.
huggingface_hub.snapshot_download(
repo_id=model_uri,
local_dir=constants.VOLUME_PATH_MODEL,
allow_patterns=["*.json", "*.safetensors", "*.model"],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For "*.safetensors" and "*.model" in the allow_patterns, I would say it works in most of the cases.
However, for model like mistralai/Mistral-7B-Instruct-v0.3, it has consolidated.safetensors and tokenizer.model.v3 (so called v3 format in Mistral, check their download example here).

snapshot_download(repo_id="mistralai/Mistral-7B-Instruct-v0.3", allow_patterns=["params.json", "consolidated.safetensors", "tokenizer.model.v3"], local_dir=mistral_models_path)

In this case, downloading the above mistral model with the current allow_patterns, the downloaded size will be 29 GB (double from the actual size 14.5 GB). Probably you need some logic to handle the Mistral model.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing! Let me add it to the TODO list.

ignore_patterns=["*.msgpack", "*.h5", "*.bin"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can consider ignoring ".pt" and ".pth" files as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saileshd1402 Do we know what files have .pt and .pth extension ?

Copy link
Contributor

@saileshd1402 saileshd1402 Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They store model weights similar to .safetensors and .bin but are outdated now. For example: In meta-llama/Llama-3.1-8B there is a folder called "/original" where consolidated.00.pth stores same data as ".safetensors" but is outdated. But would like to know if others think those files are important

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lizzzcai do you have any thoughts on whether we can exclude .pth and .pt files from model download ?
I noticed that you suggested to add the same ignore_patterns to KServe storage initializer: kserve/kserve#3584 (comment)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @andreyvelich, I don't see an issue to exclude .pth, .pt and .bin, as safetensors should be a preferred format and it has better security compared to others. Assuming that the model being downloaded provides safetensors format (most of the popular model should have).

If you want to support multiple formats, can check how vLLM support it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the info!
I think, in the future we should follow the vLLM approach.

)

logging.info("Model has been downloaded")
Empty file.
37 changes: 37 additions & 0 deletions pkg/initiailizer_v2/utils/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import os
from abc import ABC, abstractmethod
from dataclasses import fields
from typing import Dict

STORAGE_URI_ENV = "STORAGE_URI"
HF_SCHEME = "hf"


class ModelProvider(ABC):
@abstractmethod
def load_config(self):
raise NotImplementedError()

@abstractmethod
def download_model(self):
raise NotImplementedError()


class DatasetProvider(ABC):
@abstractmethod
def load_config(self):
raise NotImplementedError()

@abstractmethod
def download_dataset(self):
raise NotImplementedError()


# Get DataClass config from the environment variables.
# Env names must be equal to the DataClass parameters.
def get_config_from_env(config) -> Dict[str, str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want some typing hints for config

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I wasn't able to find Python type that can represent DataClass, as described here: https://stackoverflow.com/questions/54668000/type-hint-for-an-instance-of-a-non-specific-dataclass#:~:text=Despite%20its%20name%2C%20dataclasses.dataclass%20doesn%27t%20expose%20a%20class%20interface..
Any ideas on how we can add the type hint for it @kannon92 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Then it’s fine to go without.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not much of a Python dev these days so I don’t know about the typing for data classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we use like this
config : Union[DataClassA, DataClassB]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this, it won't work. E.g. I can see this error from PyLance:

Argument of type "type[HuggingFaceDatasetConfig]" cannot be assigned to parameter "config" of type "HuggingFaceDatasetConfig" in function "get_config_from_env"
  "type[type]" is not assignable to "type[HuggingFaceDatasetConfig]"

Copy link
Contributor

@deepanker13 deepanker13 Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from dataclasses import fields
from typing import Dict
import os

@dataclass
class HuggingFaceModelInputConfig:
    storage_uri: str
    access_token: Optional[str] = None


def get_config_from_env(config: Union[HuggingFaceModelInputConfig]) -> Dict[str, str]:
    config_from_env = {}
    for field in fields(config):
        config_from_env[field.name] = os.getenv(field.name.upper())

    return config_from_env

cf1 = get_config_from_env(HuggingFaceModelInputConfig)
print(cf1)

This is working @andreyvelich

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepanker13 What type validator do you use in your IDE ?
My VSCode complains with the error that I added above.
I am using PyLance VSCode extension: https://github.com/microsoft/pylance-release

config_from_env = {}
for field in fields(config):
config_from_env[field.name] = os.getenv(field.name.upper())

return config_from_env
Loading