Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support downloading models from OCI Registries #1756

Merged
merged 4 commits into from
Jul 24, 2024

Conversation

jaideepr97
Copy link
Member

@jaideepr97 jaideepr97 commented Jul 17, 2024

Check if repository supplied during ilab model download meets the URL structure of an OCI registry, if so leverage skopeo to copy image layers into cache, and apply mapping logic to move the model files into the permanent models directory. This PR also refactors existing ilab model download code to introduce base and implementation classes to support multiple downloading backends (HF vs OCI)

assumes skopeo is installed on the system

based off of #1687 (comment)

Issue resolved by this Pull Request:
Resolves #1687

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the
    conventional commits.
  • Changelog updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jul 17, 2024
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jul 18, 2024
@mergify mergify bot added the ci-failure PR has at least one CI failure label Jul 18, 2024
@mergify mergify bot removed the ci-failure PR has at least one CI failure label Jul 18, 2024
@jaideepr97 jaideepr97 marked this pull request as ready for review July 18, 2024 16:24
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jul 18, 2024
@jaideepr97 jaideepr97 changed the title [WIP] feat: Support downloading models from OCI Registries feat: Support downloading models from OCI Registries Jul 18, 2024
@mergify mergify bot added the ci-failure PR has at least one CI failure label Jul 18, 2024
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jul 18, 2024
Copy link
Contributor

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good so far. As for tests. I'd stick with e2e for this one.

The best way is probably to mock up a local registry? and then pass that in as the place to download from. Or point it to some registry we know for now will always be there.

unit will be hard but you can test the specific functions individually. You can modify the existing download tests to still work for the HF path though

@@ -541,3 +541,20 @@ def split_hostport(hostport: str) -> tuple[str, int]:
if not hostname or not port:
raise ValueError(f"Invalid host-port string: '{hostport}'")
return hostname, port


def is_oci_repo(repo_url: str) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that check for a single string, I'd remove this function and do the check with an if statement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just made this a function because our oci check is very simple right now, but i thought we may need to make this more robust in the future, so we could develop this logic later and it would be more testable this way

@jeremyeder
Copy link
Member

I was able to download models using this branch. Thanks!

Downloading model from OCI registry: docker://quay.io/ai-lab/models/granite-7b-lab@latest to /home/jeder/.local/share/instructlab/models...
Getting image source signatures
Copying blob 88bc90d34dff done   |
Copying blob 5d96ed7903c0 done   |
Copying blob e8285413be69 done   |
Copying blob 59bbfb6b312b done   |
Copying blob 078e999c62c3 done   |
Copying blob f36673df9c7f done   |
Copying blob da5c4bd66dca done   |
Copying blob 9e556afd4421 done   |
Copying blob 6c507cde1b53 done   |
Copying blob afa302f67a27 done   |
Copying blob 72ce22c1313f done   |
Copying blob d16bf783cb66 done   |
Copying blob b4cbcd39f603 done   |
Copying config 44136fa355 done   |
Writing manifest to image destination```

Copy link
Contributor

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the general state of the PR. The last thing I'd like to add (despite the existing comment) is a small functional test in scripts/functional-tests.sh. It's also almost time to squash your commits. Thanks!

@jaideepr97
Copy link
Member Author

@leseb I was thinking about adding an e2e test in scripts/basic-workflow-tests.sh instead, so that the completion of the e2e test would automatically validate that the model was downloaded correctly and in a usable state
Lmk how that sounds to you

@leseb
Copy link
Contributor

leseb commented Jul 19, 2024

@leseb I was thinking about adding an e2e test in scripts/basic-workflow-tests.sh instead, so that the completion of the e2e test would automatically validate that the model was downloaded correctly and in a usable state Lmk how that sounds to you

Does that mean not testing HF in e2e anymore? I believe the HF download is our main entry point so we should test it by default.

Any functional test is good! They both validate the model in different ways, and the scripts/basic-workflow-tests.sh is more advanced, takes more time to run, and is not free. IMO, if download + serve + chat is sufficient (it should be). I'm leaning towards scripts/basic-workflow-tests.sh simply because the GitHub runners are free. Thanks!

Copy link
Contributor

mergify bot commented Jul 19, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @jaideepr97 please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase This Pull Request needs to be rebased label Jul 19, 2024
@mergify mergify bot added the ci-failure PR has at least one CI failure label Jul 24, 2024
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jul 24, 2024
@mergify mergify bot removed the ci-failure PR has at least one CI failure label Jul 24, 2024
@mergify mergify bot added the one-approval PR has one approval from a maintainer label Jul 24, 2024
leseb added 2 commits July 24, 2024 08:45
The version on the Ubuntu repository is old (1.4.0), at the time of
writting skopeo latest stable version is 1.15.2.

The CI was failing with the following error:

```
FATA[0126] creating an updated image manifest: preparing updated
manifest, layer
"sha256:49ec71d70c57bd3256964b9f4f622da8a2457094a8bdfbb9f8e54781a54d139d":
unsupported MIME type for compression: text/plain
```

Upgrading to a newer version made the issue go away.

Signed-off-by: Sébastien Han <seb@redhat.com>
We current suppor the 1.15 version, so the code will now check the
version and report if not 1.15. We don't fail, but warn the user that
downloading the model might fail.

Signed-off-by: Sébastien Han <seb@redhat.com>
@leseb leseb added the hold In-progress PR. Tag should be removed before merge. label Jul 24, 2024
@leseb leseb force-pushed the oci-download branch 2 times, most recently from 2690244 to 01e66d6 Compare July 24, 2024 08:38
@mergify mergify bot added the ci-failure PR has at least one CI failure label Jul 24, 2024
* use logger.isEnabledFor(logging.DEBUG) to determine if DEBUG level is
  set
* correctly catch FileNotFoundError and print it
* a few useful debug commands
* fix functional test file matcher

Signed-off-by: Sébastien Han <seb@redhat.com>
@mergify mergify bot removed the ci-failure PR has at least one CI failure label Jul 24, 2024
Copy link
Contributor

@booxter booxter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaving some more comments. I think most can be ignored or taken as you see fit. There are two comments that I'd like to be addressed for sure, but these may go into a separate PR as a follow-up if you don't mind.

  1. Move the click Exception raise for skopeo version from the general download function into OCIDownloader.
  2. Raise an error on an empty blob map.

except json.JSONDecodeError as e:
raise ValueError(f"could not read JSON file: {file_path}") from e
except Exception as e:
raise ValueError("unexpected error occurred:") from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove :?

return re.match(pattern, repo_name) is not None


def _load_json(file_path: Path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(No action required) These functions seem closely coupled to download module. (The sha one also to OCI implementation.) I'd put them back into download.py.

raise ValueError("unexpected error occurred:") from e


def _extract_SHA(SHAstr: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: pep8 requires lower case names for variables: https://peps.python.org/pep-0008/#function-and-variable-names

Download specified model from Hugging Face
"""
click.echo(
f"Downloading model from Hugging Face : {self.repository}@{self.release} to {self.download_dest}..."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove space character before :.


except Exception as exc:
click.secho(
f"Downloading GGUF model failed with the following Hugging Face Hub error: {exc}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: double-space after Face.

raise ValueError(
f"could not find hash for index file at: {oci_model_path}"
)
except Exception as exc:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This except block is redundant since it just re-raises the exception back. You can remove it and try and this will also un-indent the code.


That said, I wonder if instead of that, you'd want to reraise ValueError?


Same comments below (line 173).

manifest["mediaType"]
== "application/vnd.oci.image.manifest.v1+json"
):
match = _extract_SHA(manifest["digest"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break immediately, otherwise it continues to the next manifest. (I don't think that's what you indend?)

f"failed to build OCI model file mapping from: {blob_dir}/{index_hash}"
) from exc

return oci_model_file_map
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry I haven't noticed it before). So if e.g. no layer matched in line 180, then the returned dictionary will be empty. And then download will do nothing, silently returning (instead of e.g. logging an error.) Should we raise an error here, or in download, when the dictionary is empty?

)
@click.option(
"--release",
default="main", # TODO: add to config.yaml
show_default=True,
help="The git revision of the model to download - e.g. a branch, tag, or commit hash.",
help="The revision of the model to download - e.g. a branch, tag, or commit hash for Hugging Face repositories and tag or commit hash for OCI repositories.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ditto - double space character

downloader.download()
except FileNotFoundError as exc:
click.secho(
"skopeo is not installed, please install recommended version 1.15",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a general function for any repo type (not just OCI). I think it's making an assumption that 1) OCI downloader will never raise the exception in any other scenario; 2) HF downloader will never raise it at all.

I think this except block belongs to where the skopeo presence is actually checked (inside check_skopeo_version or maybe inside the OCIDownloader.download).

@booxter
Copy link
Contributor

booxter commented Jul 24, 2024

To be clear: this can merge, my comments don't block the PR. @leseb also asked me to send the follow-up PR to fix the comments, I will do it after this PR merges.

@leseb
Copy link
Contributor

leseb commented Jul 24, 2024

I cannot approve a PR that I participated in. So waiting for other maintainers to chime in :)

Copy link
Member

@nathan-weinberg nathan-weinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with many of @booxter's nits but non-blocking



@click.command()
@click.option(
"--repository",
default=DEFAULTS.MERLINITE_GGUF_REPO, # TODO: add to config.yaml
show_default=True,
help="Hugging Face repository of the model to download.",
help="Hugging Face or OCI repository of the model to download.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: double space here as well

@mergify mergify bot removed the one-approval PR has one approval from a maintainer label Jul 24, 2024
@nathan-weinberg nathan-weinberg removed the hold In-progress PR. Tag should be removed before merge. label Jul 24, 2024
@mergify mergify bot merged commit 2da4ccd into instructlab:main Jul 24, 2024
24 checks passed
@ktam3 ktam3 modified the milestones: 0.18.0, 0.18.0a7 Jul 24, 2024
booxter added a commit to booxter/instructlab that referenced this pull request Jul 24, 2024
The original PR:
- instructlab#1756

Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>
booxter added a commit to booxter/instructlab that referenced this pull request Jul 24, 2024
The original PR:
- instructlab#1756

Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ilab model download needs to be able to pull from different types of registires
9 participants