Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse checkout for git pulls #15824

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

tetracionist
Copy link

@tetracionist tetracionist commented Oct 28, 2024

Added support for sparse-checkout when cloning a GitHub repo (#15185).
This is ideal for teams that have larger repos who don't want to clone all the folders in their repo.

When directories are specified in the Prefect.yaml or GitHubRepository storage class, we will use sparse checkout to get the directories and any file in the root directory (cone-mode).

Example usage in a Prefect.yaml

pull:
- prefect.deployments.steps.git_clone:
    repository: https://github.com/tetracionist/prefect.git
    branch: sparse-checkout-for-git-pulls
    access_token:
    directories: [src/integrations/prefect-azure, src/integrations/prefect-dask]

Checklist

  • This pull request references any related issue by including "closes <link to issue>"
    • If no issue exists and your change is not a small fix, please create an issue first.
  • If this pull request adds new functionality, it includes unit tests that cover the changes
  • If this pull request removes docs files, it includes redirect settings in mint.json.
  • If this pull request adds functions or classes, it includes helpful docstrings.

Copy link

codspeed-hq bot commented Oct 28, 2024

CodSpeed Performance Report

Merging #15824 will not alter performance

Comparing tetracionist:sparse-checkout-for-git-pulls (2d64115) with main (40590a5)

Summary

✅ 3 untouched benchmarks

@desertaxle desertaxle linked an issue Oct 28, 2024 that may be closed by this pull request
Copy link
Member

@desertaxle desertaxle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @tetracionist! This is looking pretty good, but I have a couple of questions about this implementation after a first pass.

Uses sparse-checkout on repository
"""

cmd = ["git", "sparse-checkout", "init"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the init subcommand has been deprecated, and the set subcommand is the preferred approach based on the git docs. It seems like we would update this method to only call git sparse-checkout set with the provided directories, but let me know if that would cause issues.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be okay, I will try using git sparse-checkout set

# Limit git history and set path to clone to
cmd += ["--depth", "1", str(self.destination)]
# For sparse-checkout it is recommended to use --filter=blob:none to reduce disk-space
if self._sparse_checkout_mode:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like if sparse checkout mode is used then you won't be able to include submodules in the checkout. Is is possible to have both?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this in the else block as I was unsure what submodules did, but are they something to do with adding code from another repo into the cloned repo?

I think this would be easy with cone mode but might be trickier without.
Would it be useful to also sparse-checkout the submodule?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, submodules allow you to reference another git repository from a parent repository. I'm not sure how sparse checkout would interact with submodules, but I think it'd be useful to apply sparse checkout to submodules, also.

Copy link
Contributor

This pull request is stale because it has been open 14 days with no activity. To keep this pull request open remove stale label or comment.

Copy link
Contributor

github-actions bot commented Dec 5, 2024

This pull request is stale because it has been open 14 days with no activity. To keep this pull request open remove stale label or comment.

@github-actions github-actions bot added the enhancement An improvement of an existing feature label Dec 13, 2024
@tetracionist
Copy link
Author

tetracionist commented Dec 13, 2024

@desertaxle Thanks for reviewing back in October, I had a chance to have another look at this PR again.

The main changes I have made are as follows:

  • Now using sparse-checkout set instead of init
  • Added an async function to check if repository is in sparse-checkout mode
  • Removed cone-mode
  • Two new tests for sparse-checkout
  • Can still use submodules if using sparse-checkout
  • I ditched the filter blob none and ended up sticking with shallow clones (after reading these docs it seemed to be the most sensible option)

I also used the configuration and flow below to test what folders get cloned and that worked as expected (brings back all items in orchestration and ingestion folder, and anything in the root directory).
Please would you be able to to review this again?

definitions:
  actions:
    pull_staging: &pull_staging
      - prefect.deployments.steps.git_clone:
          repository: https://github.com/tetracionist/repo.git
          branch: test-sparse
          access_token: '{{ prefect.blocks.secret.my-gh-token }}'
          directories: ["orchestration", "ingestion"]
from pathlib import Path

from prefect import flow
from prefect.logging import get_run_logger


@flow(name="dir check flow")
def dir_check_flow():

    logger = get_run_logger()
   
    directory = Path()
    logger.info(f"{list(directory.rglob("*"))}")


if __name__ == "__main__":
    dir_check_flow()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sparse checkout for git pulls
2 participants