-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse checkout for git pulls #15824
base: main
Are you sure you want to change the base?
Sparse checkout for git pulls #15824
Conversation
CodSpeed Performance ReportMerging #15824 will not alter performanceComparing Summary
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @tetracionist! This is looking pretty good, but I have a couple of questions about this implementation after a first pass.
src/prefect/runner/storage.py
Outdated
Uses sparse-checkout on repository | ||
""" | ||
|
||
cmd = ["git", "sparse-checkout", "init"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the init
subcommand has been deprecated, and the set
subcommand is the preferred approach based on the git docs. It seems like we would update this method to only call git sparse-checkout set
with the provided directories, but let me know if that would cause issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be okay, I will try using git sparse-checkout set
src/prefect/runner/storage.py
Outdated
# Limit git history and set path to clone to | ||
cmd += ["--depth", "1", str(self.destination)] | ||
# For sparse-checkout it is recommended to use --filter=blob:none to reduce disk-space | ||
if self._sparse_checkout_mode: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like if sparse checkout mode is used then you won't be able to include submodules in the checkout. Is is possible to have both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this in the else block as I was unsure what submodules did, but are they something to do with adding code from another repo into the cloned repo?
I think this would be easy with cone mode but might be trickier without.
Would it be useful to also sparse-checkout the submodule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, submodules allow you to reference another git repository from a parent repository. I'm not sure how sparse checkout would interact with submodules, but I think it'd be useful to apply sparse checkout to submodules, also.
This pull request is stale because it has been open 14 days with no activity. To keep this pull request open remove stale label or comment. |
This pull request is stale because it has been open 14 days with no activity. To keep this pull request open remove stale label or comment. |
@desertaxle Thanks for reviewing back in October, I had a chance to have another look at this PR again. The main changes I have made are as follows:
I also used the configuration and flow below to test what folders get cloned and that worked as expected (brings back all items in orchestration and ingestion folder, and anything in the root directory). definitions:
actions:
pull_staging: &pull_staging
- prefect.deployments.steps.git_clone:
repository: https://github.com/tetracionist/repo.git
branch: test-sparse
access_token: '{{ prefect.blocks.secret.my-gh-token }}'
directories: ["orchestration", "ingestion"]
from pathlib import Path
from prefect import flow
from prefect.logging import get_run_logger
@flow(name="dir check flow")
def dir_check_flow():
logger = get_run_logger()
directory = Path()
logger.info(f"{list(directory.rglob("*"))}")
if __name__ == "__main__":
dir_check_flow() |
Added support for sparse-checkout when cloning a GitHub repo (#15185).
This is ideal for teams that have larger repos who don't want to clone all the folders in their repo.
When directories are specified in the Prefect.yaml or GitHubRepository storage class, we will use sparse checkout to get the directories and any file in the root directory (cone-mode).
Example usage in a
Prefect.yaml
Checklist
<link to issue>
"mint.json
.