Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

base: Honor .gitignore for rsync #3037

Merged
merged 3 commits into from
Aug 2, 2024
Merged

Conversation

cgwalters
Copy link
Contributor

@cgwalters cgwalters commented Jun 21, 2024

I am guessing I may be one of the first people to try using tmt
with a Rust project. The default for the cargo toolchain
is to keep a lot of cached incremental data in target/.
In my case with bootc, it's currently 20G.

A plain rsync() of this is incredibly inefficient. rsync doesn't
even use reflinks if available, though that's a distinct bug.

Use git ls-files to honor .gitignore.

Signed-off-by: Colin Walters walters@verbum.org

@cgwalters
Copy link
Contributor Author

I don't quite understand why we need even a logical copy of the context directory at all by default. I would expect tests to just not mutate it. But, if we wanted to be more sure, we could of course try to design a flow where we run tmt in a container, with the context mounted readonly.

@martinpitt
Copy link

Big +1. In Cockpit we are actually in the same boat -- a lot of our developer trees have a ~ 1 GB node_modules/ (most of which isn't needed for running tests) and often also a multi-GB test/images/ (ephemeral VM overlays for our own tests, not at all relevant for tmt), and heaven knows what other unclean stuff in the build tree. So only copying the git-tracked files would be much appreciated. I suppose it's not that obvious though -- for tmt try and iterating you actually want to test the checkout including local and uncommitted modifications. A simple git clone wouldn't do that.

tmt/base.py Outdated Show resolved Hide resolved
@cgwalters cgwalters changed the title base: Clone git repo if present instead of rsync base: Honor .gitignore for rsync Jun 21, 2024
Copy link

@martinpitt martinpitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely, thanks @cgwalters ! Neither my previous 👎 nor my 👍 here are authoritative of course, but this is really nice!

@LecrisUT
Copy link
Contributor

Love the idea. Could it check for the presence of .gitignore first, and then can it be made configurable? Technically tmt should work in a non-git environment as well.

@martinpitt
Copy link

This PR doesn't change anything if the directory isn't a git tree. The question is just if anyone should rely on built (i.e. not git-tracked) files to go into the tmt target. IMHO this is just confusing, inefficient, and unpredictable, so I'm all for making this not configurable. (But as always, https://m.xkcd.com/1172/ 😁 )

@LecrisUT
Copy link
Contributor

LecrisUT commented Jun 21, 2024

This PR doesn't change anything if the directory isn't a git tree

I'm not sure about that, I don't see relevant checks for if the worktree is a git near where _initialize_worktree is called. The previous workflow would always work because it was simply --exclude .git, which does nothing if it's not a git repo. An easy way to check is to actually run a simple plan+test in a non-git environment

@cgwalters
Copy link
Contributor Author

tmt/base.py Outdated Show resolved Hide resolved
@happz happz added this to the 1.35 milestone Jun 21, 2024
@happz
Copy link
Collaborator

happz commented Jun 21, 2024

Seems like a good idea, although my experience with various directories in tmt is still limited. I would just propose different $SUBJ, rsync is called a lot and this case seems to be about pushing worktree to a guest, so how about something like "Honor .gitignore when pushing worktree to guest"?

@cgwalters
Copy link
Contributor Author

@happz ping - since you know the tmt code can I please ask you to drive this PR to completion?

@cgwalters
Copy link
Contributor Author

Hmm wait, are there two rsyncs that tmt does? Oh wow...there is...we do an rsync of the worktree once into /var/tmp/tmt (which would be dramatically faster/cheaper if it used reflinks) but what I'm now hitting is that tmt apparently rsyncs the worktree into the guest again if I'm using

discover:
  how: tmt

?

Ouch...so we'd clearly need to factor out shared rsync exclusions for these two things.

(This said again, the best fix would be to avoid full copies; for the local virt one we could use virtiofs, although that doesn't work for remote VMs)

@happz
Copy link
Collaborator

happz commented Jul 26, 2024

@happz ping - since you know the tmt code can I please ask you to drive this PR to completion?

Yeah, it's still on the roadmap for 1.35, I just need to find some time to wrap it up & ask folks for review :/

@happz happz added the full test Pull request is ready for the full test execution label Jul 31, 2024
@happz
Copy link
Collaborator

happz commented Jul 31, 2024

My additions are in b7e8759.

@happz
Copy link
Collaborator

happz commented Jul 31, 2024

/packit build

Copy link
Collaborator

@martinhoyer martinhoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't rsync --filter=':- .gitignore' work in this context? I see it mentioned also in the stackoverflow question linked in comment.

EDIT: Seems to work for me when copying tmt repo. --filter=':- .gitignore' --exclude=.git

tmt/utils/__init__.py Outdated Show resolved Hide resolved
@LecrisUT
Copy link
Contributor

LecrisUT commented Jul 31, 2024

Wouldn't rsync --filter=':- .gitignore' work in this context? I see it mentioned also in the stackoverflow question linked in comment.

My main concern was if it picked up recursive .gitignore. it actually did:

$ tree -a ./
./
├── copy_this
├── .gitignore
├── other_value
├── some_folder
│   ├── copy_this_too
│   ├── .gitignore
│   └── ignore_me_too
└── some_valueA

2 directories, 7 files
$ tail .gitignore some_folder/.gitignore 
==> .gitignore <==
some_value*
/other_value

==> some_folder/.gitignore <==
ignore_me_too
$ rsync -a --filter=':- .gitignore' ./ ../test_rsync_out/
$ tree -a ../test_rsync_out/
../test_rsync_out/
├── copy_this
├── .gitignore
└── some_folder
    ├── copy_this_too
    └── .gitignore

2 directories, 4 files

👍 for such an approach IMO

@happz
Copy link
Collaborator

happz commented Jul 31, 2024

Wouldn't rsync --filter=':- .gitignore' work in this context? I see it mentioned also in the stackoverflow question linked in comment.

My main concern was if it picked up recursive .gitignore. it actually did:

$ tree -a ./
./
├── copy_this
├── .gitignore
├── other_value
├── some_folder
│   ├── copy_this_too
│   ├── .gitignore
│   └── ignore_me_too
└── some_valueA

2 directories, 7 files
$ tail .gitignore some_folder/.gitignore 
==> .gitignore <==
some_value*
/other_value

==> some_folder/.gitignore <==
ignore_me_too
$ rsync -a --filter=':- .gitignore' ./ ../test_rsync_out/
$ tree -a ../test_rsync_out/
../test_rsync_out/
├── copy_this
├── .gitignore
└── some_folder
    ├── copy_this_too
    └── .gitignore

2 directories, 4 files

Using ls-files filters out also .gitignore files, but that's minor point. TBH, I'm glad it works as it is, and unless someone shows me a larger benefit, I for one would trust git as the source of truth here.

@martinhoyer
Copy link
Collaborator

Using ls-files filters out also .gitignore files, but that's minor point. TBH, I'm glad it works as it is, and unless someone shows me a larger benefit, I for one would trust git as the source of truth here.

Well I guess one benefit would be that it works without git and perhaps speed?(haven't tested). But I don't have a strong opinion on this either way, just throwing it in for smarter people to evaluate ;)

@LecrisUT
Copy link
Contributor

This approach should also pick up .git/info/exclude right? Maybe that's an argument for such approach. An argument for the other would be that it could be expanded for non-git environments, and .tmtignore (jk).

@happz
Copy link
Collaborator

happz commented Jul 31, 2024

Using ls-files filters out also .gitignore files, but that's minor point. TBH, I'm glad it works as it is, and unless someone shows me a larger benefit, I for one would trust git as the source of truth here.

Well I guess one benefit would be that it works without git and perhaps speed?(haven't tested).

The current code also works without git, outside of a git repository, git ls-files is not called at all in such a situation, and adds nothing to the ignore list.

@happz
Copy link
Collaborator

happz commented Aug 1, 2024

/packit build

cgwalters and others added 3 commits August 2, 2024 14:28
I am guessing I may be one of the first people to try using tmt
with a Rust project. The default for the `cargo` toolchain
is to keep a *lot* of cached incremental data in `target/`.
In my case with bootc, it's currently 20G.

A plain rsync() of this is *incredibly* inefficient. rsync doesn't
even use reflinks if available, though that's a distinct bug.

Use `git ls-files` to honor `.gitignore`.

Signed-off-by: Colin Walters <walters@verbum.org>
@happz
Copy link
Collaborator

happz commented Aug 2, 2024

/packit build

@happz happz merged commit 96f6e23 into teemtee:main Aug 2, 2024
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
full test Pull request is ready for the full test execution ready for merge The only missing piece is to do the rebase the current 'main' and let the CI finish.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants