Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Josh generates lots of redundant merge commits #952

Closed
RalfJung opened this issue Sep 29, 2022 · 23 comments
Closed

Josh generates lots of redundant merge commits #952

RalfJung opened this issue Sep 29, 2022 · 23 comments

Comments

@RalfJung
Copy link
Contributor

To reproduce

docker run -p 8000:8000 -e JOSH_REMOTE=https://github.com -v josh-vol:/data/git joshproject/josh-proxy:latest

Then in another terminal

git clone https://github.com/rust-lang/miri
cd miri
git remote add josh http://localhost:8000/rust-lang/rust.git:/src/tools/miri.git
git fetch josh
git merge josh/master

The resulting history duplicates all merge commits from the rustc repo, which is not pretty and makes me wonder why -- most of them have nothing to do with this subrepo:

image

Cc @oli-obk

@christian-schilling
Copy link
Member

So I tried it myself:
Screenshot 2022-09-29 at 15 52 07

This is the entire history of /src/tools/miri so I don't think that those are all the merge commits from the main rust repo.

The reason that those commits are kept even though they look redundant:
When adding an external repo as a subdirectory into a monorepo (what you want to do) Josh guarantees that splitting that subdirectory back out will yield the exact same sha1's like the original repo. (Most, if not all, other filtering tools do not make that guarantee)
It took quite a lot of tweaks to the commit selection rules to handle all the corner cases of wired original histories (people do strange things...) and one of those rules that proved important is:
"Never change the number of parents a commit has"
In other words: If an original commit is a merge commit, the filtered one has to be a merge commit as well.
A second (more obvious) rule is: "Keep all commits that differ from at least one of their parents"
Git usually only shows the diff to the first parent when looking at history, so some of those commits may look redundant, but they have a non empty diff with their second parent.

@RalfJung
Copy link
Contributor Author

RalfJung commented Sep 29, 2022

Thanks for taking a look!

those are all the merge commits from the main rust repo.

All the ones since we did the first josh sync.... or so I thought, but yeah it seems to be only the ones on the path from some Miri-changing commit to the tip of master. That's still quite a few, and it will be one merge commit for each PR that lands in rustc (the vast majority of which does not change Miri).

The reason that those commits are kept even though they look redundant:
When adding an external repo as a subdirectory into a monorepo (what you want to do) Josh guarantees that splitting that subdirectory back out will yield the exact same sha1's like the original repo.

This is a monorepo-to-subrepo sync though, not a sync in the other direction -- it's about which history shows up in the subrepo (Miri), which seems to now include many merge commits from the monorepo (rustc). So I feel I am misunderstanding what you are saying. Though it is also entirely possible that we are using josh the wrong way here.^^ (We originally tried git subtree which has the right behavior but entirely fails due to the size of our history.)

@christian-schilling
Copy link
Member

christian-schilling commented Sep 29, 2022

That's still quite a few, and it will be one merge commit for each PR that lands in rustc (the vast majority of which does not change Miri).

Merges where both paths since the branch of point don't affect miri will not appear in the mono-to-subrepo sync

Your use case is something that should actually be well supported. I know someone else doing basically the same for years already. So my guess is things got somehow messed up at the beginning and now it's hard to untangle 🙈

Apart from that though: I'm wondering why you even want to do the whole thing? Why not just develop miri in the rust repo? (potentially on a branch...)

@RalfJung
Copy link
Contributor Author

RalfJung commented Sep 29, 2022

Merging things into rustc does huge amounts of CI and takes time away from the big rustc bors queue. Miri having its own separate CI and issue tracker, and only occasionally merging things back upstream, works much better for everyone involved.

Merges where both paths since the branch of point don't affect miri will not appear in the mono-to-subrepo sync

But rustc is a linear history of PRs. So if there is any Miri change in rustc, all PRs since then, on the path to master, will have their merge commit mirrored into Miri -- that's exactly what we see in the history above.

I looked at using :linear but I can't imagine that would allow us to sync things both ways properly so it's probably not a solution...

So my guess is things got somehow messed up at the beginning and now it's hard to untangle

Yeah, we got started with the wrong tool and now we are sitting here with a git history that doesn't look the way josh needs it to look, or so.^^

@christian-schilling
Copy link
Member

Merging things into rustc does huge amounts of CI and takes time away from the big rustc bors queue.

That sounds oh too familiar...And that's one of the main drivers of Josh development: Speeding up CI in repos like this by allowing CI on subsets while still keeping things in one repo. It does require some work on the CI infrastructure though to really get the benefits.

Still, the "import & export"/"periodic sync" case with a separate repo is also very much in the scope of what we want to support. There is even a test case for this: https://github.com/josh-project/josh/blob/master/tests/proxy/import_export.t

Also :linear should work with two way syncs as well, though I don't know of anyone having done that in production yet.

So how to get out of the mess...
I assume the history in both repos is "immutable", as in all it's sha's should be kept?
My guess is you need to somehow start over and importing it in a clean way, so that rust-lang.git/src/tools/miri.git and rust-lang/miri.git have the same sha's. This is guaranteed if you do the initial import in one of the ways described here: https://josh-project.github.io/josh/guide/importing.html
The easiest solution would be to import in a different path in the main repo. (sucks, I know)
Another option could be implementing a new :cutoff=<sha> filter (I will need to do that for a different use case anyway) and use that to hide the messed up history.

@RalfJung
Copy link
Contributor Author

RalfJung commented Sep 29, 2022

I assume the history in both repos is "immutable", as in all it's sha's should be kept?

The main rustc repo is definitely immutable.

The Miri repo could live with a force-push, but it should preserve the history so git blame etc still makes sense.

So e.g. if we could take the history josh generates from rust-lang/rust.git:/src/tools/miri.git, and then just add a parent to the root commit of that history (i.e. 62f33e71128d395ad086bc7c1ad65d19e1075e12 should be changed to have a parent, probably that parent should be 75dd959a3a40eb5b4574f8d2e23aa6efbeb33573), that might be enough.

Also :linear should work with two way syncs as well, though I don't know of anyone having done that in production yet.

I can't wrap my head around what josh would do when pushing (a non-linear history) to that.

We don't want the Miri repo to have a linear history, we just want the history that is imported from rustc to Miri to be linear. The local history of things happening on the Miri side should have all the Miri PRs etc. Not sure if that even makes any sense in the josh world.

Another option could be implementing a new :cutoff= filter (I will need to do that for a different use case anyway) and use that to hide the messed up history.

I can't quite imagine what exactly that would do but it might help? Hard to say.

@christian-schilling
Copy link
Member

christian-schilling commented Sep 29, 2022

We don't want the Miri repo to have a linear history, we just want the history that is imported from rustc to Miri to be linear.

Not sure that's possible.

If the miri repo can live with a force push things are reasonably easy:

  1. In the main rust repo, git fetch https://localhost:8000/rust-lang/miri.git:prefix=src/tools/miri.git
  2. git merge --allow-unrelated FETCH_HEAD
  3. Get the result into master (via PR propably)
  4. In the standalone miri repo: git fetch https://localhost:8000/rust-lang/rust.git:/src/tools/miri.git
  5. Force push FETCH_HEAD as the new master of the standalone repo

Now the initial state is clean: Both remotes of the miri repo have the same sha's. Although history might looks ugly in places.
Fixing that could be possible by somehow editing FETCH_HEAD to have nicer history before doing the merge --allow-unrelated but I don't know how to do that. (apart from coding up something super custom with libgit ;))

So now you can always do a git push to .../rust-lang/rust.git:/src/tools/miri.git to get the latest miri changes into and a pull for the other way. With merges of course.

Would be good to try that out locally first to make sure everything works as expected.

@RalfJung
Copy link
Contributor Author

(still experimenting with that) the first step will re-import the entire Miri history into the rustc repo, right? We already have that entire history there, so this duplication is annoying. But if there is no alternative...

@christian-schilling
Copy link
Member

Yes. But if you already have the entire Miri history in rustc, you can just skip the first step. No need to re import.

@RalfJung
Copy link
Contributor Author

Well, we have it in the git history, but not in the src/tools/miri subfolder... the way we imported it, we have the original commits in that git repo. That's how git subtree works.

@christian-schilling
Copy link
Member

I See. In that case yes, step 1 is needed and the commits will be "duplicated".

@RalfJung
Copy link
Contributor Author

Is there a josh filter for "in all commits that lead to 75dd959a3a40eb5b4574f8d2e23aa6efbeb33573, move all files into src/tools/miri"? That, followed by :/src/tools/miri for all commits, would probably reconstruct the desired history.

@christian-schilling
Copy link
Member

christian-schilling commented Sep 30, 2022

It's funny that you suggest that today, since I also need something like this for a very different use case. So I will look into implementing that, but it can take a while.

However, even if it was already available, I'm not sure that would be such a good idea for your use case 🤔

It would mean that that special sha has to be remembered somewhere (where?) forever. Also it means that while you get nice history in the standalone miri repo, git blame etc. in the main rust repo for the miri code will always have to rely on git's implicit rename detection.

@oli-obk
Copy link

oli-obk commented Sep 30, 2022

It would mean that that special sha has to be remembered somewhere (where?) forever. Also it means that while you get nice history in the standalone miri repo, git blame etc. in the main rust repo for the miri code will always have to rely on git's implicit rename detection.

Both of these are actually not an issue 😆 We will likely provide a simple script for syncs anyway an we rarely have a need for using blame in the rustc repo on the miri subrepo

@christian-schilling
Copy link
Member

If you can wait a week or so that feature might become available.
In the meantime I would at least try (locally, without pushing to GH) if the re-import way will work how you expect.

And still. I'd recommend re-import. The little bit of extra history should not be an issue, certainly not storage wise. What problem do you expect as a result of having that part of the history duplicated?

@RalfJung
Copy link
Contributor Author

RalfJung commented Sep 30, 2022

Well, it's the entire history of Miri -- not huge but also not "little". It will certainly skew the Github commit stats (how many commits by which author) badly, and just generally spam the history with tons of redundant noise that I would rather avoid.

Also if we want to use this for other rustc subtools that already use git subtree, we will have the exact same problem.

@christian-schilling
Copy link
Member

Ah I see, makes sense.
I will see about implementing that filter...

@RalfJung
Copy link
Contributor Author

I might have some time this weekend to take a look. :D But I have zero experience with this codebase, obviously.

@RalfJung
Copy link
Contributor Author

(Also btw the discussion here now has nothing to do with the multiple merge requests any more and is more about the problem in #950.)

@RalfJung
Copy link
Contributor Author

RalfJung commented Oct 2, 2022

Looking at more recent history, indeed josh does not include every merge commit on the path to current master -- but it includes a lot of them, and I have a hard time figuring out why some are included and some are not.

image

So rust-lang/rust@b30c886 gets mirrored and rust-lang/rust@8a497b7, but the other two PRs that were merged in between do not. I guess it has something to do with which commit those PR feature branches were based on. It might be something like, if there is a Miri change in between the base commit of that PR and when the PR landed, then that PR's merge commit is included in the subfolder history?

@christian-schilling
Copy link
Member

christian-schilling commented Oct 2, 2022

It might be something like, if there is a Miri change in between the base commit of that PR and when the PR landed, then that PR's merge commit is included in the subfolder history?

Yes, exactly. So there are miri changes being made directly in the full repo?

@RalfJung
Copy link
Contributor Author

RalfJung commented Oct 2, 2022

Ah, okay. That makes sense I guess. For us it will mean a whole bunch of noise in the Miri repo, since rustc has a lot of parallel feature branch development going on, but I can see why it would be important to preserve those merges.

So there are miri changes being made directly in the full repo?

Yes. Any time someone does a rustc change that would break the Miri test suite, the fix for that has to be done directly in the monorepo. That is in fact the main point of using subtree/josh rather than submodules: it lets us handle that kind of change much better.

@RalfJung
Copy link
Contributor Author

RalfJung commented Oct 2, 2022

This issue can be closed then, since the merge commits are expected. We can keep discussing the subtree filtering situation in #961.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants