-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moving binaries out of the repo #1805
Comments
can you update the link, I think it's not pointing to anything ? |
Done, thanks. Would've thought the comment parser would pick up wiki-style links. |
The push to use Apothecary should help with this. I think the idea was for this to be phased [1] get everything building with Apothecary [2] remove binaries and build dependencies as part of the release process. |
Yes, indeed, @pizthewiz. |
just used this tool successfully on another repo: https://rtyley.github.io/bfg-repo-cleaner/ really easy, and offers a lot of control. was going to add it to the wiki page but no one gets notifications for that :) |
yeah, I've seen this tool before, it looks promising. I thought I had already added it to the wiki article. Great to hear it worked well for you! |
I did a quick test with Big Repo Cleaner after seeing a checkout of the current openFrameworks branch is
Test results after running Big Repo Cleaner:https://rtyley.github.io/bfg-repo-cleaner/ After a full binary removal (.a, .exe, .lib, .so, .dll, .app) from all history the repo size is reduced to:
So the current concerns I heard a while ago was that a repo clean like this destroys the history and breaks forks. Well I tested both and my results were surprisingly positive with: 1950+ Binaries RemovedMirrored Repo from current openFrameworks/openFrameworks: (today) and force updated post Running Big Repo Cleaner.
The bad:
I'm yet to re-upload the 0.9.0 binaries to test the size after "latest" binaries are back in the structure. Full command line run for reference:
Complete BFG Binary Removal Log: (Lists all the binaries individually removed, each is a unique binary)
|
Summon @openframeworks/core |
Awesome work on test-driving this!
This is the part that breaks the history, since a user's fork and a post-BFG repo show two different hashes for the "same" commit (where "same" means same title and branching ancestry) - if it has a different hash, it's a different commit. Btw, I don't think we will have to re-run the BFG (or any cleaning solution) multiple times - we should run it only when we've cleaned up our act, and our workflow doesn't add new binaries to the repo. The primary (big big) disadvantage of using the BFG to clean the binaries out is that this will lose all the information about the binaries - they are just deleted after all! This means the repo's history basically becomes useless since we won't be able to use a previous state in any meaningful way (since all the binaries are missing!) (e.g. try to compile 0.8.4 or 0.9.0 tagged commit after cleaning) This problem is avoided by the multiple methods (git-media, git-annex, git LFS,...) analyzed in the wiki page I've linked in the OP - they preserve the metadata about the binaries, just move the big files themselves out of the repo "somewhere" else, and this is probably the way to go if we want to keep a usable OF repo history. I'm not saying one or the other approach is better, I'm only saying if we won't use the repo history, we don't need the BFG, and if we use the BFG, we won't (be able to) use the history. |
that looks really interesting. the main concern was that when you do this the commit hashes change so PRs, and issues will loose the reference to the original commit. looking through the first commits in the repo this seems to preserve the same commit hashes but probably changes the ones in which there were binaries? i agree with @bilderbuchi, we shouldn't bring back the latest binaries after doing this and just keep them somewhere else |
After releasing 0.9.0, I wanted to find the difference in size between 0.8.4 and 0.9.0, both in working directory and in repo and/or .git directory size - this would give some kind of feel for how much "binary churn overhead" we are incurring (as we recompiled lots of libraries often, on the way to 0.9.0). |
The first commit near the tree root where a binary is ripped out changes all subsequent/downstream hashes, so effectively all commits except for ancient history get changed. |
oh, yeah i was checking the wrong repo :S yes loosing the commit hashes is kind of problematic. if we could create some tool that recreated the gihub issues and PRs pointing to the new hashes it would be relatively ok, but loosing the PRs mostly those not merged yet but even the old ones is a problem |
I'll try out this with a fresh fork and add some random PR's to it. See what happens ;) |
Yeah the release tags will become useless as a method of downloading anything but the latest. For example: https://github.com/danoli3/ofxiOSBoost/releases/tag/1.56.0-libc%2B%2B We really shouldn't hold onto the old binaries just to hold them in a tags. They should be releases and seperate from the git history regardless, that is why it has the upload releases section in tagging.
All of the proposed solutions don't remove the files from repo history and will still require a binary nuke to remove them from the git checkout (being stored on everyones .git for the repo).
The major issue currently seems to be the PR / commit issue. Lets see how she handles it. |
Just to be clear. git-media, git-annex and even git-LFS ALL use git-filter-branch to achieve their own results.
And that is precisely what it does... it re-writes all hashes of commits after a change in a offending commit. So exactly the same situation as above... Basically even github say to use BFG here: (They used to say how to use git-filter-branch on this page, however now replaced it with "Only USE BFG" by ommission lol) |
It's looking like old PR's cannot be merged unless updated after something like this... else it will basically infect the new hash tree with potentially old hashes... Based on this comment:
I'm still going to run the test though |
Okay test complete.
|
this is incorrect, most remove them and store them somewhere else, and only a metadata placeholder with a pointer to the binary location will be in the repo
sure, but in contrast to BFG, they don't disappear, but are still available (from a rackspace or AWS or github server, for example)
yes. even to only move them out of the way with the other methods
we could try and host our own LFS server, as soon as it's stabilized a bit.
of course, "here" in your case means the use case of totally nuking files from the repo (e.g. wrongly committed passwords), so this is slightly misleading. we, on the other hand, would optimally still retain a working repo with history afterwards.
yes, that's what I wrote above, no? "This also means that all those users will not be able to PR against the cleaned repo (at least, without pulling binaries in)" |
in any case, the discussion is kinda premature anyway, since we first have to completely move to a workflow where we add no new binaries to the repo. apothecary is a first great step in that direction. when that has been established and works, we can consider how we solve the history issue. this is probably quite a long time away, so we can hope for some technical progress on that front (e.g. LFS has appeared years after I initially started looking into this issue). At that point (we're maybe even near OF 1.0 at the time), we can still decide if we just deprecate/archive the old repo, and move everyone over to a new one, without binaries. Or, maybe some great solution is available which solve the present problems better (probably not). |
Just to offer a small clarification, since version 1.12.5, the BFG has supported converting Git repos to git-lfs format - moving big files from Git history into the LFS store: $ java -jar ~/bfg-1.12.5.jar --convert-to-git-lfs '*.exe' --no-blob-protection |
@rtyley nice, that would be pretty useful! any particular reason why this feature is not mentioned on the BFG homepage? |
I've been busy 😄 Also, I wanted to gather a bit of feedback from early users before pointing everyone at it - GitHub's LFS support isn't currently suitable for some users. |
I wonder if it would be possible to use github's api to recreate the issues and PRs to point to the new hashes. i guess if one could get a correspondence index between old -> new hashes creating the messages pointing to the new commits would be doable not sure about recreating the actual commits in PRs. i guess it would be pretty complex but not sure if it's doable at all. |
It may be possible to just pull / rebase the PR to get it working again
(however this requires the original PR author do make this change... I'll
try this out though
|
yeah, PRs have to be rebased (and any binary changes in them cleaned). yes, with the API you can edit comments, so this should be possible or at least feasible. |
Definitely need to get the PR's down haha! In the meantime, we should upload the releases to the tags on github, as uploaded on openframeworks.cc for the different platforms. Anyone got a good upload speed? ;D! So just need to add "Release Notes" and then can add zips to each tag. |
why? we already have the uploads on of.cc, why duplicate them here, what's the benefit outweighing the drawbacks (more work, several sources for releases->confusion)? |
Well for someone who is just looking at this repo, it's very confusing to not have them here. No link or anything on here to them aside from the main website link at the top. |
Arguable, the main README could probably use a restructuring - the text "To grab a copy of openFrameworks for your platform, check the download page on the main site." is currently in a "Developers" section, and could maybe be made more prominent. Release notes with links to the relevant download section on of.cc sounds good, but I'm against putting the same binaries in two places. |
You can use: https://github.com/bozaro/git-lfs-migrate.git to migrate files to git-lfs without breaking release tags. |
At this time, the repo became 2.31 GiB. |
Ping |
Binaries bloat the repo cause they can't be diffed by git, so everytime you update a binary, you increase size by that file's size - text files just store the diff. The repo gets larger and slower and on a checkout users also have to pull all those old binary files down, too (primary problem according to TAZ).
This is a pretty big, difficult and long-term issue, so I collected my findings so far in a Wiki page: https://github.com/openframeworks/openFrameworks/wiki/Moving-binaries-out-of-the-repo (feel free to add your wisdom!), but I thought an issue would be more efficient for discussion.
The text was updated successfully, but these errors were encountered: