Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving binaries out of the repo #1805

Open
bilderbuchi opened this issue Jan 9, 2013 · 33 comments
Open

Moving binaries out of the repo #1805

bilderbuchi opened this issue Jan 9, 2013 · 33 comments

Comments

@bilderbuchi
Copy link
Member

Binaries bloat the repo cause they can't be diffed by git, so everytime you update a binary, you increase size by that file's size - text files just store the diff. The repo gets larger and slower and on a checkout users also have to pull all those old binary files down, too (primary problem according to TAZ).

This is a pretty big, difficult and long-term issue, so I collected my findings so far in a Wiki page: https://github.com/openframeworks/openFrameworks/wiki/Moving-binaries-out-of-the-repo (feel free to add your wisdom!), but I thought an issue would be more efficient for discussion.

@ofZach
Copy link
Contributor

ofZach commented Jan 9, 2013

can you update the link, I think it's not pointing to anything ?

@bilderbuchi
Copy link
Member Author

Done, thanks. Would've thought the comment parser would pick up wiki-style links.

@pizthewiz
Copy link
Member

The push to use Apothecary should help with this. I think the idea was for this to be phased [1] get everything building with Apothecary [2] remove binaries and build dependencies as part of the release process.

@bilderbuchi
Copy link
Member Author

Yes, indeed, @pizthewiz.

@kylemcdonald
Copy link
Contributor

just used this tool successfully on another repo: https://rtyley.github.io/bfg-repo-cleaner/

really easy, and offers a lot of control. was going to add it to the wiki page but no one gets notifications for that :)

@bilderbuchi
Copy link
Member Author

yeah, I've seen this tool before, it looks promising. I thought I had already added it to the wiki article. Great to hear it worked well for you!

@danoli3
Copy link
Member

danoli3 commented Dec 18, 2015

I did a quick test with Big Repo Cleaner after seeing a checkout of the current openFrameworks branch is

  • 1.43 GB

Test results after running Big Repo Cleaner:

https://rtyley.github.io/bfg-repo-cleaner/

After a full binary removal (.a, .exe, .lib, .so, .dll, .app) from all history the repo size is reduced to:

  • 111 MB

So the current concerns I heard a while ago was that a repo clean like this destroys the history and breaks forks.

Well I tested both and my results were surprisingly positive with:

1950+ Binaries Removed

Mirrored Repo from current openFrameworks/openFrameworks: (today) and force updated post Running Big Repo Cleaner.

  • https://github.com/danoli3/openFrameworksBFG
  • History retained perfectly with original authors.
  • Hashs are recalculated for each commit
  • Date/Time is all preserved
  • Forks are retained at their current state (not forced into the new commit hash tree), until pulled / updated
    • I tested this by forking the openFrameworksBFG repo before running BFG repo cleaner.
    • Forked repo: https://github.com/danoli4/openFrameworksBFG (Remains at the original commit hashes / history when it was forked. No loss in binaries post the origin being force updated. (As in it would be possible to make a openFrameworksLegacy repo as it currently stands with a full clone of the current hash structure + old binaries).
  • Notes: I had to remove the binaries from Master / Head for complete history removal of binaries (so in case those binaries were updated multiple times with the same name, which they definitely have).
  • Other good news is once we run this... Future BFG runs will only re-write the hash history for the effected commits (So basically once we do this, we won't ever be re-calculating the commit hashes for older commits than the first incidence of the "offending" commit. Once a commit is modified to remove the binaries, all commits hashes beyond that need to be re-calculated.

The bad:

  • It will break "release" tags in the sense those tags will not include the binaries with them.
  • This being said, we can add notes to each tag on github and add the link to the physical download from openFrameworks website for older versions, even add the .zip with the binaries in them directly in the release notes: Example: https://github.com/openframeworks/openFrameworks/releases/new?tag=0.9.0

I'm yet to re-upload the 0.9.0 binaries to test the size after "latest" binaries are back in the structure.
Once I do that we will have a clearer picture of how big the repo "truly" is at a stable master.

Full command line run for reference:

git clone --mirror https://github.com/danoli3/openFrameworksBFG.git
git clone openFrameworksBFG.git oFBFG
cd oFBFG
## Remove binaries from Master / Head for complete history removal of binaries
 find . -name "*.a" -exec rm -rf {} \;
 find . -name "*.so" -exec rm -rf {} \;
 find . -name "*.dll" -exec rm -rf {} \;
 find . -name "*.app" -exec rm -rf {} \;
 find . -name "*.exe" -exec rm -rf {} \;
 find . -name "*.lib" -exec rm -rf {} \;
git commit -m "Removal of Binaries"
cd ../
# run BFG to remove binaries from all history!

java -jar bfg.jar --delete-files '*.{a,exe,lib,so,dll,app}' openFrameworksBFG.git

cd /Users/danielrosser/Documents/bfg/openFrameworksBFG.git
git reflog expire --expire=now --all && git gc --prune=now --aggressive
cd ../

# done


Complete BFG Binary Removal Log: (Lists all the binaries individually removed, each is a unique binary)
https://gist.github.com/danoli3/0055e81b9c0830b6090c
E.G:

e87ea78bbc90bbc24597b0d49317d605882db4e5 788480 3DModelLoaderExample.exe
c679e6dae8d7a329baf0017fade09c3576eb6d73 4627456 Assimp32.dll
3f66cfe797598f37d2647fe3538141b6a5908006 240560 CppUnit.a
09dadd785951b62e6f6ee522bef2ad37807bcae7 113920 CppUnit.a
ae29db3f0223280b8a36050a29381bba81894a16 182960 CppUnit.a
bf5e7768b0caf932d41f4e4e81b211d31c74215a 49376 CppUnit.lib
2323189c7c4bd847aaa176f7940a9120d0516363 996366 CppUnitmt.lib
e5b1afe441c6d1b4cabad6da73fbeb38186753a6 2404352 FreeImage.dll
0562f4a0ea221ccfda6f28adf93f826f62e338c3 6402560 FreeImage.dll
075f1a75cbf07d16a216e4321398893ad92039f0 1089536 FreeImage.dll
cb4cf6ef990e7ef1324b06bf2a166c005cae6a8f 5647872 FreeImage.dll
2155170ecc019b4f582e1d61ca5ab4dbb11abd39 2713600 FreeImage.dll
...

@danoli3
Copy link
Member

danoli3 commented Dec 18, 2015

Summon @openframeworks/core

@bilderbuchi
Copy link
Member Author

Awesome work on test-driving this!

Hashs are recalculated for each commit

This is the part that breaks the history, since a user's fork and a post-BFG repo show two different hashes for the "same" commit (where "same" means same title and branching ancestry) - if it has a different hash, it's a different commit.
This also means that all those users will not be able to PR against the cleaned repo (at least, without pulling binaries in). As found previously, and as you also guess, a Legacy repo is probably still the cleanest way to do all this.

Btw, I don't think we will have to re-run the BFG (or any cleaning solution) multiple times - we should run it only when we've cleaned up our act, and our workflow doesn't add new binaries to the repo.

The primary (big big) disadvantage of using the BFG to clean the binaries out is that this will lose all the information about the binaries - they are just deleted after all! This means the repo's history basically becomes useless since we won't be able to use a previous state in any meaningful way (since all the binaries are missing!) (e.g. try to compile 0.8.4 or 0.9.0 tagged commit after cleaning)
At that point, why bother with cleaning the history if you can't use it? Might as well throw it away, and keep it for historic purposes in a legacy repo...

This problem is avoided by the multiple methods (git-media, git-annex, git LFS,...) analyzed in the wiki page I've linked in the OP - they preserve the metadata about the binaries, just move the big files themselves out of the repo "somewhere" else, and this is probably the way to go if we want to keep a usable OF repo history.

I'm not saying one or the other approach is better, I'm only saying if we won't use the repo history, we don't need the BFG, and if we use the BFG, we won't (be able to) use the history.

@arturoc
Copy link
Member

arturoc commented Dec 18, 2015

that looks really interesting. the main concern was that when you do this the commit hashes change so PRs, and issues will loose the reference to the original commit. looking through the first commits in the repo this seems to preserve the same commit hashes but probably changes the ones in which there were binaries?

i agree with @bilderbuchi, we shouldn't bring back the latest binaries after doing this and just keep them somewhere else

@bilderbuchi
Copy link
Member Author

After releasing 0.9.0, I wanted to find the difference in size between 0.8.4 and 0.9.0, both in working directory and in repo and/or .git directory size - this would give some kind of feel for how much "binary churn overhead" we are incurring (as we recompiled lots of libraries often, on the way to 0.9.0).
Sadly, I never got around to do it (you'll have to clone the tags directly I think, to get a fair comparison, so that getting a fresh 0.8.4 repo doesn't have the history (and thus size) up to 0.9.0 already in it).

@bilderbuchi
Copy link
Member Author

looking through the first commits in the repo this seems to preserve the same commit hashes but probably changes the ones in which there were binaries?

The first commit near the tree root where a binary is ripped out changes all subsequent/downstream hashes, so effectively all commits except for ancient history get changed.

@arturoc
Copy link
Member

arturoc commented Dec 18, 2015

oh, yeah i was checking the wrong repo :S

yes loosing the commit hashes is kind of problematic. if we could create some tool that recreated the gihub issues and PRs pointing to the new hashes it would be relatively ok, but loosing the PRs mostly those not merged yet but even the old ones is a problem

@danoli3
Copy link
Member

danoli3 commented Dec 18, 2015

I'll try out this with a fresh fork and add some random PR's to it. See what happens ;)

@danoli3
Copy link
Member

danoli3 commented Dec 18, 2015

The primary (big big) disadvantage of using the BFG to clean the binaries out is that this will lose all the information about the binaries - they are just deleted after all! This means the repo's history basically becomes useless since we won't be able to use a previous state in any meaningful way (since all the binaries are missing!) (e.g. try to compile 0.8.4 or 0.9.0 tagged commit after cleaning)
At that point, why bother with cleaning the history if you can't use it? Might as well throw it away, and keep it for historic purposes in a legacy repo...

Yeah the release tags will become useless as a method of downloading anything but the latest.
However if we maintained those tags, added the release downloads to github, it will solve this directly, or forward them to the oF site download for such tags.

For example: https://github.com/danoli3/ofxiOSBoost/releases/tag/1.56.0-libc%2B%2B
screen shot 2015-12-19 at 8 04 08 am
Source code only download, as well as the "release" binary with the included binary library / release.

We really shouldn't hold onto the old binaries just to hold them in a tags. They should be releases and seperate from the git history regardless, that is why it has the upload releases section in tagging.

This problem is avoided by the multiple methods (git-media, git-annex, git LFS,...) analyzed in the wiki page I've linked in the OP - they preserve the metadata about the binaries, just move the big files themselves out of the repo "somewhere" else, and this is probably the way to go if we want to keep a usable OF repo history.

All of the proposed solutions don't remove the files from repo history and will still require a binary nuke to remove them from the git checkout (being stored on everyones .git for the repo).

  • I investigated Git LFS, seems to be only private repo's for the foreseeable future (you can't fork a LFS Repo... LOL, useless).
  • Git Annex is a 3rd party external command that requires installation on the system... (probably not the best idea).

The major issue currently seems to be the PR / commit issue. Lets see how she handles it.

@danoli3
Copy link
Member

danoli3 commented Dec 18, 2015

Just to be clear.
The git hash history MUST be re-written in order to remove the binaries.

git-media, git-annex and even git-LFS ALL use git-filter-branch to achieve their own results.
git-filter-branch by definition

Lets you rewrite Git revision history
https://www.kernel.org/pub/software/scm/git/docs/git-filter-branch.html

And that is precisely what it does... it re-writes all hashes of commits after a change in a offending commit. So exactly the same situation as above...

Basically even github say to use BFG here: (They used to say how to use git-filter-branch on this page, however now replaced it with "Only USE BFG" by ommission lol)
https://help.github.com/articles/removing-files-from-a-repository-s-history/

@danoli3
Copy link
Member

danoli3 commented Dec 18, 2015

It's looking like old PR's cannot be merged unless updated after something like this... else it will basically infect the new hash tree with potentially old hashes...

Based on this comment:
http://stackoverflow.com/a/17592246/1676524

You can prevent people from merging (more precisely pushing) the old history by writing (5 lines) an appropriate update hook on the server. Just check whether the history of the pushed head contains a specific old commit.

I'm still going to run the test though

@danoli3
Copy link
Member

danoli3 commented Dec 19, 2015

Okay test complete.

@bilderbuchi
Copy link
Member Author

All of the proposed solutions don't remove the files from repo history

this is incorrect, most remove them and store them somewhere else, and only a metadata placeholder with a pointer to the binary location will be in the repo

and will still require a binary nuke to remove them from the git checkout (being stored on everyones .git for the repo).

sure, but in contrast to BFG, they don't disappear, but are still available (from a rackspace or AWS or github server, for example)

The git hash history MUST be re-written in order to remove the binaries.

yes. even to only move them out of the way with the other methods

I investigated Git LFS, seems to be only private repo's for the foreseeable future (you can't fork a LFS Repo... LOL, useless).

we could try and host our own LFS server, as soon as it's stabilized a bit.

Basically even github say to use BFG here:

of course, "here" in your case means the use case of totally nuking files from the repo (e.g. wrongly committed passwords), so this is slightly misleading. we, on the other hand, would optimally still retain a working repo with history afterwards.

It's looking like old PR's cannot be merged unless updated after something like this... else it will basically infect the new hash tree with potentially old hashes...

yes, that's what I wrote above, no? "This also means that all those users will not be able to PR against the cleaned repo (at least, without pulling binaries in)"

@bilderbuchi
Copy link
Member Author

in any case, the discussion is kinda premature anyway, since we first have to completely move to a workflow where we add no new binaries to the repo. apothecary is a first great step in that direction.

when that has been established and works, we can consider how we solve the history issue. this is probably quite a long time away, so we can hope for some technical progress on that front (e.g. LFS has appeared years after I initially started looking into this issue). At that point (we're maybe even near OF 1.0 at the time), we can still decide if we just deprecate/archive the old repo, and move everyone over to a new one, without binaries. Or, maybe some great solution is available which solve the present problems better (probably not).

@rtyley
Copy link

rtyley commented Dec 19, 2015

sure, but in contrast to BFG, they don't disappear, but are still available (from a rackspace or AWS or github server, for example)

Just to offer a small clarification, since version 1.12.5, the BFG has supported converting Git repos to git-lfs format - moving big files from Git history into the LFS store:

$ java -jar ~/bfg-1.12.5.jar --convert-to-git-lfs '*.exe' --no-blob-protection

@bilderbuchi
Copy link
Member Author

@rtyley nice, that would be pretty useful! any particular reason why this feature is not mentioned on the BFG homepage?

@rtyley
Copy link

rtyley commented Dec 19, 2015

any particular reason why this feature is not mentioned on the BFG homepage?

I've been busy 😄 Also, I wanted to gather a bit of feedback from early users before pointing everyone at it - GitHub's LFS support isn't currently suitable for some users.

@arturoc
Copy link
Member

arturoc commented Dec 21, 2015

I wonder if it would be possible to use github's api to recreate the issues and PRs to point to the new hashes. i guess if one could get a correspondence index between old -> new hashes creating the messages pointing to the new commits would be doable not sure about recreating the actual commits in PRs. i guess it would be pretty complex but not sure if it's doable at all.

@danoli3
Copy link
Member

danoli3 commented Dec 21, 2015 via email

@bilderbuchi
Copy link
Member Author

yeah, PRs have to be rebased (and any binary changes in them cleaned).

yes, with the API you can edit comments, so this should be possible or at least feasible.
Commits in PRs are taken from the git repo structure, so rebasing PRs would be necessary. this is why it makes sense to do this at a point where the minimum amount of PRs are open or, alternatively, improve our PR merging workflow to not have 70+ PRs open at a time ^.^

@danoli3
Copy link
Member

danoli3 commented Jan 13, 2016

Definitely need to get the PR's down haha!

In the meantime, we should upload the releases to the tags on github, as uploaded on openframeworks.cc for the different platforms. Anyone got a good upload speed? ;D! So just need to add "Release Notes" and then can add zips to each tag.

https://github.com/blog/1547-release-your-software

@bilderbuchi
Copy link
Member Author

In the meantime, we should upload the releases to the tags on github, as uploaded on openframeworks.cc for the different platforms.

why? we already have the uploads on of.cc, why duplicate them here, what's the benefit outweighing the drawbacks (more work, several sources for releases->confusion)?

@danoli3
Copy link
Member

danoli3 commented Jan 13, 2016

Well for someone who is just looking at this repo, it's very confusing to not have them here. No link or anything on here to them aside from the main website link at the top.
Should at least add release notes with the links. Don't you think?
That's the standard thing to do for github yeah?

@bilderbuchi
Copy link
Member Author

Arguable, the main README could probably use a restructuring - the text "To grab a copy of openFrameworks for your platform, check the download page on the main site." is currently in a "Developers" section, and could maybe be made more prominent.

Release notes with links to the relevant download section on of.cc sounds good, but I'm against putting the same binaries in two places.

@jamesblackburn
Copy link

You can use: https://github.com/bozaro/git-lfs-migrate.git

to migrate files to git-lfs without breaking release tags.

@lynnboy
Copy link

lynnboy commented Feb 2, 2018

At this time, the repo became 2.31 GiB.

@danoli3
Copy link
Member

danoli3 commented Mar 19, 2020

Ping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants