Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practice for migrating an existing Git repo to support LFS? #326

Closed
strich opened this issue May 22, 2015 · 33 comments
Closed

Best practice for migrating an existing Git repo to support LFS? #326

strich opened this issue May 22, 2015 · 33 comments

Comments

@strich
Copy link
Contributor

strich commented May 22, 2015

I'd like to give Git LFS a good crack with our existing ~25GB Git repo but I'm at a bit of a loss as to how we should go about migrating it to support LFS.
I'm happy to loose all history of the existing large files I'd like to track with LFS.

Do I need to write a script to find and purge all the filetypes I want to track with LFS from the existing Git history, untrack them, then retrack them with LFS?

@andyneff
Copy link
Contributor

I'm HOPING there is a better way to do this... BUT after redacted units of time, I found a way where you will NOT lose any history of existing large files. It is probably SLOW but it works.

DISCLAIMER I'm just a stubborn user, can't promise this will work out for you, BACKUP!

I have run into a similar situation myself, but in my case, all my large files were sitting in submodules. So I was able to remove the submodules, track files, and add them back (without the submodules, HORRAY!) This way I can say I did maintain all my history, only some of the history is in the submodules. I'm guessing this is not similar to your situation, and that everything is in one big repo.

  1. Filter-branch to convert all tracking over to lfs
git filter-branch --prune-empty --tree-filter '
git config -f .gitconfig lfs.url "http://127.0.0.1:8080/user/repo"
git lfs track "*.npy"
git add .gitattributes .gitconfig

for file in $(git ls-files | xargs git check-attr filter | grep "filter: lfs" | sed -r "s/(.*): filter: lfs/\1/"); do
  echo "Processing ${file}"

  git rm -f --cached ${file}
  echo "Adding $file lfs style"
  git add ${file}
done' --tag-name-filter cat -- --all
  • You have to use tree-filter here, and not index-filter because we are adding objects back.
  • I just added track *.npy as an example, but add all of your track commands in there. Hopefully you can use the SAME set of rules for EVERY commit
  • The .gitconfig files and git config lines are so that if you have an lfs server in a separate location than the git repo, everything works. IF you have them at the same url, you can skip that step
  1. Push changes to whatever remote or remotes you have, of course you have to use the -f option, which can (and WILL!) have many implications to all the other users using the repo. Make sure no one else pushes or references the commits you just rebased, or else you will have a mess. This is when all the large files are sent to the lfs server.
git push -f origin master

(Optional) Collect garbage to shrink currently checked out repo

rm .git/refs/original -rf
git -c gc.reflogExpireUnreachable=0 -c gc.pruneExpire=now gc
  • The rm command remove a "backup copy" of the the original refs before the filter-branch command, just to CYA. but when you are done with them, WIPE EM!
  • If this doesn't do enough gc, check out http://stackoverflow.com/a/14728706

(Optional) Collect garbage on a bare repo (on the remotes)

git -c gc.reflogExpireUnreachable=0 -c gc.pruneExpire=now gc

I hope this helps!

Untested ideas
Instead Step 1, just add

git lfs track "*.foo"
git config -f .gitconfig lfs.url "http://127.0.0.1:8080/user/repo"
git add .gitattributes .gitconfig

inside the for loop on step two. It should save that wasted initial rebase, and prevent the need for the git add .

Tested using git-lfs 0.5.1 and git 1.9.4.msysgit.1 (Yes, on windows 64 bit)

@tlbtlbtlb
Copy link

bfg is much more efficient than git-filter-branch, especially for long histories.
[https://rtyley.github.io/bfg-repo-cleaner/]

Here's roughly what I did to convert all my .mov files to lfs:

  • cp *.mov (and a few other large blob types) ~/tmp
  • git rm *.mov
  • git commit
  • git lfs track *.mov
  • git add .gitattributes
  • git commit; git push

In a fresh directory:

  • git clone --mirror $remote; cd repo
  • bfg --delete-files '*.mov'
  • git reflog expire --expire=now --all && git gc --prune=now --aggressive
  • git push

Back to my src directory:

  • mv repo repo.bloated
  • git clone $remote; cd repo
  • cp ~/tmp/*.mov .
  • git add *.mov (it now puts them in lfs)
  • git commit; git push

Kind of a chore to figure out, but now my repo is small and zippy.
Tip: do this on a cloud machine instead of your laptop, since most of the time is pulling/pushing data to github.

@andyneff
Copy link
Contributor

@tlbtlbtlb in your solution, when you check out all the previous versions of your history, are all the .mov files correctly there, or are the missing everywhere except the newest commits?

@tlbtlbtlb
Copy link

They're gone from previous versions. Which is what's necessary to cut the size of the repository.

@andyneff
Copy link
Contributor

I guess I was being stubborn and trying to give anyone an option where they DON'T loose the history of all large files, even though it was stated he'd be happy without them.

I guess the slower build-in equivalent to bfg --delete-files '*.mov' would be

git filter-branch --prune-empty --index-filter 'git rm --ignore-unmatch --cached "*.mov"'

@strich
Copy link
Contributor Author

strich commented May 24, 2015

Although I'm happy to loose the history of the large files, I would need them to continue to exist in previous commits though - No point having any history at all if every commit previous is broken with missing files.

@andyneff
Copy link
Contributor

@strich I agree with the second statement, I'm a little confused about the first. Are you saying that you are ok with losing the contents AND history of any large files NOT on the tip of your current branch, but the version of the large files that are there, you want them to exist in previous commits up until that version of the file, where before that it will just not exist (be a missing file)?

Example:

If I understand correctly, are you saying that if I have something like this,

commit what happened big contents
master nothing new with big files big1.bin(version 2)
master~1 update big1.bin to version 2, delete big2.bin big1.bin(version 2)
master~2 add big2.bin big1.bin(version 1) and big2.bin
master~3 add big1.bin (version 1) big1.bin (version 1)

You want to make sure that master and master~1 both point to big1.bin(version 2), but you are ok with big2.bin and big1.bin(version 1) just disappearing?

@strich
Copy link
Contributor Author

strich commented May 24, 2015

Thanks for continuing to reply @andyneff - Yes that is exactly what I'd like to achieve if it is possible.

@andyneff
Copy link
Contributor

Oh, its possible... My current solution actually preserves ALL the files (so big1.bin(version 1), big1.bin(version 2) AND big2.bin). You you are actually asking for a less, more limited version. It is possible to just use the latest commit if you add a little more

git lfs track "*.npy"
export KEEP_FILE="$(git rev-parse --show-toplevel)/.git/.keep"
git ls-files | xargs git check-attr filter | grep "filter: lfs" | sed -r "s/(.*): filter: lfs/\1/" > ${KEEP_FILE}
rm .gitattributes
git checkout .gitattributes  || :

git filter-branch --prune-empty --tree-filter '
git config -f .gitconfig lfs.url "http://127.0.0.1:8080/user/repo"
git lfs track "*.npy"
git add .gitattributes .gitconfig

for file in $(git ls-files | xargs git check-attr filter | grep "filter: lfs" | sed -r "s/(.*): filter: lfs/\1/"); do
  keep_file=0
  while read keep; do
    if [ "${keep}" == "${file}" ]; then
      keep_file=1
      break
    fi
  done < ${KEEP_FILE}
  if [[ ${keep_file} == 1 ]]; then
    git rm --cached ${file}
    git add ${file}
  else
    git rm -f ${file}
  fi
done' --tag-name-filter cat -- --all

rm ${KEEP_FILE}
rm .git/refs/original -rv
git -c gc.reflogExpireUnreachable=0 -c gc.pruneExpire=now gc

This will remove all files matching the lfs track patterns, (in this case, *.npy) except for those listed in the KEEP_FILE file. The files listed in KEEP_FILE and their history will be maintained.

Of course, you can replace the first 5 lines with anything you want to get the list of files you want to keeps.

Side effect: It is possible that was two separate commit will be merged into one, if the only difference was a file that is now gone. This can also change the topology of your branches too. It will still be "as correct as possible", only some commit messages would disappear. Of course, it is unlikely it will happen, but just an FYI

@andyneff
Copy link
Contributor

Other things I've tried that did NOT work out

  • I had this idea of using git-replace. This has an advantage of efficiently creating all the lfs pointers once and only once. This WORKS (after a filter-branch to update .gitconfig/.gitattribute), but I have a hard time making the changes permanent. git-filter-branch does not appear to support blob replacement (only commit replacement).
    while read line; do
    LINE=($line)
    OBJECT_HASH=${LINE[0]}
    FILENAME="${LINE[@]:1}"
    MATCH=$(git check-attr filter ${FILENAME} | grep 'filter: lfs' | sed -r 's/(.): filter: lfs/\1/')
    if [ ! "${MATCH}" == "" ]; then
    POINTER_HASH=$(git show ${OBJECT_HASH} | git lfs clean | git hash-object -w --stdin)
    echo New file ${FILENAME} ${OBJECT_HASH} ${POINTER_HASH}
    git replace ${OBJECT_HASH} ${POINTER_HASH}
    fi
    done < <(git rev-list --objects --all | \grep "[0-9a-f]
    .")
  • Solutions involving git rebase --interactive, this is possible, but only works for one branch. Plus it is prone to merge conflicts in cases where your repo uses .gitattributes
    env GIT_SEQUENCE_EDITOR="sed -i s:^pick:edit:" git rebase --interactive --root
    while : ; do
    git lfs ls-files > .gitlfs_converted
    for file in $(git ls-files | xargs git check-attr filter | grep "filter: lfs" | sed -r "s/(.*): filter: lfs/\1/"); do
    new_file=1
    while read converted; do
    if [ "${converted}" == "${file}" ]; then
    new_file=0
    fi
    done < .gitlfs_converted
    if [[ ${new_file} == 1 ]]; then
    echo "Processing ${file}"
    git rm --cached ${file}
    echo "Adding $file"
    git add ${file}
    fi
    done
    git commit --no-edit --amend
    rm .gitlfs_converted
    git rebase --continue || break
    done

@bozaro
Copy link
Contributor

bozaro commented Jun 10, 2015

I write simple java code which can convert repository for LFS usage: https://github.com/bozaro/git-lfs-migrate
May be it be usefull for somebody.

@vmrob
Copy link

vmrob commented Aug 12, 2015

Version of the first script that works on OS X with files that contains spaces (but not newline characters):

git filter-branch --prune-empty --tree-filter '
git lfs track "*.zip"
git lfs track "*.exe"
git add .gitattributes

git ls-files -z | xargs -0 git check-attr filter | grep "filter: lfs" | sed -E "s/(.*): filter: lfs/\1/" | tr "\n" "\0" | while read -r -d $'"'\0'"' file; do
    echo "Processing ${file}"

    git rm -f --cached "${file}"
    echo "Adding $file lfs style"
    git add "${file}"
done

' --tag-name-filter cat -- --all

The most unusual part is while read -r -d $'"'\0'"'

The parameter to read, -d, is $'\0', but to escape the single quote inside the block that is already single quoted, we end the quote, open a new double quote, then use single quotes, close the double quote, and then open a single quote back up. Shell escaping wasn't always the easiest..

@rtyley
Copy link
Contributor

rtyley commented Oct 1, 2015

As @tlbtlbtlb mentioned, the BFG is much faster than git filter-branch for rewriting history - and I've added explicit Git LFS support with BFG v1.12.5:

$ bfg --convert-to-git-lfs '*.{exe,dll}' --no-blob-protection

Incidentally, the git-lfs-migrate code by @bozaro is quite interesting - it looks like it does an equivalent job, and maybe at equivalent speed - will check it out when I get a chance.

@bozaro
Copy link
Contributor

bozaro commented Oct 2, 2015

@rtyley

I think both of these projects need to work roughly the same speed.
By the way, I'll stand in the common part of my projects (git-as-svn and git-lfs-migrate) in a separate git-lfs-java project.

@ltrzesniewski
Copy link

@andyneff did you keep your original migration script around? I'd be very interested :)

I'm asking as I'm in the same situation you had: I have a whole lot of binary files in a submodule, and I'd like to merge the submodule to the main repository while converting the files inside to LFS (and keep the whole history of course). The script you posted doesn't seem to deal with this, as this wasn't requested in this issue (I admit I didn't test it yet).

@andyneff
Copy link
Contributor

@ltrzesniewski When I said "I maintained my submodule history" what I meant I kept the submodules hosted, just abandoned using them for future commits. This means if I went back in history before the conversion, it would checkout the submodule and use it. This is very clunky and not really a great idea...

I believe the new preferred way as @rtyley pointed out, is using bfg to convert a repo to lfs, now that it has lfs support.

So I believe what you could do is

  1. Convert each of your your submodule repos to lfs using bfg (sorry, I haven't tried bfg yet)
  2. You can actually merge in the submodule's history into you main repo history using git subtree I think. I've only ever done this once, and if I remember correctly it simply connects the two histories at the current commit
    M
  /   \
M~1    S
 |     |
M~2   S~1
 |     |
M~3   S~2
...

Where M are your main repo commits, and S are submodule commits. This means that the versions of the submodule the M1... commits point to won't be lined up with the new S* commits. In fact M1.. will still point to the submodules. Neither solution is ideal.

To summarize, I only know of two tricks

  1. Just remove the submodule in your current commit, and then add all the files back tracking git lfs files. This way when you checkout M~1 ... commits, it'll re-checkout the submodule version. You'll have to re-init the submodules and submodule update each time, and it'll probably break requiring you to manually help submodule each time, but it is doable. This also requires you to continue hosting the submodule repos
  2. Use subtree to merge the history in. and just remember that the versions M~1... will be non-functional.
  3. The PERFECT solution would probably be something that could line up each sha from the submodules and basically shuffle the commits together. Perhaps a smart filter script could do this, but I don't have one. What I'd imagine is
    M
  /   
M~1
 |   \
M~2    S
 |   \
M~3   S~1
 |     |
 |    S~2
 |
M~4
 |  \
 |    S~3  
...
  • The commit from M~1 pointed to S,
  • The commit from M2 pointed to S1, so it gets both S1 and S2 because M3 pointed to S2
  • The commit from M3 and M4 pointed to S~3
  • etc...

I see references to another merge method I'm unfamiliar with, maybe it can help you.

http://stackoverflow.com/a/8901691/4166604
http://x3ro.de/2013/09/01/Integrating-a-submodule-into-the-parent-repository.html

@ltrzesniewski
Copy link

@andyneff thanks for your help, I appreciate it very much!

Now I understand what you meant in your first post, I guess I just got prone to wishful thinking but you cleared up the confusion - I basically thought you've already got a solution for that PERFECT method you describe 😉

In my case I don't really need to keep the commit history of the submodule, but I need to keep track of the relevant file versions referenced by the main repo, so I was thinking about performing a submodule checkout for each tree in the --tree-filter script but that would be sloooooow. I'll see what I can do from there.

@bozaro
Copy link
Contributor

bozaro commented Oct 14, 2015

@andyneff
If you fetch both repository to some .git directory or intergrate submodule before convert, then bozaro/git-lfs-migrate will convert whole both repository history, including submodule links.

@kilianc
Copy link

kilianc commented Oct 22, 2015

I am in a slightly different situation where I have an orphan branch called design in a repo where we store sketch files and other assets. I am ok with losing the history since the project is less than a month old, and I already tried to do so but it doesn't seem to work.

Couple of questions:

  • Can I git track * ?
  • Do I have to git lfs init? If yes when? Everyone in the team needs to to that as well?
  • How do I check it's working? It appears it is not, it says compressing objects, writing objects as usual.

@bozaro
Copy link
Contributor

bozaro commented Oct 23, 2015

@kilianc

git lfs init

This command need to run only one time on user computer. Usually it's runned by git-lfs installer.

This command add to $HOME/.gitconfig lines like:

[filter "lfs"]
    clean = git-lfs clean %f
    smudge = git-lfs smudge %f
    required = true

@kalibyrn
Copy link

I hope I'm not beating a dead horse, but..

This little script is still probably slower than bfg, but I couldn't figure out how to get bfg to honor my lfs remote location. So, I wanted to build on the work from @andyneff and @vmrob and make the filter-branch commands they provided faster.

git filter-branch --prune-empty --tree-filter '
git config -f .gitconfig lfs.url "http://artifactory.local:8081/artifactory/api/lfs/git-lfs"
git lfs track "*.exe" "*.gz" "*.msi" "*.pdf" "*.ppt" "*.pptx" "*.rar" "*.vdx" "*.vsd" "*.war" "*.xls" "*.xlsm" "*.xlsx" "*.zip" > /dev/null
git add .gitattributes .gitconfig
git ls-files | xargs -d "\n" git check-attr filter | grep "filter: lfs" | sed -r "s/(.*): filter: lfs/\1/" | xargs -d "\n" -r -n 50 bash -c "git rm -f --cached \"\$@\"; git add \"\$@\"" bash \
' --tag-name-filter cat -- --all

By combining the "git lfs track" line into one and by using "xargs -n 50" I was able to cut down on invoking git by more than 50 times per revision, in my case. (Way too many binaries in our repository!) That made things FAR faster... It handles spaces in the filenames also.

It seems to be working on Linux, but I can't comment on whether it would work for Mac OS X.

@bozaro
Copy link
Contributor

bozaro commented Nov 27, 2015

@kalibyrn
I'm sure, that git-filter-branch is really bad idea. It's convert revision-by-revision with full checkout every revision.
Converting bare repo is much faster. I would recommend for you tool: https://github.com/bozaro/git-lfs-migrate

@jamesblackburn
Copy link

git-lfs-migrate is amazing! Have converted a few repos, and done some superficial verification that the converted tags and HEADs are good.

Super easy to use and worked a treat. Thanks @bozaro !

@dashesy
Copy link

dashesy commented Jul 21, 2016

Does git-lfs-migrate change commit hash numbers? I would like to migrate a repo with large files, but am afraid of using filter-branch. If all blobs are substituted by a pointer (text file) in the history, without changing the actual graph that would be perfect.

@jamesblackburn
Copy link

Because the file object blobs do change (from a large file, to a text pointer) it by the design of git does change the blob SHA, and therefore the commit SHA. The result is there isn't a way, in git, of changing the content of blobs without changing the SHAs.

@dashesy
Copy link

dashesy commented Jul 21, 2016

@jamesblackburn But git could add a feature (or plugin) to fake the SHA for some special blobs (blobs that have their SHA hard-coded). Problem with changing the commit SHA is that you suddenly lose all of the references to old commits (issue trackers, wikis, urls .. become invalid).

@andyneff
Copy link
Contributor

@dashesy

Faking the sha would be VERY bad (if it were even possible). The SHAs are different, they NEED to be pulled down. If they looked the same, other people fetching the latest version wouldn't know they need the new SHAs

As for your issue tracking, etc... problem... Yes, those would be broken. There is a git replace and git graphs feature... I'm not sure if those could help, and I don't think git lfs migrate uses those. It might be possible to keep a list of all the old shas replaced by new shas with that feature. However, I'm not sure what you would do with that list when you are done...

You are justified so be worried about all the SHAs changing, but this is necessary.

The entire graph (at least for the branch you convert over) still retain its original topology, only ALL the SHAs in the graph will change (as of the first lfs file, at least). I don't remember if git-lfs-migrate convert all your branches or not.

So a few points

  • Yes, you'll be pushing a whole new set of SHAs
  • Yes, you will be LOSING the original SHAs, but the topology and commit history will remain. The only difference will be large files will be lfs pointers in the git repository instead of large files. This means your repo will be smaller.
  • Yes, all the other team members will need to pull down the new changes, and make sure they are branched off of the new SHAs, and not the old one. This means they can't use git merge (unless you use --squash) to get to the new SHAs, it either has to be git rebase, git reset, git cherry-pick, or hopefully git checkout, all depending on the situation ;)
  • Depending on who all is using your repo, complexity, etc... you may want to go the route of having everyone just git fetch and then checkout the new SHAs, or just tell everyone to make a clean clone (and possible change the repo name just to try and prevent confusion... Of course that sometimes add confusion too)
  • git-lfs-migrate is the preferred way now. You shouldn't be needing filter-branch anymore... I hope

@Permafacture
Copy link

git-lfs-migrate has an option now for mapping old commit hashes to new ones. I'm linking to the pull request for adding --annotate-id-prefix since I installed from a non-master branch, but it may be in master at some point soon: bozaro/git-lfs-migrate#24

@revolter
Copy link

@Permafacture Any idea if this option exists in this project's migrate tool?

@revolter
Copy link

Isn't git lfs migrate the best option for this?

@ltrzesniewski
Copy link

@revolter it didn't exist when this thread was created.

FWIW I tried to use it to convert a large repo just after it was released, but with no luck. @bozaro's tool worked just fine, although I had to revert the "Dramatically reduce memory usage" commit (so it used lots of memory but it actually finished before the heat death of the universe 😉). Maybe now the issues are ironed out, I don't know.

@technoweenie
Copy link
Contributor

LFS v2.3.0 improves migrate perf and fixes some crucial bugs. So, hopefully it works better now :)

@Permafacture
Copy link

Permafacture commented Sep 19, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests