Saturday, June 27, 2015

Splitting a git repository

I keep all my hobby code in a big git repository. The problem is that it is too big. I have something like 28GB in it, of which the vast majority is data, by which I mean results of data-collection processes, such as photos, rocketometer data, etc. It also includes data acquired from other sources, such as spice kernels, shape models, image maps, etc. It is making everything slow.

So, the solution is to split the repository. We split it into code, data, and docs.

Now, how do we split a repository? One option is to use the data we have to build new repositories. We go through the existing one commit-by-commit, generate a diff, filter that diff so that only the right files go into the new repository, then make a new commit in the new repository, making sure to keep the commit message, user information, and timestamp.

That seems hard. Also, it seems like someone should have already done that. So, I did some research to see if anyone else has done this. Mostly I am looking for people who are permanently removing files from a repository. My idea is to make three copies of the existing repository, then remove the files that don't belong using the methods described on the Internet.

In the course of my research, I found a program called BFG. This program does a full remove of a large set of files, in a way that is much quicker than git filter-branch. In order to do this, we need to remove the files from the head of the repository, so we do things like this:

  1. Make three copies of the original workspace with git clone --mirror . These are going to be the three new repositories, so call them code.git, data.git, and docs.git
  2. For each repository, check it out with git clone (no --mirror). This working copy is temporary. We will use the code repository as typical.
  3. Because of the way BFG works, we have to rename any folders which are called Data or Docs. BFG doesn't remove just /Data, but any folder called Data. As it turned out, there were some such files. They needed to be either renamed (to data and docs, note lower case), deleted, or moved.
  4. Remove the files from the working copy with git rm -r Data Docs .
  5. Move all the files and folders from code down one level, since the old code folder will be the new root: git mv code/* .; git rm code
  6. Commit and push the removal and move
  7. Now go back to the new mirrored repository code.git and run java -jar ~/bfg-1.12.3.jar --delete-folders Data and java -jar ~/bfg-1.12.3.jar --delete-folders Docs . This process runs very quickly.
  8. At this point, the repository is cleaned, you can't see any evidence of the files, but there are garbage objects which need to be removed to actually make the repository take less space. To do this, run git reflog expire --expire=now --all --verbose and git gc --prune=now --aggressive . Git garbage collection takes a long time and an immense amount of memory, more than the 8GiB my file server actually had. I had to run the gc over the network on my game rig, which has 32GiB.
  9. Now the repository is shrunk. We went from 28GB to 3GB for the code repository.

No comments:

Post a Comment