Migrating SVN to Git

February 15, 2014

Executive Summary

I migrated an 11k-commit, 3M+ line repository from Subversion to Git. A naïve git clone consumed 1.2GiB. Several blobs exceeded 100MiB. With some cleanup, git filter-branch and pixie-dust I got to a ~100MiB repository, about the same size as a single Subversion checkout previously.

This describes how.

Background

A company I used to work for has an enormous (50k+ commit) Subversion repository (which actually has several major “trunk/branches/tags” sub-projects inside it). They wanted to migrate the main sub-project to Git (and GitHub). This was “only” 11000 of the commits.

This presented several major challenges:

  • lots of big files + historical weirdness (breaking github limits)
  • non-standard branching strategies
  • two svn:externals with tight coupling
  • many binaries included over the years (PDFs, JARs, etcetera)
  • a “naïve” git svn clone weighs 1.2 GiB – way too big
  • several “repository re-organizations” over the years
  • non-trivial server-side hooks

Since we wanted to use GitHub, there are additional hard and soft limits on file-sizes: no single blob can exceed 100MiB and the repository can’t exceed 1GiB. Completely reasonable limits!

The repository being migrated is monstrous: 3 million+ lines of code (unfortunately including some imported 3rd-party dependencies), 11000+ commits, multiple languages, binaries, etcetera. The build system has to deal with Java, javascript, C++, Python, flex/actionscript and dozens of translations and several binaries, JARs, RPMs, etcetera being produced. (ohcount said the build-system alone was around 16000 lines).

One obvious thing would be to move many of these disparate pieces to their own repositories/projects. However that was out-of-scope for this migration.

So, given these limitations…

Standing on the Giants

Several tools and blog-posts have been valuable in making this happen. Of course, Git itself and git-svn (which keeps me sane when working with Subversion repositories) along with fantastic online Git documentation are top of the list. There were also several blog posts and scripts detailing other people’s experiences:

Overview of the Process

Before starting, we found several outstanding technical-debt style problems that were dealt with directly in Subversion (removing binaries like PDFs, JARs, deleting 1M+ lines of dead code, removing two svn:external definitions, …)

In our case, the Subversion repository would be remaining in a read-only state “forever”, so a pedantic full-history clone wasn’t strictly necessary. This allowed, for example, extensive git filter-branch operations to save size. Although not ideal (the history isn’t precisely the same), it’s better than a repository measured in Gigabytes.

After the above clean-ups, the actual conversion process proceeded as follows:

  • run git svn clone on entire repository
    • took a couple days, running via http
    • in hindsight, using a dump would be much faster (i.e. a file:/// URL on a machine with an import of the repo)
  • the above ends up with several “@”-branches for each “real” branch:
    • like “trunk”, “trunk@32123”, “trunk@18124” etc
    • (see below for full explanation)
    • all these @-branches were deleted
  • make “git” branches for each subversion branch wanted
    • we didn’t migrate everything (especially personal branches)
    • still about 50+ branches migrated
  • run several git filter-branch operations
    • pretend binaries (PDFs, JARs, etc) never existed
    • pretend obsolete large data files never existed
    • pretend a generic user did the “migrate stuff around” commits
  • remove historical svn:externals references
    • git filter-branch
    • svn export the right version of the external
    • …and commit back into the tree
    • that is, pretend it was never an external
  • re-write past sins
    • change author on re-organization to “nobody”
    • re-write first commit to “nobody”
    • these really help when trying to figure out where code came from (e.g. using git blame)
  • delete all the backup refs and branches created by filter-branch
  • expire the reflog
  • delete all dead objects
  • clone the (git) repository
    • ensures we leave unreachable objects behind
  • this is now “the Git repository”
  • set up git-based commit hooks
    • can’t just use pre-receive hooks (not available on GitHub/bitbucket)
    • make the build system bootstrap the client-side .git/hooks/* scripts if missing
    • used a post-merge script to also update the hooks (i.e. after doing a git pull, the hooks are checked for updates, too)
  • push to GitHub
  • Congratulations, it’s the Git magic moment!

Now, that sounds like an aweful lot to do, and it was. It took several months just to completely accomplish many of the “cleanup” tasks (in Subversion) before Git-conversion proper could even start – there were several deep, thorny issues with the externals and build system.

On-going development didn’t particularly help in this regard (i.e. trying to clean up an extensive, crufty build system while new things were being added).

Doing The Cleanup

This isn’t really part of the “migrate to Git” project proper, however I suspect most companies with a long-lived Subversion repository will have at least some cleanup-type things to do before such a migration.

Pat yourselves on the back if you really have none!

Remember that you only have one shot at the Git migration (no pressure!) so now is your chance to pretend that historical mistakes never happened; it’s much harder to remove them once you’re actively using Git. (You still could, but it changes history…)

Don’t go too crazy, though! As far as Git migration is concerned, you’ll want to keep out things which increase the repository size:

  • lots of big binaries
  • “weird” histories

You probably never really wanted the big binaries in your repository in the first place, so those should be easy to get rid of (i.e. move elsewhere). In the repository I was fixing, we had three types of large-objects:

  • Dependencies (JARs, SWCs). These were simply incorporated into our other third-party library infrastructure
  • Assets (PDFs mostly). These were moved to more appropriate infrastructure.
  • Data (old test-data mostly). There was already a separate Subversion repository with these things, so the historic data were simply deleted.

On the “weird history” front, Git is pretty incredibly compact at storing branches and history which are largely the same (e.g. 20 branches of trunk aren’t going to be that much bigger than 1 since their histories are probably 99% the same).

In the repository I was dealing with, the “trunk” I was migrating had actually 3 different platforms (server, iOS, Android) during its infancy. These were later split apart (around commit 18,000). So, there were certain personal branches that ended up including either the iOS or Android history. Since these were the only branches including this history, deleting them was beneficial.

I also used git filter-branch to delete any of the old iOS and Android things that were now in their own repository.

Lessons learned: it might be less disruptive to have one (or two, or whatever it takes) sprint(s) where everyone does any pre-Git cleanups you want instead of having one team or even one person do this stuff while “normal” development carries on. Probably depends a lot, but if there are a lot of cleanups or they touch a lot of the code, I think doing it “all at once” would be preferable.

Accomplishing The Actual Migration

Since this was a large, complicated and time-consuming migration process with many, many (many) oppportunities to get things wrong I “showed my work” with a small collection of mostly-Python scripts to “do” the migration.

From a very high level, this means I could run do-migration.py in a shell and walk away – it would churn away for several days just doing the from-scratch git svn clone but I could be assured everything was being done from scratch.

This had several overall steps, some of which could be skipped – for example, after spending a bunch of time git svn clone-ing the repository you do a backup and always skip this step after that ;) In other words the first step becomes “copy the One True Git Clone and update it to current trunk”.

I also made several support scripts which were passed to the various Git commands:

  • SVN author to proper email conversion (for --authors-prog)
  • A script that calls git filter-branch to:
    • delete all PDFs
    • delete all JARs
    • delete historical accidents/mistakes/cruft
  • A filter-branch script to rename the authors on some commits
    • repository re-organization
    • some merges
  • A thing to edit/create the “git-svn” setup in .git/config
  • A “branch cleanup” script
    • delete branches with @s
    • move trunk to master
    • make git branches for svn branches
    • make real git tags for svn “tags” (become branches during git-svn clone)
    • clean out the reflog
    • remove filter-branch backup references
  • Overall orchestration script
  • Tests
    • basically acceptance-style things
    • ensure master’s history matches trunk’s
    • check that the various codelines:
      • appear
      • come from the right place (i.e. do the branch-points match)
      • contain the correct tags
    • are the tips of all branches identical between Git, SVN
    • all this testing is eased by keeping the git-svn-id tags

All together, this is a few hundred lines of Python, bash and configuration files.

One thing that many people do as part of a Git migration that I decided NOT to do is remove the git-svn-id tags that git svn adds to all the commit messages. Since we were keeping Subversion around “for historical purposes” I thought it made sense to keep these so people could immediately tell which commits came from Subversion (and precisely which revision + path). It also helps with testing the resulting Git repository.

For a smaller repository, or wholesale switch to Git you might like to remove these (see the --no-metadata option to git-svn).

git-svn Setup

The repository under conversion has several types of branches:

  • branches/RELEASES/CodeName: major codelines for releases
  • branches/initials/anything-at-all: personal branches, by dev’s initials (or sometimes name)
  • branches/feature/a-feature-name: features several devs are working on

Of course, not every branch was migrated – some old or obsolete branches that should have been deleted in SVN already were simply never cloned. So, we first create the repository:

$ git init
$ git svn init --prefix svn/ --stdlayout http://svn/repo/
$ # edit .git/config

The .git/config file in the end looks something like this (I’m showing just the appropriate secion here):

[svn-remote "svn"]
   url = http://svn/repo
   fetch = server/trunk:refs/remotes/trunk
   branches = server/branches/auser/*:refs/remotes/people/auser/*
   branches = server/branches/jsmith/*:refs/remotes/people/jsmith/*
   branches = server/branches/meejah/*:refs/remotes/people/meejah/*
   branches = server/branches/RELEASES/{release1,somerelease,amazingrelease}:refs/remotes/release/*
   ## as many more branches = lines as you need
   tags = server/tags/*

So, for example, an SVN URI for a user branch above would be http://svn/repo/server/branches/meejah/foo and would become “refs/remotes/people/meejah/foo” in the git-svn clone. (Don’t worry, later on we make real Git branches from these like people/meejah/foo)

Now, we’re ready to actually do the clone.

$ git svn fetch --authors-prog svn-author-to-git-author.py
## wait a week

The above can take an incredible amount of time, especially if your regular expressions for branches yield lots of branches – git-svn is checking out every revision on every branch. You can do an import from a Subversion dump. If you have direct access to the server, this is probably the way to go for bigger repositories. Or just find something else to work on for a while while your IT department wonders whats up with the SVN server…

The @-Branches

After the clone is all done, I found a whole bunch of branches with ‘@’s in the name (e.g. trunk, trunk@18456, trunk@12345 etcetera). There is an explanation in 1.8.1 or later git-svn man pages but the short version is that any of these things may cause this to happen (as I found in our repository):

  • re-organization (e.g. moving where in the repo trunk/branches/tags lives)
  • accomplishing a “merge” by deleting trunk and moving your branch over top (seriously, people do this!)
  • if git-svn gets confused about anything else

For the cases I examined, all the @-containing branches could simply be deleted. So, we do this using update-ref to remove them all:

$ git update-ref -d refs/remotes/trunk@12345
$ # etc

Git Branches For The Good Ones

After deleting all the @-containing branches (and anything else you don’t want), we can make proper Git branches for all the svn branches. Note that git-svn creates branches for all svn tags (since Subversion itself doesn’t distinguish between branches and tags). For each branch you’d like as a Git branch:

$ git branch people/meejah/foo --track refs/remotes/branches/meejah/foo
$ git update-ref -d refs/remotes/branches/meejah/foo

Remove All The Things

Now we can go back in time and pretend all the bad stuff never happened. Most of these sins in our repository were binary dependencies (.jar files, .swc files) or build-generated artifacts (e.g. PDF files from the documentation repositories).

I’m glossing over a few details here, but basically if you can make a --tree-filter work for this with only index-modifying Git commands and use ramdisk as a temporary directory, this will be fairly quick.

$ git filter-branch -d /dev/shm/gittmp --index-filter "git rm --cached -f --ignore-unmatch test/data/*.bin ; git rm --cached -f --ignore-unmatch *.jar" --tag-name-filter cat --prune-empty -- --all

That above deletes all .jar files, and anything below test/data with a .bin extension. As you can imagine, lots more possibilities. The --ignore-unmatch is so you don’t get errors if nothing got deleted.

Git’s filter-branch command creates wisely creates backups of the un-filtered branch, which we’ll now blow away. Note that I usually do all the filtering stuff on a copy of the cloned repository, in case something goes awry…

$ git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
$ rm -rf .git/refs/original  # scorch the earth, too
$ git reflog expire --expire=now --all
$ git gc --prune-now

Note that you can accomplish basically the same thing by git clone-ing the repository. (Use a file:// URI instead of a path, or you get hard links remember).

Removing svn:external Remotes

Part of the “clean-up trunk” activities included removing two svn:remote declarations. However, these remotes still existed in the history. For one of these, eliminating the remote was accomplished by copying all the content into trunk.

So, for our historic revisions, we use git filter-branch to go back in time and pretend this was never a remote by svn export-ing the remote at the right revision.

For each of your branches, you’ll want something like this

$ git filter-branch -f -d /dev/shm/gittmp --prune-empty --tree-filter git-svn-filterbranch.py people/meejah/foo

In pseudo-code to highlight the Git commands, the git-svn-filterbranch.py will do somthing like this:

$ EXTERNAL_NAME, EXTERNAL_URI = `git svn propget svn:externals`
$ OUR_REV = `git svn info`
$ svn export --force ${EXTERNAL_URI}@${OUR_REV} ${EXTERNAL_NAME}

The net result of this is that Git goes back through time on the people/meejah/foo branch and tries to svn export the correct URI + revision for the very revision of the external you are bringing in. (Every revision on the branch is re-written). Repeat for all branches.

Turn Trunk into Master

You really want the trunk branch to go away and “be” master instead:

$ git checkout trunk
$ git branch -D master
$ git checkout -f -b master
$ git branch -r -D trunk

Are We There Yet?

Yay! We’ve achieved a history-complete clone of our Subversion repository, filtered out anything we didn’t want and now have a nice pristine Git repository. So, we could push this to github for example:

$ git remote add origin https://github.com/username/repo
$ git push origin --all

In our case, we went from a 1.2 GiB repository with many objects over 50 MiB and a ton of cruft, to a complete repository of ~100 MiB with all binary objects deleted, all svn:externals resolved to the correct code and nearly all of the history.

This is still a pretty unweildly Git repository by any standard, but it’s actually smaller than a single Subversion checkout of the previously existing (i.e. pre-cleanups) Subversion trunk…