I've found myself in a (potentially unique) situation where we have a
gi-normous Subversion repository at work and
we've been exploring ways on how to trim off some of the fat but still keep all
the logical history so that we could still use things like
svn blame to
drill-down into code-history. Our central SVN repository is some 4-5 years old
and a whopping 300GB+ on-disk. (Yowza!) What we'd really like to do is dump
/trunk history out to a new repo and roll forward with that, ditching
any historical baggage from old topic branches (
/branches). The trouble is,
I haven't been able to find any tools to do this.
So, I ended-up writing my own tool to do this. But, first, some back-story…
In doing some searches for variations around “svn repo filter”, I found a lot
of people pointing to the “svndumpfilter”
utility as the tool-of-choice. Sadly, it doesn't seem to be quite “smart” enough
to do what I'd like it to do. It seems to be aimed at (namely) filtering an
svnadmin dump stream, taking only certain paths in the SVN “filesystem”. That
works fine if you're trying to take an isolated folder/project from the repository,
but if that folder/project has ever been merged into from any paths outside of
the target filter path, then things fall apart. The “trouble” are the copy-from's…
For example, say that you have a repo with a typical trunk/tags/branches setup.
Say you create a new topic branch (e.g.
svn copy /trunk /branches/my-fix && svn co /branches/my-fix)
and happily work on your sandboxed branch. Say that you decide to rename some of
the pre-existing files/folders, so you run the appropriate
svn move commands and
happily commit those changes to your branch. Once everything is working happily,
you go to merge these changes into
/trunk and that all works great. After the
commit, if you run a
svn log -v -l1 /trunk to look at the details of the most
recent commit to trunk, you'll see something like this:
A /trunk/Project/RenamedFolder (from /branches/my-fix/Project/RenamedFolder@12345)
…which only describes the (top-level) folder rename, not any add/modifications/renames/etc
that might have happened inside that folder on the branch. At the full repo-level,
SVN can get away with just doing a “
svn copy” from the branch to trunk, so that
when you do an “
svn log” on the new “RenamedFolder” path it will walk from trunk
back into that originating topic branch and follow the rename (copy or move) that
svnadmin dump will show the same “copy-from” action as
svn log -v did,
since the dump stream will need to be able to recreate that same ancestry. When
svndumpfilter sees this, it throws up its hands and returns an error because
it doesn't know how to recreate the logical history that happened between
when that branch originally forked from trunk versus the final state of that branch.
That is, assuming the copy-from path even has any ancestry back to trunk…
If You Want Something Done Right…
In my searches, I had found a few folks that had hacked together there own solutions. None of those quite fit my problem at hand, so naturally I decided to start working on my own solution.
The closest fit I found was a project named “svn2svn”
hosted on Google Code. It had copied parts of the “hgsvn”
project (sychronizing between Mercurial and Subversion repositories) and
slapped them together in a way that worked for the original author's needs.
It used “
svn log” to walk the entire history of a given path in a source
repository and then manually replayed those changes to a working-copy of
some path in a target repository. Revision by revision, it would replay
the delta, recreating the history of just the source path in the target repo.
This was oh-so-close to what I wanted, except that it also didn't correctly
handle the copy-from case. And the repository I really wanted to replay was
littered with copy-from's, since I'm trying to replay just the history from
/trunk and we create topic-branches for everything.
So, I started with the svn2svn project, got familiar with the code, and started hacking to extend the code to solve my problem at hand.
The net-result is my own (nearly completely rewritten) take on the problem, a project which I'm also calling svn2svn.
I've made several enhancements to the original script:
Full ancestry (copy-from) support. This was the tricky part. It took several different iterations/rewrites to get something which worked for all the different edge-cases. The general idea here is that we can use
svn logto walk backwards through the ancestry on a copy-from case: if we can trace back our ancestry to same source path we're replaying, then we can do an
svn copyfrom that original parent and then do
svn exportto update the contents of all the files to match the final copy-from version. There's also some extra recursion that needs to happen here, to handle cases where child files got renamed inside of a parent-renamed folder. There are other edge-cases like files getting replaced inside a parent-renamed folder.
Use revprops for source-tracking and resume support. Subversion has revision properties, key+value pairs that are associated with a particular revision/commit. I'm setting some
svn2svn:*rev-props to track the source URL, source repository UUID (i.e. in case source URL now points at a physically different repo), and source revision #. This is all needed for proper resume support, since for the ancestry support we have to maintain a mapping-table of source_rev -> target_rev, so that when we find a copy-from from some revision # in the source repo, we can map this to the equivalent revision # in the target repo so we can do an equivalent
Better verbosity output, and optional debug output. As I was playing with the rewrite, I quickly found I needed better debug output, to see which shell commands were being run and to display just general debug/status messages as we do all this new complicate ancestry-walking logic. Bonus: the debugging messages have colored output, using ANSI escape codes which all self-respecting terminal emulators should respect.
All commits (including initial import) go through the same central code-path, which means we could run a (client-side) pre-commit script to scrub the contents of the target working-copy before each commit. This is where the power of doing the manual replay of changes really starts to shine. We have full control over the pending changes, which means that if your original trunk history had some files which you didn't want to transition into the new replayed repo then you could easily
svn rmthose files from the working-copy before the commit happens. That could be as simple as excluding certain fixed paths, but it could be a lot more flexible like searching the entire working-copy tree and removing any files which match a certain file-name. Heck, you could even modify the working-copy file-contents at this point…or add brand-new files if you want. We're replaying the SVN history here will full control of the target content, so this opens a lot of interesting options.
Check-out the svn2svn project page for more details.
Also, I have the project mirrored on Github so please feel free to fork the project, send me issues/enhancements, or send me pull-requests for any tweaks you've made.