10 GB) repositories."/>10 GB) repositories."/>10 GB) repositories."/>

Danny's Lab

Engineering the World

Version Control for Big Repositories (git, svn, mercurial)

Published on: Feb 3, 2015
Reading time: 4 minutes

Overview

These days git appears to have won the SCM "wars". While it's fantastic for what it was intended (source code), it has some short-comings when you try to use it for large (i.e. >10 GB) repositories.

The Problem

I have a repository for keeping track of images/photos/scans. I do this because a repository affords me several benefits I don't get easily with a bare filesystem:

  1. Change monitoring. I want to be absolutely sure when a file has been modified
  2. History tracking. I'd like to maintain an idea of when photos were added/taken (independent of any metadata in the image).
  3. Backup. Because of the change tracking and the remote storage, an SCM "automatically" gives me a backup of my library with easy synchronization

My library consists of many images of varying sizes (some small, some not), but the entire library is nearly 100GB.

The Requirements

  • Easy Repository Rebuild - I must be able to easily rebuild the repository history from a working copy (even if the checkout is not a complete history)
  • Be network fault tolerant - A 50+GB repository takes a while to transfer. I must know that initial checkouts as well as future checkouts and commits can be continued if the network connection is disrupted.

The Candidates

I evaluated several SCM tools to solve this problem:

  • git
  • svn
  • git-svn
  • mercurial
  • boar (https://code.google.com/p/boar/)
  • bup (https://github.com/apenwarr/bup)
  • camilstore (https://camlistore.org/docs/overview)
  • a replicated filesystem

Git

I like git. Git works really well.... for most problems. But when it comes to big files and/or repositories, it's severely lacking. While I found numerous extensions that attempt to address the issue (git-annex, git-bigfiles, git-fat, git-media, etc.), none really work. The way git packs files and performs delta operations, it's extremely inefficient when it comes to large repositories, wanting to keep large portions of it in memory, which inevitably results in "Out of memory" errors when working on large repositories. All of the extensions attempt to resolve this problem by simply storing metadata about your files in git while keeping the content outside of the repository. However, this doesn't satisfy my need of doubling as a backup tool. Git also has a problem where an initial clone operation must complete fully for the checkout to work. In other words if you have any network hiccups (during your 100GB transfer), you'd have to start over.

Subversion

Subversion handles large files much better than git. The only real problem I had with it is that checkouts can't automatically serve as a repository backup. ie. there is no tool to easily take log data from a checkout and recreate a server from a working checkout. In my use case, I have a repository that I only ever add to (no deletes or modifications), so this would have been perfect, had it existed.

Git-svn

This seemed like the best of both worlds. I could use a Subversion server, checkout to a local git repository, then perhaps rebuild the subversion server from the working git repository if any problems occurred. Alas, I couldn't get this working.

Mercurial

Mercurial handles things quite similarly to git, however it doesn't suffer from interrupted network connection quite as badly as git does. It allows you to do partial checkouts, so you can simply specify smaller ranges and work your way until you checkout the entire repository. I did end up getting this to work in Ubuntu. However, I didn't have any luck in CentOS, which causes me to question the longevity of the program.

Boar

Boar appears to be designed exactly for my purpose. However, it doesn't seem to have much of an active community around it. And there doesn't appear to be a way to convert between boar and svn/git.

Camilstore

Camilstore also appears to be designed for a similar purpose. But it also doesn't appear to have much community support.

Bup

Is a backup utility. It doesn't seem to be designed exactly for my purpose, but it may work for you. I decided it wasn't quite mature enough for my purpose.

Replicated Filesystem

This is certainly one of the easiest options, but it doesn't afford me any of the integrity checking that I want.

Conclusion

I see little alternative but to stay with subversion. It's still well supported and handles it's job well. However, I'll solve my backup problem by simply syncing the server repository with an Amazon S3 bucket (See my article on [Simple Backup to Amazon S3](/articles/computers-and-software/servers/simple-backup-to-amazon-s3/ "Simple Backup to Amazon S3")). I'd prefer a more elegant solution, but this appears to be the state of the art for the time being.