wishlist: git-annex replicate

I'd like to be able to do something like the following:

Create encrypted git-annex remotes on a couple of semi-trusted machines - ones that have good connectivity, but non-redundant hardware
set numcopies=3
run git-annex replicate and have git-annex run the appropriate copy commands to make sure every file is on at least 3 machines

There would also likely be a git annex rebalance command which could be used if remotes were added or removed. If possible, it should copy files between servers directly, rather than proxy through a potentially slow client.

There might be the need to have a 'replication_priority' option for each remote that configures which machines would be preferred. That way you could set your local server to a high priority to ensure that it is always 1 of the 3 machines used and files are distributed across 2 of the remaining remotes. Other than priority, other options that might help:

maxspace - A self imposed quota per remote machine. git-annex replicate should try to replicate files first to machines with more free space. maxspace would change the free space calculation to be `min(actual_free_space, maxspace - space_used_by_git_annex)
bandwidth - when replication files, copies should be done between machines with the highest available bandwidth. ( I think this option could be useful for git-annex get in general)

comment 1

While having remotes redistribute introduces some obvious security concerns, I might use it.

As remotes support a cost factor already, you can basically implement bandwidth through that.

Comment by Richard — Fri Apr 22 14:27:00 2011

comment 2

Besides the cost values, annex.diskreserve was recently added. (But is not available for special remotes.)

I have held off on adding high-level management stuff like this to git-annex, as it's hard to make it generic enough to cover use cases.

A low-level way to accomplish this would be to have a way for git annex get and/or copy to skip files when numcopies is already satisfied. Then cron jobs could be used.

Comment by joey — Sat Apr 23 12:22:07 2011

comment 3

Hmm, so it seems there is almost a way to do this already.

I think the one thing that isn't currently possible is to have 'plain' ssh remotes.. basically something just like the directory remote, but able to take a ssh user@host/path url. something like sshfs could be used to fake this, but for things like fsck you would want to do the sha1 calculations on the remote host.

Comment by Justin — Sat Apr 23 13:54:42 2011

Comments on this page are closed.