A supplement to the walkthrough.

using Amazon S3

git-annex extends git's usual remotes with some special remotes, that are not git repositories. This way you can set up a remote using say, Amazon S3, and use git-annex to transfer files into the cloud.

First, export your S3 credentials:

# export ANNEX_S3_ACCESS_KEY_ID="08TJMT99S3511WOZEP91"
# export ANNEX_S3_SECRET_ACCESS_KEY="s3kr1t"

Now, create a gpg key, if you don't already have one. This will be used to encrypt everything stored in S3, for your privacy. Once you have a gpg key, run gpg --list-secret-keys to look up its key id, something like "2512E3C7"

Next, create the S3 remote, and describe it.

# git annex initremote cloud type=S3 encryption=2512E3C7
initremote cloud (encryption setup with gpg key C910D9222512E3C7) (checking bucket) (creating bucket in US) (gpg) ok
# git annex describe cloud "at Amazon's US datacenter"
describe cloud ok

The configuration for the S3 remote is stored in git. So to make another repository use the same S3 remote is easy:

# cd /media/usb/annex
# git pull laptop
# git annex initremote cloud
initremote cloud (gpg) (checking bucket) ok

Now the remote can be used like any other remote.

# git annex copy my_cool_big_file --to cloud
copy my_cool_big_file (gpg) (checking cloud...) (to cloud...) ok
# git annex move video/hackity_hack_and_kaxxt.mov --to cloud
move video/hackity_hack_and_kaxxt.mov (checking cloud...) (to cloud...) ok

See S3 for details.

using bup

Another special remote that git-annex can use is a bup repository. Bup stores large file contents in a git repository of its own, with deduplication. Combined with git-annex, you can have git on both the frontend and the backend.

Here's how to create a bup remote, and describe it.

[[!template Error: note not found]]

# git annex initremote mybup type=bup encryption=none buprepo=example.com:/big/mybup
initremote bup (bup init)
Initialized empty Git repository in /big/mybup/
ok
# git annex describe mybup "my bup repository at example.com"
describe mybup ok

Now the remote can be used like any other remote.

# git annex move my_cool_big_file --to mybup
move my_cool_big_file (to mybup...)
Receiving index from server: 1100/1100, done.
ok

Note that, unlike other remotes, bup does not really support removing content from its git repositories. This is a feature. :)

# git annex move my_cool_big_file --from mybup
move my_cool_big_file (from mybup...)
  content cannot be removed from bup remote
failed
git-annex: 1 failed

See bup for details.

using the web

The web can be used as a special remote too.

# git annex addurl http://example.com/video.mpeg
addurl video.mpeg (downloading http://example.com/video.mpeg)
########################################################## 100.0%
ok

Now the file is downloaded, and has been added to the annex like any other file. So it can be copied to other repositories, and so on.

Note that git-annex assumes that, if the web site does not 404, the file is still present on the web, and this counts as one copy of the file. So it will let you remove your last copy, trusting it can be downloaded again:

# git annex drop video.mpeg
drop video.mpeg (checking http://example.com/video.mpeg) ok

If you don't trust the web to this degree, just let git-annex know:

# git annex untrust web
untrust web ok

With the result that it will hang onto files:

# git annex drop video.mpeg
drop video.mpeg (unsafe) 
  Could only verify the existence of 0 out of 1 necessary copies
  Also these untrusted repositories may contain the file:
    00000000-0000-0000-0000-000000000001  -- web
  (Use --force to override this check, or adjust annex.numcopies.)
failed

using the SHA1 backend

A handy alternative to the default backend is the SHA1 backend. This backend provides more git-style assurance that your data has not been damaged. And the checksum means that when you add the same content to the annex twice, only one copy need be stored in the backend.

The only reason it's not the default is that it needs to checksum files when they're added to the annex, and this can slow things down significantly for really big files. To make SHA1 the default, just add something like this to .gitattributes:

* annex.backend=SHA1

migrating data to a new backend

Maybe you started out using the WORM backend, and have now configured git-annex to use SHA1. But files you added to the annex before still use the WORM backend. There is a simple command that can migrate that data:

# git annex migrate my_cool_big_file
migrate my_cool_big_file (checksum...) ok

You can only migrate files whose content is currently available. Other files will be skipped.

After migrating a file to a new backend, the old content in the old backend will still be present. That is necessary because multiple files can point to the same content. The git annex unused subcommand can be used to clear up that detritus later. Note that hard links are used, to avoid wasting disk space.

untrusted repositories

Suppose you have a USB thumb drive and are using it as a git annex repository. You don't trust the drive, because you could lose it, or accidentally run it through the laundry. Or, maybe you have a drive that you know is dying, and you'd like to be warned if there are any files on it not backed up somewhere else. Maybe the drive has already died or been lost.

You can let git-annex know that you don't trust a repository, and it will adjust its behavior to avoid relying on that repositories's continued availability.

# git annex untrust usbdrive
untrust usbdrive ok

Now when you do a fsck, you'll be warned appropriately:

# git annex fsck .
fsck my_big_file
  Only these untrusted locations may have copies of this file!
    05e296c4-2989-11e0-bf40-bad1535567fe  -- portable USB drive
  Back it up to trusted locations with git-annex copy.
failed

Also, git-annex will refuse to drop a file from elsewhere just because it can see a copy on the untrusted repository.

It's also possible to tell git-annex that you have an unusually high level of trust for a repository. See trust for details.

what to do when you lose a repository

So you lost a thumb drive containing a git-annex repository. Or a hard drive died or some other misfortune has befallen your data.

Unless you configured backups, git-annex can't get your data back. But it can help you deal with the loss.

First, go somewhere that knows about the lost repository, and mark it as untrusted.

git annex untrust usbdrive

To remind yourself later what happened, you can change its description, too:

git annex describe usbdrive "USB drive lost in Timbuktu. Probably gone forever."

This retains the location tracking information for the repository. Maybe you'll find the drive later. Maybe that's impossible. Either way, this lets git-annex tell you why a file is no longer accessible, and it avoids it relying on that drive to hold any content.

recover data from lost+found

Suppose something goes wrong, and fsck puts all the files in lost+found. It's actually very easy to recover from this disaster.

First, check out the git repository again. Then, in the new checkout:

$ mkdir recovered-content
$ sudo mv ../lost+found/* recovered-content
$ sudo chown you:you recovered-content
$ chmod -R u+w recovered-content
$ git annex add recovered-content
$ git rm recovered-content
$ git commit -m "recovered some content"
$ git annex fsck

The way that works is that when git-annex adds the same content that was in the repository before, all the old links to that content start working again. This works particularly well if the SHA* backends are used, but even with the default backend it will work pretty well, as long as fsck preserved the modification time of the files.

Internet Archive via S3

The Internet Archive allows members to upload collections using an Amazon S3 compatible API, and this can be used with git-annex's S3 support.

So, you can locally archive things with git-annex, define remotes that correspond to "items" at the Internet Archive, and use git-annex to upload your files to there. Of course, your use of the Internet Archive must comply with their terms of service.

Sign up for an account, and get your access keys here: http://www.archive.org/account/s3.php

# export AWS_ACCESS_KEY_ID=blahblah
# export AWS_SECRET_ACCESS_KEY=xxxxxxx

Specify host=s3.us.archive.org when doing initremote to set up a remote at the Archive. This will enable a special Internet Archive mode: Encryption is not allowed; you are required to specify a bucket name rather than having git-annex pick a random one; and you can optionally specify x-archive-meta* headers to add metadata as explained in their documentation.

[[!template Error: note not found]]

# git annex initremote archive-panama type=S3 \
    host=s3.us.archive.org bucket=panama-canal-lock-blueprints \
    x-archive-meta-mediatype=texts x-archive-meta-language=eng \
    x-archive-meta-title="original Panama Canal lock design blueprints"
initremote archive-panama (Internet Archive mode) ok
# git annex describe archive-panama "a man, a plan, a canal: panama"
describe archive-panama ok

Then you can annex files and copy them to the remote as usual:

# git annex add photo1.jpeg --backend=SHA1E
add photo1.jpeg (checksum...) ok
# git annex copy photo1.jpeg --fast --to archive-panama
copy (to archive-panama...) ok

Note the use of the SHA1E backend. It makes most sense to use the WORM or SHA1E backend for files that will be stored in the Internet Archive, since the key name will be exposed as the filename there, and since the Archive does special processing of files based on their extension.

Comments on this page are closed.