I checked in my music collection into git annex (about 25000 files) and i'm really impressed by the performance of git annex (after i've done an git-repack). Now i'm also moving my movies into the same git-annex, but i have the following layout of my disk drives:

  • small raid-1 for important stuff (music, documents), which is also backupped (aka: raid)
  • big bulk data store (aka: media)

In the git-annex the following layout of files is used:

  • documents/ <- on raid
  • music/ <- on raid
  • videos/ <- on media

Now i didn't simply clone the raid-annex to media, but did an sparse-checkout (possible since version 1.7.0)

  • raid: .git-annex/, documents/ and music
  • media: .git-annex/, videos/

As you can see i have to checkout the .git-annex directory with the file-logs twice which slows down git operations. Everything else works fine until now. git-annex doesn't have any problem, that only a part of the symlinks are present, which is really great. Is there a possibility to sparse checkout the .git-annex directory also? Perhaps splitting the log files in .git-annex/ into N subfolders, corresponding to the toplevel subfolders, like this?

Before:

 $ ls .git-annex
 00 01 02....

After:

 $ ls .git-annex
 documents/ music/ videos/
 $ ls .git-annex/documents
 00 01 02....

This would make it possible to checkout only the part of the log files which i'm interested in.

That's awesome, I had not heard of git sparse checkouts before.

It does not make sense to tie the log files to the directory of the corresponding files, as then the logs would have to move when the files are moved, which would be a PITA and likely make merging log file changes very complex. Also, of course, multiple files in different locations can point at the same content, which has the same log file. And, to cap it off, git-annex can need to access the log file for a given key without having the slightest idea what file in the repository might point to it, and it would be very expensive to scan the whole repository to find out what that file is in order to lookup the filename of the log file.

The most likely change in git-annex that will make this better is in this todo item -- but it's unknown how to do it yet.

Comment by joey Thu Apr 7 12:32:04 2011
BTW, git-annex unused will have a problem that not all the symlinks are present. It will suggest dropping content belonging to the excluded symlinks.
Comment by joey Thu Apr 7 12:33:30 2011
So perhaps checking if git-status (or similar) complains about missing files is a possible solution for this?
Comment by Christian Fri Apr 8 03:31:03 2011

And something else i've done is, that i symlinked the video/ directory from the media annex to the normal raid annex

ln -s ~/media/annex/video ~/annex

And it's working out great.

~annex $ git annex whereis video/series/episode1.avi
whereis video/series/episode1.avi(1 copy)
        f210b45a-60d3-11e0-b593-3318d96f2520  -- Trantor - Media
ok

I really like this, perhaps it is a good idea to store all log files in every repo, but maybe there is a possibilitiy to to pack multiple log files into one single file, where not only the time, the present bit and the annex-repository is stored, but also the file key. I don't know if this format would also be merged correctly by the union merge driver.

Comment by Christian Fri Apr 8 03:54:37 2011
Comments on this page are closed.