This is git-annex's todo list. Link items to done when done.

smudge
Posted Sat Sep 17 09:24:24 2011

git-annex unused eats memory
Posted Sat Sep 17 09:24:24 2011

parallel possibilities
Posted Sat Sep 17 09:10:11 2011

wishlist: swift backend
Posted Sat Sep 17 09:10:11 2011

tahoe lfs for reals
Posted Sat Sep 17 09:10:11 2011

hidden files
Posted Sat Sep 17 09:10:11 2011

cache key info
Posted Sat Sep 17 09:10:11 2011

wishlist: Prevent repeated password prompts for one command
Posted Sat Sep 17 09:10:11 2011

wishlist: Provide a "git annex" command that will skip duplicates
Posted Sat Sep 17 09:10:11 2011

support S3 multipart uploads
Posted Sat Sep 17 09:10:11 2011

auto remotes
Posted Sat Sep 17 09:10:11 2011

wishlist: "git annex add" multiple processes
Posted Sat Sep 17 09:10:11 2011

union mounting
Posted Wed Apr 13 16:18:01 2011

Well, I spent a few hours playing this evening in the 'reorg' branch in git. It seems to be shaping up pretty well; type-based refactoring in haskell makes these kind of big systematic changes a matter of editing until it compiles. And it compiles and test suite passes. But, so far I've only covered 1. 3. and 4. on the list, and have yet to deal with upgrades.

I'd recommend you not wait before using git-annex. I am committed to provide upgradability between annexes created with all versions of git-annex, going forward. This is important because we can have offline archival drives that sit unused for years. Git-annex will upgrade a repository to current standard the first time it sees it, and I hope the upgrade will be pretty smooth. It was not bad for the annex.version 0 to 1 upgrade earlier. The only annoyance with upgrades is that it will result in some big commits to git, as every symlink in the repo gets changed, and log files get moved to new names.

(The metadata being stored with keys is data that a particular backend can use, and is static to a given key, so there are no merge issues (and it won't be used to preserve mtimes, etc).)

Comment by joey Tue Mar 15 23:22:45 2011

What is the potential time-frame for this change? As I am not using git-annex for production yet, I can see myself waiting to avoid any potential hassle.

Supporting generic metadata seems like a great idea. Though if you are going this path, wouldn't it make sense to avoid metastore for mtime etc and support this natively without outside dependencies?

-- RichiH

Comment by Richard Tue Mar 15 10:08:41 2011

The mtime cannot be stored for all keys. Consider a SHA1 key. The mtime is irrelevant; 2 files with different mtimes, when added to the SHA1 backend, should get the same key.

Probably our spam filter doesn't like your work IP.

Comment by joey Wed Mar 16 12:32:52 2011

For what it's worth, yes, I want to actually forget I ever had the same file in the filesystem with a duplicated name. I'm not just aiming to clean up the disk's space usage; I'm also aiming to clean things up so that navigating the filesystem is easier.

I can write my own script to do that based on the symlinks' target (and I wrote something along those lines), but I still think it'd be nicer if git-annex supported this use case.

Perhaps:

git annex drop --by-contents

could let me remove a file from git-annex if the contents are available through a different name. (Right now, "git annex drop" requires the name and contents match.)

-- Asheesh.

Comment by Asheesh Fri Apr 29 07:48:22 2011

Ah, OK. I assumed the metadata would be attached to a key, not part of the key. This seems to make upgrades/extensions down the line harder than they need to be, but you are right that this way, merges are not, and never will be, an issue.

Though with the SHA1 backend, changing files can be tracked. This means that tracking changes in mtime or other is possible. It also means that there are potential merge issues. But I won't argue the point endlessly. I can accept design decisions :)

The prefix at work is from a university netblock so yes, it might be on a few hundred proxy lists etc.

Comment by Richard Wed Mar 16 17:05:38 2011

I agree with Christian.

One should first make a better use of connections to remotes before exploring parallel possibilities. One should pipeline the requests and answers.

Of course this could be implemented using parallel&concurrency features of Haskell to do this.

Comment by npouillard Fri May 20 16:14:15 2011
Thank your for your answer and the link !
Comment by jbd Sat Feb 26 06:26:12 2011
It does offer a S3 compability layer, but that is de facto non-functioning as of right now.
Comment by Richard Sat May 14 11:00:51 2011

I really do want just one filename per file, at least for some cases.

For my photos, there's no benefit to having a few filenames point to the same file. As I'm putting them all into the git-annex, that is a good time to remove the pure duplicates so that I don't e.g. see them twice when browsing the directory as a gallery. Also, I am uploading my photos to the web, and I want to avoid uploading the same photo (by content) twice.

I hope that makes things clearer!

For now I'm just doing this:

  • paulproteus@renaissance:/mnt/backups-terabyte/paulproteus/sd-card-from-2011-01-06/sd-cards/DCIM/100CANON $ for file in *; do hash=$(sha1sum "$file"); if ls /home/paulproteus/Photos/in-flickr/.git-annex | grep -q "$hash"; then echo already annexed ; else flickr_upload "$file" && mv "$file" "/home/paulproteus/Photos/in-flickr/2011-01-28/from-some-nested-sd-card-bk" && (cd /home/paulproteus/Photos/in-flickr/2011-01-28/from-some-nested-sd-card-bk && git annex add . && git commit -m ...) ; fi; done

(Yeah, Flickr for my photos for now. I feel sad about betraying the principle of autonomo.us-ness.)

Comment by Asheesh Fri Jan 28 03:30:05 2011

Sounds like a good idea.

  • git annex fsck (or similar) should check/rebuild the caches
  • I would simply require a clean tree with a verbose error. 80/20 rule and defaulting to save actions.
Comment by Richard Tue May 17 03:27:02 2011
Comment by joey Fri Feb 25 15:54:28 2011
I don't suppose this SWIFT api is compatible with the eucalytpus walrus api ?
Comment by Jimmy Sat May 14 06:04:36 2011

I'd expect the checksumming to be disk bound, not CPU bound, on most systems.

I suggest you start off on the WORM backend, and then you can run a job later to migrate to the SHA1 backend.

Comment by joey Fri Feb 25 15:12:42 2011
Hmm, I added quite a few comments at work, but they are stuck in moderation. Maybe I forgot to log in before adding them. I am surprised this one appeared immediately. -- RichiH
Comment by Richard Tue Mar 15 21:19:25 2011

(Sadly, it cannot create a symlink, as git still wants to write the file afterwards. So the nice current behavior of unavailable files being clearly missing due to dangling symlinks, would be lost when using smudge/clean filters. (Contact git developers to get an interface to do this?)

Have you checked what the smudge filter sees when the input is a symlink? Because git supports tracking symlinks, so it should also support pushing symlinks through a smudge filter, right? Either way: yes, contact the git devs, one can only ask and hope. And if you can demonstrate the awesomeness of git-annex they might get more 1interested :)

Comment by dieter Sun Apr 3 16:30:21 2011

Hashing & segmenting seems to be around the corner, which is nice :)

Is there a chance that you will optionally add mtime to your native metadata store? If yes, I'd rather wait for v2 to start with the native system from the start. If not, I will probably set it up tonight.

PS: While posting from work, my comments are held for moderation once again. I am somewhat confused as to why this happens when I can just submit directly from home. And yes, I am using the same auth provider and user in both cases.

Comment by Richard Wed Mar 16 11:51:30 2011
If you support generic meta-data, keep in mind that you will need to do conflict resolution. Timestamps may not be synched across all systems, so keeping a log of old metadata could be used, sorting by history and using the latest. Which leaves the situation of two incompatible changes. This would probably mean manual conflict resolution. You will probably have thought of this already, but I still wanted to make sure this is recorded. -- RichiH
Comment by Richard Tue Mar 15 21:16:48 2011

Hm... O(N^2)? I think it just takes O(N). To read an entry out of a directory you have to download the entire directory (and store it in RAM and parse it). The constants are basically "too big to be good but not big enough to be prohibitive", I think. jctang has reported that his special remote hook performs well enough to use, but it would be nice if it were faster.

The Tahoe-LAFS folks are working on speeding up mutable files, by the way, after which we would be able to speed up directories.

Comment by zooko Tue May 17 15:20:39 2011

Whoops! You'd only told me O(N) twice before..

So this is not too high priority. I think I would like to get the per-remote storage sorted out anyway, since probably it will be the thing needed to convert the URL backend into a special remote, which would then allow ripping out the otherwise unused pluggable backend infrastructure.

Update: Per-remote storage is now sorted out, so this could be implemented if it actually made sense to do so.

Comment by joey Tue May 17 15:57:33 2011

Hey Asheesh, I'm happy you're finding git-annex useful.

So, there are two forms of duplication going on here. There's duplication of the content, and duplication of the filenames pointing at that content.

Duplication of the filenames is probably not a concern, although it's what I thought you were talking about at first. It's probably info worth recording that backup-2010/some_dir/foo and backup-2009/other_dir/foo are two names you've used for the same content in the past. If you really wanted to remove backup-2009/foo, you could do it by writing a script that looks at the basenames of the symlink targets and removes files that point to the same content as other files.

Using SHA1 ensures that the same key is used for identical files, so generally avoids duplication of content. But if you have 2 disks with an identical file on each, and make them both into annexes, then git-annex will happily retain both copies of the content, one per disk. It generally considers keeping copies of content a good thing. :)

So, what if you want to remove the unnecessary copies? Well, there's a really simple way:

cd /media/usb-1
git remote add other-disk /media/usb-0
git annex add
git annex drop

This asks git-annex to add everything to the annex, but then remove any file contents that it can safely remove. What can it safely remove? Well, anything that it can verify is on another repository such as "other-disk"! So, this will happily drop any duplicated file contents, while leaving all the rest alone.

In practice, you might not want to have all your old backup disks mounted at the same time and configured as remotes. Look into configuring trust to avoid needing do to that. If usb-0 is already a trusted disk, all you need is a simple "git annex drop" on usb-1.

Comment by joey Thu Jan 27 14:29:44 2011
would, with these modifications in place, there still be a way to really git-add a file? (my main repository contains both normal git and git-annex files.)
Comment by chrysn Sat Feb 26 17:43:21 2011
I also think, that fetching keys via rsync can be done by one rsync process, when the keys are fetched from one host. This would avoid establishing a new TCP connection for every file.
Comment by Christian Fri Apr 8 08:41:43 2011

Unless you are forced to use a password, you should really be using a ssh key.

ssh-keygen
#put local .ssh/id_?sa.pub into remote .ssh/authorized_keys (which needs to be chmod 600)
ssh-add
git annex whatever
Comment by Richard Fri May 6 14:30:02 2011
Comments on this page are closed.