This is git-annex's todo list. Link items to done when done.
smudge
Posted Sat Sep 17 09:24:24 2011
git-annex unused eats memory
Posted Sat Sep 17 09:24:24 2011
parallel possibilities
Posted Sat Sep 17 09:10:11 2011
wishlist: swift backend
Posted Sat Sep 17 09:10:11 2011
tahoe lfs for reals
Posted Sat Sep 17 09:10:11 2011
hidden files
Posted Sat Sep 17 09:10:11 2011
cache key info
Posted Sat Sep 17 09:10:11 2011
wishlist: Prevent repeated password prompts for one command
Posted Sat Sep 17 09:10:11 2011
wishlist: Provide a "git annex" command that will skip duplicates
Posted Sat Sep 17 09:10:11 2011
support S3 multipart uploads
Posted Sat Sep 17 09:10:11 2011
auto remotes
Posted Sat Sep 17 09:10:11 2011
wishlist: "git annex add" multiple processes
Posted Sat Sep 17 09:10:11 2011
union mounting
Posted Wed Apr 13 16:18:01 2011
Well, I spent a few hours playing this evening in the 'reorg' branch in git. It seems to be shaping up pretty well; type-based refactoring in haskell makes these kind of big systematic changes a matter of editing until it compiles. And it compiles and test suite passes. But, so far I've only covered 1. 3. and 4. on the list, and have yet to deal with upgrades.
I'd recommend you not wait before using git-annex. I am committed to provide upgradability between annexes created with all versions of git-annex, going forward. This is important because we can have offline archival drives that sit unused for years. Git-annex will upgrade a repository to current standard the first time it sees it, and I hope the upgrade will be pretty smooth. It was not bad for the annex.version 0 to 1 upgrade earlier. The only annoyance with upgrades is that it will result in some big commits to git, as every symlink in the repo gets changed, and log files get moved to new names.
(The metadata being stored with keys is data that a particular backend can use, and is static to a given key, so there are no merge issues (and it won't be used to preserve mtimes, etc).)
What is the potential time-frame for this change? As I am not using git-annex for production yet, I can see myself waiting to avoid any potential hassle.
Supporting generic metadata seems like a great idea. Though if you are going this path, wouldn't it make sense to avoid metastore for mtime etc and support this natively without outside dependencies?
-- RichiH
The mtime cannot be stored for all keys. Consider a SHA1 key. The mtime is irrelevant; 2 files with different mtimes, when added to the SHA1 backend, should get the same key.
Probably our spam filter doesn't like your work IP.
For what it's worth, yes, I want to actually forget I ever had the same file in the filesystem with a duplicated name. I'm not just aiming to clean up the disk's space usage; I'm also aiming to clean things up so that navigating the filesystem is easier.
I can write my own script to do that based on the symlinks' target (and I wrote something along those lines), but I still think it'd be nicer if git-annex supported this use case.
Perhaps:
could let me remove a file from git-annex if the contents are available through a different name. (Right now, "git annex drop" requires the name and contents match.)
-- Asheesh.
Ah, OK. I assumed the metadata would be attached to a key, not part of the key. This seems to make upgrades/extensions down the line harder than they need to be, but you are right that this way, merges are not, and never will be, an issue.
Though with the SHA1 backend, changing files can be tracked. This means that tracking changes in mtime or other is possible. It also means that there are potential merge issues. But I won't argue the point endlessly. I can accept design decisions :)
The prefix at work is from a university netblock so yes, it might be on a few hundred proxy lists etc.
I agree with Christian.
One should first make a better use of connections to remotes before exploring parallel possibilities. One should pipeline the requests and answers.
Of course this could be implemented using parallel&concurrency features of Haskell to do this.
I really do want just one filename per file, at least for some cases.
For my photos, there's no benefit to having a few filenames point to the same file. As I'm putting them all into the git-annex, that is a good time to remove the pure duplicates so that I don't e.g. see them twice when browsing the directory as a gallery. Also, I am uploading my photos to the web, and I want to avoid uploading the same photo (by content) twice.
I hope that makes things clearer!
For now I'm just doing this:
(Yeah, Flickr for my photos for now. I feel sad about betraying the principle of autonomo.us-ness.)
Sounds like a good idea.
I'd expect the checksumming to be disk bound, not CPU bound, on most systems.
I suggest you start off on the WORM backend, and then you can run a job later to migrate to the SHA1 backend.
Have you checked what the smudge filter sees when the input is a symlink? Because git supports tracking symlinks, so it should also support pushing symlinks through a smudge filter, right? Either way: yes, contact the git devs, one can only ask and hope. And if you can demonstrate the awesomeness of git-annex they might get more 1interested :)
Hashing & segmenting seems to be around the corner, which is nice :)
Is there a chance that you will optionally add mtime to your native metadata store? If yes, I'd rather wait for v2 to start with the native system from the start. If not, I will probably set it up tonight.
PS: While posting from work, my comments are held for moderation once again. I am somewhat confused as to why this happens when I can just submit directly from home. And yes, I am using the same auth provider and user in both cases.
Hm... O(N^2)? I think it just takes O(N). To read an entry out of a directory you have to download the entire directory (and store it in RAM and parse it). The constants are basically "too big to be good but not big enough to be prohibitive", I think. jctang has reported that his special remote hook performs well enough to use, but it would be nice if it were faster.
The Tahoe-LAFS folks are working on speeding up mutable files, by the way, after which we would be able to speed up directories.
Whoops! You'd only told me O(N) twice before..
So this is not too high priority. I think I would like to get the per-remote storage sorted out anyway, since probably it will be the thing needed to convert the URL backend into a special remote, which would then allow ripping out the otherwise unused pluggable backend infrastructure.
Update: Per-remote storage is now sorted out, so this could be implemented if it actually made sense to do so.
Hey Asheesh, I'm happy you're finding git-annex useful.
So, there are two forms of duplication going on here. There's duplication of the content, and duplication of the filenames pointing at that content.
Duplication of the filenames is probably not a concern, although it's what I thought you were talking about at first. It's probably info worth recording that backup-2010/some_dir/foo and backup-2009/other_dir/foo are two names you've used for the same content in the past. If you really wanted to remove backup-2009/foo, you could do it by writing a script that looks at the basenames of the symlink targets and removes files that point to the same content as other files.
Using SHA1 ensures that the same key is used for identical files, so generally avoids duplication of content. But if you have 2 disks with an identical file on each, and make them both into annexes, then git-annex will happily retain both copies of the content, one per disk. It generally considers keeping copies of content a good thing. :)
So, what if you want to remove the unnecessary copies? Well, there's a really simple way:
This asks git-annex to add everything to the annex, but then remove any file contents that it can safely remove. What can it safely remove? Well, anything that it can verify is on another repository such as "other-disk"! So, this will happily drop any duplicated file contents, while leaving all the rest alone.
In practice, you might not want to have all your old backup disks mounted at the same time and configured as remotes. Look into configuring trust to avoid needing do to that. If usb-0 is already a trusted disk, all you need is a simple "git annex drop" on usb-1.
Unless you are forced to use a password, you should really be using a ssh key.