This is a place to discuss using git-annex. If you need help, advice, or anything, post about it here.

version 3 upgrade
Posted Sat Sep 17 09:24:24 2011

advantages of SHA* over WORM
Posted Sat Sep 17 09:24:24 2011

migration to git-annex and rsync
Posted Sat Sep 17 09:10:11 2011

wishlist: do round robin downloading of data
Posted Sat Sep 17 09:10:11 2011

migrate existing git repository to git-annex
Posted Sat Sep 17 09:10:11 2011

Error while adding a file "createSymbolicLink: already exists"
Posted Sat Sep 17 09:10:11 2011

Wishlist: Is it possible to "unlock" files without copying the file data?
Posted Sat Sep 17 09:10:11 2011

wishlist: define remotes that must have all files
Posted Sat Sep 17 09:10:11 2011

seems to build fine on haskell platform 2011
Posted Sat Sep 17 09:10:11 2011

Can I store normal files in the git-annex git repository?
Posted Sat Sep 17 09:10:11 2011

working without git-annex commits
Posted Sat Sep 17 09:10:11 2011

wishlist: traffic accounting for git-annex
Posted Sat Sep 17 09:10:11 2011

tips: special_remotes/hook with tahoe-lafs
Posted Sat Sep 17 09:10:11 2011

wishlist: push to cia.vc from the website's repo, not your personal one
Posted Sat Sep 17 09:10:11 2011

wishlist: git-annex replicate
Posted Sat Sep 17 09:10:11 2011

OSX's haskell-platform statically links things
Posted Sat Sep 17 09:10:11 2011

example of massively disconnected operation
Posted Sat Sep 17 09:10:11 2011

wishlist: git annex status
Posted Sat Sep 17 09:10:11 2011

Is an automagic upgrade of the object directory safe?
Posted Sat Sep 17 09:10:11 2011

new microfeatures
Posted Sat Sep 17 09:10:11 2011

Wishlist: Ways of selecting files based on meta-information
Posted Sat Sep 17 09:10:11 2011

Need new build instructions for Debian stable
Posted Sat Sep 17 09:10:11 2011

unannex alternatives
Posted Sat Sep 17 09:10:11 2011

hashing objects directories
Posted Sat Sep 17 09:10:11 2011

git-annex communication channels
Posted Sat Sep 17 09:10:11 2011

Behaviour of fsck
Posted Sat Sep 17 09:10:11 2011

performance improvement: git on ssd, annex on spindle disk
Posted Sat Sep 17 09:10:11 2011

wishlist: git backend for git-annex
Posted Sat Sep 17 09:10:11 2011

Will git annex work on a FAT32 formatted key?
Posted Sat Sep 17 09:10:11 2011

wishlist: support for more ssh urls
Posted Sat Sep 17 09:10:11 2011

wishlist: command options changes
Posted Sat Sep 17 09:10:11 2011

getting git annex to do a force copy to a remote
Posted Sat Sep 17 09:10:11 2011

incompatible versions?
Posted Sat Sep 17 09:10:11 2011

bainstorming: git annex push & pull
Posted Sat Sep 17 09:10:11 2011

can git-annex replace ddm?
Posted Sat Sep 17 09:10:11 2011

wishlist: git annex put -- same as get, but for defaults
Posted Sat Sep 17 09:10:11 2011

git-annex on OSX
Posted Sat Sep 17 09:10:11 2011

batch check on remote when using copy
Posted Sat Sep 17 09:10:11 2011

sparse git checkouts with annex
Posted Sat Sep 17 09:10:11 2011

relying on git for numcopies
Posted Sat Sep 17 09:10:11 2011

wishlist:alias system
Posted Sat Sep 17 09:10:11 2011

OSX's default sshd behaviour has limited paths set
Posted Sat Sep 17 09:10:11 2011

rsync over ssh?
Posted Sat Sep 17 09:10:11 2011

wishlist: special remote for sftp or rsync
Posted Sat Sep 17 09:10:11 2011

"git annex lock" very slow for big repo
Posted Sat Sep 17 09:10:11 2011

Problems with large numbers of files
Posted Sat Sep 17 09:10:11 2011

Running git checkout by hand is fine, of course.

Underlying problem is that git has some O(N) scalability of operations on the index with regards to the number of files in the repo. So a repo with a whole lot of files will have a big index, and any operation that changes the index, like the git reset this needs to do, has to read in the entire index, and write out a new, modified version. It seems that git could be much smarter about its index data structures here, but I confess I don't understand the index's data structures at all. I hope someone takes it on, as git's scalability to number of files in the repo is becoming a new pain point, now that scalability to large files is "solved". ;)

Still, it is possible to speed this up at git-annex's level. Rather than doing a git reset followed by a git checkout, it can just git checkout HEAD -- file, and since that's one command, it can then be fed into the queueing machinery in git-annex (that exists mostly to work around this git malfescence), and so only a single git command will need to be run to lock multiple files.

I've just implemented the above. In my music repo, this changed an lock of a CD's worth of files from taking ctrl-c long to 1.75 seconds. Enjoy!

(Hey, this even speeds up the one file case greatly, since git reset -- file is slooooow -- it seems to scan the entire repository tree. Yipes.)

Comment by joey Tue May 31 14:51:13 2011

@joey

OK, I'll try increasing the stack size and see if that helps.

For reference, I was running:

git annex add .

on a directory containing about 100k files spread over many nested subdirectories. I actually have more than a dozen projects like this that I plan to keep in git annex, possibly in separate repositories if necessary. I could probably tar the data and then archive that, but I like the idea of being able to see the structure of my data even though the contents of the files are on a different machine.

After the crash, running:

git annex unannex

does nothing and returns instantly. What exactly is 'git annex add' doing? I know that it's moving files into the key-value store and adding symlinks, but I don't know what else it does.

--Justin

If

Comment by Justin Tue Apr 5 17:14:12 2011

Right, I have thought about untrusting all but a few remotes to achieve something similar before and I'm sure it would kind of work. It would be more of an ugly workaround, however, because I would have to untrust remotes that are, in reality, at least semi-trusted. That's why an extra option/attribute for that kind of purpose/remote would be nice.

Obviously I didn't see the scalability problem though. Good Point. Maybe I can achieve the same thing by writing a log parsing script for myself?

Comment by gernot Sun Apr 24 07:20:05 2011

Can't you just use an underscore instead of a colon?

Would it be feasible to split directories dynamically? I.e. start with SHA1_123456789abcdef0123456789abcdef012345678/SHA1_123456789abcdef0123456789abcdef012345678 and, at a certain cut-off point, switch to shorter directory names? This could even be done per subdirectory and based purely on a locally-configured number. Different annexes on different file systems or with different file subsets might even have different thresholds. This would ensure scale while not forcing you to segment from the start. Also, while segmenting with longer directory names means a flatter tree, segments longer than four characters might not make too much sense. Segmenting too often could lead to some directories becoming too populated, bringing us back to the dynamic segmentation.

All of the above would make merging annexes by hand a lot harder, but I don't know if this is a valid use case. And if all else fails, one could merge everything with the unsegemented directory names and start again from there.

-- RichiH

Comment by Richard Tue Mar 15 09:52:16 2011
I tend to agree that the default output of fsck is not quite right. I often use git annex fsck -q. A progress spinner display is a good idea.
Comment by joey Thu Mar 24 13:45:08 2011

+1 for a generic user configurable backend that a user can put shell commands in, which has a disclaimer such that if a user hangs themselves with misconfiguration then its their own fault :P

I would love to be able to quickly plugin an irods/sector set of put/get/delete/stat(get info) commands into git-annex to access my private clouds which aren't s3 compatible.

Comment by Jimmy Thu Apr 28 03:47:38 2011

This was already asked here, but I have a use case where I need to unlock with the files being hardlinked instead of copied (my fs does not support CoW), even though 'git annex lock' is now much faster ;-) . The idea is that 1) I want the external world see my repo "as if" it wasn't annexed (because of its own limitation to deal with soft links), and 2) I know what I do, and am sure that files won't be written to but only read.

My case is: the repo contains a snapshot A1 of a certain remote directory. Later I want to rsync this dir into a new snapshot A2. Of course, I want to transfer only new or changed files, with the --copy-dest=A1 (or --compare-dest) rsync's options. Unfortunately, rsync won't recognize soft-links from git-annex, and will re-transfer everything.

Maybe I'm overusing git-annex ;-) but still, I find it is a legitimate use case, and even though there are workarounds (I don't even remember what I had to do), it would be much more straightforward to have 'git annex unlock --readonly' (or '--readonly-unsafe'?), ... or have rsync take soft-links into account, but I did not see the author ask for microfeatures ideas :) (it was discussed, and only some convoluted workarounds were proposed). Thanks.

Comment by Rafaël Thu Jun 2 07:34:42 2011
http://xfs.org/index.php/XFS_FAQ#Q:Performance:mkfs.xfs_-n_size.3D64k_option
Comment by Richard Fri Apr 8 17:55:36 2011
.1 cents: Having IRC would be really nice for seeking quick help. E.g. like I was trying to do now, google lead me to this page.
Comment by Yaroslav Wed Apr 13 13:53:26 2011

@justin, I discovered that "git annex describe" did what I wanted

@joey, yep that is the behaviour of "tahoe ls", thanks for the tip on removing the file from the remote.

It seems to be working okay for now, the only concern is that on the remote everything is dumped into the same directory, but I can live with that, since I want to track biggish blobs and not lots of small little files.

Comment by Jimmy Fri Apr 29 11:33:24 2011
the non-bare repository issue would go away if this was combined with the "alternate" approach to branching. (with the "fleshed out proposal" of branching, this would not work at all for lack of shared commits.)
Comment by chrysn Wed Feb 23 17:48:14 2011

Hey Jimmy: how's this working for you now? I would expect it to go slower and slower since Tahoe-LAFS has an O(N) algorithm for reading or updating directories.

Of course, if it is still fast enough for your uses then that's okay. :-)

(We're working on optimizations of this for future releases of Tahoe-LAFS.)

I'd like to understand the desired behavior of store-hook and retrieve-hook better, in order to see if there is a more efficient way to use Tahoe-LAFS for this.

Off to look for docs.

Regards,

Zooko

Comment by zooko Sat May 14 01:07:17 2011

If tahoe ls outputs only the key, on its own line, and exits nonzero if it's not present, then I think you did the right thing.

To remove a file, use git annex move file --from tahoe and then you can drop it locally.

Comment by joey Fri Apr 29 11:24:56 2011

My last comment is a bit confused. The "git fetch" command allows to get all the information from a remote, and it is then possible to merge while being offline (without access to the remote). I would like a "git annex fetch remote" command to be able to get all annexed files from remote, so that if I later merge with remote, all annexed files are already here. And "git annex fetch" could (optionally) call "git fetch" before getting the files.

It seems also that in my last post, I should have written "git annex get --from=remote" instead of "git annex copy --from=remote", because "annex copy --from" copies all files, even if the local repo already have them (is this the case? if yes, when is it useful?)

Comment by Rafaël Sun Jul 3 13:57:00 2011
I've been longing for an automated way of removing references to a remote assuming I know the exact uuid that I want to remove. i.e. I have lost a portable HDD due to a destructive process, I now want to delete all references to copies of data that was on that disk. Unless this feature exists, I would love to see it implemented.
Comment by Jimmy Wed Jun 1 13:36:50 2011

This begs the question: What is the default remote? It's probably not the same repository that git's master branch is tracking (ie, origin/master). It seems there would have to be an annex.defaultremote setting.

BTW, mr can easily be configured on a per-repo basis so that "mr push" copies to somewhere: push = git push; git annex push wherever

Comment by joey Mon Apr 4 14:13:46 2011

Going one step further, a --min-copy could put all files so that numcopies is satisfied. --all could push to all available ones.

To take everything another step further, if it was possible to group remotes, one could act on the groups. "all" would be an obvious choice for a group that always exists, everything else would be set up by the user.

Comment by Richard Mon Apr 4 06:28:01 2011

Git-annex's commit hook does not prevent unannex being used. The file you unannex will not be checked into git anymore and will be a regular file again, not a git-annex symlink.

For example, here's a transcript:

joey@gnu:~/tmp>mkdir demo
joey@gnu:~/tmp>cd demo
joey@gnu:~/tmp/demo>git init
Initialized empty Git repository in /home/joey/tmp/demo/.git/
joey@gnu:~/tmp/demo>git annex init demo
init demo ok
joey@gnu:~/tmp/demo>echo hi > file
joey@gnu:~/tmp/demo>git annex add file 
add file ok
(Recording state in git...)
joey@gnu:~/tmp/demo>git commit -m add
[master 64cf267] add
 2 files changed, 2 insertions(+), 0 deletions(-)
 create mode 100644 .git-annex/WORM:1296607093:3:file.log
 create mode 120000 file
joey@gnu:~/tmp/demo>git annex unannex file
unannex file ok
(Recording state in git...)
joey@gnu:~/tmp/demo>ls -l file
-rw-r--r-- 1 joey joey 3 Feb  1 20:38 file
joey@gnu:~/tmp/demo>git commit
[master 78a09cc] unannex
 2 files changed, 1 insertions(+), 2 deletions(-)
 delete mode 120000 file
joey@gnu:~/tmp/demo>ls -l file
-rw-r--r-- 1 joey joey 3 Feb  1 20:38 file
joey@gnu:~/tmp/demo>git status
# On branch master
# Untracked files:
#   (use "git add ..." to include in what will be committed)
#
#   file
nothing added to commit but untracked files present (use "git add" to track)
Comment by joey Tue Feb 1 20:39:10 2011

Yes, there is value in layering something over git-annex to use a policy to choose what goes where.

I use mr to update and manage all my repositories, and since mr can be made to run arbitrary commands when doing eg, an update, I use its config file as such a policy layer. For example, my podcasts are pulled into my sound repository in a subdirectory; boxes that consume podcasts run "git pull; git annex get podcasts --exclude="/out/"; git annex drop podcasts/*/out". I move podcasts to "out" directories once done with them (I have yet to teach mpd to do that for me..), and the next time I run "mr update" to update everything, it pulls down new ones and removes old ones.

I don't see any obstacle to doing what you want. May be that you'd need better querying facilities in git-annex (so the policy layer can know what is available where), or finer control (--exclude is a good enough hammer for me, but maybe not for you).

Comment by joey Mon Feb 14 18:08:54 2011

That's awesome, I had not heard of git sparse checkouts before.

It does not make sense to tie the log files to the directory of the corresponding files, as then the logs would have to move when the files are moved, which would be a PITA and likely make merging log file changes very complex. Also, of course, multiple files in different locations can point at the same content, which has the same log file. And, to cap it off, git-annex can need to access the log file for a given key without having the slightest idea what file in the repository might point to it, and it would be very expensive to scan the whole repository to find out what that file is in order to lookup the filename of the log file.

The most likely change in git-annex that will make this better is in this todo item -- but it's unknown how to do it yet.

Comment by joey Thu Apr 7 12:32:04 2011
My estimates were pretty close -- the new bup special remote type took 133 lines of code, and 2 hours to write. A testament to the flexibility of the special remote infrastructure. :)
Comment by joey Fri Apr 8 16:59:37 2011
I would also like an git-annex channel. Would be #git-annex@OFTC ok?
Comment by Christian Thu Apr 14 07:24:59 2011

I've committed the queue flush improvements, so it will buffer up to 10240 git actions, and then flush the queue.

There may be other memory leaks at scale (besides the two I mentioned earlier), but this seems promising. I'm well into running git annex add on a half million files and it's using 18 mb ram and has flushed the queue several times. This run will fail due to running out of inodes for the log files, not due to memory. :)

Comment by joey Thu Apr 7 14:09:13 2011
BTW, git-annex unused will have a problem that not all the symlinks are present. It will suggest dropping content belonging to the excluded symlinks.
Comment by joey Thu Apr 7 12:33:30 2011

While having remotes redistribute introduces some obvious security concerns, I might use it.

As remotes support a cost factor already, you can basically implement bandwidth through that.

Comment by Richard Fri Apr 22 14:27:00 2011
I'm not sure it is worth adding a command for such a small feature, but I would certainly use it: having something like "git annex fetch remote" do "git fetch remote && git annex copy --from=remote", and "git annex push remote" do "git push remote && git annex copy --to=remote". And maybe the same for a pull operation?
Comment by Rafaël Sun Jul 3 10:39:41 2011
@jimmy what to do when you lose a repository.. I have not seen a convincing argument that removing the location tracking data entirely serves any purpose
Comment by joey Wed Jun 1 16:24:33 2011
using the location tracking information, it should be possible to show the status of other remotes as well. what about supporting --from=... or --all? (thus, among other things, one could determine if a remote has a complete checkout.)
Comment by chrysn Wed Jun 15 04:39:24 2011
See fat support. A bare git repo will have to be used to avoid symlink problems, at least for now. The other problem is that git-annex key files have colons in their filenames.
Comment by joey Mon Mar 7 15:13:14 2011
Either option should work fine, but git gc --aggressive will probably avoid most of git's seeking.
Comment by joey Sat Apr 2 13:48:29 2011
Personally, I would not mind a requirement to keep a local bup repo. I wouldn't want my data to to unncessarily complex setups, anyway. -- RichiH
Comment by Richard Mon Mar 28 16:45:35 2011

It is unfortunatly not possible to do system-dependant hashing, so long as git-annex stores symlinks to the content in git.

It might be possible to start without hashing, and add hashing for new files after a cutoff point. It would add complexity.

I'm currently looking at a 2 character hash directory segment, based on an md5sum of the key, which splits it into 1024 buckets. git uses just 256 buckets for its object directory, but then its objects tend to get packed away. I sorta hope that one level is enough, but guess I could go to 2 levels (objects/ab/cd/key), which would provide 1048576 buckets, probably plenty, as if you are storing more than a million files, you are probably using a modern enough system to have a filesystem that doesn't need hashing.

Comment by joey Tue Mar 15 23:13:39 2011

No matter what you end up doing, I would appreciate a git-annex-announce@ list.

I really like the persistence of ikiwiki, but it's not ideal for quick communication. I would be fine with IRC and/or ML. The advantage of a ML over ikiwiki is that it doesn't seem to be as "wasteful" to mix normal chat with actual problem-solving. But maybe that's merely my own perception.

Speaking of RSS: I thought I had added a wishlist item to ikiwiki about providing per-subsite RSS feeds. For example there is no (obvious) way to subscribe to changes in http://git-annex.branchable.com/forum/git-annex_communication_channels/ .

FWIW, I resorted to tagging my local clone of git-annex to keep track of what I've read, already.

-- RichiH

Comment by Richard Mon Mar 28 11:48:08 2011

Seems to have a scalability problem, what happens when such a repository becomes full?

Another way to accomplish I think the same thing is to pick the repositories that you would include in such a set, and make all other repositories untrusted. And set numcopies as desired. Then git-annex will never remove files from the set of non-untrusted repositories, and fsck will warn if a file is present on only an untrusted repository.

Comment by joey Sat Apr 23 12:27:13 2011
I dunno about parrallel downloads -- eek! -- but there is at least room for improvement of what "git annex get" does when there are multiple remotes that have a file, and the one it decides to use is not available, or very slow, or whatever.
Comment by joey Sun Apr 3 12:39:35 2011

Another nice thing would be a summary of what is wrong. I.e.

% git fsck
[...]
git-annex: 100 total failed
  50 checksum failed
  50 not enough copies exit

And the same/similar for all other failure modes.

-- RichiH

Comment by Richard Sun Mar 27 21:16:21 2011

Thanks for the update, Joey. I think you forgot to change libghc-missingh-dev to libghc6-missingh-dev for the copy & paste instructions though.

Also, after having checked that I have everything installed I'm still getting this error:

...
[15 of 77] Compiling Annex            ( Annex.hs, Annex.o )

Annex.hs:19:35:
    Module `Control.Monad.State' does not export `state'
make[1]: *** [git-annex] Error 1
make[1]: Leaving directory `/home/gernot/dev/git-annex'
dh_auto_build: make -j1 returned exit code 2
make: *** [binary] Error 2
Comment by gernot Tue Apr 26 14:56:44 2011
Oh, you'll need profiling builds of various haskell libraries to build with profiling support. If that's not easily accomplished, if you could show me the form of the command you're running, and also how git annex unannex fails, that would be helpful for investigating.
Comment by joey Tue Apr 5 14:02:05 2011

For future reference, git can recover from a corrupted index file with rm .git/index; git reset --mixed.

Of course, you lose any staged changes that were in the old index file, and may need to re-stage some files.

Comment by joey Sat Apr 2 21:48:57 2011
I've corrected the missing ANNEX_HASH_* oversight. (It also affected removal, btw.)
Comment by joey Fri Apr 29 14:01:04 2011

The .git-annex/ directory is what really needs hashing.

Consider that when git looks for changes in there, it has to scan every file in the directory. With hashing, it should be able to more quickly identify just the subdirectories that contained changed files, by the directory mtimes.

And the real kicker is that when committing there, git has to create a tree object containing every single file, even if only 1 file changed. That will be a lot of extra work; with hashed subdirs it will instead create just 2 or 3 small tree objects leading down to the changed file. (Probably these trees both pack down to similar size pack files, not sure.)

Comment by joey Wed Mar 16 00:06:19 2011
Thanks a lot. I tried various howtos around the net, but none of them worked; yours did. (I tried it in one of the copies of the broken repo which I keep around for obvious reasons).
Comment by Richard Sun Apr 3 05:03:22 2011
On second thought maybe the current behaviour is better than what I am suggesting that the force command should do. I guess it's better to be safe than sorry.
Comment by Jimmy Sun Apr 3 13:12:35 2011
+1 for this feature, I've been longing for something like this other than rolling my own perl/shell scripts to parse the outputs of "git annex whereis ." to see how many files are on my machine or not.
Comment by Jimmy Fri Apr 8 03:23:08 2011
Cool!, I just tried adding tahoe-lafs as a remote, and it wasn't too hard.
Comment by Jimmy Fri Apr 29 06:43:31 2011

These are good examples; I think you've convinced me at least for upgrades going forward after v2. I'm not sure we have enough users and outdated git-annex installations to worry about it for v1.

(Hoping such upgrades are rare anyway.. Part of the point of changes made in v2 was to allow lots of changes to be made later w/o needing a v3.)

Update: Upgrades from v1 to v2 will no longer be handled automatically now.

Comment by joey Thu Mar 17 20:38:51 2011

What a good idea!

150 lines of haskell later, I have this:

# git annex status
supported backends: WORM SHA1 SHA256 SHA512 SHA224 SHA384 SHA1E SHA256E SHA512E SHA224E SHA384E URL
supported remote types: git S3 bup directory rsync hook
local annex keys: 32
local annex size: 58 megabytes
total annex keys: 38158
total annex size: 6 terabytes (but 1632 keys have unknown size)
backend usage: 
    SHA1: 1789
    WORM: 36369
Comment by joey Mon May 16 21:15:10 2011

Cool, that seems to make things work as expected, here's an updated recipe

git config annex.tahoe-store-hook 'tahoe mkdir tahoe:$ANNEX_HASH_1/$ANNEX_HASH_2 && tahoe put $ANNEX_FILE tahoe:$ANNEX_HASH_1/$ANNEX_HASH_2/$ANNEX_KEY'
git config annex.tahoe-retrieve-hook 'tahoe get tahoe:$ANNEX_HASH_1/$ANNEX_HASH_2/$ANNEX_KEY $ANNEX_FILE'
git config annex.tahoe-remove-hook 'tahoe rm tahoe:$ANNEX_HASH_1/$ANNEX_HASH_2/$ANNEX_KEY'
git config annex.tahoe-checkpresent-hook 'tahoe ls tahoe:$ANNEX_HASH_1/$ANNEX_HASH_2/$ANNEX_KEY 2>&1 || echo FAIL'
git annex initremote library type=hook hooktype=tahoe encryption=none
git annex describe 1d1bc312-7243-11e0-a9ce-5f10c0ce9b0a library

I just needs some of the output redirected to /dev/null.

(I updated this comment to fix a bug. --Joey)

Comment by Jimmy Fri Apr 29 16:11:08 2011

thanks Joey,

is it possible to run some git annex command that tells me, for a specific directory, which files are available in an other remote? (and which remote, and which filenames?) I guess I could run that, do my own policy thingie, and run git annex get for the files I want.

For your podcast use case (and some of my use cases) don't you think git [annex] might actually be overkill? For example your podcasts use case, what value does git annex give over a simple rsync/rm script? such a script wouldn't even need a data store to store its state, unlike git. it seems simpler and cleaner to me.

for the mpd thing, check http://alip.github.com/mpdcron/ (bad project name, it's a plugin based "event handler") you should be able to write a simple plugin for mpdcron that does what you want (or even interface with mpd yourself from perl/python/.. to use its idle mode to get events)

Dieter

Comment by dieter Wed Feb 16 17:32:04 2011

Before dropping unsused items, sometimes I want to check the content of the files manually. But currently, from e.g. a sha1 key, I don't know how to find the corresponding file, except with 'find .git/annex/objects -type f -name 'SHA1-s1678--70....', wich is too slow (I'm in the case where "git log --stat -S'KEY'" won't work, either because it is too slow or it was never commited). By the way, is it documented somewhere how to determine the 2 (nested) sub-directories in which a given (by name) object is located?

So I would like 'git-annex unused' be able to give me the list of paths to the unused items. Also, I would really appreciate a command like 'git annex unused --log NUMBER [NUMBER2...]' which would do for me the suggested command "git log --stat -S'KEY'", where NUMBER is from the 'git annex unused' output. Thanks.

Comment by Rafaël Thu Jun 2 07:55:58 2011
FWIW, I wanted to suggest exactly the same thing.
Comment by Richard Fri Mar 25 07:23:04 2011

This message comes from ghc's runtime memory manager. Apparently your ghc defaults to limiting the stack to 80 mb. Mine seems to limit it slightly higher -- I have seen haskell programs successfully grow as large as 350 mb, although generally not intentionally. :)

Here's how to adjust the limit at runtime, obviously you'd want a larger number:

# git-annex +RTS -K100 -RTS find
Stack space overflow: current size 100 bytes.
Use `+RTS -Ksize -RTS' to increase it.

I've tried to avoid git-annex using quantities of memory that scale with the number of files in the repo, and I think in general successfully -- I run it on 32 mb and 128 mb machines, FWIW. There are some tricky cases, and haskell makes it easy to accidentally write code that uses much more memory than would be expected.

One well known case is git annex unused, which has to build a structure of every annexed file. I have been considering using a bloom filter or something to avoid that.

Another possible case is when running a command like git annex add, and passing it a lot of files/directories. Some code tries to preserve the order of your input after passing it through git ls-files (which destroys ordering), and to do so it needs to buffer both the input and the result in ram.

It's possible to build git-annex with memory profiling and generate some quite helpful profiling data. Edit the Makefile and add this to GHCFLAGS: -prof -auto-all -caf-all -fforce-recomp then when running git-annex, add the parameters: +RTS -p -RTS , and look for the git-annex.prof file.

Comment by joey Tue Apr 5 13:46:03 2011

Nice! So if I understand correctly, 'git reset -- file' was there to discard staged (but not commited) changes made to 'file', before checking out, so that it is equivalent to directly 'git checkout HEAD -- file' ? I'm curious about the "queueing machinery in git-annex": does it end up calling the one git command with multiple files as arguments? does it correspond to the message "(Recording state in git...)" ? Thanks!

Comment by Rafaël Tue May 31 17:43:22 2011
They are not. See upgrades
Comment by joey Tue Jun 7 20:40:54 2011
Everything is done over ssh unless both repos are on the same system (or unless you NFS mount a repo)
Comment by joey Sun Mar 6 11:59:37 2011
I have updated the instructions.
Comment by joey Tue Apr 26 11:27:49 2011

No-so-subtle sarcasm taken and acknowledged :)

Arguably, git-annex should know about any local limits and not have them implemented via mr from the outside. I guess my concern boils down to having git-annex do the right thing all by itself with minimal user interaction. And while I really do appreciate the flexibility of chaining commands, I am a firm believer in exposing the common use cases as easily as possible.

And yes, I am fully aware that not all annexes are created equal. Point in case, I would never use git annex pull on my laptop, but I would git annex push extensively.

Comment by Richard Tue Apr 5 16:52:52 2011

I've just tried to use the ANNEX_HASH_ variables, example of my configuration

    git config annex.tahoe-store-hook 'tahoe mkdir $ANNEX_HASH_1 && tahoe put $ANNEX_FILE tahoe:$ANNEX_HASH_1/$ANNEX_KEY'
    git config annex.tahoe-retrieve-hook 'tahoe get tahoe:$ANNEX_HASH_1/$ANNEX_KEY $ANNEX_FILE'
    git config annex.tahoe-remove-hook 'tahoe rm tahoe:$ANNEX_HASH_1/$ANNEX_KEY'
    git config annex.tahoe-checkpresent-hook 'tahoe ls tahoe:$ANNEX_HASH_1/$ANNEX_KEY 2>&1 || echo FAIL'
    git annex initremote library type=hook hooktype=tahoe encryption=none
    git annex describe 1d1bc312-7243-11e0-a9ce-5f10c0ce9b0a library

It's seems to work quite well for me now, I did run across this when I tried to drop a file locally, leaving the file on my remote

jtang@x00:/tmp/annex3 $ git annex drop .
drop frink.sh (checking library...) (unsafe) 
  Could only verify the existence of 0 out of 1 necessary copies
  Try making some of these repositories available:
    1d1bc312-7243-11e0-a9ce-5f10c0ce9b0a  -- library
  (Use --force to override this check, or adjust annex.numcopies.)
failed
drop t/frink.jar (checking library...) (unsafe) 
  Could only verify the existence of 0 out of 1 necessary copies
  Try making some of these repositories available:
    1d1bc312-7243-11e0-a9ce-5f10c0ce9b0a  -- library
  (Use --force to override this check, or adjust annex.numcopies.)
failed
git-annex: 2 failed
1|jtang@x00:/tmp/annex3 $ 

I do know that the files exist in my library as I have just inserted them, it seemed to work when I didnt have the hashing, it appears that the checkpresent doesn't seem to pass the ANNEX_HASH_* variables (from the limited debugging I did)

Comment by Jimmy Fri Apr 29 12:17:11 2011

Whups, the comment above got stuck in moderation queue for 27 days. I will try to check that more frequently.

In the meantime, I've implemented "git annex whereis" -- enjoy!

I find keeping my podcasts in the annex useful because it allows me to download individual episodes or poscasts easily when low bandwidth is available (ie, dialup), or over sneakernet. And generally keeps everything organised.

Comment by joey Tue Mar 15 23:01:17 2011

After some thought, perhaps the default fsck output should be at least machine readable and copy and pasteable i.e.

$ git annex fsck
Files with errors

    file1
    file2

so I can then copy the list of borked files and then just paste it into a for loop in my shell to recover the files. it's just an idea.

Comment by Jimmy Sat Mar 26 06:57:41 2011

Probably more like 150 lines of haskell. Maybe just 50 lines if the bup repository is required to be on the same computer as the git-annex repository.

Since I do have some repositories where I'd appreciate this level of assurance that data not be lost, it's mostly a matter of me finding a free day.

Comment by joey Mon Mar 28 16:05:13 2011
Push access to the non-code bits of git-annex' ikiwiki would be very welcome indeed. Given the choice, I would rather edit everything in Vim than in a browser. -- RichiH
Comment by Richard Mon Mar 28 16:47:23 2011

Hmm, so it seems there is almost a way to do this already.

I think the one thing that isn't currently possible is to have 'plain' ssh remotes.. basically something just like the directory remote, but able to take a ssh user@host/path url. something like sshfs could be used to fake this, but for things like fsck you would want to do the sha1 calculations on the remote host.

Comment by Justin Sat Apr 23 13:54:42 2011

i'll comment on each of the points separately, well aware that even a single little leftover issue can show that my plan is faulty:

  • force removal: well, yes -- but the file that is currently force-removed on the laptop could just as well be the last of its kind itself. i see the problem, but am not sure if it's fatal (after all, if we rely on out-of-band knowledge when forcing something, we could just as well ask a little more)
  • non-bare repos: pushing is tricky with non-bare repos now just as well; a post-commit hook could auto-accept counter changes. (but pushing causes problems with counters anyway, doesn't it?)
  • merging: i'd have them auto-merge. git-annex will have to check the validity of the current state anyway, and a situation in which a counter-decrementing commit is not a fast-forward one would be reverted in the next step (or upon discovery, in case the next step never took place).
  • reverting: my wording was bad as "revert" is already taken in git-lingo. the correct term for what i was thinking of is "reset". (as the commit could not be pushed, it would be rolled back completely).
    • we might have to resort to reverting, though, if the commit has already been pused to a first server of many.
  • hidden files: yes, this solves pre-removal dropping :-)
  • round trips: it's not the number of servers, it's the number of files (up to 30k in my case). it seems to me that an individual request was made for every single file i wanted to drop (that would be N*M roundtrips for N affected servers and M files, and N roundtrips with git managed numcopies)

all together, it seems to be a bit more complicated than i imagined, although not completely impossible. a combination of hidden files and maybe a simpler reduction of the number of requests might though achieve the important goals as well.

Comment by chrysn Wed Feb 23 12:43:59 2011

Besides the cost values, annex.diskreserve was recently added. (But is not available for special remotes.)

I have held off on adding high-level management stuff like this to git-annex, as it's hard to make it generic enough to cover use cases.

A low-level way to accomplish this would be to have a way for git annex get and/or copy to skip files when numcopies is already satisfied. Then cron jobs could be used.

Comment by joey Sat Apr 23 12:22:07 2011

I'll give it a try as soon as I get rid of this:

% git annex fsck

fatal: index file smaller than expected fatal: index file smaller than expected % git status fatal: index file smaller than expected %

And no, I am not sure where that is coming from all of a sudden... (it might have to do with a hard lockup of the whole system due to a faulty hdd I tested, but I didn't do anything to it for ages before that lock-up. So meh. Also, this is prolly off topic in here)

Richard

Comment by Richard Sat Apr 2 17:34:24 2011
ps: concerning the command 'find .git/annex/objects -type f -name 'SHA1-s1678--70....' from my previous comment, it is "significantly" faster to search for the containing directory which have the same name: 'find .git/annex/objects -maxdepth 2 -mindepth 2 -type d -name 'SHA1-s1678--70....'. I am just curious: what is the need to have each file object in its own directory, itself nested under two more sub-directories?
Comment by Rafaël Thu Jun 2 15:51:49 2011

Indeed, see add a git backend, where you and I have already discussed this idea. :)

With the new support for special remotes, which will be used by S3, it would be possible to make such a git repo, using bup, be a special remote. I think it would be pretty easy to implement now. Not a priority for me though.

Comment by joey Mon Mar 28 12:01:30 2011
I got my answer on #vcs-home: Yes, git-annex and git get along fine.
Comment by peter-simons [myopenid.com] Wed Jul 13 12:21:25 2011

I think what is happening with "git annex unannex" is that "git annex add" crashes before it can "git add" the symlinks. unannex only looks at files that "git ls-files" shows, and so files that are not added to git are not seen. So, this can be recovered from by looking at git status and manually adding the symlinks to git, and then unannex.

That also suggests that "git annex add ." has done something before crashing. That's consistent with you passing it < 2 parameters; it's not just running out of memory trying to expand and preserve order of its parameters (like it might if you ran "git annex add experiment-1/ experiment-2/")

I'm pretty sure I know where the space leak is now. git-annex builds up a queue of git commands, so that it can run git a minimum number of times. Currently, this queue is only flushed at the end. I had been meaning to work on having it flush the queue periodically to avoid it growing without bounds, and I will prioritize doing that.

(The only other thing that "git annex add" does is record location log information.)

Comment by joey Thu Apr 7 12:41:00 2011

In my case, the remotes are the same, but adding a new option could make sense.

And while I can tell mr what to do explicitly, I would prefer if it did the right thing all by itself. Having to change configs in two separate places is less than ideal.

I am not sure what you mean by git annex push as that does not exist. Did you mean copy?

Comment by Richard Mon Apr 4 16:45:30 2011

How remote is REMOTE? If it's a directory on the same computer, then git-annex copy --to is actually quickly checking that each file is present on the remote, and when it is, skipping copying it again.

If the remote is ssh, git-annex copy talks to the remote to see if it has the file. This makes copy --to slow, as Rich complained before. :)

So, copy --to does not trust location tracking information (unless --fast is specified), which means that it should be doing exactly what you want it to do in your situation -- transferring every file that is really not present in the destination repository already.

Neither does copy --from, by the way. It always checks if each file is present in the current repository's annex before trying to download it.

Comment by joey Sun Apr 3 12:49:01 2011

The rsync or directory special remotes would work if the media player uses metadata in the files, rather than directory locations.

Beyond that there is the smudge idea, which is hoped to be supported sometime.

Comment by joey Thu Jul 7 11:27:28 2011

Heh, cool, I was thinking throwing about 28million files at git-annex. Let me know how it goes, I suspect you have just run into a default limits OSX problem.

You probably just need to up some system limits (you will need to read the error messages that first appear) then do something like

# this is really for the run time, you can set these settings in /etc/sysctl.conf
sudo sysctl -w kern.maxproc=2048
sudo sysctl -w kern.maxprocperuid=1024

# tell launchd about having higher limits
sudo echo "limit maxfiles 1024 unlimited" >> /etc/launchd.conf
sudo echo "limit maxproc 1024 2048" >> /etc/launchd.conf

There are other system limits which you can check by doing a "ulimit -a", once you make the above changes, you will need to reboot to make the changes take affect. I am unsure if the above will help as it is an example of what I did on 10.6.6 a few months ago to fix some forking issues. From the error you got you will probably need to increase the stacksize to something bigger or even make it unlimited if you feel lucky, the default stacksize on OSX is 8192, try making it say 10times that size first and see what happens.

Comment by Jimmy Tue Apr 5 03:27:46 2011

My experience is that modern filesystems are not going to have many issues with tens to hundreds of thousands of items in the directory. However, if a transition does happen for FAT support I will consider adding hashing. Although getting a good balanced hash in general without, say, checksumming the filename and taking part of the checksum, is difficult.

I prefer to keep all the metadata in the filename, as this eases recovery if the files end up in lost+found. So while "SHA/" is a nice workaround for the FAT colon problem, I'll be doing something else. (What I'm not sure yet.)

There is no point in creating unused hash directories on initialization. If anything, with a bad filesystem that just guarantees worst performance from the beginning..

Comment by joey Mon Mar 14 12:12:49 2011

I don't know how to approach this yet, but I support the idea -- it would be great if there was a tool that could punch files out of git history and put them in the annex. (Of course with typical git history rewriting caveats.)

Sounds like it might be enough to add a switch to git-annex that overrides where it considers the top of the git repository to be?

Comment by joey Fri Feb 25 01:16:48 2011

--to and --from seem to have different semantics than --source and --destination. Subtle, but still different.

That being said, I am not sure --from and --to are needed at all. Calling the local repo . and all remotes by their name, they are arguably redundant and removing them would make the syntax a lot prettier; mv and cp don't need them, either.

I am not sure changing syntax at this point is considered good style though personally, I wouldn't mind adapting and would actually prefer it over using --to and --from.

-v and -q would be nice.

Richard

Comment by Richard Sun Apr 17 19:46:37 2011
Great! This was the only thing about git-annex which could have kept me from using it. --Michael
Comment by m-f-k [myopenid.com] Sun Mar 6 12:33:19 2011
So perhaps checking if git-status (or similar) complains about missing files is a possible solution for this?
Comment by Christian Fri Apr 8 03:31:03 2011

And something else i've done is, that i symlinked the video/ directory from the media annex to the normal raid annex

ln -s ~/media/annex/video ~/annex

And it's working out great.

~annex $ git annex whereis video/series/episode1.avi
whereis video/series/episode1.avi(1 copy)
        f210b45a-60d3-11e0-b593-3318d96f2520  -- Trantor - Media
ok

I really like this, perhaps it is a good idea to store all log files in every repo, but maybe there is a possibilitiy to to pack multiple log files into one single file, where not only the time, the present bit and the annex-repository is stored, but also the file key. I don't know if this format would also be merged correctly by the union merge driver.

Comment by Christian Fri Apr 8 03:54:37 2011
Ask and ye shalle receive with an Abbot on top: hook
Comment by joey Thu Apr 28 17:22:03 2011

Maybe, otoh, part of the point of git-annex is that the data may be too large to pull down all of it.

I find mr useful as a policy layer over top of git-annex, so "mr update" can pull down appropriate quantities of data from appropriate locations.

Comment by joey Tue Apr 5 14:05:00 2011
Good point. scp fixes this by using a colon, but as colons aren't needed in git-annex remotes' names... -- RichiH
Comment by Richard Wed Apr 20 17:28:06 2011

On the plus side, the past me wanted exactly what I had in mind.

On the meh side, I really forgot about this conversation :/

When you say this todo is not a priority, does that mean there's no ETA at all and that it will most likely sleep for a long time? Or the almost usual "what the heck, I will just wizard it up in two lines of haskell"?

-- RichiH

Comment by Richard Mon Mar 28 13:47:38 2011

I see the following problems with this scheme:

  • Disallows removal of files when disconnected. It's currently safe to force that, as long as git-annex tells you enough other repos are belived to have the file. Just as long as you only force on one machine (say your laptop). With your scheme, if you drop a file while disconnected, any other host could see that the counter is still at N, because your laptop had the file last time it was online, and can decide to drop the file, and lose the last version.

  • pushing a changed counter commit to other repos is tricky, because they're not bare, and the network topology to get the commit pulled into the other repo could vary.

  • Merging counter files issues. If the counter file doesn't automerge, two repos dropping the same file will conflict. But, if it does automerge, it breaks the counter conflict detection.

  • Needing to revert commits is going to be annoying. An actual git revert could probably not reliably be done. It's need to construct a revert and commit it as a new commit. And then try to push that to remotes, and what if that push conflicts?

  • I do like the pre-removal dropping somewhat as an alternative to trust checking. I think that can be done with current git-annex though, just remove the files from the location log, but keep them in-annex. Dropping a file only looks at repos that the location log says have a file; so other repos can have retained a copy of a file secretly like this, and can safely remove it at any time. I'd need to look into this a bit more to be 100% sure it's safe, but have started hidden files.

  • I don't see any reduced round trips. It still has to contact N other repos on drop. Now, rather than checking that they have a file, it needs to push a change to them.

Comment by joey Tue Feb 22 14:44:28 2011
I think the forums/website currently is sufficient, I do at times wish there was a mailing list or anonymous git push to the wiki as I find editing posts through the web browser is some times tedious (the lack of !fmt or alt-q bugs me at times ;) ). The main advantage of keeping stuff on the site/forum is that everything gets saved and passed on to anyone who checks out the git repo of the code base.
Comment by Jimmy Mon Mar 28 14:35:50 2011

Sorry for all the followups, but I see now that if you unannex, then add the file to git normally, and commit, the hook does misbehave.

This seems to be a bug. git-annex's hook thinks that you have used git annex unlock (or "git annex edit") on the file and are now committing a changed version, and the right thing to do there is to add the new content to the annex and update the symlink accordingly. I'll track this bug over at unannex vs unlock hook confusion.

So, committing after unannex, and before checking the file into git in the usual way, is a workaround. But only if you do a "git commit" to commit staged changes.

Anyway, this confusing point is fixed in git now!

Comment by joey Tue Feb 1 20:46:00 2011

You should be able to fix the missing label by editing .git-annex/uuid.log and adding

1d1bc312-7243-11e0-a9ce-5f10c0ce9b0a tahoe
Comment by Justin Fri Apr 29 09:08:35 2011

My current workflow looks like this (I'm still experimenting):

Create backup clone for migration

git clone original migrate
cd migrate
for branch in $(git branch -a | grep remotes/origin | grep -v HEAD); do git checkout --track $branch; done

Inject git annex initialization at repository base

git symbolic-ref HEAD refs/heads/newroot
git rm --cached *.rpm
git clean -f -d
git annex init master
git cherry-pick $(git rev-list --reverse master | head -1)
git rebase --onto newroot newroot master
git rebase master mybranch # how to automate this for all branches?
git branch -d newroot

Start migration with tree filter

echo \*.rpm annex.backend=SHA1 > .git/info/attributes
MYWORKDIR=$(pwd) git filter-branch --tree-filter ' \
    if [ ! -d .git-annex ]; then \
        mkdir .git-annex; \
        cp ${MYWORKDIR}/.git-annex/uuid.log .git-annex/; \
        cp ${MYWORKDIR}/.gitattributes .; \
    fi
    for rpm in $(git ls-files | grep "\.rpm$"); do \
        echo; \
        git annex add $rpm; \
        annexdest=$(readlink $rpm); \
        if [ -e .git-annex/$(basename $annexdest).log ]; then \
            echo "FOUND $(basename $annexdest).log"; \
        else \
            echo "COPY $(basename $annexdest).log"; \
            cp ${MYWORKDIR}/.git-annex/$(basename $annexdest).log .git-annex/; \
        fi; \
        ln -sf ${annexdest#../../} $rpm; \
    done; \
    git reset HEAD .git-rewrite; \
    : \
    ' -- $(git branch | cut -c 3-)
rm -rf .temp
git reset --hard

There are still some drawbacks:

  • git history shows that git annex log files are modified with each checkin
  • branches have to be rebased manually before starting migration
Comment by tyger Tue Mar 1 10:07:50 2011

Let's see..

  • -v is already an alias for --verbose

  • I don't find --source and --destination as easy to type or as clear as --from or --to.

  • -F is fast, so it cannot be used for --force. And I have no desire to make it easy to mistype a short option and enable --force; it can lose data.

@richard while it would be possible to support some syntax like "git annex copy . remote"; what is it supposed to do if there are local files named foo and bar, and a remotes named foo and bar? Does "git annex copy foo bar" copy file foo to remote bar, or file bar from remote foo? I chose to use --from/--to to specify remotes independant of files to avoid such ambiguity, which plain old cp doesn't have since it's operating entirely on filesystem objects, not both filesystem objects and abstract remotes.

Seems like nothing to do here. done --Joey

Comment by joey Tue Apr 19 16:13:10 2011

@joey thanks for the update in the previous comment, I had forgotten about updating it.

@zooko it's working okay for me right now, since I'm only putting fairly big blogs on stuff on to it and only things that I really care about. On the performance side, if it ran faster then it would be nicer :)

Comment by Jimmy Sat May 14 06:02:26 2011

Remote as in "another physical machine". I assumed that

git annex copy --force --to REMOTE .

would have not trusted the contents in the current directory (or the remote that is being copied to) and then just go off and re-download/upload all the files and overwrite what is already there. I expected the combination of --force and copy --to that it would not bother to check if the files are there or not and just copy it regardless of the outcome.

Comment by Jimmy Sun Apr 3 12:59:47 2011

we could include the information about the current directory as well, if the command is not issued in the local git root directory. to avoid large numbers of similar lines, that could look like this:

Estimated annex size: B MiB (of C MiB; [B/C]%)
Estimated annex size in $PWD: B' MiB (of C' MiB; [B'/C']%)

with the percentages being replaced with "complete" if really all files are present (and not just many enough for the value to be rounded to 100%).

Comment by chrysn Tue Apr 26 08:31:02 2011
@Rafaël , you're correct on all counts.
Comment by joey Tue May 31 17:54:23 2011

additional filter criteria could come from the git history:

  • git annex get --touched-in HEAD~5.. to fetch what has recently been worked on
  • git annex get --touched-by chrysn --touched-in version-1.0..HEAD to fetch what i've been workin on recently (based on regexp or substring match in author; git experts could probably craft much more meaningful expressions)

these options could also apply to git annex find -- actually, looking at the normal file system tools for such tasks, that might even be sufficient (think git annex find --numcopies-gt 3 --present-on lanserver1 --drop like find -iname '*foo*' -delete

(i was about to open a new forum discussion for commit-based getting, but this is close enough to be usefully joint in a discussion)

Comment by chrysn Thu Jun 23 09:56:35 2011

If you can't segment the names retroactively, it's better to start with segmenting, imo.

As subdirectories are cheap, going with ab/cd/rest or even ab/cd/ef/rest by default wouldn't hurt.

Your point about git not needing to create as many tree objects is a kicker indeed. If I were you, I would default to segmentation.

Comment by Richard Wed Mar 16 11:47:17 2011

@Jimmy mentioned anonymous git push -- that is now enabled for this wiki. Enjoy!

I may try to spend more time on #vcs-home -- or I can be summoned there from my other lurking places on irc, I guess.

Comment by joey Thu May 19 15:21:51 2011

And following on to my transcript, you can then add the file to git in the regular git way, and it works fine:

joey@gnu:~/tmp/demo>git add file
joey@gnu:~/tmp/demo>git commit
[master 225ffc0] added as regular git file, not in annex
 1 files changed, 1 insertions(+), 0 deletions(-)
 create mode 100644 file
joey@gnu:~/tmp/demo>ls -l file
-rw-r--r-- 1 joey joey 3 Feb  1 20:38 file
joey@gnu:~/tmp/demo>git log file
commit 225ffc048f5af7c0466b3b1fe549a6d5e9a9e9fe
Author: Joey Hess 
Date:   Tue Feb 1 20:43:13 2011 -0400

    added as regular git file, not in annex

commit 78a09cc791b875c3b859ca9401e5b6472bf19d08
Author: Joey Hess 
Date:   Tue Feb 1 20:38:30 2011 -0400

    unannex

commit 64cf267734adae05c020d9fd4d5a7ff7c64390db
Author: Joey Hess 
Date:   Tue Feb 1 20:38:18 2011 -0400

    add
Comment by joey Tue Feb 1 20:41:24 2011
Now it's fully supported, so long as you put a bare git repo on your key.
Comment by joey Sat Mar 19 11:37:22 2011

Sounds like it might be enough to add a switch to git-annex that overrides where it considers the top of the git repository to be?

It should sufficient to honor GIT_DIR/GIT_WORK_TREE/GIT_INDEX_FILE environment variables. git filter-branch sets GIT_WORK_TREE to ., but this can be mitigated by starting the filter script with 'GIT_WORK_TREE=$(pwd $GIT_WORK_TREE)'. E.g. GIT_DIR=/home/tyger/repo/.git, GIT_WORK_TREE=/home/tyger/repo/.git-rewrite/t, then git annex should be able to compute the correct relative path or maybe use absolute pathes in symlinks.

Another problem I observed is that git annex add automatically commits the symlink; this behaviour doesn't work well with filter-tree. git annex commits the wrong path (.git-rewrite/t/LINK instead of LINK). Also filter-tree doesn't expect that the filter script commmits anything; new files in the temporary work tree will be committed by filter-tree on each iteration of the filter script (missing files will be removed).

Comment by tyger Wed Mar 2 04:15:37 2011
We seem to be using #vcs-home @ OFTC for now. madduck is fine with it and joeyh pokes his head in there, as well. I just added a CIA bot to #vcs-home and this comment is a test if pushing works. -- RichiH
Comment by Richard Fri Apr 15 15:32:08 2011
Both problems fixed.
Comment by joey Tue Apr 26 19:40:33 2011
You're right -- as long as nothing changes a file without letting the modification time update, editing WORM files is safe.
Comment by joey Mon Aug 29 12:10:38 2011

It's ok that git pull does not merge the git-annex branch. You can merge it with git annex merge, or it will be done automatically when you use other git-annex commands.

If you use git pull and git push without any options, the defaults will make git pull and push the git-annex branch automatically.

But if you're in the habit of doing git push origin master, that won't cause the git-annex branch to be pushed (use git push origin git-annex to manually push it then). Similarly, git pull origin master won't pull it. And also, the remote.origin.fetch setting in .git/config can be modified in ways that make git pull not automatically pull the git-annex branch. So those are the things to avoid after upgrade to v3, basically.

Comment by joey Tue Aug 16 21:33:08 2011

Yes, it can read id3-tags and guess titles from movie filenames but it sometimes gets confused by the filename metadata provided by the WORM-backend.

I think I have a good enough solution to this problem. It's not efficient when it comes to renames but handles adding and deletion just fine

rsync -vaL --delete source dest

The -L flag looks at symbolic links and copies the actual data they are pointing to. Of course "source" must have all data locally for this to work.

Comment by Kristian Sun Jul 31 11:24:25 2011
Comments on this page are closed.