This is a place to discuss using git-annex. If you need help, advice, or anything, post about it here.
version 3 upgrade
Posted Sat Sep 17 09:24:24 2011
advantages of SHA* over WORM
Posted Sat Sep 17 09:24:24 2011
migration to git-annex and rsync
Posted Sat Sep 17 09:10:11 2011
wishlist: do round robin downloading of data
Posted Sat Sep 17 09:10:11 2011
migrate existing git repository to git-annex
Posted Sat Sep 17 09:10:11 2011
Error while adding a file "createSymbolicLink: already exists"
Posted Sat Sep 17 09:10:11 2011
Wishlist: Is it possible to "unlock" files without copying the file data?
Posted Sat Sep 17 09:10:11 2011
wishlist: define remotes that must have all files
Posted Sat Sep 17 09:10:11 2011
seems to build fine on haskell platform 2011
Posted Sat Sep 17 09:10:11 2011
Can I store normal files in the git-annex git repository?
Posted Sat Sep 17 09:10:11 2011
working without git-annex commits
Posted Sat Sep 17 09:10:11 2011
wishlist: traffic accounting for git-annex
Posted Sat Sep 17 09:10:11 2011
tips: special_remotes/hook with tahoe-lafs
Posted Sat Sep 17 09:10:11 2011
wishlist: push to cia.vc from the website's repo, not your personal one
Posted Sat Sep 17 09:10:11 2011
wishlist: git-annex replicate
Posted Sat Sep 17 09:10:11 2011
OSX's haskell-platform statically links things
Posted Sat Sep 17 09:10:11 2011
example of massively disconnected operation
Posted Sat Sep 17 09:10:11 2011
wishlist: git annex status
Posted Sat Sep 17 09:10:11 2011
Is an automagic upgrade of the object directory safe?
Posted Sat Sep 17 09:10:11 2011
new microfeatures
Posted Sat Sep 17 09:10:11 2011
Wishlist: Ways of selecting files based on meta-information
Posted Sat Sep 17 09:10:11 2011
Need new build instructions for Debian stable
Posted Sat Sep 17 09:10:11 2011
unannex alternatives
Posted Sat Sep 17 09:10:11 2011
hashing objects directories
Posted Sat Sep 17 09:10:11 2011
git-annex communication channels
Posted Sat Sep 17 09:10:11 2011
Behaviour of fsck
Posted Sat Sep 17 09:10:11 2011
performance improvement: git on ssd, annex on spindle disk
Posted Sat Sep 17 09:10:11 2011
wishlist: git backend for git-annex
Posted Sat Sep 17 09:10:11 2011
Will git annex work on a FAT32 formatted key?
Posted Sat Sep 17 09:10:11 2011
wishlist: support for more ssh urls
Posted Sat Sep 17 09:10:11 2011
wishlist: command options changes
Posted Sat Sep 17 09:10:11 2011
getting git annex to do a force copy to a remote
Posted Sat Sep 17 09:10:11 2011
incompatible versions?
Posted Sat Sep 17 09:10:11 2011
bainstorming: git annex push & pull
Posted Sat Sep 17 09:10:11 2011
can git-annex replace ddm?
Posted Sat Sep 17 09:10:11 2011
wishlist: git annex put -- same as get, but for defaults
Posted Sat Sep 17 09:10:11 2011
git-annex on OSX
Posted Sat Sep 17 09:10:11 2011
batch check on remote when using copy
Posted Sat Sep 17 09:10:11 2011
sparse git checkouts with annex
Posted Sat Sep 17 09:10:11 2011
relying on git for numcopies
Posted Sat Sep 17 09:10:11 2011
wishlist:alias system
Posted Sat Sep 17 09:10:11 2011
OSX's default sshd behaviour has limited paths set
Posted Sat Sep 17 09:10:11 2011
rsync over ssh?
Posted Sat Sep 17 09:10:11 2011
wishlist: special remote for sftp or rsync
Posted Sat Sep 17 09:10:11 2011
"git annex lock" very slow for big repo
Posted Sat Sep 17 09:10:11 2011
Problems with large numbers of files
Posted Sat Sep 17 09:10:11 2011
Running
git checkout
by hand is fine, of course.Underlying problem is that git has some O(N) scalability of operations on the index with regards to the number of files in the repo. So a repo with a whole lot of files will have a big index, and any operation that changes the index, like the
git reset
this needs to do, has to read in the entire index, and write out a new, modified version. It seems that git could be much smarter about its index data structures here, but I confess I don't understand the index's data structures at all. I hope someone takes it on, as git's scalability to number of files in the repo is becoming a new pain point, now that scalability to large files is "solved". ;)Still, it is possible to speed this up at git-annex's level. Rather than doing a
git reset
followed by a git checkout, it can justgit checkout HEAD -- file
, and since that's one command, it can then be fed into the queueing machinery in git-annex (that exists mostly to work around this git malfescence), and so only a single git command will need to be run to lock multiple files.I've just implemented the above. In my music repo, this changed an lock of a CD's worth of files from taking ctrl-c long to 1.75 seconds. Enjoy!
(Hey, this even speeds up the one file case greatly, since
git reset -- file
is slooooow -- it seems to scan the entire repository tree. Yipes.)@joey
OK, I'll try increasing the stack size and see if that helps.
For reference, I was running:
git annex add .
on a directory containing about 100k files spread over many nested subdirectories. I actually have more than a dozen projects like this that I plan to keep in git annex, possibly in separate repositories if necessary. I could probably tar the data and then archive that, but I like the idea of being able to see the structure of my data even though the contents of the files are on a different machine.
After the crash, running:
git annex unannex
does nothing and returns instantly. What exactly is 'git annex add' doing? I know that it's moving files into the key-value store and adding symlinks, but I don't know what else it does.
--Justin
If
Right, I have thought about untrusting all but a few remotes to achieve something similar before and I'm sure it would kind of work. It would be more of an ugly workaround, however, because I would have to untrust remotes that are, in reality, at least semi-trusted. That's why an extra option/attribute for that kind of purpose/remote would be nice.
Obviously I didn't see the scalability problem though. Good Point. Maybe I can achieve the same thing by writing a log parsing script for myself?
Can't you just use an underscore instead of a colon?
Would it be feasible to split directories dynamically? I.e. start with SHA1_123456789abcdef0123456789abcdef012345678/SHA1_123456789abcdef0123456789abcdef012345678 and, at a certain cut-off point, switch to shorter directory names? This could even be done per subdirectory and based purely on a locally-configured number. Different annexes on different file systems or with different file subsets might even have different thresholds. This would ensure scale while not forcing you to segment from the start. Also, while segmenting with longer directory names means a flatter tree, segments longer than four characters might not make too much sense. Segmenting too often could lead to some directories becoming too populated, bringing us back to the dynamic segmentation.
All of the above would make merging annexes by hand a lot harder, but I don't know if this is a valid use case. And if all else fails, one could merge everything with the unsegemented directory names and start again from there.
-- RichiH
+1 for a generic user configurable backend that a user can put shell commands in, which has a disclaimer such that if a user hangs themselves with misconfiguration then its their own fault :P
I would love to be able to quickly plugin an irods/sector set of put/get/delete/stat(get info) commands into git-annex to access my private clouds which aren't s3 compatible.
This was already asked here, but I have a use case where I need to unlock with the files being hardlinked instead of copied (my fs does not support CoW), even though 'git annex lock' is now much faster ;-) . The idea is that 1) I want the external world see my repo "as if" it wasn't annexed (because of its own limitation to deal with soft links), and 2) I know what I do, and am sure that files won't be written to but only read.
My case is: the repo contains a snapshot A1 of a certain remote directory. Later I want to rsync this dir into a new snapshot A2. Of course, I want to transfer only new or changed files, with the --copy-dest=A1 (or --compare-dest) rsync's options. Unfortunately, rsync won't recognize soft-links from git-annex, and will re-transfer everything.
Maybe I'm overusing git-annex ;-) but still, I find it is a legitimate use case, and even though there are workarounds (I don't even remember what I had to do), it would be much more straightforward to have 'git annex unlock --readonly' (or '--readonly-unsafe'?), ... or have rsync take soft-links into account, but I did not see the author ask for microfeatures ideas :) (it was discussed, and only some convoluted workarounds were proposed). Thanks.
@justin, I discovered that "git annex describe" did what I wanted
@joey, yep that is the behaviour of "tahoe ls", thanks for the tip on removing the file from the remote.
It seems to be working okay for now, the only concern is that on the remote everything is dumped into the same directory, but I can live with that, since I want to track biggish blobs and not lots of small little files.
Hey Jimmy: how's this working for you now? I would expect it to go slower and slower since Tahoe-LAFS has an O(N) algorithm for reading or updating directories.
Of course, if it is still fast enough for your uses then that's okay. :-)
(We're working on optimizations of this for future releases of Tahoe-LAFS.)
I'd like to understand the desired behavior of store-hook and retrieve-hook better, in order to see if there is a more efficient way to use Tahoe-LAFS for this.
Off to look for docs.
Regards,
Zooko
If
tahoe ls
outputs only the key, on its own line, and exits nonzero if it's not present, then I think you did the right thing.To remove a file, use
git annex move file --from tahoe
and then you can drop it locally.My last comment is a bit confused. The "git fetch" command allows to get all the information from a remote, and it is then possible to merge while being offline (without access to the remote). I would like a "git annex fetch remote" command to be able to get all annexed files from remote, so that if I later merge with remote, all annexed files are already here. And "git annex fetch" could (optionally) call "git fetch" before getting the files.
It seems also that in my last post, I should have written "git annex get --from=remote" instead of "git annex copy --from=remote", because "annex copy --from" copies all files, even if the local repo already have them (is this the case? if yes, when is it useful?)
This begs the question: What is the default remote? It's probably not the same repository that git's master branch is tracking (ie, origin/master). It seems there would have to be an annex.defaultremote setting.
BTW, mr can easily be configured on a per-repo basis so that "mr push" copies to somewhere:
push = git push; git annex push wherever
Going one step further, a --min-copy could put all files so that numcopies is satisfied. --all could push to all available ones.
To take everything another step further, if it was possible to group remotes, one could act on the groups. "all" would be an obvious choice for a group that always exists, everything else would be set up by the user.
Git-annex's commit hook does not prevent unannex being used. The file you unannex will not be checked into git anymore and will be a regular file again, not a git-annex symlink.
For example, here's a transcript:
Yes, there is value in layering something over git-annex to use a policy to choose what goes where.
I use mr to update and manage all my repositories, and since mr can be made to run arbitrary commands when doing eg, an update, I use its config file as such a policy layer. For example, my podcasts are pulled into my sound repository in a subdirectory; boxes that consume podcasts run "git pull; git annex get podcasts --exclude="/out/"; git annex drop podcasts/*/out". I move podcasts to "out" directories once done with them (I have yet to teach mpd to do that for me..), and the next time I run "mr update" to update everything, it pulls down new ones and removes old ones.
I don't see any obstacle to doing what you want. May be that you'd need better querying facilities in git-annex (so the policy layer can know what is available where), or finer control (--exclude is a good enough hammer for me, but maybe not for you).
That's awesome, I had not heard of git sparse checkouts before.
It does not make sense to tie the log files to the directory of the corresponding files, as then the logs would have to move when the files are moved, which would be a PITA and likely make merging log file changes very complex. Also, of course, multiple files in different locations can point at the same content, which has the same log file. And, to cap it off, git-annex can need to access the log file for a given key without having the slightest idea what file in the repository might point to it, and it would be very expensive to scan the whole repository to find out what that file is in order to lookup the filename of the log file.
The most likely change in git-annex that will make this better is in this todo item -- but it's unknown how to do it yet.
I've committed the queue flush improvements, so it will buffer up to 10240 git actions, and then flush the queue.
There may be other memory leaks at scale (besides the two I mentioned earlier), but this seems promising. I'm well into running
git annex add
on a half million files and it's using 18 mb ram and has flushed the queue several times. This run will fail due to running out of inodes for the log files, not due to memory. :)While having remotes redistribute introduces some obvious security concerns, I might use it.
As remotes support a cost factor already, you can basically implement bandwidth through that.
--from=...
or--all
? (thus, among other things, one could determine if a remote has a complete checkout.)It is unfortunatly not possible to do system-dependant hashing, so long as git-annex stores symlinks to the content in git.
It might be possible to start without hashing, and add hashing for new files after a cutoff point. It would add complexity.
I'm currently looking at a 2 character hash directory segment, based on an md5sum of the key, which splits it into 1024 buckets. git uses just 256 buckets for its object directory, but then its objects tend to get packed away. I sorta hope that one level is enough, but guess I could go to 2 levels (objects/ab/cd/key), which would provide 1048576 buckets, probably plenty, as if you are storing more than a million files, you are probably using a modern enough system to have a filesystem that doesn't need hashing.
No matter what you end up doing, I would appreciate a git-annex-announce@ list.
I really like the persistence of ikiwiki, but it's not ideal for quick communication. I would be fine with IRC and/or ML. The advantage of a ML over ikiwiki is that it doesn't seem to be as "wasteful" to mix normal chat with actual problem-solving. But maybe that's merely my own perception.
Speaking of RSS: I thought I had added a wishlist item to ikiwiki about providing per-subsite RSS feeds. For example there is no (obvious) way to subscribe to changes in http://git-annex.branchable.com/forum/git-annex_communication_channels/ .
FWIW, I resorted to tagging my local clone of git-annex to keep track of what I've read, already.
-- RichiH
Seems to have a scalability problem, what happens when such a repository becomes full?
Another way to accomplish I think the same thing is to pick the repositories that you would include in such a set, and make all other repositories untrusted. And set numcopies as desired. Then git-annex will never remove files from the set of non-untrusted repositories, and fsck will warn if a file is present on only an untrusted repository.
Another nice thing would be a summary of what is wrong. I.e.
And the same/similar for all other failure modes.
-- RichiH
Thanks for the update, Joey. I think you forgot to change libghc-missingh-dev to libghc6-missingh-dev for the copy & paste instructions though.
Also, after having checked that I have everything installed I'm still getting this error:
For future reference, git can recover from a corrupted index file with
rm .git/index; git reset --mixed
.Of course, you lose any staged changes that were in the old index file, and may need to re-stage some files.
ANNEX_HASH_*
oversight. (It also affected removal, btw.)The .git-annex/ directory is what really needs hashing.
Consider that when git looks for changes in there, it has to scan every file in the directory. With hashing, it should be able to more quickly identify just the subdirectories that contained changed files, by the directory mtimes.
And the real kicker is that when committing there, git has to create a tree object containing every single file, even if only 1 file changed. That will be a lot of extra work; with hashed subdirs it will instead create just 2 or 3 small tree objects leading down to the changed file. (Probably these trees both pack down to similar size pack files, not sure.)
These are good examples; I think you've convinced me at least for upgrades going forward after v2. I'm not sure we have enough users and outdated git-annex installations to worry about it for v1.
(Hoping such upgrades are rare anyway.. Part of the point of changes made in v2 was to allow lots of changes to be made later w/o needing a v3.)
Update: Upgrades from v1 to v2 will no longer be handled automatically now.
What a good idea!
150 lines of haskell later, I have this:
Cool, that seems to make things work as expected, here's an updated recipe
I just needs some of the output redirected to /dev/null.
(I updated this comment to fix a bug. --Joey)
thanks Joey,
is it possible to run some git annex command that tells me, for a specific directory, which files are available in an other remote? (and which remote, and which filenames?) I guess I could run that, do my own policy thingie, and run
git annex get
for the files I want.For your podcast use case (and some of my use cases) don't you think git [annex] might actually be overkill? For example your podcasts use case, what value does git annex give over a simple rsync/rm script? such a script wouldn't even need a data store to store its state, unlike git. it seems simpler and cleaner to me.
for the mpd thing, check http://alip.github.com/mpdcron/ (bad project name, it's a plugin based "event handler") you should be able to write a simple plugin for mpdcron that does what you want (or even interface with mpd yourself from perl/python/.. to use its idle mode to get events)
Dieter
Before dropping unsused items, sometimes I want to check the content of the files manually. But currently, from e.g. a sha1 key, I don't know how to find the corresponding file, except with 'find .git/annex/objects -type f -name 'SHA1-s1678--70....', wich is too slow (I'm in the case where "git log --stat -S'KEY'" won't work, either because it is too slow or it was never commited). By the way, is it documented somewhere how to determine the 2 (nested) sub-directories in which a given (by name) object is located?
So I would like 'git-annex unused' be able to give me the list of paths to the unused items. Also, I would really appreciate a command like 'git annex unused --log NUMBER [NUMBER2...]' which would do for me the suggested command "git log --stat -S'KEY'", where NUMBER is from the 'git annex unused' output. Thanks.
This message comes from ghc's runtime memory manager. Apparently your ghc defaults to limiting the stack to 80 mb. Mine seems to limit it slightly higher -- I have seen haskell programs successfully grow as large as 350 mb, although generally not intentionally. :)
Here's how to adjust the limit at runtime, obviously you'd want a larger number:
I've tried to avoid git-annex using quantities of memory that scale with the number of files in the repo, and I think in general successfully -- I run it on 32 mb and 128 mb machines, FWIW. There are some tricky cases, and haskell makes it easy to accidentally write code that uses much more memory than would be expected.
One well known case is
git annex unused
, which has to build a structure of every annexed file. I have been considering using a bloom filter or something to avoid that.Another possible case is when running a command like
git annex add
, and passing it a lot of files/directories. Some code tries to preserve the order of your input after passing it throughgit ls-files
(which destroys ordering), and to do so it needs to buffer both the input and the result in ram.It's possible to build git-annex with memory profiling and generate some quite helpful profiling data. Edit the Makefile and add this to GHCFLAGS:
-prof -auto-all -caf-all -fforce-recomp
then when running git-annex, add the parameters:+RTS -p -RTS
, and look for the git-annex.prof file.Nice! So if I understand correctly, 'git reset -- file' was there to discard staged (but not commited) changes made to 'file', before checking out, so that it is equivalent to directly 'git checkout HEAD -- file' ? I'm curious about the "queueing machinery in git-annex": does it end up calling the one git command with multiple files as arguments? does it correspond to the message "(Recording state in git...)" ? Thanks!
No-so-subtle sarcasm taken and acknowledged :)
Arguably, git-annex should know about any local limits and not have them implemented via mr from the outside. I guess my concern boils down to having git-annex do the right thing all by itself with minimal user interaction. And while I really do appreciate the flexibility of chaining commands, I am a firm believer in exposing the common use cases as easily as possible.
And yes, I am fully aware that not all annexes are created equal. Point in case, I would never use git annex pull on my laptop, but I would git annex push extensively.
I've just tried to use the ANNEX_HASH_ variables, example of my configuration
It's seems to work quite well for me now, I did run across this when I tried to drop a file locally, leaving the file on my remote
I do know that the files exist in my library as I have just inserted them, it seemed to work when I didnt have the hashing, it appears that the checkpresent doesn't seem to pass the ANNEX_HASH_* variables (from the limited debugging I did)
Whups, the comment above got stuck in moderation queue for 27 days. I will try to check that more frequently.
In the meantime, I've implemented "git annex whereis" -- enjoy!
I find keeping my podcasts in the annex useful because it allows me to download individual episodes or poscasts easily when low bandwidth is available (ie, dialup), or over sneakernet. And generally keeps everything organised.
After some thought, perhaps the default fsck output should be at least machine readable and copy and pasteable i.e.
so I can then copy the list of borked files and then just paste it into a for loop in my shell to recover the files. it's just an idea.
Probably more like 150 lines of haskell. Maybe just 50 lines if the bup repository is required to be on the same computer as the git-annex repository.
Since I do have some repositories where I'd appreciate this level of assurance that data not be lost, it's mostly a matter of me finding a free day.
Hmm, so it seems there is almost a way to do this already.
I think the one thing that isn't currently possible is to have 'plain' ssh remotes.. basically something just like the directory remote, but able to take a ssh user@host/path url. something like sshfs could be used to fake this, but for things like fsck you would want to do the sha1 calculations on the remote host.
i'll comment on each of the points separately, well aware that even a single little leftover issue can show that my plan is faulty:
all together, it seems to be a bit more complicated than i imagined, although not completely impossible. a combination of hidden files and maybe a simpler reduction of the number of requests might though achieve the important goals as well.
Besides the cost values, annex.diskreserve was recently added. (But is not available for special remotes.)
I have held off on adding high-level management stuff like this to git-annex, as it's hard to make it generic enough to cover use cases.
A low-level way to accomplish this would be to have a way for
git annex get
and/orcopy
to skip files whennumcopies
is already satisfied. Then cron jobs could be used.I'll give it a try as soon as I get rid of this:
fatal: index file smaller than expected fatal: index file smaller than expected % git status fatal: index file smaller than expected %
And no, I am not sure where that is coming from all of a sudden... (it might have to do with a hard lockup of the whole system due to a faulty hdd I tested, but I didn't do anything to it for ages before that lock-up. So meh. Also, this is prolly off topic in here)
Richard
Indeed, see add a git backend, where you and I have already discussed this idea. :)
With the new support for special remotes, which will be used by S3, it would be possible to make such a git repo, using bup, be a special remote. I think it would be pretty easy to implement now. Not a priority for me though.
I think what is happening with "git annex unannex" is that "git annex add" crashes before it can "git add" the symlinks. unannex only looks at files that "git ls-files" shows, and so files that are not added to git are not seen. So, this can be recovered from by looking at git status and manually adding the symlinks to git, and then unannex.
That also suggests that "git annex add ." has done something before crashing. That's consistent with you passing it < 2 parameters; it's not just running out of memory trying to expand and preserve order of its parameters (like it might if you ran "git annex add experiment-1/ experiment-2/")
I'm pretty sure I know where the space leak is now. git-annex builds up a queue of git commands, so that it can run git a minimum number of times. Currently, this queue is only flushed at the end. I had been meaning to work on having it flush the queue periodically to avoid it growing without bounds, and I will prioritize doing that.
(The only other thing that "git annex add" does is record location log information.)
In my case, the remotes are the same, but adding a new option could make sense.
And while I can tell mr what to do explicitly, I would prefer if it did the right thing all by itself. Having to change configs in two separate places is less than ideal.
I am not sure what you mean by
git annex push
as that does not exist. Did you mean copy?How remote is REMOTE? If it's a directory on the same computer, then git-annex copy --to is actually quickly checking that each file is present on the remote, and when it is, skipping copying it again.
If the remote is ssh, git-annex copy talks to the remote to see if it has the file. This makes copy --to slow, as Rich complained before. :)
So, copy --to does not trust location tracking information (unless --fast is specified), which means that it should be doing exactly what you want it to do in your situation -- transferring every file that is really not present in the destination repository already.
Neither does copy --from, by the way. It always checks if each file is present in the current repository's annex before trying to download it.
The rsync or directory special remotes would work if the media player uses metadata in the files, rather than directory locations.
Beyond that there is the smudge idea, which is hoped to be supported sometime.
Heh, cool, I was thinking throwing about 28million files at git-annex. Let me know how it goes, I suspect you have just run into a default limits OSX problem.
You probably just need to up some system limits (you will need to read the error messages that first appear) then do something like
There are other system limits which you can check by doing a "ulimit -a", once you make the above changes, you will need to reboot to make the changes take affect. I am unsure if the above will help as it is an example of what I did on 10.6.6 a few months ago to fix some forking issues. From the error you got you will probably need to increase the stacksize to something bigger or even make it unlimited if you feel lucky, the default stacksize on OSX is 8192, try making it say 10times that size first and see what happens.
My experience is that modern filesystems are not going to have many issues with tens to hundreds of thousands of items in the directory. However, if a transition does happen for FAT support I will consider adding hashing. Although getting a good balanced hash in general without, say, checksumming the filename and taking part of the checksum, is difficult.
I prefer to keep all the metadata in the filename, as this eases recovery if the files end up in lost+found. So while "SHA/" is a nice workaround for the FAT colon problem, I'll be doing something else. (What I'm not sure yet.)
There is no point in creating unused hash directories on initialization. If anything, with a bad filesystem that just guarantees worst performance from the beginning..
I don't know how to approach this yet, but I support the idea -- it would be great if there was a tool that could punch files out of git history and put them in the annex. (Of course with typical git history rewriting caveats.)
Sounds like it might be enough to add a switch to git-annex that overrides where it considers the top of the git repository to be?
--to and --from seem to have different semantics than --source and --destination. Subtle, but still different.
That being said, I am not sure --from and --to are needed at all. Calling the local repo . and all remotes by their name, they are arguably redundant and removing them would make the syntax a lot prettier; mv and cp don't need them, either.
I am not sure changing syntax at this point is considered good style though personally, I wouldn't mind adapting and would actually prefer it over using --to and --from.
-v and -q would be nice.
Richard
And something else i've done is, that i symlinked the video/ directory from the media annex to the normal raid annex
And it's working out great.
I really like this, perhaps it is a good idea to store all log files in every repo, but maybe there is a possibilitiy to to pack multiple log files into one single file, where not only the time, the present bit and the annex-repository is stored, but also the file key. I don't know if this format would also be merged correctly by the union merge driver.
Maybe, otoh, part of the point of git-annex is that the data may be too large to pull down all of it.
I find mr useful as a policy layer over top of git-annex, so "mr update" can pull down appropriate quantities of data from appropriate locations.
On the plus side, the past me wanted exactly what I had in mind.
On the meh side, I really forgot about this conversation :/
When you say this todo is not a priority, does that mean there's no ETA at all and that it will most likely sleep for a long time? Or the almost usual "what the heck, I will just wizard it up in two lines of haskell"?
-- RichiH
I see the following problems with this scheme:
Disallows removal of files when disconnected. It's currently safe to force that, as long as git-annex tells you enough other repos are belived to have the file. Just as long as you only force on one machine (say your laptop). With your scheme, if you drop a file while disconnected, any other host could see that the counter is still at N, because your laptop had the file last time it was online, and can decide to drop the file, and lose the last version.
pushing a changed counter commit to other repos is tricky, because they're not bare, and the network topology to get the commit pulled into the other repo could vary.
Merging counter files issues. If the counter file doesn't automerge, two repos dropping the same file will conflict. But, if it does automerge, it breaks the counter conflict detection.
Needing to revert commits is going to be annoying. An actual git revert could probably not reliably be done. It's need to construct a revert and commit it as a new commit. And then try to push that to remotes, and what if that push conflicts?
I do like the pre-removal dropping somewhat as an alternative to trust checking. I think that can be done with current git-annex though, just remove the files from the location log, but keep them in-annex. Dropping a file only looks at repos that the location log says have a file; so other repos can have retained a copy of a file secretly like this, and can safely remove it at any time. I'd need to look into this a bit more to be 100% sure it's safe, but have started hidden files.
I don't see any reduced round trips. It still has to contact N other repos on drop. Now, rather than checking that they have a file, it needs to push a change to them.
Sorry for all the followups, but I see now that if you unannex, then add the file to git normally, and commit, the hook does misbehave.
This seems to be a bug. git-annex's hook thinks that you have used git annex unlock (or "git annex edit") on the file and are now committing a changed version, and the right thing to do there is to add the new content to the annex and update the symlink accordingly. I'll track this bug over at unannex vs unlock hook confusion.
So, committing after unannex, and before checking the file into git in the usual way, is a workaround. But only if you do a "git commit" to commit staged changes.
Anyway, this confusing point is fixed in git now!
You should be able to fix the missing label by editing .git-annex/uuid.log and adding
My current workflow looks like this (I'm still experimenting):
Create backup clone for migration
Inject git annex initialization at repository base
Start migration with tree filter
There are still some drawbacks:
Let's see..
-v is already an alias for --verbose
I don't find --source and --destination as easy to type or as clear as --from or --to.
-F is fast, so it cannot be used for --force. And I have no desire to make it easy to mistype a short option and enable --force; it can lose data.
@richard while it would be possible to support some syntax like "git annex copy . remote"; what is it supposed to do if there are local files named foo and bar, and a remotes named foo and bar? Does "git annex copy foo bar" copy file foo to remote bar, or file bar from remote foo? I chose to use --from/--to to specify remotes independant of files to avoid such ambiguity, which plain old
cp
doesn't have since it's operating entirely on filesystem objects, not both filesystem objects and abstract remotes.Seems like nothing to do here. done --Joey
@joey thanks for the update in the previous comment, I had forgotten about updating it.
@zooko it's working okay for me right now, since I'm only putting fairly big blogs on stuff on to it and only things that I really care about. On the performance side, if it ran faster then it would be nicer :)
Remote as in "another physical machine". I assumed that
would have not trusted the contents in the current directory (or the remote that is being copied to) and then just go off and re-download/upload all the files and overwrite what is already there. I expected the combination of --force and copy --to that it would not bother to check if the files are there or not and just copy it regardless of the outcome.
we could include the information about the current directory as well, if the command is not issued in the local git root directory. to avoid large numbers of similar lines, that could look like this:
with the percentages being replaced with "complete" if really all files are present (and not just many enough for the value to be rounded to 100%).
additional filter criteria could come from the git history:
git annex get --touched-in HEAD~5..
to fetch what has recently been worked ongit annex get --touched-by chrysn --touched-in version-1.0..HEAD
to fetch what i've been workin on recently (based on regexp or substring match in author; git experts could probably craft much more meaningful expressions)these options could also apply to
git annex find
-- actually, looking at the normal file system tools for such tasks, that might even be sufficient (thinkgit annex find --numcopies-gt 3 --present-on lanserver1 --drop
likefind -iname '*foo*' -delete
(i was about to open a new forum discussion for commit-based getting, but this is close enough to be usefully joint in a discussion)
If you can't segment the names retroactively, it's better to start with segmenting, imo.
As subdirectories are cheap, going with ab/cd/rest or even ab/cd/ef/rest by default wouldn't hurt.
Your point about git not needing to create as many tree objects is a kicker indeed. If I were you, I would default to segmentation.
@Jimmy mentioned anonymous git push -- that is now enabled for this wiki. Enjoy!
I may try to spend more time on #vcs-home -- or I can be summoned there from my other lurking places on irc, I guess.
And following on to my transcript, you can then add the file to git in the regular git way, and it works fine:
It should sufficient to honor GIT_DIR/GIT_WORK_TREE/GIT_INDEX_FILE environment variables. git filter-branch sets GIT_WORK_TREE to ., but this can be mitigated by starting the filter script with 'GIT_WORK_TREE=$(pwd $GIT_WORK_TREE)'. E.g. GIT_DIR=/home/tyger/repo/.git, GIT_WORK_TREE=/home/tyger/repo/.git-rewrite/t, then git annex should be able to compute the correct relative path or maybe use absolute pathes in symlinks.
Another problem I observed is that git annex add automatically commits the symlink; this behaviour doesn't work well with filter-tree. git annex commits the wrong path (.git-rewrite/t/LINK instead of LINK). Also filter-tree doesn't expect that the filter script commmits anything; new files in the temporary work tree will be committed by filter-tree on each iteration of the filter script (missing files will be removed).
It's ok that
git pull
does not merge the git-annex branch. You can merge it withgit annex merge
, or it will be done automatically when you use other git-annex commands.If you use
git pull
andgit push
without any options, the defaults will make git pull and push the git-annex branch automatically.But if you're in the habit of doing
git push origin master
, that won't cause the git-annex branch to be pushed (usegit push origin git-annex
to manually push it then). Similarly,git pull origin master
won't pull it. And also, theremote.origin.fetch
setting in.git/config
can be modified in ways that makegit pull
not automatically pull the git-annex branch. So those are the things to avoid after upgrade to v3, basically.Yes, it can read id3-tags and guess titles from movie filenames but it sometimes gets confused by the filename metadata provided by the WORM-backend.
I think I have a good enough solution to this problem. It's not efficient when it comes to renames but handles adding and deletion just fine
The -L flag looks at symbolic links and copies the actual data they are pointing to. Of course "source" must have all data locally for this to work.