<out-of-date-warning>The main problems this is supposed to solve are addressed in a different way with hidden files and the --fast option introduced in batch check on remote when using copy, so while this is not technically obsolete, the main reasons for it are gone. --chrysn</out-of-date-warning>

This is a rough sketch of a modification of git-annex to rely more on git commit semantics. It might be flawed due to my lack of understanding of git-annex internals. --chrysn

Summary

Currently, location tracking is only used for informational purposes unless a repository is trusted, in which case there is no checking at all. It is proposed to use the location tracking information as a commitment to keep track of a file until another repository takes over responsibility.

git's semantics for atomic commits are proposed to be used, which makes sure that before files are actually deleted, another repository has accepted the deletion.

Modified git-annex-drop behavior

The most important (if not only) git-annex command that is affected by this is git annex drop. Currently, for dropping a large number of files, every file is checked with another (or multiple, if so configured) host if it's safe to delete.

The new behavior would be to

  • decrement the location tracking counter for all files to be dropped,
  • commit that change,
  • try to push it to at least as many repositories that the numcopies constraints are met,
  • revert if that fails,
  • otherwise really drop the files from the backend.

Unlike explicit checking, this never looks at the remote backend if the file is really present -- otoh, git-annex already relies on the files in the backend to not be touched by anyone but git-annex itself, and git-annex would only drop them if they were derefed and committed, in which case git would not accept the push. (git by itself would accept a merged push, but even if the reverting step failed due to a power outage or similar, git-annex would, before really deleting files from the backend, check again if the numcopies restraint is still met, and revert its own delete commit as the files are still present anyway.)

Implications for trust

The proposed change also changes the semantics of trust. Trust can now be controlled in a finer-grained way between untrusted and semi-trusted, as best illustrated by a use case:

Alice takes her netbook with her on a trip through Spain, and will fill most of its disk up with pictures she takes. As she expects to meet some old friends during the first days, she wants to take older pictures with her, which are safely backed up at home, so they can be deleted on demand.

She tells her netbook's repository to dereference the old images (but not other parts of the repository she has not copied anywhere yet) and pushes to the server before leaving. When she adds pictures from her camera to the repository, git-annex can now free up space as needed.

Dereferencing could be implemented as git annex drop --no-rm (or move --no-rm), freeing space is similar to dropunused.

A trusted repository with the new semantics would mean that the repository would not accept dropping anything, just as before.

Advantages / Disadvantages

The advantage of this proposal is that the round trips required for dropping something could be greatly reduced.

There should also be simplifications in the git annex drop command as it doesn't need to take care of locking any more (git should already do that between checking if HEAD is a parent of the pushed commit and replacing HEAD).

Besides being a major change in git-annex (with the requirement to track hosts' git-annex versions for migration, as the new trust system is incompatible with the old one), no disadvantages of that stragegy are known to the author (hoping for discussion below).

I see the following problems with this scheme:

  • Disallows removal of files when disconnected. It's currently safe to force that, as long as git-annex tells you enough other repos are belived to have the file. Just as long as you only force on one machine (say your laptop). With your scheme, if you drop a file while disconnected, any other host could see that the counter is still at N, because your laptop had the file last time it was online, and can decide to drop the file, and lose the last version.

  • pushing a changed counter commit to other repos is tricky, because they're not bare, and the network topology to get the commit pulled into the other repo could vary.

  • Merging counter files issues. If the counter file doesn't automerge, two repos dropping the same file will conflict. But, if it does automerge, it breaks the counter conflict detection.

  • Needing to revert commits is going to be annoying. An actual git revert could probably not reliably be done. It's need to construct a revert and commit it as a new commit. And then try to push that to remotes, and what if that push conflicts?

  • I do like the pre-removal dropping somewhat as an alternative to trust checking. I think that can be done with current git-annex though, just remove the files from the location log, but keep them in-annex. Dropping a file only looks at repos that the location log says have a file; so other repos can have retained a copy of a file secretly like this, and can safely remove it at any time. I'd need to look into this a bit more to be 100% sure it's safe, but have started hidden files.

  • I don't see any reduced round trips. It still has to contact N other repos on drop. Now, rather than checking that they have a file, it needs to push a change to them.

Comment by joey Tue Feb 22 14:44:28 2011

i'll comment on each of the points separately, well aware that even a single little leftover issue can show that my plan is faulty:

  • force removal: well, yes -- but the file that is currently force-removed on the laptop could just as well be the last of its kind itself. i see the problem, but am not sure if it's fatal (after all, if we rely on out-of-band knowledge when forcing something, we could just as well ask a little more)
  • non-bare repos: pushing is tricky with non-bare repos now just as well; a post-commit hook could auto-accept counter changes. (but pushing causes problems with counters anyway, doesn't it?)
  • merging: i'd have them auto-merge. git-annex will have to check the validity of the current state anyway, and a situation in which a counter-decrementing commit is not a fast-forward one would be reverted in the next step (or upon discovery, in case the next step never took place).
  • reverting: my wording was bad as "revert" is already taken in git-lingo. the correct term for what i was thinking of is "reset". (as the commit could not be pushed, it would be rolled back completely).
    • we might have to resort to reverting, though, if the commit has already been pused to a first server of many.
  • hidden files: yes, this solves pre-removal dropping :-)
  • round trips: it's not the number of servers, it's the number of files (up to 30k in my case). it seems to me that an individual request was made for every single file i wanted to drop (that would be N*M roundtrips for N affected servers and M files, and N roundtrips with git managed numcopies)

all together, it seems to be a bit more complicated than i imagined, although not completely impossible. a combination of hidden files and maybe a simpler reduction of the number of requests might though achieve the important goals as well.

Comment by chrysn Wed Feb 23 12:43:59 2011
the non-bare repository issue would go away if this was combined with the "alternate" approach to branching. (with the "fleshed out proposal" of branching, this would not work at all for lack of shared commits.)
Comment by chrysn Wed Feb 23 17:48:14 2011
Comments on this page are closed.