hashing objects directories

I'm wondering how easy the addition of hashing to the directories of the objects would be.

Currently a tree directory structure becomes a flat two level tree under the .git/annex/objects directory (internals). This, through the 555 mode on the directory prevents the accidental destruction of content, which is good. However file and directory numbers soon add up in there and as such any file-systems with sub directory limitations will quickly realize the limit (certainly quicker than maybe expected).

Suggestion is therefore to change from

.git/annex/objects/SHA1:123456789abcdef0123456789abcdef012345678/SHA1:123456789abcdef0123456789abcdef012345678

.git/annex/objects/SHA1:1/2/3456789abcdef0123456789abcdef012345678/SHA1:123456789abcdef0123456789abcdef012345678

or anything in between to a paranoid

.git/annex/objects/SHA1:123/456/789/abc/def/012/345/678/9ab/cde/f01/234/5678/SHA1:123456789abcdef0123456789abcdef012345678

Also the use of a colon specifically breaks FAT32 (fat support), must it be a colon or could an extra directory be used? i.e. .git/annex/objects/SHA1/*/...

git annex init could also create all but the last level directory on initialization. I'm thinking SHA1/1/1, SHA1/1/2, ..., SHA256/f/f, ..., URL/f/f, ..., WORM/f/f

This is done now with a 2-level hash. It also hashes .git-annex/ log files which were the worse problem really. Scales to hundreds of millions of files with each dir having 1024 or fewer contents. Example:

me -> .git/annex/objects/71/9t/WORM-s3-m1300247299--me/WORM-s3-m1300247299--me

--Joey

comment 1

My experience is that modern filesystems are not going to have many issues with tens to hundreds of thousands of items in the directory. However, if a transition does happen for FAT support I will consider adding hashing. Although getting a good balanced hash in general without, say, checksumming the filename and taking part of the checksum, is difficult.

I prefer to keep all the metadata in the filename, as this eases recovery if the files end up in lost+found. So while "SHA/" is a nice workaround for the FAT colon problem, I'll be doing something else. (What I'm not sure yet.)

There is no point in creating unused hash directories on initialization. If anything, with a bad filesystem that just guarantees worst performance from the beginning..

Comment by joey — Mon Mar 14 12:12:49 2011

comment 2

Can't you just use an underscore instead of a colon?

Would it be feasible to split directories dynamically? I.e. start with SHA1_123456789abcdef0123456789abcdef012345678/SHA1_123456789abcdef0123456789abcdef012345678 and, at a certain cut-off point, switch to shorter directory names? This could even be done per subdirectory and based purely on a locally-configured number. Different annexes on different file systems or with different file subsets might even have different thresholds. This would ensure scale while not forcing you to segment from the start. Also, while segmenting with longer directory names means a flatter tree, segments longer than four characters might not make too much sense. Segmenting too often could lead to some directories becoming too populated, bringing us back to the dynamic segmentation.

All of the above would make merging annexes by hand a lot harder, but I don't know if this is a valid use case. And if all else fails, one could merge everything with the unsegemented directory names and start again from there.

-- RichiH

Comment by Richard — Tue Mar 15 09:52:16 2011

comment 3

It is unfortunatly not possible to do system-dependant hashing, so long as git-annex stores symlinks to the content in git.

It might be possible to start without hashing, and add hashing for new files after a cutoff point. It would add complexity.

I'm currently looking at a 2 character hash directory segment, based on an md5sum of the key, which splits it into 1024 buckets. git uses just 256 buckets for its object directory, but then its objects tend to get packed away. I sorta hope that one level is enough, but guess I could go to 2 levels (objects/ab/cd/key), which would provide 1048576 buckets, probably plenty, as if you are storing more than a million files, you are probably using a modern enough system to have a filesystem that doesn't need hashing.

Comment by joey — Tue Mar 15 23:13:39 2011

comment 4

The .git-annex/ directory is what really needs hashing.

Consider that when git looks for changes in there, it has to scan every file in the directory. With hashing, it should be able to more quickly identify just the subdirectories that contained changed files, by the directory mtimes.

And the real kicker is that when committing there, git has to create a tree object containing every single file, even if only 1 file changed. That will be a lot of extra work; with hashed subdirs it will instead create just 2 or 3 small tree objects leading down to the changed file. (Probably these trees both pack down to similar size pack files, not sure.)

Comment by joey — Wed Mar 16 00:06:19 2011

comment 5

If you can't segment the names retroactively, it's better to start with segmenting, imo.

As subdirectories are cheap, going with ab/cd/rest or even ab/cd/ef/rest by default wouldn't hurt.

Your point about git not needing to create as many tree objects is a kicker indeed. If I were you, I would default to segmentation.

Comment by Richard — Wed Mar 16 11:47:17 2011

Comments on this page are closed.