Problems with large numbers of files

I'm trying to use git-annex to archive scientific data. I'm often dealing with large numbers of files, sometimes 10k or more. When I try to git-annex add these files I get this error:

Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize' to increase it.

This is with the latest version of git-annex and a current version of git on OS 10.6.7. After this error occurs, I am unable to un-annex the files and I'm forced to recover from a backup.

comment 1

Heh, cool, I was thinking throwing about 28million files at git-annex. Let me know how it goes, I suspect you have just run into a default limits OSX problem.

You probably just need to up some system limits (you will need to read the error messages that first appear) then do something like

# this is really for the run time, you can set these settings in /etc/sysctl.conf
sudo sysctl -w kern.maxproc=2048
sudo sysctl -w kern.maxprocperuid=1024

# tell launchd about having higher limits
sudo echo "limit maxfiles 1024 unlimited" >> /etc/launchd.conf
sudo echo "limit maxproc 1024 2048" >> /etc/launchd.conf

There are other system limits which you can check by doing a "ulimit -a", once you make the above changes, you will need to reboot to make the changes take affect. I am unsure if the above will help as it is an example of what I did on 10.6.6 a few months ago to fix some forking issues. From the error you got you will probably need to increase the stacksize to something bigger or even make it unlimited if you feel lucky, the default stacksize on OSX is 8192, try making it say 10times that size first and see what happens.

Comment by Jimmy — Tue Apr 5 03:27:46 2011

comment 2

This message comes from ghc's runtime memory manager. Apparently your ghc defaults to limiting the stack to 80 mb. Mine seems to limit it slightly higher -- I have seen haskell programs successfully grow as large as 350 mb, although generally not intentionally. :)

Here's how to adjust the limit at runtime, obviously you'd want a larger number:

# git-annex +RTS -K100 -RTS find
Stack space overflow: current size 100 bytes.
Use `+RTS -Ksize -RTS' to increase it.

I've tried to avoid git-annex using quantities of memory that scale with the number of files in the repo, and I think in general successfully -- I run it on 32 mb and 128 mb machines, FWIW. There are some tricky cases, and haskell makes it easy to accidentally write code that uses much more memory than would be expected.

One well known case is git annex unused, which has to build a structure of every annexed file. I have been considering using a bloom filter or something to avoid that.

Another possible case is when running a command like git annex add, and passing it a lot of files/directories. Some code tries to preserve the order of your input after passing it through git ls-files (which destroys ordering), and to do so it needs to buffer both the input and the result in ram.

It's possible to build git-annex with memory profiling and generate some quite helpful profiling data. Edit the Makefile and add this to GHCFLAGS: -prof -auto-all -caf-all -fforce-recomp then when running git-annex, add the parameters: +RTS -p -RTS , and look for the git-annex.prof file.

Comment by joey — Tue Apr 5 13:46:03 2011

comment 3

Oh, you'll need profiling builds of various haskell libraries to build with profiling support. If that's not easily accomplished, if you could show me the form of the command you're running, and also how git annex unannex fails, that would be helpful for investigating.

Comment by joey — Tue Apr 5 14:02:05 2011

comment 4

@joey

OK, I'll try increasing the stack size and see if that helps.

For reference, I was running:

git annex add .

on a directory containing about 100k files spread over many nested subdirectories. I actually have more than a dozen projects like this that I plan to keep in git annex, possibly in separate repositories if necessary. I could probably tar the data and then archive that, but I like the idea of being able to see the structure of my data even though the contents of the files are on a different machine.

After the crash, running:

git annex unannex

does nothing and returns instantly. What exactly is 'git annex add' doing? I know that it's moving files into the key-value store and adding symlinks, but I don't know what else it does.

--Justin

Comment by Justin — Tue Apr 5 17:14:12 2011

comment 5

I think what is happening with "git annex unannex" is that "git annex add" crashes before it can "git add" the symlinks. unannex only looks at files that "git ls-files" shows, and so files that are not added to git are not seen. So, this can be recovered from by looking at git status and manually adding the symlinks to git, and then unannex.

That also suggests that "git annex add ." has done something before crashing. That's consistent with you passing it < 2 parameters; it's not just running out of memory trying to expand and preserve order of its parameters (like it might if you ran "git annex add experiment-1/ experiment-2/")

I'm pretty sure I know where the space leak is now. git-annex builds up a queue of git commands, so that it can run git a minimum number of times. Currently, this queue is only flushed at the end. I had been meaning to work on having it flush the queue periodically to avoid it growing without bounds, and I will prioritize doing that.

(The only other thing that "git annex add" does is record location log information.)

Comment by joey — Thu Apr 7 12:41:00 2011

comment 6

I've committed the queue flush improvements, so it will buffer up to 10240 git actions, and then flush the queue.

There may be other memory leaks at scale (besides the two I mentioned earlier), but this seems promising. I'm well into running git annex add on a half million files and it's using 18 mb ram and has flushed the queue several times. This run will fail due to running out of inodes for the log files, not due to memory. :)