Functions to represent a Vector
on disk in efficient, if
unportable, ways.
This module uses memory-mapping, a feature of all modern operating-systems, to mirror the disk contents in memory. There are quite a few advantages to memory-mapping files instead of reading the files traditionally:
- Speed: memory-mapping is often much faster than traditional reading.
- Memory efficiency: Memory-mapped files are loaded into RAM on-demand, and easily swapped out. The upside is that the program can work with data-sets larger than the available RAM, as long as they are accessed carefully.
The caveat to using memory-mapping is that it makes the files specific to the current architecture because of the endianness of the data. For more information, see the description in System.IO.MMap
If you wish to write the contents in a portable fashion, either use the ASCII load and save functions in Numeric.Container, or use the binary serialization in Data.Binary.
- unsafeMMapVector :: forall a. Storable a => FilePath -> Maybe (Int64, Int) -> IO (Vector a)
- unsafeLazyMMapVectors :: forall a. Storable a => FilePath -> Maybe (Int64, Int64) -> Int -> IO (Int64, [Vector a])
- hPutVector :: forall a. Storable a => Handle -> Vector a -> IO ()
- writeVector :: forall a. Storable a => FilePath -> Vector a -> IO ()
Memory-mapping Vector
from disk
:: forall a . Storable a | |
=> FilePath | Path of the file to map |
-> Maybe (Int64, Int) |
|
-> IO (Vector a) |
Map a file into memory (read-only) as a Vector
.
It is considered unsafe because changes to the underlying file may
(or may not) be reflected in the Vector
, which breaks referential
transparency.
:: forall a . Storable a | |
=> FilePath | Path of the file to map |
-> Maybe (Int64, Int64) |
|
-> Int | The number of elements in each |
-> IO (Int64, [Vector a]) | Return |
Map a file into memory as a lazy-list of equal-sized Vector
,
even if they can't all fit in the address space at the same time.
(numVectors,vectors) <- unsafeLazyMMapVectors filename Nothing vectorSize
Commonly, a data file will contain multiple vectors of equal length (matrix). This function is convenient for those uses, but it plays a more important role: supporting data-sets that cannot fit in the address space of the current machine.
On 32-bit machines the address space is only 4GB, and it is actually pretty easy to find data-sets that are too large to be represented, even in virtual memory.
This function loads the data in chunks, and as long as you drop your reference to the vectors as you consume the data, the old chunks will be unmapped before mapping the next chunk.
The number of vectors in the list is returned because it's often
needed, yet calculating it using length
would demand the whole
list.
Writing Vector
to disk
These functions write the Vector
in a way suitable for
reading back with unsafeMMapVector
.