Marching Cubes
The initial marching cubes implementation was rather slow. Its performance did not scale linearly with the size of the input dataset as expected. There was a particular bottleneck when moving up to a dataset of 128x128x128, when the process appears never to terminate (at least on my 1.3GHz Mac G4 laptop, and on a 2GHz Linux machine). On my 2.5GHz Mac G5 with 2Gb memory, I was able to successfully compute up to two and a half times that size, but it took a very long time.
Performance measurements
| dataset | size | bytes | x | time (s) | x |
| marschnerlobb | 41x41x41 | 68,921 | 1.06 | ||
| silicium | 98x34x34 | 113,288 | 1.64 | 1.71 | 1.605 |
| neghip | 64x64x64 | 262,144 | 2.31 | 4.88 | 2.857 |
| hydrogenAtom | 128x128x128 | 2,097,152 | 8 | 129.2 | 26.48 |
| lobster | 301x324x56 | 5,461,344 | 2.60 | 859.6 | 6.654 |
| engine | 256x256x128 | 8,388,608 | 1.54 | - | - |
| statueLeg | 341x341x93 | 10,814,133 | 1.29 | - | - |
| BostonTeapot | 256x256x178 | 11,665,408 | 1.08 | - | - |
| skull | 256x256x256 | 16,777,216 | 1.44 | - | - |
The 'x' column is the ratio of size/time between this dataset and the previous one in the table. It can be clearly seen that the ratio of times is supra-linear with respect to the ratio of dataset sizes.
Speculation on the cause of the problem
Heap profiling shows that there is a spike of memory usage early in the computation, which collapses to a constant size that clearly represents the array structure plus its contents. It therefore seems that the basic I/O library operation that reads the data off disc and into memory potentially uses far more space than necessary. For larger datasets, this could be what is killing us.
Lazy Marching Cubes
To overcome the apparent bottleneck of simply reading the dataset into memory, we thought it would be good to try to exploit the characteristics of the algorithm. The key observation is that, in iterating over the regular grid, we do not need the entire dataset in memory at once. At any one moment, we need only a single point and 7 of its neighbours, making up a unit cube of sample values.
If we compare this with the storage format on disc, a linear sequence of samples, then the cube we need is entirely contained within a "window" of the file, corresponding to exactly one plane (plus one cell) of the volume. So, the ideal lazy solution is to slide this window over the file, constructing one unit cube on each iteration, and dropping the previous unit cube.
This is rather easy to arrange. First, read the file as a lazy sequence of bytes, then use the following code to turn the byte stream into a stream of unit cubes:
type Cube a = (a,a,a,a,a,a,a,a)
mkStream :: Int -> Int -> Int -> [a] -> [Cube a]
mkStream isz jsz ksz origin =
let line = isz
plane = isz*jsz
planeline = plane + line
in zip8 origin (drop 1 origin)
(drop (line+1) origin) (drop line origin)
(drop plane origin) (drop (plane+1) origin)
(drop (planeline+1) origin) (drop planeline origin)
Given this abstraction of a cube, we can now do some other nice lazy things. Apart from needing the sample values in each cube, we are also going to need the same cube thresholded, i.e. a Boolean at each vertex of the cube rather than a full sample value. This boolean cube is essentially the index into the marching cubes case table.
We can arrange that the thresholding function is only ever applied to each sample value once, rather than recomputing it up to 8 times, once for every cube it belongs to. The way to do it is by lazily mapping the threshold comparison over the input sample stream, before building the cube representation.
cases :: Ord a => [a] -> [CaseTableResult] cases s = (map (mcCaseClassic!) . mkStream isz jsz ksz . map (>threshold)) s
Now, we just need to ensure that the stream of CaseTableResults is consumed at exactly the same rate as the stream of sample cubes themselves, so that they share the same lazy "window" over the file. The 'zip' family of functions is the appropriate expression of this design pattern.
traverse :: [a] -> [Vertex]
traverse samples = zipWith3 march indexes (mkStream samples) (cases samples)
where indexes = [ (i,j,k) | k <- [1 .. ksz]
, j <- [1 .. jsz]
, i <- [1 .. isz] ]
march (x,y,z) cube edges = map (interpolate (x,y,z) cube) edges
This simplified presentation of the algorithm is only slightly different from the full code, which is stored in the repository at source:LazyMarchingCubes/
Performance measurements
The lazy marching cubes algorithm is now both much faster in the simple cases, and has the expected linear complexity as the dataset gets larger.
| dataset | size | bytes | x | time (s) | x |
| marschnerlobb | 41x41x41 | 68,921 | 0.562 | ||
| silicium | 98x34x34 | 113,288 | 1.64 | 0.897 | 1.59 |
| neghip | 64x64x64 | 262,144 | 2.31 | 2.077 | 2.32 |
| hydrogenAtom | 128x128x128 | 2,097,152 | 8 | 17.103 | 8.23 |
| lobster | 301x324x56 | 5,461,344 | 2.60 | 44.094 | 2.58 |
| engine | 256x256x128 | 8,388,608 | 1.54 | 76.645 | 1.74 |
| statueLeg | 341x341x93 | 10,814,133 | 1.29 | 87.758 | 1.14 |
| BostonTeapot | 256x256x178 | 11,665,408 | 1.08 | 95.393 | 1.09 |
| skull | 256x256x256 | 16,777,216 | 1.44 | 141.085 | 1.48 |
Caveats
There is still a lot of computation being repeated, even in this version of the code. For instance, a vertex of the computed surface will be interpolated along an edge of the grid, once for every time it is encountered. In the worst case it might contribute to three triangles in each cube, where there are four cubes adjacent to that edge, making a total of 12 times. In the best case, the vertex is interpolated once per cube, but must still be repeated for all four incident cubes, so 4 times.
The challenge is to try to share this work, so each interpolation is performed only once. This may require circular programming, where the result of the marching cubes algorithm is fed back as one of its inputs. In this way, already-computed vertexes are recognised and re-used. However, the details of how this might be programmed have not been worked out yet.
