Copyright	[2009..2014] Trevor L. McDonell
License	BSD
Safe Haskell	None
Language	Haskell98

Foreign.CUDA.Driver.Marshal

Contents

Host Allocation
Device Allocation
Unified Memory Allocation
Marshalling
Combined Allocation and Marshalling
Utility

Description

Memory management for low-level driver interface

Synopsis

Host Allocation

data AllocFlag Source #

Options for host allocation

Constructors

Portable
DeviceMapped
WriteCombined

Instances

Bounded AllocFlag Source #
Methods minBound :: AllocFlag # maxBound :: AllocFlag #
Enum AllocFlag Source #
Methods succ :: AllocFlag -> AllocFlag # pred :: AllocFlag -> AllocFlag # toEnum :: Int -> AllocFlag # fromEnum :: AllocFlag -> Int # enumFrom :: AllocFlag -> [AllocFlag] # enumFromThen :: AllocFlag -> AllocFlag -> [AllocFlag] # enumFromTo :: AllocFlag -> AllocFlag -> [AllocFlag] # enumFromThenTo :: AllocFlag -> AllocFlag -> AllocFlag -> [AllocFlag] #
Eq AllocFlag Source #
Methods (==) :: AllocFlag -> AllocFlag -> Bool # (/=) :: AllocFlag -> AllocFlag -> Bool #
Show AllocFlag Source #
Methods showsPrec :: Int -> AllocFlag -> ShowS # show :: AllocFlag -> String # showList :: [AllocFlag] -> ShowS #

mallocHostArray :: Storable a => [AllocFlag] -> Int -> IO (HostPtr a) Source #

Allocate a section of linear memory on the host which is page-locked and directly accessible from the device. The storage is sufficient to hold the given number of elements of a storable type.

Note that since the amount of pageable memory is thusly reduced, overall system performance may suffer. This is best used sparingly to allocate staging areas for data exchange.

Host memory allocated in this way is automatically and immediately accessible to all contexts on all devices which support unified addressing.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gdd8311286d2c2691605362c689bc64e0

mallocHostForeignPtr :: Storable a => [AllocFlag] -> Int -> IO (ForeignPtr a) Source #

As mallocHostArray, but return a ForeignPtr instead. The array will be deallocated automatically once the last reference to the ForeignPtr is dropped.

freeHost :: HostPtr a -> IO () Source #

Free a section of page-locked host memory.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g62e0fdbe181dab6b1c90fa1a51c7b92c

registerArray :: Storable a => [AllocFlag] -> Int -> Ptr a -> IO (HostPtr a) Source #

Page-locks the specified array (on the host) and maps it for the device(s) as specified by the given allocation flags. Subsequently, the memory is accessed directly by the device so can be read and written with much higher bandwidth than pageable memory that has not been registered. The memory range is added to the same tracking mechanism as mallocHostArray to automatically accelerate calls to functions such as pokeArray.

Note that page-locking excessive amounts of memory may degrade system performance, since it reduces the amount of pageable memory available. This is best used sparingly to allocate staging areas for data exchange.

This function has limited support on Mac OS X. OS 10.7 or later is required.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gf0a9fe11544326dabd743b7aa6b54223

unregisterArray :: HostPtr a -> IO (Ptr a) Source #

Unmaps the memory from the given pointer, and makes it pageable again.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g63f450c8125359be87b7623b1c0b2a14

Device Allocation

mallocArray :: Storable a => Int -> IO (DevicePtr a) Source #

Allocate a section of linear memory on the device, and return a reference to it. The memory is sufficient to hold the given number of elements of storable type. It is suitably aligned for any type, and is not cleared.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gb82d2a09844a58dd9e744dc31e8aa467

allocaArray :: Storable a => Int -> (DevicePtr a -> IO b) -> IO b Source #

Execute a computation on the device, passing a pointer to a temporarily allocated block of memory sufficient to hold the given number of elements of storable type. The memory is freed when the computation terminates (normally or via an exception), so the pointer must not be used after this.

Note that kernel launches can be asynchronous, so you may want to add a synchronisation point using sync as part of the continuation.

free :: DevicePtr a -> IO () Source #

Release a section of device memory.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g89b3f154e17cc89b6eea277dbdf5c93a

Unified Memory Allocation

data AttachFlag Source #

Options for unified memory allocations

Constructors

CuMemAttachGlobal
CuMemAttachHost
CuMemAttachSingle

Instances

Bounded AttachFlag Source #
Methods minBound :: AttachFlag # maxBound :: AttachFlag #
Enum AttachFlag Source #
Methods succ :: AttachFlag -> AttachFlag # pred :: AttachFlag -> AttachFlag # toEnum :: Int -> AttachFlag # fromEnum :: AttachFlag -> Int # enumFrom :: AttachFlag -> [AttachFlag] # enumFromThen :: AttachFlag -> AttachFlag -> [AttachFlag] # enumFromTo :: AttachFlag -> AttachFlag -> [AttachFlag] # enumFromThenTo :: AttachFlag -> AttachFlag -> AttachFlag -> [AttachFlag] #
Eq AttachFlag Source #
Methods (==) :: AttachFlag -> AttachFlag -> Bool # (/=) :: AttachFlag -> AttachFlag -> Bool #
Show AttachFlag Source #
Methods showsPrec :: Int -> AttachFlag -> ShowS # show :: AttachFlag -> String # showList :: [AttachFlag] -> ShowS #

mallocManagedArray :: Storable a => [AttachFlag] -> Int -> IO (DevicePtr a) Source #

Allocates memory that will be automatically managed by the Unified Memory system. The returned pointer is valid on the CPU and on all GPUs which supported managed memory. All accesses to this pointer must obey the Unified Memory programming model.

On a multi-GPU system with peer-to-peer support, where multiple GPUs support managed memory, the physical storage is created on the GPU which is active at the time mallocManagedArray is called. All other GPUs will access the array at reduced bandwidth via peer mapping over the PCIe bus. The Unified Memory system does not migrate memory between GPUs.

On a multi-GPU system where multiple GPUs support managed memory, but not all pairs of such GPUs have peer-to-peer support between them, the physical storage is allocated in system memory (zero-copy memory) and all GPUs will access the data at reduced bandwidth over the PCIe bus.

Requires CUDA-6.0

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gb347ded34dc326af404aa02af5388a32

Marshalling

peekArray :: Storable a => Int -> DevicePtr a -> Ptr a -> IO () Source #

Copy a number of elements from the device to host memory. This is a synchronous operation.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g3480368ee0208a98f75019c9a8450893

peekArrayAsync :: Storable a => Int -> DevicePtr a -> HostPtr a -> Maybe Stream -> IO () Source #

Copy memory from the device asynchronously, possibly associated with a particular stream. The destination host memory must be page-locked.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g56f30236c7c5247f8e061b59d3268362

peekArray2D Source #

Arguments

:: Storable a
=> Int	width to copy (elements)
-> Int	height to copy (elements)
-> DevicePtr a	source array
-> Int	source array width
-> Int	source x-coordinate
-> Int	source y-coordinate
-> Ptr a	destination array
-> Int	destination array width
-> Int	destination x-coordinate
-> Int	destination y-coordinate
-> IO ()

Copy a 2D array from the device to the host.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g27f885b30c34cc20a663a671dbf6fc27

peekArray2DAsync Source #

Arguments

:: Storable a
=> Int	width to copy (elements)
-> Int	height to copy (elements)
-> DevicePtr a	source array
-> Int	source array width
-> Int	source x-coordinate
-> Int	source y-coordinate
-> HostPtr a	destination array
-> Int	destination array width
-> Int	destination x-coordinate
-> Int	destination y-coordinate
-> Maybe Stream	stream to associate to
-> IO ()

Copy a 2D array from the device to the host asynchronously, possibly associated with a particular execution stream. The destination host memory must be page-locked.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g4acf155faeb969d9d21f5433d3d0f274

peekListArray :: Storable a => Int -> DevicePtr a -> IO [a] Source #

Copy a number of elements from the device into a new Haskell list. Note that this requires two memory copies: firstly from the device into a heap allocated array, and from there marshalled into a list.

pokeArray :: Storable a => Int -> Ptr a -> DevicePtr a -> IO () Source #

Copy a number of elements onto the device. This is a synchronous operation.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g4d32266788c440b0220b1a9ba5795169

pokeArrayAsync :: Storable a => Int -> HostPtr a -> DevicePtr a -> Maybe Stream -> IO () Source #

Copy memory onto the device asynchronously, possibly associated with a particular stream. The source host memory must be page-locked.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g1572263fe2597d7ba4f6964597a354a3

pokeArray2D Source #

Arguments

:: Storable a
=> Int	width to copy (elements)
-> Int	height to copy (elements)
-> Ptr a	source array
-> Int	source array width
-> Int	source x-coordinate
-> Int	source y-coordinate
-> DevicePtr a	destination array
-> Int	destination array width
-> Int	destination x-coordinate
-> Int	destination y-coordinate
-> IO ()

Copy a 2D array from the host to the device.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g27f885b30c34cc20a663a671dbf6fc27

pokeArray2DAsync Source #

Arguments

:: Storable a
=> Int	width to copy (elements)
-> Int	height to copy (elements)
-> HostPtr a	source array
-> Int	source array width
-> Int	source x-coordinate
-> Int	source y-coordinate
-> DevicePtr a	destination array
-> Int	destination array width
-> Int	destination x-coordinate
-> Int	destination y-coordinate
-> Maybe Stream	stream to associate to
-> IO ()

Copy a 2D array from the host to the device asynchronously, possibly associated with a particular execution stream. The source host memory must be page-locked.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g4acf155faeb969d9d21f5433d3d0f274

pokeListArray :: Storable a => [a] -> DevicePtr a -> IO () Source #

Write a list of storable elements into a device array. The device array must be sufficiently large to hold the entire list. This requires two marshalling operations.

copyArray :: Storable a => Int -> DevicePtr a -> DevicePtr a -> IO () Source #

Copy the given number of elements from the first device array (source) to the second device (destination). The copied areas may not overlap. This operation is asynchronous with respect to the host, but will never overlap with kernel execution.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g1725774abf8b51b91945f3336b778c8b

copyArrayAsync :: Storable a => Int -> DevicePtr a -> DevicePtr a -> Maybe Stream -> IO () Source #

Copy the given number of elements from the first device array (source) to the second device array (destination). The copied areas may not overlap. The operation is asynchronous with respect to the host, and can be asynchronous to other device operations by associating it with a particular stream.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g39ea09ba682b8eccc9c3e0c04319b5c8

copyArray2D Source #

Arguments

:: Storable a
=> Int	width to copy (elements)
-> Int	height to copy (elements)
-> DevicePtr a	source array
-> Int	source array width
-> Int	source x-coordinate
-> Int	source y-coordinate
-> DevicePtr a	destination array
-> Int	destination array width
-> Int	destination x-coordinate
-> Int	destination y-coordinate
-> IO ()

Copy a 2D array from the first device array (source) to the second device array (destination). The copied areas must not overlap. This operation is asynchronous with respect to the host, but will never overlap with kernel execution.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g27f885b30c34cc20a663a671dbf6fc27

copyArray2DAsync Source #

Arguments

:: Storable a
=> Int	width to copy (elements)
-> Int	height to copy (elements)
-> DevicePtr a	source array
-> Int	source array width
-> Int	source x-coordinate
-> Int	source y-coordinate
-> DevicePtr a	destination array
-> Int	destination array width
-> Int	destination x-coordinate
-> Int	destination y-coordinate
-> Maybe Stream	stream to associate to
-> IO ()

Copy a 2D array from the first device array (source) to the second device array (destination). The copied areas may not overlap. The operation is asynchronous with respect to the host, and can be asynchronous to other device operations by associating it with a particular execution stream.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g4acf155faeb969d9d21f5433d3d0f274

copyArrayPeer Source #

Arguments

:: Storable a
=> Int	number of array elements
-> DevicePtr a
-> Context	source array and context
-> DevicePtr a
-> Context	destination array and context
-> IO ()

Copies an array from device memory in one context to device memory in another context. Note that this function is asynchronous with respect to the host, but serialised with respect to all pending and future asynchronous work in the source and destination contexts. To avoid this synchronisation, use copyArrayPeerAsync instead.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1ge1f5c7771544fee150ada8853c7cbf4a

copyArrayPeerAsync Source #

Arguments

:: Storable a
=> Int	number of array elements
-> DevicePtr a
-> Context	source array and context
-> DevicePtr a
-> Context	destination array and device context
-> Maybe Stream	stream to associate with
-> IO ()

Copies from device memory in one context to device memory in another context. Note that this function is asynchronous with respect to the host and all work in other streams and devices.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g82fcecb38018e64b98616a8ac30112f2

Combined Allocation and Marshalling

newListArray :: Storable a => [a] -> IO (DevicePtr a) Source #

Write a list of storable elements into a newly allocated device array. This is newListArrayLen composed with fst.

newListArrayLen :: Storable a => [a] -> IO (DevicePtr a, Int) Source #

Write a list of storable elements into a newly allocated device array, returning the device pointer together with the number of elements that were written. Note that this requires two memory copies: firstly from a Haskell list to a heap allocated array, and from there onto the graphics device. The memory should be freed when no longer required.

withListArray :: Storable a => [a] -> (DevicePtr a -> IO b) -> IO b Source #

Temporarily store a list of elements into a newly allocated device array. An IO action is applied to to the array, the result of which is returned. Similar to newListArray, this requires copying the data twice.

As with allocaArray, the memory is freed once the action completes, so you should not return the pointer from the action, and be wary of asynchronous kernel execution.

withListArrayLen :: Storable a => [a] -> (Int -> DevicePtr a -> IO b) -> IO b Source #

A variant of withListArray which also supplies the number of elements in the array to the applied function

Utility

memset :: Storable a => DevicePtr a -> Int -> a -> IO () Source #

Set a number of data elements to the specified value, which may be either 8-, 16-, or 32-bits wide.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g6e582bf866e9e2fb014297bfaf354d7b

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g7d805e610054392a4d11e8a8bf5eb35c

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g983e8d8759acd1b64326317481fbf132

memsetAsync :: Storable a => DevicePtr a -> Int -> a -> Maybe Stream -> IO () Source #

Set the number of data elements to the specified value, which may be either 8-, 16-, or 32-bits wide. The operation is asynchronous and may optionally be associated with a stream.

Requires CUDA-3.2.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gaef08a7ccd61112f94e82f2b30d43627

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gf731438877dd8ec875e4c43d848c878c

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g58229da5d30f1c0cdf667b320ec2c0f5

getDevicePtr :: [AllocFlag] -> HostPtr a -> IO (DevicePtr a) Source #

Return the device pointer associated with a mapped, pinned host buffer, which was allocated with the DeviceMapped option by mallocHostArray.

Currently, no options are supported and this must be empty.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g57a39e5cba26af4d06be67fc77cc62f0

getBasePtr :: DevicePtr a -> IO (DevicePtr a, Int64) Source #

Return the base address and allocation size of the given device pointer.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g64fee5711274a2a0573a789c94d8299b

getMemInfo :: IO (Int64, Int64) Source #

Return the amount of free and total memory respectively available to the current context (bytes).

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g808f555540d0143a331cc42aa98835c0