cuda-0.10.0.0: FFI binding to the CUDA interface for programming NVIDIA GPUs

Copyright[2009..2018] Trevor L. McDonell
LicenseBSD
Safe HaskellNone
LanguageHaskell98

Foreign.CUDA.Driver.Marshal

Contents

Description

Memory management for low-level driver interface

Synopsis

Host Allocation

mallocHostArray :: Storable a => [AllocFlag] -> Int -> IO (HostPtr a) Source #

Allocate a section of linear memory on the host which is page-locked and directly accessible from the device. The storage is sufficient to hold the given number of elements of a storable type.

Note that since the amount of pageable memory is thusly reduced, overall system performance may suffer. This is best used sparingly to allocate staging areas for data exchange.

Host memory allocated in this way is automatically and immediately accessible to all contexts on all devices which support unified addressing.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gdd8311286d2c2691605362c689bc64e0

mallocHostForeignPtr :: Storable a => [AllocFlag] -> Int -> IO (ForeignPtr a) Source #

As mallocHostArray, but return a ForeignPtr instead. The array will be deallocated automatically once the last reference to the ForeignPtr is dropped.

registerArray :: Storable a => [AllocFlag] -> Int -> Ptr a -> IO (HostPtr a) Source #

Page-locks the specified array (on the host) and maps it for the device(s) as specified by the given allocation flags. Subsequently, the memory is accessed directly by the device so can be read and written with much higher bandwidth than pageable memory that has not been registered. The memory range is added to the same tracking mechanism as mallocHostArray to automatically accelerate calls to functions such as pokeArray.

Note that page-locking excessive amounts of memory may degrade system performance, since it reduces the amount of pageable memory available. This is best used sparingly to allocate staging areas for data exchange.

This function has limited support on Mac OS X. OS 10.7 or later is required.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gf0a9fe11544326dabd743b7aa6b54223

unregisterArray :: HostPtr a -> IO (Ptr a) Source #

Unmaps the memory from the given pointer, and makes it pageable again.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g63f450c8125359be87b7623b1c0b2a14

Device Allocation

mallocArray :: Storable a => Int -> IO (DevicePtr a) Source #

Allocate a section of linear memory on the device, and return a reference to it. The memory is sufficient to hold the given number of elements of storable type. It is suitably aligned for any type, and is not cleared.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gb82d2a09844a58dd9e744dc31e8aa467

allocaArray :: Storable a => Int -> (DevicePtr a -> IO b) -> IO b Source #

Execute a computation on the device, passing a pointer to a temporarily allocated block of memory sufficient to hold the given number of elements of storable type. The memory is freed when the computation terminates (normally or via an exception), so the pointer must not be used after this.

Note that kernel launches can be asynchronous, so you may want to add a synchronisation point using sync as part of the continuation.

Unified Memory Allocation

mallocManagedArray :: Storable a => [AttachFlag] -> Int -> IO (DevicePtr a) Source #

Allocates memory that will be automatically managed by the Unified Memory system. The returned pointer is valid on the CPU and on all GPUs which supported managed memory. All accesses to this pointer must obey the Unified Memory programming model.

On a multi-GPU system with peer-to-peer support, where multiple GPUs support managed memory, the physical storage is created on the GPU which is active at the time mallocManagedArray is called. All other GPUs will access the array at reduced bandwidth via peer mapping over the PCIe bus. The Unified Memory system does not migrate memory between GPUs.

On a multi-GPU system where multiple GPUs support managed memory, but not all pairs of such GPUs have peer-to-peer support between them, the physical storage is allocated in system memory (zero-copy memory) and all GPUs will access the data at reduced bandwidth over the PCIe bus.

Requires CUDA-6.0

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gb347ded34dc326af404aa02af5388a32

prefetchArrayAsync :: Storable a => DevicePtr a -> Int -> Maybe Device -> Maybe Stream -> IO () Source #

Pre-fetches the given number of elements to the specified destination device. If the specified device is Nothing, the data is pre-fetched to host memory. The pointer must refer to a memory range allocated with mallocManagedArray.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html#group__CUDA__UNIFIED_1gfe94f8b7fb56291ebcea44261aa4cb84

Requires CUDA-8.0.

attachArrayAsync :: forall a. Storable a => [AttachFlag] -> Stream -> DevicePtr a -> Int -> IO () Source #

Attach an array of the given number of elements to a stream asynchronously

https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g6e468d680e263e7eba02a56643c50533

Since: 0.10.0.0

Marshalling

peekArray :: Storable a => Int -> DevicePtr a -> Ptr a -> IO () Source #

Copy a number of elements from the device to host memory. This is a synchronous operation.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g3480368ee0208a98f75019c9a8450893

peekArrayAsync :: Storable a => Int -> DevicePtr a -> HostPtr a -> Maybe Stream -> IO () Source #

Copy memory from the device asynchronously, possibly associated with a particular stream. The destination host memory must be page-locked.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g56f30236c7c5247f8e061b59d3268362

peekArray2D Source #

Arguments

:: Storable a 
=> Int

width to copy (elements)

-> Int

height to copy (elements)

-> DevicePtr a

source array

-> Int

source array width

-> Int

source x-coordinate

-> Int

source y-coordinate

-> Ptr a

destination array

-> Int

destination array width

-> Int

destination x-coordinate

-> Int

destination y-coordinate

-> IO () 

peekArray2DAsync Source #

Arguments

:: Storable a 
=> Int

width to copy (elements)

-> Int

height to copy (elements)

-> DevicePtr a

source array

-> Int

source array width

-> Int

source x-coordinate

-> Int

source y-coordinate

-> HostPtr a

destination array

-> Int

destination array width

-> Int

destination x-coordinate

-> Int

destination y-coordinate

-> Maybe Stream

stream to associate to

-> IO () 

Copy a 2D array from the device to the host asynchronously, possibly associated with a particular execution stream. The destination host memory must be page-locked.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g4acf155faeb969d9d21f5433d3d0f274

peekListArray :: Storable a => Int -> DevicePtr a -> IO [a] Source #

Copy a number of elements from the device into a new Haskell list. Note that this requires two memory copies: firstly from the device into a heap allocated array, and from there marshalled into a list.

pokeArray :: Storable a => Int -> Ptr a -> DevicePtr a -> IO () Source #

pokeArrayAsync :: Storable a => Int -> HostPtr a -> DevicePtr a -> Maybe Stream -> IO () Source #

Copy memory onto the device asynchronously, possibly associated with a particular stream. The source host memory must be page-locked.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g1572263fe2597d7ba4f6964597a354a3

pokeArray2D Source #

Arguments

:: Storable a 
=> Int

width to copy (elements)

-> Int

height to copy (elements)

-> Ptr a

source array

-> Int

source array width

-> Int

source x-coordinate

-> Int

source y-coordinate

-> DevicePtr a

destination array

-> Int

destination array width

-> Int

destination x-coordinate

-> Int

destination y-coordinate

-> IO () 

pokeArray2DAsync Source #

Arguments

:: Storable a 
=> Int

width to copy (elements)

-> Int

height to copy (elements)

-> HostPtr a

source array

-> Int

source array width

-> Int

source x-coordinate

-> Int

source y-coordinate

-> DevicePtr a

destination array

-> Int

destination array width

-> Int

destination x-coordinate

-> Int

destination y-coordinate

-> Maybe Stream

stream to associate to

-> IO () 

Copy a 2D array from the host to the device asynchronously, possibly associated with a particular execution stream. The source host memory must be page-locked.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g4acf155faeb969d9d21f5433d3d0f274

pokeListArray :: Storable a => [a] -> DevicePtr a -> IO () Source #

Write a list of storable elements into a device array. The device array must be sufficiently large to hold the entire list. This requires two marshalling operations.

copyArray :: Storable a => Int -> DevicePtr a -> DevicePtr a -> IO () Source #

Copy the given number of elements from the first device array (source) to the second device (destination). The copied areas may not overlap. This operation is asynchronous with respect to the host, but will never overlap with kernel execution.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g1725774abf8b51b91945f3336b778c8b

copyArrayAsync :: Storable a => Int -> DevicePtr a -> DevicePtr a -> Maybe Stream -> IO () Source #

Copy the given number of elements from the first device array (source) to the second device array (destination). The copied areas may not overlap. The operation is asynchronous with respect to the host, and can be asynchronous to other device operations by associating it with a particular stream.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g39ea09ba682b8eccc9c3e0c04319b5c8

copyArray2D Source #

Arguments

:: Storable a 
=> Int

width to copy (elements)

-> Int

height to copy (elements)

-> DevicePtr a

source array

-> Int

source array width

-> Int

source x-coordinate

-> Int

source y-coordinate

-> DevicePtr a

destination array

-> Int

destination array width

-> Int

destination x-coordinate

-> Int

destination y-coordinate

-> IO () 

Copy a 2D array from the first device array (source) to the second device array (destination). The copied areas must not overlap. This operation is asynchronous with respect to the host, but will never overlap with kernel execution.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g27f885b30c34cc20a663a671dbf6fc27

copyArray2DAsync Source #

Arguments

:: Storable a 
=> Int

width to copy (elements)

-> Int

height to copy (elements)

-> DevicePtr a

source array

-> Int

source array width

-> Int

source x-coordinate

-> Int

source y-coordinate

-> DevicePtr a

destination array

-> Int

destination array width

-> Int

destination x-coordinate

-> Int

destination y-coordinate

-> Maybe Stream

stream to associate to

-> IO () 

Copy a 2D array from the first device array (source) to the second device array (destination). The copied areas may not overlap. The operation is asynchronous with respect to the host, and can be asynchronous to other device operations by associating it with a particular execution stream.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g4acf155faeb969d9d21f5433d3d0f274

copyArrayPeer Source #

Arguments

:: Storable a 
=> Int

number of array elements

-> DevicePtr a 
-> Context

source array and context

-> DevicePtr a 
-> Context

destination array and context

-> IO () 

Copies an array from device memory in one context to device memory in another context. Note that this function is asynchronous with respect to the host, but serialised with respect to all pending and future asynchronous work in the source and destination contexts. To avoid this synchronisation, use copyArrayPeerAsync instead.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1ge1f5c7771544fee150ada8853c7cbf4a

copyArrayPeerAsync Source #

Arguments

:: Storable a 
=> Int

number of array elements

-> DevicePtr a 
-> Context

source array and context

-> DevicePtr a 
-> Context

destination array and device context

-> Maybe Stream

stream to associate with

-> IO () 

Copies from device memory in one context to device memory in another context. Note that this function is asynchronous with respect to the host and all work in other streams and devices.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g82fcecb38018e64b98616a8ac30112f2

Combined Allocation and Marshalling

newListArray :: Storable a => [a] -> IO (DevicePtr a) Source #

Write a list of storable elements into a newly allocated device array. This is newListArrayLen composed with fst.

newListArrayLen :: Storable a => [a] -> IO (DevicePtr a, Int) Source #

Write a list of storable elements into a newly allocated device array, returning the device pointer together with the number of elements that were written. Note that this requires two memory copies: firstly from a Haskell list to a heap allocated array, and from there onto the graphics device. The memory should be freed when no longer required.

withListArray :: Storable a => [a] -> (DevicePtr a -> IO b) -> IO b Source #

Temporarily store a list of elements into a newly allocated device array. An IO action is applied to to the array, the result of which is returned. Similar to newListArray, this requires copying the data twice.

As with allocaArray, the memory is freed once the action completes, so you should not return the pointer from the action, and be wary of asynchronous kernel execution.

withListArrayLen :: Storable a => [a] -> (Int -> DevicePtr a -> IO b) -> IO b Source #

A variant of withListArray which also supplies the number of elements in the array to the applied function

Utility

getDevicePtr :: [AllocFlag] -> HostPtr a -> IO (DevicePtr a) Source #

Return the device pointer associated with a mapped, pinned host buffer, which was allocated with the DeviceMapped option by mallocHostArray.

Currently, no options are supported and this must be empty.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g57a39e5cba26af4d06be67fc77cc62f0

getMemInfo :: IO (Int64, Int64) Source #

Return the amount of free and total memory respectively available to the current context (bytes).

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g808f555540d0143a331cc42aa98835c0