Copyright | [2009..2018] Trevor L. McDonell |
---|---|
License | BSD |
Safe Haskell | None |
Language | Haskell98 |
Memory management for low-level driver interface
Synopsis
- data AllocFlag
- mallocHostArray :: Storable a => [AllocFlag] -> Int -> IO (HostPtr a)
- mallocHostForeignPtr :: Storable a => [AllocFlag] -> Int -> IO (ForeignPtr a)
- freeHost :: HostPtr a -> IO ()
- registerArray :: Storable a => [AllocFlag] -> Int -> Ptr a -> IO (HostPtr a)
- unregisterArray :: HostPtr a -> IO (Ptr a)
- mallocArray :: Storable a => Int -> IO (DevicePtr a)
- allocaArray :: Storable a => Int -> (DevicePtr a -> IO b) -> IO b
- free :: DevicePtr a -> IO ()
- data AttachFlag
- mallocManagedArray :: Storable a => [AttachFlag] -> Int -> IO (DevicePtr a)
- prefetchArrayAsync :: Storable a => DevicePtr a -> Int -> Maybe Device -> Maybe Stream -> IO ()
- attachArrayAsync :: forall a. Storable a => [AttachFlag] -> Stream -> DevicePtr a -> Int -> IO ()
- peekArray :: Storable a => Int -> DevicePtr a -> Ptr a -> IO ()
- peekArrayAsync :: Storable a => Int -> DevicePtr a -> HostPtr a -> Maybe Stream -> IO ()
- peekArray2D :: Storable a => Int -> Int -> DevicePtr a -> Int -> Int -> Int -> Ptr a -> Int -> Int -> Int -> IO ()
- peekArray2DAsync :: Storable a => Int -> Int -> DevicePtr a -> Int -> Int -> Int -> HostPtr a -> Int -> Int -> Int -> Maybe Stream -> IO ()
- peekListArray :: Storable a => Int -> DevicePtr a -> IO [a]
- pokeArray :: Storable a => Int -> Ptr a -> DevicePtr a -> IO ()
- pokeArrayAsync :: Storable a => Int -> HostPtr a -> DevicePtr a -> Maybe Stream -> IO ()
- pokeArray2D :: Storable a => Int -> Int -> Ptr a -> Int -> Int -> Int -> DevicePtr a -> Int -> Int -> Int -> IO ()
- pokeArray2DAsync :: Storable a => Int -> Int -> HostPtr a -> Int -> Int -> Int -> DevicePtr a -> Int -> Int -> Int -> Maybe Stream -> IO ()
- pokeListArray :: Storable a => [a] -> DevicePtr a -> IO ()
- copyArray :: Storable a => Int -> DevicePtr a -> DevicePtr a -> IO ()
- copyArrayAsync :: Storable a => Int -> DevicePtr a -> DevicePtr a -> Maybe Stream -> IO ()
- copyArray2D :: Storable a => Int -> Int -> DevicePtr a -> Int -> Int -> Int -> DevicePtr a -> Int -> Int -> Int -> IO ()
- copyArray2DAsync :: Storable a => Int -> Int -> DevicePtr a -> Int -> Int -> Int -> DevicePtr a -> Int -> Int -> Int -> Maybe Stream -> IO ()
- copyArrayPeer :: Storable a => Int -> DevicePtr a -> Context -> DevicePtr a -> Context -> IO ()
- copyArrayPeerAsync :: Storable a => Int -> DevicePtr a -> Context -> DevicePtr a -> Context -> Maybe Stream -> IO ()
- newListArray :: Storable a => [a] -> IO (DevicePtr a)
- newListArrayLen :: Storable a => [a] -> IO (DevicePtr a, Int)
- withListArray :: Storable a => [a] -> (DevicePtr a -> IO b) -> IO b
- withListArrayLen :: Storable a => [a] -> (Int -> DevicePtr a -> IO b) -> IO b
- memset :: Storable a => DevicePtr a -> Int -> a -> IO ()
- memsetAsync :: Storable a => DevicePtr a -> Int -> a -> Maybe Stream -> IO ()
- getDevicePtr :: [AllocFlag] -> HostPtr a -> IO (DevicePtr a)
- getBasePtr :: DevicePtr a -> IO (DevicePtr a, Int64)
- getMemInfo :: IO (Int64, Int64)
Host Allocation
Options for host allocation
Instances
Bounded AllocFlag Source # | |
Enum AllocFlag Source # | |
Defined in Foreign.CUDA.Driver.Marshal succ :: AllocFlag -> AllocFlag # pred :: AllocFlag -> AllocFlag # fromEnum :: AllocFlag -> Int # enumFrom :: AllocFlag -> [AllocFlag] # enumFromThen :: AllocFlag -> AllocFlag -> [AllocFlag] # enumFromTo :: AllocFlag -> AllocFlag -> [AllocFlag] # enumFromThenTo :: AllocFlag -> AllocFlag -> AllocFlag -> [AllocFlag] # | |
Eq AllocFlag Source # | |
Show AllocFlag Source # | |
mallocHostArray :: Storable a => [AllocFlag] -> Int -> IO (HostPtr a) Source #
Allocate a section of linear memory on the host which is page-locked and directly accessible from the device. The storage is sufficient to hold the given number of elements of a storable type.
Note that since the amount of pageable memory is thusly reduced, overall system performance may suffer. This is best used sparingly to allocate staging areas for data exchange.
Host memory allocated in this way is automatically and immediately accessible to all contexts on all devices which support unified addressing.
mallocHostForeignPtr :: Storable a => [AllocFlag] -> Int -> IO (ForeignPtr a) Source #
As mallocHostArray
, but return a ForeignPtr
instead. The array will be
deallocated automatically once the last reference to the ForeignPtr
is
dropped.
registerArray :: Storable a => [AllocFlag] -> Int -> Ptr a -> IO (HostPtr a) Source #
Page-locks the specified array (on the host) and maps it for the device(s) as
specified by the given allocation flags. Subsequently, the memory is accessed
directly by the device so can be read and written with much higher bandwidth
than pageable memory that has not been registered. The memory range is added
to the same tracking mechanism as mallocHostArray
to automatically
accelerate calls to functions such as pokeArray
.
Note that page-locking excessive amounts of memory may degrade system performance, since it reduces the amount of pageable memory available. This is best used sparingly to allocate staging areas for data exchange.
This function has limited support on Mac OS X. OS 10.7 or later is required.
Requires CUDA-4.0.
unregisterArray :: HostPtr a -> IO (Ptr a) Source #
Unmaps the memory from the given pointer, and makes it pageable again.
Requires CUDA-4.0.
Device Allocation
mallocArray :: Storable a => Int -> IO (DevicePtr a) Source #
Allocate a section of linear memory on the device, and return a reference to it. The memory is sufficient to hold the given number of elements of storable type. It is suitably aligned for any type, and is not cleared.
allocaArray :: Storable a => Int -> (DevicePtr a -> IO b) -> IO b Source #
Execute a computation on the device, passing a pointer to a temporarily allocated block of memory sufficient to hold the given number of elements of storable type. The memory is freed when the computation terminates (normally or via an exception), so the pointer must not be used after this.
Note that kernel launches can be asynchronous, so you may want to add a
synchronisation point using sync
as part
of the continuation.
Unified Memory Allocation
data AttachFlag Source #
Options for unified memory allocations
Instances
Bounded AttachFlag Source # | |
Defined in Foreign.CUDA.Driver.Marshal minBound :: AttachFlag # maxBound :: AttachFlag # | |
Enum AttachFlag Source # | |
Defined in Foreign.CUDA.Driver.Marshal succ :: AttachFlag -> AttachFlag # pred :: AttachFlag -> AttachFlag # toEnum :: Int -> AttachFlag # fromEnum :: AttachFlag -> Int # enumFrom :: AttachFlag -> [AttachFlag] # enumFromThen :: AttachFlag -> AttachFlag -> [AttachFlag] # enumFromTo :: AttachFlag -> AttachFlag -> [AttachFlag] # enumFromThenTo :: AttachFlag -> AttachFlag -> AttachFlag -> [AttachFlag] # | |
Eq AttachFlag Source # | |
Defined in Foreign.CUDA.Driver.Marshal (==) :: AttachFlag -> AttachFlag -> Bool # (/=) :: AttachFlag -> AttachFlag -> Bool # | |
Show AttachFlag Source # | |
Defined in Foreign.CUDA.Driver.Marshal showsPrec :: Int -> AttachFlag -> ShowS # show :: AttachFlag -> String # showList :: [AttachFlag] -> ShowS # |
mallocManagedArray :: Storable a => [AttachFlag] -> Int -> IO (DevicePtr a) Source #
Allocates memory that will be automatically managed by the Unified Memory system. The returned pointer is valid on the CPU and on all GPUs which supported managed memory. All accesses to this pointer must obey the Unified Memory programming model.
On a multi-GPU system with peer-to-peer support, where multiple GPUs
support managed memory, the physical storage is created on the GPU which
is active at the time mallocManagedArray
is called. All other GPUs
will access the array at reduced bandwidth via peer mapping over the
PCIe bus. The Unified Memory system does not migrate memory between
GPUs.
On a multi-GPU system where multiple GPUs support managed memory, but not all pairs of such GPUs have peer-to-peer support between them, the physical storage is allocated in system memory (zero-copy memory) and all GPUs will access the data at reduced bandwidth over the PCIe bus.
Requires CUDA-6.0
prefetchArrayAsync :: Storable a => DevicePtr a -> Int -> Maybe Device -> Maybe Stream -> IO () Source #
Pre-fetches the given number of elements to the specified destination
device. If the specified device is Nothing, the data is pre-fetched to host
memory. The pointer must refer to a memory range allocated with
mallocManagedArray
.
Requires CUDA-8.0.
attachArrayAsync :: forall a. Storable a => [AttachFlag] -> Stream -> DevicePtr a -> Int -> IO () Source #
Attach an array of the given number of elements to a stream asynchronously
Since: 0.10.0.0
Marshalling
peekArray :: Storable a => Int -> DevicePtr a -> Ptr a -> IO () Source #
Copy a number of elements from the device to host memory. This is a synchronous operation.
peekArrayAsync :: Storable a => Int -> DevicePtr a -> HostPtr a -> Maybe Stream -> IO () Source #
Copy memory from the device asynchronously, possibly associated with a particular stream. The destination host memory must be page-locked.
:: Storable a | |
=> Int | width to copy (elements) |
-> Int | height to copy (elements) |
-> DevicePtr a | source array |
-> Int | source array width |
-> Int | source x-coordinate |
-> Int | source y-coordinate |
-> Ptr a | destination array |
-> Int | destination array width |
-> Int | destination x-coordinate |
-> Int | destination y-coordinate |
-> IO () |
Copy a 2D array from the device to the host.
:: Storable a | |
=> Int | width to copy (elements) |
-> Int | height to copy (elements) |
-> DevicePtr a | source array |
-> Int | source array width |
-> Int | source x-coordinate |
-> Int | source y-coordinate |
-> HostPtr a | destination array |
-> Int | destination array width |
-> Int | destination x-coordinate |
-> Int | destination y-coordinate |
-> Maybe Stream | stream to associate to |
-> IO () |
Copy a 2D array from the device to the host asynchronously, possibly associated with a particular execution stream. The destination host memory must be page-locked.
peekListArray :: Storable a => Int -> DevicePtr a -> IO [a] Source #
Copy a number of elements from the device into a new Haskell list. Note that this requires two memory copies: firstly from the device into a heap allocated array, and from there marshalled into a list.
pokeArray :: Storable a => Int -> Ptr a -> DevicePtr a -> IO () Source #
Copy a number of elements onto the device. This is a synchronous operation.
pokeArrayAsync :: Storable a => Int -> HostPtr a -> DevicePtr a -> Maybe Stream -> IO () Source #
Copy memory onto the device asynchronously, possibly associated with a particular stream. The source host memory must be page-locked.
:: Storable a | |
=> Int | width to copy (elements) |
-> Int | height to copy (elements) |
-> Ptr a | source array |
-> Int | source array width |
-> Int | source x-coordinate |
-> Int | source y-coordinate |
-> DevicePtr a | destination array |
-> Int | destination array width |
-> Int | destination x-coordinate |
-> Int | destination y-coordinate |
-> IO () |
Copy a 2D array from the host to the device.
:: Storable a | |
=> Int | width to copy (elements) |
-> Int | height to copy (elements) |
-> HostPtr a | source array |
-> Int | source array width |
-> Int | source x-coordinate |
-> Int | source y-coordinate |
-> DevicePtr a | destination array |
-> Int | destination array width |
-> Int | destination x-coordinate |
-> Int | destination y-coordinate |
-> Maybe Stream | stream to associate to |
-> IO () |
Copy a 2D array from the host to the device asynchronously, possibly associated with a particular execution stream. The source host memory must be page-locked.
pokeListArray :: Storable a => [a] -> DevicePtr a -> IO () Source #
Write a list of storable elements into a device array. The device array must be sufficiently large to hold the entire list. This requires two marshalling operations.
copyArray :: Storable a => Int -> DevicePtr a -> DevicePtr a -> IO () Source #
Copy the given number of elements from the first device array (source) to the second device (destination). The copied areas may not overlap. This operation is asynchronous with respect to the host, but will never overlap with kernel execution.
copyArrayAsync :: Storable a => Int -> DevicePtr a -> DevicePtr a -> Maybe Stream -> IO () Source #
Copy the given number of elements from the first device array (source) to the second device array (destination). The copied areas may not overlap. The operation is asynchronous with respect to the host, and can be asynchronous to other device operations by associating it with a particular stream.
:: Storable a | |
=> Int | width to copy (elements) |
-> Int | height to copy (elements) |
-> DevicePtr a | source array |
-> Int | source array width |
-> Int | source x-coordinate |
-> Int | source y-coordinate |
-> DevicePtr a | destination array |
-> Int | destination array width |
-> Int | destination x-coordinate |
-> Int | destination y-coordinate |
-> IO () |
Copy a 2D array from the first device array (source) to the second device array (destination). The copied areas must not overlap. This operation is asynchronous with respect to the host, but will never overlap with kernel execution.
:: Storable a | |
=> Int | width to copy (elements) |
-> Int | height to copy (elements) |
-> DevicePtr a | source array |
-> Int | source array width |
-> Int | source x-coordinate |
-> Int | source y-coordinate |
-> DevicePtr a | destination array |
-> Int | destination array width |
-> Int | destination x-coordinate |
-> Int | destination y-coordinate |
-> Maybe Stream | stream to associate to |
-> IO () |
Copy a 2D array from the first device array (source) to the second device array (destination). The copied areas may not overlap. The operation is asynchronous with respect to the host, and can be asynchronous to other device operations by associating it with a particular execution stream.
:: Storable a | |
=> Int | number of array elements |
-> DevicePtr a | |
-> Context | source array and context |
-> DevicePtr a | |
-> Context | destination array and context |
-> IO () |
Copies an array from device memory in one context to device memory in another
context. Note that this function is asynchronous with respect to the host,
but serialised with respect to all pending and future asynchronous work in
the source and destination contexts. To avoid this synchronisation, use
copyArrayPeerAsync
instead.
Requires CUDA-4.0.
:: Storable a | |
=> Int | number of array elements |
-> DevicePtr a | |
-> Context | source array and context |
-> DevicePtr a | |
-> Context | destination array and device context |
-> Maybe Stream | stream to associate with |
-> IO () |
Copies from device memory in one context to device memory in another context. Note that this function is asynchronous with respect to the host and all work in other streams and devices.
Requires CUDA-4.0.
Combined Allocation and Marshalling
newListArray :: Storable a => [a] -> IO (DevicePtr a) Source #
Write a list of storable elements into a newly allocated device array. This
is newListArrayLen
composed with fst
.
newListArrayLen :: Storable a => [a] -> IO (DevicePtr a, Int) Source #
Write a list of storable elements into a newly allocated device array,
returning the device pointer together with the number of elements that were
written. Note that this requires two memory copies: firstly from a Haskell
list to a heap allocated array, and from there onto the graphics device. The
memory should be free
d when no longer required.
withListArray :: Storable a => [a] -> (DevicePtr a -> IO b) -> IO b Source #
Temporarily store a list of elements into a newly allocated device array. An
IO action is applied to to the array, the result of which is returned.
Similar to newListArray
, this requires copying the data twice.
As with allocaArray
, the memory is freed once the action completes, so you
should not return the pointer from the action, and be wary of asynchronous
kernel execution.
withListArrayLen :: Storable a => [a] -> (Int -> DevicePtr a -> IO b) -> IO b Source #
A variant of withListArray
which also supplies the number of elements in
the array to the applied function
Utility
memset :: Storable a => DevicePtr a -> Int -> a -> IO () Source #
Set a number of data elements to the specified value, which may be either 8-, 16-, or 32-bits wide.
memsetAsync :: Storable a => DevicePtr a -> Int -> a -> Maybe Stream -> IO () Source #
Set the number of data elements to the specified value, which may be either 8-, 16-, or 32-bits wide. The operation is asynchronous and may optionally be associated with a stream.
Requires CUDA-3.2.
getDevicePtr :: [AllocFlag] -> HostPtr a -> IO (DevicePtr a) Source #
Return the device pointer associated with a mapped, pinned host buffer, which
was allocated with the DeviceMapped
option by mallocHostArray
.
Currently, no options are supported and this must be empty.
getBasePtr :: DevicePtr a -> IO (DevicePtr a, Int64) Source #
Return the base address and allocation size of the given device pointer.