cuda-0.10.0.0: FFI binding to the CUDA interface for programming NVIDIA GPUs

Copyright[2009..2018] Trevor L. McDonell
LicenseBSD
Safe HaskellNone
LanguageHaskell98

Foreign.CUDA.Driver

Description

This module defines an interface to the CUDA driver API. The Driver API is a lower-level interface to CUDA devices than that provided by the Runtime API. Using the Driver API, the programmer must deal explicitly with operations such as initialisation, context management, and loading (kernel) modules. Although more difficult to use initially, the Driver API provides more control over how CUDA is used. Furthermore, since it does not require compiling and linking the program with nvcc, the Driver API provides better inter-language compatibility.

The following is a short tutorial on using the Driver API. The steps can be copied into a file, or run directly in ghci, in which case ghci should be launched with the option -fno-ghci-sandbox. This is because CUDA maintains CPU-local state, so operations should always be run from a bound thread.

Using the Driver API

Before any operation can be performed, the Driver API must be initialised:

>>> import Foreign.CUDA.Driver
>>> initialise []

Next, we must select a GPU that we will execute operations on. Each GPU is assigned a unique identifier (beginning at zero). We can get a handle to a compute device at a given ordinal using the device operation. Given a device handle, we can query the properties of that device using props. The number of available CUDA-capable devices is given via count. For example:

>>> count
1
>>> dev0 <- device 0
>>> props dev0
DeviceProperties {deviceName = "GeForce GT 650M", computeCapability = 3.0, ...}

This package also includes the executable 'nvidia-device-query', which when executed displays the key properties of all available devices. See Foreign.CUDA.Driver.Device for additional operations to query the capabilities or status of a device.

Once you have chosen a device to use, the next step is to create a CUDA context. A context is associated with a particular device, and all operations, such as memory allocation and kernel execution, take place with respect to that context. For example, to create a new execution context on CUDA device 0:

>>> ctx <- create dev0 []

The second argument is a set of ContextFlags which control how the context behaves in various situations, for example, whether or not the CPU should actively spin when waiting for results from the GPU (SchedSpin), or to yield control to other threads instead (SchedYield).

The newly created context is now the active context, and all subsequent operations take place within that context. More than one context can be created per device, but resources, such as memory allocated in the GPU, are unique to each context. The module Foreign.CUDA.Driver.Context contains operations for managing multiple contexts. Some devices allow data to be shared between contexts without copying, see Foreign.CUDA.Driver.Context.Peer for more information.

Once the context is no longer needed, it should be destroyed in order to free up any resources that were allocated to it.

>>> destroy ctx

Each device also has a unique context which is used by the Runtime API. This context can be accessed with the module Foreign.CUDA.Driver.Context.Primary.

Executing kernels onto the GPU

Once the Driver API is initialised and an execution context is created on the GPU, we can begin to interact with it.

At an example, we'll step through executing the CUDA equivalent of the following Haskell function, which element-wise adds the elements of two arrays:

>>> vecAdd xs ys = zipWith (+) xs ys

The following CUDA kernel can be used to implement this on the GPU:

extern "C" __global__ void vecAdd(float *xs, float *ys, float *zs, int N)
{
    int ix = blockIdx.x * blockDim.x + threadIdx.x;

    if ( ix < N ) {
        zs[ix] = xs[ix] + ys[ix];
    }
}

Here, the global keyword marks the function as a kernel that should be computed on the GPU in data parallel. When we execute this function on the GPU, (at least) N threads will execute N individual instances of the kernel function vecAdd. Each thread will operate on a single element of each input array to create a single value in the result. See the CUDA programming guide for more details.

We can save this to a file vector_add.cu, and compile it using nvcc into a form that we can then load onto the GPU and execute:

$ nvcc --ptx vector_add.cu

The module Foreign.CUDA.Driver.Module contains functions for loading the resulting .ptx file (or .cubin files) into the running program.

>>> mdl <- loadFile "vector_add.ptx"

Once finished with the module, it is also a good idea to unload it.

Modules may export kernel functions, global variables, and texture references. Before we can execute our function, we need to look it up in the module by name.

>>> vecAdd <- getFun mdl "vecAdd"

Given this reference to our kernel function, we are almost ready to execute it on the device using launchKernel, but first, we must create some data that we can execute the function on.

Transferring data to and from the GPU

GPUs typically have their own memory which is separate from the CPU's memory, and we need to explicitly copy data back and forth between these two regions. The module Foreign.CUDA.Driver.Marshal provides functions for allocating memory on the GPU, and copying data between the CPU and GPU, as well as directly between multiple GPUs.

For simplicity, we'll use standard Haskell lists for our input and output data structure. Note however that this will have significantly lower effective bandwidth than reading a single contiguous region of memory, so for most practical purposes you will want to use some kind of unboxed array.

>>> let xs = [1..1024]   :: [Float]
>>> let ys = [2,4..2048] :: [Float]

In CUDA, like C, all memory management is explicit, and arrays on the device must be explicitly allocated and freed. As mentioned previously, data transfer is also explicit. However, we do provide convenience functions for combined allocation and marshalling, as well as bracketed operations.

>>> xs_dev <- newListArray xs
>>> ys_dev <- newListArray ys
>>> zs_dev <- mallocArray 1024 :: IO (DevicePtr Float)

After executing the kernel (see next section), we transfer the result back to the host, and free the memory that was allocated on the GPU.

>>> zs <- peekListArray 1024 zs_dev
>>> free xs_dev
>>> free ys_dev
>>> free zs_dev
Piecing it all together

Finally, we have everything in place to execute our operation on the GPU. Launching a kernel on the GPU consists of creating many threads on the GPU which all execute the same function, and each thread has a unique identifier in the grid/block hierarchy which can be used to identify exactly which element this thread should process (the blockIdx and threadIdx parameters that we saw earlier, respectively).

To execute our function, we will use a grid of 4 blocks, each containing 256 threads. Thus, a total of 1024 threads will be launched, which will each compute a single element of the output array (recall that our input arrays each have 1024 elements). The module Foreign.CUDA.Analysis.Occupancy contains functions to help determine the ideal thread block size for a given kernel and GPU combination.

>>> launchKernel vecAdd (4,1,1) (256,1,1) 0 Nothing [VArg xs_dev, VArg ys_dev, VArg zs_dev, IArg 1024]

Note that kernel execution is asynchronous, so we should also wait for the operation to complete before attempting to read the results back.

>>> sync

And that's it!

Next steps

As mentioned at the end of the previous section, kernels on the GPU are executed asynchronously with respect to the host, and other operations such as data transfers can also be executed asynchronously. This allows the CPU to continue doing other work while the GPU is busy. Events can be used to check whether an operation has completed yet.

It is also possible to execute multiple kernels or data transfers concurrently with each other, by assigning those operations to different execution Streams. Used in conjunction with Events, operations will be scheduled efficiently only once all dependencies (in the form of Events) have been cleared.

See Foreign.CUDA.Driver.Event and Foreign.CUDA.Driver.Stream for more information on this topic.

Synopsis

Documentation

newtype Context Source #

A device context

Constructors

Context (Ptr ()) 
Instances
Eq Context Source # 
Instance details

Defined in Foreign.CUDA.Driver.Context.Base

Methods

(==) :: Context -> Context -> Bool #

(/=) :: Context -> Context -> Bool #

Show Context Source # 
Instance details

Defined in Foreign.CUDA.Driver.Context.Base

create :: Device -> [ContextFlag] -> IO Context Source #

Create a new CUDA context and associate it with the calling thread. The context is created with a usage count of one, and the caller of create must call destroy when done using the context. If a context is already current to the thread, it is supplanted by the newly created context and must be restored by a subsequent call to pop.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g65dc0012348bc84810e2103a40d8e2cf

attach :: Context -> [ContextFlag] -> IO () Source #

Deprecated: as of CUDA-4.0

Increments the usage count of the context. API: no context flags are currently supported, so this parameter must be empty.

detach :: Context -> IO () Source #

Deprecated: as of CUDA-4.0

Detach the context, and destroy if no longer used

destroy :: Context -> IO () Source #

Destroy the specified context, regardless of how many threads it is current to. The context will be poped from the current thread's context stack, but if it is current on any other threads it will remain current to those threads, and attempts to access it will result in an error.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g27a365aebb0eb548166309f58a1e8b8e

pop :: IO Context Source #

Pop the current CUDA context from the CPU thread. The context may then be attached to a different CPU thread by calling push.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g2fac188026a062d92e91a8687d0a7902

push :: Context -> IO () Source #

Push the given context onto the CPU's thread stack of current contexts. The specified context becomes the CPU thread's current context, so all operations that operate on the current context are affected.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1gb02d4c850eb16f861fe5a29682cc90ba

sync :: IO () Source #

Block until the device has completed all preceding requests. If the context was created with the SchedBlockingSync flag, the CPU thread will block until the GPU has finished its work.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g7a54725f28d34b8c6299f0c6ca579616

type StreamPriority = Int Source #

Priority of an execution stream. Work submitted to a higher priority stream may preempt execution of work already executing in a lower priority stream. Lower numbers represent higher priorities.

data PeerFlag Source #

Possible option values for direct peer memory access

accessible :: Device -> Device -> IO Bool Source #

Queries if the first device can directly access the memory of the second. If direct access is possible, it can then be enabled with add.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PEER__ACCESS.html#group__CUDA__PEER__ACCESS_1g496bdaae1f632ebfb695b99d2c40f19e

add :: Context -> [PeerFlag] -> IO () Source #

If the devices of both the current and supplied contexts support unified addressing, then enable allocations in the supplied context to be accessible by the current context.

Note that access is unidirectional, and in order to access memory in the current context from the peer context, a separate symmetric call to add is required.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PEER__ACCESS.html#group__CUDA__PEER__ACCESS_1g0889ec6728e61c05ed359551d67b3f5a

remove :: Context -> IO () Source #

Disable direct memory access from the current context to the supplied peer context, and unregisters any registered allocations.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PEER__ACCESS.html#group__CUDA__PEER__ACCESS_1g5b4b6936ea868d4954ce4d841a3b4810

data Cache Source #

Device cache configuration preference

Instances
Enum Cache Source # 
Instance details

Defined in Foreign.CUDA.Driver.Context.Config

Eq Cache Source # 
Instance details

Defined in Foreign.CUDA.Driver.Context.Config

Methods

(==) :: Cache -> Cache -> Bool #

(/=) :: Cache -> Cache -> Bool #

Show Cache Source # 
Instance details

Defined in Foreign.CUDA.Driver.Context.Config

Methods

showsPrec :: Int -> Cache -> ShowS #

show :: Cache -> String #

showList :: [Cache] -> ShowS #

setLimit :: Limit -> Int -> IO () Source #

Specify the size of the call stack, for compute 2.0 devices.

Requires CUDA-3.1.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g0651954dfb9788173e60a9af7201e65a

getCache :: IO Cache Source #

On devices where the L1 cache and shared memory use the same hardware resources, this function returns the preferred cache configuration for the current context.

Requires CUDA-3.2.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g40b6b141698f76744dea6e39b9a25360

setCache :: Cache -> IO () Source #

On devices where the L1 cache and shared memory use the same hardware resources, this sets the preferred cache configuration for the current context. This is only a preference.

Any function configuration set via setCacheConfigFun will be preferred over this context-wide setting.

Requires CUDA-3.2.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g54699acf7e2ef27279d013ca2095f4a3

getSharedMem :: IO SharedMem Source #

Return the current size of the shared memory banks in the current context. On devices with configurable shared memory banks, setSharedMem can be used to change the configuration, so that subsequent kernel launches will by default us the new bank size. On devices without configurable shared memory, this function returns the fixed bank size of the hardware.

Requires CUDA-4.2

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g17153a1b8b8c756f7ab8505686a4ad74

setSharedMem :: SharedMem -> IO () Source #

On devices with configurable shared memory banks, this function will set the context's shared memory bank size that will be used by default for subsequent kernel launches.

Changing the shared memory configuration between launches may insert a device synchronisation.

Shared memory bank size does not affect shared memory usage or kernel occupancy, but may have major effects on performance. Larger bank sizes allow for greater potential bandwidth to shared memory, but change the kinds of accesses which result in bank conflicts.

Requires CUDA-4.2

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g2574235fa643f8f251bf7bc28fac3692

getStreamPriorityRange :: IO (StreamPriority, StreamPriority) Source #

Returns the numerical values that correspond to the greatest and least priority execution streams in the current context respectively. Stream priorities follow the convention that lower numerical numbers correspond to higher priorities. The range of meaningful stream priorities is given by the inclusive range [greatestPriority,leastPriority].

Requires CUDA-5.5.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g137920ab61a71be6ce67605b9f294091

data DeviceProperties Source #

The properties of a compute device

Constructors

DeviceProperties 

Fields

data Compute Source #

GPU compute capability, major and minor revision number respectively.

Constructors

Compute !Int !Int 
Instances
Eq Compute Source # 
Instance details

Defined in Foreign.CUDA.Analysis.Device

Methods

(==) :: Compute -> Compute -> Bool #

(/=) :: Compute -> Compute -> Bool #

Ord Compute Source # 
Instance details

Defined in Foreign.CUDA.Analysis.Device

Show Compute Source # 
Instance details

Defined in Foreign.CUDA.Analysis.Device

data InitFlag Source #

Possible option flags for CUDA initialisation. Dummy instance until the API exports actual option values.

data DeviceAttribute Source #

Device attributes

Constructors

MaxThreadsPerBlock 
MaxBlockDimX 
MaxBlockDimY 
MaxBlockDimZ 
MaxGridDimX 
MaxGridDimY 
MaxGridDimZ 
MaxSharedMemoryPerBlock 
SharedMemoryPerBlock 
TotalConstantMemory 
WarpSize 
MaxPitch 
MaxRegistersPerBlock 
RegistersPerBlock 
ClockRate 
TextureAlignment 
GpuOverlap 
MultiprocessorCount 
KernelExecTimeout 
Integrated 
CanMapHostMemory 
ComputeMode 
MaximumTexture1dWidth 
MaximumTexture2dWidth 
MaximumTexture2dHeight 
MaximumTexture3dWidth 
MaximumTexture3dHeight 
MaximumTexture3dDepth 
MaximumTexture2dLayeredWidth 
MaximumTexture2dArrayWidth 
MaximumTexture2dLayeredHeight 
MaximumTexture2dArrayHeight 
MaximumTexture2dLayeredLayers 
MaximumTexture2dArrayNumslices 
SurfaceAlignment 
ConcurrentKernels 
EccEnabled 
PciBusId 
PciDeviceId 
TccDriver 
MemoryClockRate 
GlobalMemoryBusWidth 
L2CacheSize 
MaxThreadsPerMultiprocessor 
AsyncEngineCount 
UnifiedAddressing 
MaximumTexture1dLayeredWidth 
MaximumTexture1dLayeredLayers 
CanTex2dGather 
MaximumTexture2dGatherWidth 
MaximumTexture2dGatherHeight 
MaximumTexture3dWidthAlternate 
MaximumTexture3dHeightAlternate 
MaximumTexture3dDepthAlternate 
PciDomainId 
TexturePitchAlignment 
MaximumTexturecubemapWidth 
MaximumTexturecubemapLayeredWidth 
MaximumTexturecubemapLayeredLayers 
MaximumSurface1dWidth 
MaximumSurface2dWidth 
MaximumSurface2dHeight 
MaximumSurface3dWidth 
MaximumSurface3dHeight 
MaximumSurface3dDepth 
MaximumSurface1dLayeredWidth 
MaximumSurface1dLayeredLayers 
MaximumSurface2dLayeredWidth 
MaximumSurface2dLayeredHeight 
MaximumSurface2dLayeredLayers 
MaximumSurfacecubemapWidth 
MaximumSurfacecubemapLayeredWidth 
MaximumSurfacecubemapLayeredLayers 
MaximumTexture1dLinearWidth 
MaximumTexture2dLinearWidth 
MaximumTexture2dLinearHeight 
MaximumTexture2dLinearPitch 
MaximumTexture2dMipmappedWidth 
MaximumTexture2dMipmappedHeight 
ComputeCapabilityMajor 
ComputeCapabilityMinor 
MaximumTexture1dMipmappedWidth 
StreamPrioritiesSupported 
GlobalL1CacheSupported 
LocalL1CacheSupported 
MaxSharedMemoryPerMultiprocessor 
MaxRegistersPerMultiprocessor 
ManagedMemory 
MultiGpuBoard 
MultiGpuBoardGroupId 
HostNativeAtomicSupported 
SingleToDoublePrecisionPerfRatio 
PageableMemoryAccess 
ConcurrentManagedAccess 
ComputePreemptionSupported 
CanUseHostPointerForRegisteredMem 
CanUseStreamMemOps 
CanUse64BitStreamMemOps 
CanUseStreamWaitValueNor 
CooperativeLaunch 
CooperativeMultiDeviceLaunch 
MaxSharedMemoryPerBlockOptin 
CanFlushRemoteWrites 
HostRegisterSupported 
PageableMemoryAccessUsesHostPageTables 
DirectManagedMemAccessFromHost 
CU_DEVICE_ATTRIBUTE_MAX 

newtype Device Source #

A CUDA device

Constructors

Device CInt 
Instances
Eq Device Source # 
Instance details

Defined in Foreign.CUDA.Driver.Device

Methods

(==) :: Device -> Device -> Bool #

(/=) :: Device -> Device -> Bool #

Show Device Source # 
Instance details

Defined in Foreign.CUDA.Driver.Device

capability :: Device -> IO Compute Source #

Return the compute compatibility revision supported by the device

props :: Device -> IO DeviceProperties Source #

Return the properties of the selected device

data FunParam where Source #

Kernel function parameters

Constructors

IArg :: !Int32 -> FunParam 
FArg :: !Float -> FunParam 
VArg :: Storable a => !a -> FunParam 

newtype Fun Source #

A __global__ device function

Constructors

Fun (Ptr ()) 

setCacheConfigFun :: Fun -> Cache -> IO () Source #

On devices where the L1 cache and shared memory use the same hardware resources, this sets the preferred cache configuration for the given device function. This is only a preference; the driver is free to choose a different configuration as required to execute the function.

Switching between configuration modes may insert a device-side synchronisation point for streamed kernel launches.

Requires CUDA-3.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1g40f8c11e81def95dc0072a375f965681

setSharedMemConfigFun :: Fun -> SharedMem -> IO () Source #

Set the shared memory configuration of a device function.

On devices with configurable shared memory banks, this will force all subsequent launches of the given device function to use the specified shared memory bank size configuration. On launch of the function, the shared memory configuration of the device will be temporarily changed if needed to suit the function configuration. Changes in shared memory configuration may introduction a device side synchronisation between kernel launches.

Any per-function configuration specified by setSharedMemConfig will override the context-wide configuration set with setSharedMem.

Changing the shared memory bank size will not increase shared memory usage or affect occupancy of kernels, but may have major effects on performance. Larger bank sizes will allow for greater potential bandwidth to shared memory, but will change what kinds of accesses to shared memory will result in bank conflicts.

This function will do nothing on devices with fixed shared memory bank size.

Requires CUDA-5.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1g430b913f24970e63869635395df6d9f5

launchKernel Source #

Arguments

:: Fun

function to execute

-> (Int, Int, Int)

block grid dimension

-> (Int, Int, Int)

thread block shape

-> Int

shared memory (bytes)

-> Maybe Stream

(optional) stream to execute in

-> [FunParam]

list of function parameters

-> IO () 

Invoke a kernel on a (gx * gy * gz) grid of blocks, where each block contains (tx * ty * tz) threads and has access to a given number of bytes of shared memory. The launch may also be associated with a specific Stream.

In launchKernel, the number of kernel parameters and their offsets and sizes do not need to be specified, as this information is retrieved directly from the kernel's image. This requires the kernel to have been compiled with toolchain version 3.2 or later.

The alternative launchKernel' will pass the arguments in directly, requiring the application to know the size and alignment/padding of each kernel parameter.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1gb8f3dc3031b40da29d5f9a7139e52e15

launchKernel' Source #

Arguments

:: Fun

function to execute

-> (Int, Int, Int)

block grid dimension

-> (Int, Int, Int)

thread block shape

-> Int

shared memory (bytes)

-> Maybe Stream

(optional) stream to execute in

-> [FunParam]

list of function parameters

-> IO () 

Invoke a kernel on a (gx * gy * gz) grid of blocks, where each block contains (tx * ty * tz) threads and has access to a given number of bytes of shared memory. The launch may also be associated with a specific Stream.

In launchKernel, the number of kernel parameters and their offsets and sizes do not need to be specified, as this information is retrieved directly from the kernel's image. This requires the kernel to have been compiled with toolchain version 3.2 or later.

The alternative launchKernel' will pass the arguments in directly, requiring the application to know the size and alignment/padding of each kernel parameter.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1gb8f3dc3031b40da29d5f9a7139e52e15

launchKernelCooperative Source #

Arguments

:: Fun

function to execute

-> (Int, Int, Int)

block grid dimension

-> (Int, Int, Int)

thread block shape

-> Int

shared memory (bytes)

-> Maybe Stream

(optional) stream to execute in

-> [FunParam]

list of function parameters

-> IO () 

Invoke a kernel on a (gx * gy * gz) grid of blocks, where each block contains (tx * ty * tz) threads and has access to a given number of bytes of shared memory. The launch may also be associated with a specific stream.

The thread blocks can cooperate and synchronise as they execute.

The device on which this kernel is invoked must have attribute CooperativeLaunch.

The total number of blocks launched can not exceed the maximum number of active thread blocks per multiprocessor (threadBlocksPerMP), multiplied by the number of multiprocessors (multiProcessorCount).

The kernel can not make use of dynamic parallelism.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1g06d753134145c4584c0c62525c1894cb

Requires CUDA-9.0

since 0.9.0.0

launch :: Fun -> (Int, Int) -> Maybe Stream -> IO () Source #

Deprecated: use launchKernel instead

Invoke the kernel on a size (w,h) grid of blocks. Each block contains the number of threads specified by a previous call to setBlockShape. The launch may also be associated with a specific Stream.

setBlockShape :: Fun -> (Int, Int, Int) -> IO () Source #

Deprecated: use launchKernel instead

Specify the (x,y,z) dimensions of the thread blocks that are created when the given kernel function is launched.

setSharedSize :: Fun -> Integer -> IO () Source #

Deprecated: use launchKernel instead

Set the number of bytes of dynamic shared memory to be available to each thread block when the function is launched

setParams :: Fun -> [FunParam] -> IO () Source #

Deprecated: use launchKernel instead

Set the parameters that will specified next time the kernel is invoked

mallocHostArray :: Storable a => [AllocFlag] -> Int -> IO (HostPtr a) Source #

Allocate a section of linear memory on the host which is page-locked and directly accessible from the device. The storage is sufficient to hold the given number of elements of a storable type.

Note that since the amount of pageable memory is thusly reduced, overall system performance may suffer. This is best used sparingly to allocate staging areas for data exchange.

Host memory allocated in this way is automatically and immediately accessible to all contexts on all devices which support unified addressing.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gdd8311286d2c2691605362c689bc64e0

mallocHostForeignPtr :: Storable a => [AllocFlag] -> Int -> IO (ForeignPtr a) Source #

As mallocHostArray, but return a ForeignPtr instead. The array will be deallocated automatically once the last reference to the ForeignPtr is dropped.

registerArray :: Storable a => [AllocFlag] -> Int -> Ptr a -> IO (HostPtr a) Source #

Page-locks the specified array (on the host) and maps it for the device(s) as specified by the given allocation flags. Subsequently, the memory is accessed directly by the device so can be read and written with much higher bandwidth than pageable memory that has not been registered. The memory range is added to the same tracking mechanism as mallocHostArray to automatically accelerate calls to functions such as pokeArray.

Note that page-locking excessive amounts of memory may degrade system performance, since it reduces the amount of pageable memory available. This is best used sparingly to allocate staging areas for data exchange.

This function has limited support on Mac OS X. OS 10.7 or later is required.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gf0a9fe11544326dabd743b7aa6b54223

unregisterArray :: HostPtr a -> IO (Ptr a) Source #

Unmaps the memory from the given pointer, and makes it pageable again.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g63f450c8125359be87b7623b1c0b2a14

mallocArray :: Storable a => Int -> IO (DevicePtr a) Source #

Allocate a section of linear memory on the device, and return a reference to it. The memory is sufficient to hold the given number of elements of storable type. It is suitably aligned for any type, and is not cleared.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gb82d2a09844a58dd9e744dc31e8aa467

allocaArray :: Storable a => Int -> (DevicePtr a -> IO b) -> IO b Source #

Execute a computation on the device, passing a pointer to a temporarily allocated block of memory sufficient to hold the given number of elements of storable type. The memory is freed when the computation terminates (normally or via an exception), so the pointer must not be used after this.

Note that kernel launches can be asynchronous, so you may want to add a synchronisation point using sync as part of the continuation.

mallocManagedArray :: Storable a => [AttachFlag] -> Int -> IO (DevicePtr a) Source #

Allocates memory that will be automatically managed by the Unified Memory system. The returned pointer is valid on the CPU and on all GPUs which supported managed memory. All accesses to this pointer must obey the Unified Memory programming model.

On a multi-GPU system with peer-to-peer support, where multiple GPUs support managed memory, the physical storage is created on the GPU which is active at the time mallocManagedArray is called. All other GPUs will access the array at reduced bandwidth via peer mapping over the PCIe bus. The Unified Memory system does not migrate memory between GPUs.

On a multi-GPU system where multiple GPUs support managed memory, but not all pairs of such GPUs have peer-to-peer support between them, the physical storage is allocated in system memory (zero-copy memory) and all GPUs will access the data at reduced bandwidth over the PCIe bus.

Requires CUDA-6.0

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gb347ded34dc326af404aa02af5388a32

prefetchArrayAsync :: Storable a => DevicePtr a -> Int -> Maybe Device -> Maybe Stream -> IO () Source #

Pre-fetches the given number of elements to the specified destination device. If the specified device is Nothing, the data is pre-fetched to host memory. The pointer must refer to a memory range allocated with mallocManagedArray.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html#group__CUDA__UNIFIED_1gfe94f8b7fb56291ebcea44261aa4cb84

Requires CUDA-8.0.

attachArrayAsync :: forall a. Storable a => [AttachFlag] -> Stream -> DevicePtr a -> Int -> IO () Source #

Attach an array of the given number of elements to a stream asynchronously

https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g6e468d680e263e7eba02a56643c50533

Since: 0.10.0.0

peekArray :: Storable a => Int -> DevicePtr a -> Ptr a -> IO () Source #

Copy a number of elements from the device to host memory. This is a synchronous operation.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g3480368ee0208a98f75019c9a8450893

peekArrayAsync :: Storable a => Int -> DevicePtr a -> HostPtr a -> Maybe Stream -> IO () Source #

Copy memory from the device asynchronously, possibly associated with a particular stream. The destination host memory must be page-locked.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g56f30236c7c5247f8e061b59d3268362

peekArray2D Source #

Arguments

:: Storable a 
=> Int

width to copy (elements)

-> Int

height to copy (elements)

-> DevicePtr a

source array

-> Int

source array width

-> Int

source x-coordinate

-> Int

source y-coordinate

-> Ptr a

destination array

-> Int

destination array width

-> Int

destination x-coordinate

-> Int

destination y-coordinate

-> IO () 

peekArray2DAsync Source #

Arguments

:: Storable a 
=> Int

width to copy (elements)

-> Int

height to copy (elements)

-> DevicePtr a

source array

-> Int

source array width

-> Int

source x-coordinate

-> Int

source y-coordinate

-> HostPtr a

destination array

-> Int

destination array width

-> Int

destination x-coordinate

-> Int

destination y-coordinate

-> Maybe Stream

stream to associate to

-> IO () 

Copy a 2D array from the device to the host asynchronously, possibly associated with a particular execution stream. The destination host memory must be page-locked.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g4acf155faeb969d9d21f5433d3d0f274

peekListArray :: Storable a => Int -> DevicePtr a -> IO [a] Source #

Copy a number of elements from the device into a new Haskell list. Note that this requires two memory copies: firstly from the device into a heap allocated array, and from there marshalled into a list.

pokeArray :: Storable a => Int -> Ptr a -> DevicePtr a -> IO () Source #

pokeArrayAsync :: Storable a => Int -> HostPtr a -> DevicePtr a -> Maybe Stream -> IO () Source #

Copy memory onto the device asynchronously, possibly associated with a particular stream. The source host memory must be page-locked.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g1572263fe2597d7ba4f6964597a354a3

pokeArray2D Source #

Arguments

:: Storable a 
=> Int

width to copy (elements)

-> Int

height to copy (elements)

-> Ptr a

source array

-> Int

source array width

-> Int

source x-coordinate

-> Int

source y-coordinate

-> DevicePtr a

destination array

-> Int

destination array width

-> Int

destination x-coordinate

-> Int

destination y-coordinate

-> IO () 

pokeArray2DAsync Source #

Arguments

:: Storable a 
=> Int

width to copy (elements)

-> Int

height to copy (elements)

-> HostPtr a

source array

-> Int

source array width

-> Int

source x-coordinate

-> Int

source y-coordinate

-> DevicePtr a

destination array

-> Int

destination array width

-> Int

destination x-coordinate

-> Int

destination y-coordinate

-> Maybe Stream

stream to associate to

-> IO () 

Copy a 2D array from the host to the device asynchronously, possibly associated with a particular execution stream. The source host memory must be page-locked.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g4acf155faeb969d9d21f5433d3d0f274

pokeListArray :: Storable a => [a] -> DevicePtr a -> IO () Source #

Write a list of storable elements into a device array. The device array must be sufficiently large to hold the entire list. This requires two marshalling operations.

copyArray :: Storable a => Int -> DevicePtr a -> DevicePtr a -> IO () Source #

Copy the given number of elements from the first device array (source) to the second device (destination). The copied areas may not overlap. This operation is asynchronous with respect to the host, but will never overlap with kernel execution.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g1725774abf8b51b91945f3336b778c8b

copyArrayAsync :: Storable a => Int -> DevicePtr a -> DevicePtr a -> Maybe Stream -> IO () Source #

Copy the given number of elements from the first device array (source) to the second device array (destination). The copied areas may not overlap. The operation is asynchronous with respect to the host, and can be asynchronous to other device operations by associating it with a particular stream.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g39ea09ba682b8eccc9c3e0c04319b5c8

copyArray2D Source #

Arguments

:: Storable a 
=> Int

width to copy (elements)

-> Int

height to copy (elements)

-> DevicePtr a

source array

-> Int

source array width

-> Int

source x-coordinate

-> Int

source y-coordinate

-> DevicePtr a

destination array

-> Int

destination array width

-> Int

destination x-coordinate

-> Int

destination y-coordinate

-> IO () 

Copy a 2D array from the first device array (source) to the second device array (destination). The copied areas must not overlap. This operation is asynchronous with respect to the host, but will never overlap with kernel execution.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g27f885b30c34cc20a663a671dbf6fc27

copyArray2DAsync Source #

Arguments

:: Storable a 
=> Int

width to copy (elements)

-> Int

height to copy (elements)

-> DevicePtr a

source array

-> Int

source array width

-> Int

source x-coordinate

-> Int

source y-coordinate

-> DevicePtr a

destination array

-> Int

destination array width

-> Int

destination x-coordinate

-> Int

destination y-coordinate

-> Maybe Stream

stream to associate to

-> IO () 

Copy a 2D array from the first device array (source) to the second device array (destination). The copied areas may not overlap. The operation is asynchronous with respect to the host, and can be asynchronous to other device operations by associating it with a particular execution stream.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g4acf155faeb969d9d21f5433d3d0f274

copyArrayPeer Source #

Arguments

:: Storable a 
=> Int

number of array elements

-> DevicePtr a 
-> Context

source array and context

-> DevicePtr a 
-> Context

destination array and context

-> IO () 

Copies an array from device memory in one context to device memory in another context. Note that this function is asynchronous with respect to the host, but serialised with respect to all pending and future asynchronous work in the source and destination contexts. To avoid this synchronisation, use copyArrayPeerAsync instead.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1ge1f5c7771544fee150ada8853c7cbf4a

copyArrayPeerAsync Source #

Arguments

:: Storable a 
=> Int

number of array elements

-> DevicePtr a 
-> Context

source array and context

-> DevicePtr a 
-> Context

destination array and device context

-> Maybe Stream

stream to associate with

-> IO () 

Copies from device memory in one context to device memory in another context. Note that this function is asynchronous with respect to the host and all work in other streams and devices.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g82fcecb38018e64b98616a8ac30112f2

newListArrayLen :: Storable a => [a] -> IO (DevicePtr a, Int) Source #

Write a list of storable elements into a newly allocated device array, returning the device pointer together with the number of elements that were written. Note that this requires two memory copies: firstly from a Haskell list to a heap allocated array, and from there onto the graphics device. The memory should be freed when no longer required.

newListArray :: Storable a => [a] -> IO (DevicePtr a) Source #

Write a list of storable elements into a newly allocated device array. This is newListArrayLen composed with fst.

withListArray :: Storable a => [a] -> (DevicePtr a -> IO b) -> IO b Source #

Temporarily store a list of elements into a newly allocated device array. An IO action is applied to to the array, the result of which is returned. Similar to newListArray, this requires copying the data twice.

As with allocaArray, the memory is freed once the action completes, so you should not return the pointer from the action, and be wary of asynchronous kernel execution.

withListArrayLen :: Storable a => [a] -> (Int -> DevicePtr a -> IO b) -> IO b Source #

A variant of withListArray which also supplies the number of elements in the array to the applied function

getDevicePtr :: [AllocFlag] -> HostPtr a -> IO (DevicePtr a) Source #

Return the device pointer associated with a mapped, pinned host buffer, which was allocated with the DeviceMapped option by mallocHostArray.

Currently, no options are supported and this must be empty.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g57a39e5cba26af4d06be67fc77cc62f0

getMemInfo :: IO (Int64, Int64) Source #

Return the amount of free and total memory respectively available to the current context (bytes).

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g808f555540d0143a331cc42aa98835c0