Copyright	[2009..2018] Trevor L. McDonell
License	BSD
Safe Haskell	None
Language	Haskell98

Foreign.CUDA.Driver

Description

This module defines an interface to the CUDA driver API. The Driver API is a lower-level interface to CUDA devices than that provided by the Runtime API. Using the Driver API, the programmer must deal explicitly with operations such as initialisation, context management, and loading (kernel) modules. Although more difficult to use initially, the Driver API provides more control over how CUDA is used. Furthermore, since it does not require compiling and linking the program with nvcc, the Driver API provides better inter-language compatibility.

The following is a short tutorial on using the Driver API. The steps can be copied into a file, or run directly in ghci, in which case ghci should be launched with the option -fno-ghci-sandbox. This is because CUDA maintains CPU-local state, so operations should always be run from a bound thread.

Using the Driver API

Before any operation can be performed, the Driver API must be initialised:

>>> import Foreign.CUDA.Driver
>>> initialise []

Next, we must select a GPU that we will execute operations on. Each GPU is assigned a unique identifier (beginning at zero). We can get a handle to a compute device at a given ordinal using the device operation. Given a device handle, we can query the properties of that device using props. The number of available CUDA-capable devices is given via count. For example:

>>> count
1
>>> dev0 <- device 0
>>> props dev0
DeviceProperties {deviceName = "GeForce GT 650M", computeCapability = 3.0, ...}

This package also includes the executable 'nvidia-device-query', which when executed displays the key properties of all available devices. See Foreign.CUDA.Driver.Device for additional operations to query the capabilities or status of a device.

Once you have chosen a device to use, the next step is to create a CUDA context. A context is associated with a particular device, and all operations, such as memory allocation and kernel execution, take place with respect to that context. For example, to create a new execution context on CUDA device 0:

>>> ctx <- create dev0 []

The second argument is a set of ContextFlags which control how the context behaves in various situations, for example, whether or not the CPU should actively spin when waiting for results from the GPU (SchedSpin), or to yield control to other threads instead (SchedYield).

The newly created context is now the active context, and all subsequent operations take place within that context. More than one context can be created per device, but resources, such as memory allocated in the GPU, are unique to each context. The module Foreign.CUDA.Driver.Context contains operations for managing multiple contexts. Some devices allow data to be shared between contexts without copying, see Foreign.CUDA.Driver.Context.Peer for more information.

Once the context is no longer needed, it should be destroyed in order to free up any resources that were allocated to it.

>>> destroy ctx

Each device also has a unique context which is used by the Runtime API. This context can be accessed with the module Foreign.CUDA.Driver.Context.Primary.

Executing kernels onto the GPU

Once the Driver API is initialised and an execution context is created on the GPU, we can begin to interact with it.

At an example, we'll step through executing the CUDA equivalent of the following Haskell function, which element-wise adds the elements of two arrays:

>>> vecAdd xs ys = zipWith (+) xs ys

The following CUDA kernel can be used to implement this on the GPU:

extern "C" __global__ void vecAdd(float *xs, float *ys, float *zs, int N)
{
    int ix = blockIdx.x * blockDim.x + threadIdx.x;

    if ( ix < N ) {
        zs[ix] = xs[ix] + ys[ix];
    }
}

Here, the global keyword marks the function as a kernel that should be computed on the GPU in data parallel. When we execute this function on the GPU, (at least) N threads will execute N individual instances of the kernel function vecAdd. Each thread will operate on a single element of each input array to create a single value in the result. See the CUDA programming guide for more details.

We can save this to a file vector_add.cu, and compile it using nvcc into a form that we can then load onto the GPU and execute:

$ nvcc --ptx vector_add.cu

The module Foreign.CUDA.Driver.Module contains functions for loading the resulting .ptx file (or .cubin files) into the running program.

>>> mdl <- loadFile "vector_add.ptx"

Once finished with the module, it is also a good idea to unload it.

Modules may export kernel functions, global variables, and texture references. Before we can execute our function, we need to look it up in the module by name.

>>> vecAdd <- getFun mdl "vecAdd"

Given this reference to our kernel function, we are almost ready to execute it on the device using launchKernel, but first, we must create some data that we can execute the function on.

Transferring data to and from the GPU

GPUs typically have their own memory which is separate from the CPU's memory, and we need to explicitly copy data back and forth between these two regions. The module Foreign.CUDA.Driver.Marshal provides functions for allocating memory on the GPU, and copying data between the CPU and GPU, as well as directly between multiple GPUs.

For simplicity, we'll use standard Haskell lists for our input and output data structure. Note however that this will have significantly lower effective bandwidth than reading a single contiguous region of memory, so for most practical purposes you will want to use some kind of unboxed array.

>>> let xs = [1..1024]   :: [Float]
>>> let ys = [2,4..2048] :: [Float]

In CUDA, like C, all memory management is explicit, and arrays on the device must be explicitly allocated and freed. As mentioned previously, data transfer is also explicit. However, we do provide convenience functions for combined allocation and marshalling, as well as bracketed operations.

>>> xs_dev <- newListArray xs
>>> ys_dev <- newListArray ys
>>> zs_dev <- mallocArray 1024 :: IO (DevicePtr Float)

After executing the kernel (see next section), we transfer the result back to the host, and free the memory that was allocated on the GPU.

>>> zs <- peekListArray 1024 zs_dev
>>> free xs_dev
>>> free ys_dev
>>> free zs_dev

Piecing it all together

Finally, we have everything in place to execute our operation on the GPU. Launching a kernel on the GPU consists of creating many threads on the GPU which all execute the same function, and each thread has a unique identifier in the grid/block hierarchy which can be used to identify exactly which element this thread should process (the blockIdx and threadIdx parameters that we saw earlier, respectively).

To execute our function, we will use a grid of 4 blocks, each containing 256 threads. Thus, a total of 1024 threads will be launched, which will each compute a single element of the output array (recall that our input arrays each have 1024 elements). The module Foreign.CUDA.Analysis.Occupancy contains functions to help determine the ideal thread block size for a given kernel and GPU combination.

>>> launchKernel vecAdd (4,1,1) (256,1,1) 0 Nothing [VArg xs_dev, VArg ys_dev, VArg zs_dev, IArg 1024]

Note that kernel execution is asynchronous, so we should also wait for the operation to complete before attempting to read the results back.

>>> sync

And that's it!

Next steps

As mentioned at the end of the previous section, kernels on the GPU are executed asynchronously with respect to the host, and other operations such as data transfers can also be executed asynchronously. This allows the CPU to continue doing other work while the GPU is busy. Events can be used to check whether an operation has completed yet.

It is also possible to execute multiple kernels or data transfers concurrently with each other, by assigning those operations to different execution Streams. Used in conjunction with Events, operations will be scheduled efficiently only once all dependencies (in the form of Events) have been cleared.

See Foreign.CUDA.Driver.Event and Foreign.CUDA.Driver.Stream for more information on this topic.

Synopsis

module Foreign.CUDA.Ptr
data ContextFlag
- = SchedAuto
- | SchedSpin
- | SchedYield
- | SchedBlockingSync
- | BlockingSync
- | SchedMask
- | MapHost
- | LmemResizeToMax
- | FlagsMask
newtype Context = Context (Ptr ())
create :: Device -> [ContextFlag] -> IO Context
attach :: Context -> [ContextFlag] -> IO ()
detach :: Context -> IO ()
destroy :: Context -> IO ()
get :: IO (Maybe Context)
set :: Context -> IO ()
pop :: IO Context
push :: Context -> IO ()
sync :: IO ()
type StreamPriority = Int
data PeerAttribute
- = PerformanceRank
- | AccessSupported
- | NativeAtomicSupported
- | ArrayAccessAccessSupported
- | CudaArrayAccessSupported
data PeerFlag
accessible :: Device -> Device -> IO Bool
add :: Context -> [PeerFlag] -> IO ()
remove :: Context -> IO ()
getAttribute :: PeerAttribute -> Device -> Device -> IO Int
data SharedMem
- = DefaultBankSize
- | FourByteBankSize
- | EightByteBankSize
data Cache
- = PreferNone
- | PreferShared
- | PreferL1
- | PreferEqual
data Limit
- = StackSize
- | PrintfFifoSize
- | MallocHeapSize
- | DevRuntimeSyncDepth
- | DevRuntimePendingLaunchCount
- | MaxL2FetchGranularity
- | Max
getFlags :: IO [ContextFlag]
getLimit :: Limit -> IO Int
setLimit :: Limit -> Int -> IO ()
getCache :: IO Cache
setCache :: Cache -> IO ()
getSharedMem :: IO SharedMem
setSharedMem :: SharedMem -> IO ()
getStreamPriorityRange :: IO (StreamPriority, StreamPriority)
data DeviceProperties = DeviceProperties {
- deviceName :: !String
- computeCapability :: !Compute
- totalGlobalMem :: !Int64
- totalConstMem :: !Int64
- sharedMemPerBlock :: !Int64
- regsPerBlock :: !Int
- warpSize :: !Int
- maxThreadsPerBlock :: !Int
- maxThreadsPerMultiProcessor :: !Int
- maxBlockSize :: !(Int, Int, Int)
- maxGridSize :: !(Int, Int, Int)
- maxTextureDim1D :: !Int
- maxTextureDim2D :: !(Int, Int)
- maxTextureDim3D :: !(Int, Int, Int)
- clockRate :: !Int
- multiProcessorCount :: !Int
- memPitch :: !Int64
- memBusWidth :: !Int
- memClockRate :: !Int
- textureAlignment :: !Int64
- computeMode :: !ComputeMode
- deviceOverlap :: !Bool
- concurrentKernels :: !Bool
- eccEnabled :: !Bool
- asyncEngineCount :: !Int
- cacheMemL2 :: !Int
- pciInfo :: !PCI
- tccDriverEnabled :: !Bool
- kernelExecTimeoutEnabled :: !Bool
- integrated :: !Bool
- canMapHostMemory :: !Bool
- unifiedAddressing :: !Bool
- streamPriorities :: !Bool
- globalL1Cache :: !Bool
- localL1Cache :: !Bool
- managedMemory :: !Bool
- multiGPUBoard :: !Bool
- multiGPUBoardGroupID :: !Int
- preemption :: !Bool
- singleToDoublePerfRatio :: !Int
- cooperativeLaunch :: !Bool
- cooperativeLaunchMultiDevice :: !Bool
}
data Compute = Compute !Int !Int
data ComputeMode
- = Default
- | Prohibited
- | ExclusiveProcess
data InitFlag
data DeviceAttribute
- = MaxThreadsPerBlock
- | MaxBlockDimX
- | MaxBlockDimY
- | MaxBlockDimZ
- | MaxGridDimX
- | MaxGridDimY
- | MaxGridDimZ
- | MaxSharedMemoryPerBlock
- | SharedMemoryPerBlock
- | TotalConstantMemory
- | WarpSize
- | MaxPitch
- | MaxRegistersPerBlock
- | RegistersPerBlock
- | ClockRate
- | TextureAlignment
- | GpuOverlap
- | MultiprocessorCount
- | KernelExecTimeout
- | Integrated
- | CanMapHostMemory
- | ComputeMode
- | MaximumTexture1dWidth
- | MaximumTexture2dWidth
- | MaximumTexture2dHeight
- | MaximumTexture3dWidth
- | MaximumTexture3dHeight
- | MaximumTexture3dDepth
- | MaximumTexture2dLayeredWidth
- | MaximumTexture2dArrayWidth
- | MaximumTexture2dLayeredHeight
- | MaximumTexture2dArrayHeight
- | MaximumTexture2dLayeredLayers
- | MaximumTexture2dArrayNumslices
- | SurfaceAlignment
- | ConcurrentKernels
- | EccEnabled
- | PciBusId
- | PciDeviceId
- | TccDriver
- | MemoryClockRate
- | GlobalMemoryBusWidth
- | L2CacheSize
- | MaxThreadsPerMultiprocessor
- | AsyncEngineCount
- | UnifiedAddressing
- | MaximumTexture1dLayeredWidth
- | MaximumTexture1dLayeredLayers
- | CanTex2dGather
- | MaximumTexture2dGatherWidth
- | MaximumTexture2dGatherHeight
- | MaximumTexture3dWidthAlternate
- | MaximumTexture3dHeightAlternate
- | MaximumTexture3dDepthAlternate
- | PciDomainId
- | TexturePitchAlignment
- | MaximumTexturecubemapWidth
- | MaximumTexturecubemapLayeredWidth
- | MaximumTexturecubemapLayeredLayers
- | MaximumSurface1dWidth
- | MaximumSurface2dWidth
- | MaximumSurface2dHeight
- | MaximumSurface3dWidth
- | MaximumSurface3dHeight
- | MaximumSurface3dDepth
- | MaximumSurface1dLayeredWidth
- | MaximumSurface1dLayeredLayers
- | MaximumSurface2dLayeredWidth
- | MaximumSurface2dLayeredHeight
- | MaximumSurface2dLayeredLayers
- | MaximumSurfacecubemapWidth
- | MaximumSurfacecubemapLayeredWidth
- | MaximumSurfacecubemapLayeredLayers
- | MaximumTexture1dLinearWidth
- | MaximumTexture2dLinearWidth
- | MaximumTexture2dLinearHeight
- | MaximumTexture2dLinearPitch
- | MaximumTexture2dMipmappedWidth
- | MaximumTexture2dMipmappedHeight
- | ComputeCapabilityMajor
- | ComputeCapabilityMinor
- | MaximumTexture1dMipmappedWidth
- | StreamPrioritiesSupported
- | GlobalL1CacheSupported
- | LocalL1CacheSupported
- | MaxSharedMemoryPerMultiprocessor
- | MaxRegistersPerMultiprocessor
- | ManagedMemory
- | MultiGpuBoard
- | MultiGpuBoardGroupId
- | HostNativeAtomicSupported
- | SingleToDoublePrecisionPerfRatio
- | PageableMemoryAccess
- | ConcurrentManagedAccess
- | ComputePreemptionSupported
- | CanUseHostPointerForRegisteredMem
- | CanUseStreamMemOps
- | CanUse64BitStreamMemOps
- | CanUseStreamWaitValueNor
- | CooperativeLaunch
- | CooperativeMultiDeviceLaunch
- | MaxSharedMemoryPerBlockOptin
- | CanFlushRemoteWrites
- | HostRegisterSupported
- | PageableMemoryAccessUsesHostPageTables
- | DirectManagedMemAccessFromHost
- | CU_DEVICE_ATTRIBUTE_MAX
newtype Device = Device CInt
initialise :: [InitFlag] -> IO ()
capability :: Device -> IO Compute
device :: Int -> IO Device
attribute :: Device -> DeviceAttribute -> IO Int
count :: IO Int
name :: Device -> IO String
uuid :: Device -> IO UUID
props :: Device -> IO DeviceProperties
totalMem :: Device -> IO Int64
module Foreign.CUDA.Driver.Error
data SharedMem
- = DefaultBankSize
- | FourByteBankSize
- | EightByteBankSize
data FunParam where
- IArg :: !Int32 -> FunParam
- FArg :: !Float -> FunParam
- VArg :: Storable a => !a -> FunParam
data FunAttribute
- = MaxKernelThreadsPerBlock
- | SharedSizeBytes
- | ConstSizeBytes
- | LocalSizeBytes
- | NumRegs
- | PtxVersion
- | BinaryVersion
- | CacheModeCa
- | MaxDynamicSharedSizeBytes
- | PreferredSharedMemoryCarveout
- | CU_FUNC_ATTRIBUTE_MAX
newtype Fun = Fun (Ptr ())
requires :: Fun -> FunAttribute -> IO Int
setCacheConfigFun :: Fun -> Cache -> IO ()
setSharedMemConfigFun :: Fun -> SharedMem -> IO ()
launchKernel :: Fun -> (Int, Int, Int) -> (Int, Int, Int) -> Int -> Maybe Stream -> [FunParam] -> IO ()
launchKernel' :: Fun -> (Int, Int, Int) -> (Int, Int, Int) -> Int -> Maybe Stream -> [FunParam] -> IO ()
launchKernelCooperative :: Fun -> (Int, Int, Int) -> (Int, Int, Int) -> Int -> Maybe Stream -> [FunParam] -> IO ()
launch :: Fun -> (Int, Int) -> Maybe Stream -> IO ()
setBlockShape :: Fun -> (Int, Int, Int) -> IO ()
setSharedSize :: Fun -> Integer -> IO ()
setParams :: Fun -> [FunParam] -> IO ()
data AttachFlag
- = CuMemAttachGlobal
- | CuMemAttachHost
- | CuMemAttachSingle
data AllocFlag
- = Portable
- | DeviceMapped
- | WriteCombined
mallocHostArray :: Storable a => [AllocFlag] -> Int -> IO (HostPtr a)
mallocHostForeignPtr :: Storable a => [AllocFlag] -> Int -> IO (ForeignPtr a)
freeHost :: HostPtr a -> IO ()
registerArray :: Storable a => [AllocFlag] -> Int -> Ptr a -> IO (HostPtr a)
unregisterArray :: HostPtr a -> IO (Ptr a)
mallocArray :: Storable a => Int -> IO (DevicePtr a)
allocaArray :: Storable a => Int -> (DevicePtr a -> IO b) -> IO b
free :: DevicePtr a -> IO ()
mallocManagedArray :: Storable a => [AttachFlag] -> Int -> IO (DevicePtr a)
prefetchArrayAsync :: Storable a => DevicePtr a -> Int -> Maybe Device -> Maybe Stream -> IO ()
attachArrayAsync :: forall a. Storable a => [AttachFlag] -> Stream -> DevicePtr a -> Int -> IO ()
peekArray :: Storable a => Int -> DevicePtr a -> Ptr a -> IO ()
peekArrayAsync :: Storable a => Int -> DevicePtr a -> HostPtr a -> Maybe Stream -> IO ()
peekArray2D :: Storable a => Int -> Int -> DevicePtr a -> Int -> Int -> Int -> Ptr a -> Int -> Int -> Int -> IO ()
peekArray2DAsync :: Storable a => Int -> Int -> DevicePtr a -> Int -> Int -> Int -> HostPtr a -> Int -> Int -> Int -> Maybe Stream -> IO ()
peekListArray :: Storable a => Int -> DevicePtr a -> IO [a]
pokeArray :: Storable a => Int -> Ptr a -> DevicePtr a -> IO ()
pokeArrayAsync :: Storable a => Int -> HostPtr a -> DevicePtr a -> Maybe Stream -> IO ()
pokeArray2D :: Storable a => Int -> Int -> Ptr a -> Int -> Int -> Int -> DevicePtr a -> Int -> Int -> Int -> IO ()
pokeArray2DAsync :: Storable a => Int -> Int -> HostPtr a -> Int -> Int -> Int -> DevicePtr a -> Int -> Int -> Int -> Maybe Stream -> IO ()
pokeListArray :: Storable a => [a] -> DevicePtr a -> IO ()
copyArray :: Storable a => Int -> DevicePtr a -> DevicePtr a -> IO ()
copyArrayAsync :: Storable a => Int -> DevicePtr a -> DevicePtr a -> Maybe Stream -> IO ()
copyArray2D :: Storable a => Int -> Int -> DevicePtr a -> Int -> Int -> Int -> DevicePtr a -> Int -> Int -> Int -> IO ()
copyArray2DAsync :: Storable a => Int -> Int -> DevicePtr a -> Int -> Int -> Int -> DevicePtr a -> Int -> Int -> Int -> Maybe Stream -> IO ()
copyArrayPeer :: Storable a => Int -> DevicePtr a -> Context -> DevicePtr a -> Context -> IO ()
copyArrayPeerAsync :: Storable a => Int -> DevicePtr a -> Context -> DevicePtr a -> Context -> Maybe Stream -> IO ()
newListArrayLen :: Storable a => [a] -> IO (DevicePtr a, Int)
newListArray :: Storable a => [a] -> IO (DevicePtr a)
withListArray :: Storable a => [a] -> (DevicePtr a -> IO b) -> IO b
withListArrayLen :: Storable a => [a] -> (Int -> DevicePtr a -> IO b) -> IO b
memset :: Storable a => DevicePtr a -> Int -> a -> IO ()
memsetAsync :: Storable a => DevicePtr a -> Int -> a -> Maybe Stream -> IO ()
getDevicePtr :: [AllocFlag] -> HostPtr a -> IO (DevicePtr a)
getBasePtr :: DevicePtr a -> IO (DevicePtr a, Int64)
getMemInfo :: IO (Int64, Int64)
module Foreign.CUDA.Driver.Module
module Foreign.CUDA.Driver.Unified
module Foreign.CUDA.Driver.Utils

Documentation

module Foreign.CUDA.Ptr

data ContextFlag Source #

Context creation flags

Constructors

SchedAuto
SchedSpin
SchedYield
SchedBlockingSync
BlockingSync	Deprecated: use SchedBlockingSync instead
SchedMask
MapHost
LmemResizeToMax
FlagsMask

Instances

Bounded ContextFlag Source #
Instance details Defined in Foreign.CUDA.Driver.Context.Base Methods minBound :: ContextFlag # maxBound :: ContextFlag #
Enum ContextFlag Source #
Instance details Defined in Foreign.CUDA.Driver.Context.Base Methods succ :: ContextFlag -> ContextFlag # pred :: ContextFlag -> ContextFlag # toEnum :: Int -> ContextFlag # fromEnum :: ContextFlag -> Int # enumFrom :: ContextFlag -> [ContextFlag] # enumFromThen :: ContextFlag -> ContextFlag -> [ContextFlag] # enumFromTo :: ContextFlag -> ContextFlag -> [ContextFlag] # enumFromThenTo :: ContextFlag -> ContextFlag -> ContextFlag -> [ContextFlag] #
Eq ContextFlag Source #
Instance details Defined in Foreign.CUDA.Driver.Context.Base Methods (==) :: ContextFlag -> ContextFlag -> Bool # (/=) :: ContextFlag -> ContextFlag -> Bool #
Show ContextFlag Source #
Instance details Defined in Foreign.CUDA.Driver.Context.Base Methods showsPrec :: Int -> ContextFlag -> ShowS # show :: ContextFlag -> String # showList :: [ContextFlag] -> ShowS #

newtype Context Source #

A device context

Constructors

Context (Ptr ())

Instances

Eq Context Source #
Instance details Defined in Foreign.CUDA.Driver.Context.Base Methods (==) :: Context -> Context -> Bool # (/=) :: Context -> Context -> Bool #
Show Context Source #
Instance details Defined in Foreign.CUDA.Driver.Context.Base Methods showsPrec :: Int -> Context -> ShowS # show :: Context -> String # showList :: [Context] -> ShowS #

create :: Device -> [ContextFlag] -> IO Context Source #

Create a new CUDA context and associate it with the calling thread. The context is created with a usage count of one, and the caller of create must call destroy when done using the context. If a context is already current to the thread, it is supplanted by the newly created context and must be restored by a subsequent call to pop.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g65dc0012348bc84810e2103a40d8e2cf

attach :: Context -> [ContextFlag] -> IO () Source #

Deprecated: as of CUDA-4.0

Increments the usage count of the context. API: no context flags are currently supported, so this parameter must be empty.

detach :: Context -> IO () Source #

Deprecated: as of CUDA-4.0

Detach the context, and destroy if no longer used

destroy :: Context -> IO () Source #

Destroy the specified context, regardless of how many threads it is current to. The context will be poped from the current thread's context stack, but if it is current on any other threads it will remain current to those threads, and attempts to access it will result in an error.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g27a365aebb0eb548166309f58a1e8b8e

get :: IO (Maybe Context) Source #

Return the context bound to the calling CPU thread.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g8f13165846b73750693640fb3e8380d0

set :: Context -> IO () Source #

Bind the specified context to the calling thread.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1gbe562ee6258b4fcc272ca6478ca2a2f7

pop :: IO Context Source #

Pop the current CUDA context from the CPU thread. The context may then be attached to a different CPU thread by calling push.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g2fac188026a062d92e91a8687d0a7902

push :: Context -> IO () Source #

Push the given context onto the CPU's thread stack of current contexts. The specified context becomes the CPU thread's current context, so all operations that operate on the current context are affected.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1gb02d4c850eb16f861fe5a29682cc90ba

sync :: IO () Source #

Block until the device has completed all preceding requests. If the context was created with the SchedBlockingSync flag, the CPU thread will block until the GPU has finished its work.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g7a54725f28d34b8c6299f0c6ca579616

type StreamPriority = Int Source #

Priority of an execution stream. Work submitted to a higher priority stream may preempt execution of work already executing in a lower priority stream. Lower numbers represent higher priorities.

data PeerAttribute Source #

Peer-to-peer attributes

Constructors

PerformanceRank
AccessSupported
NativeAtomicSupported
ArrayAccessAccessSupported
CudaArrayAccessSupported

Instances

Enum PeerAttribute Source #
Instance details Defined in Foreign.CUDA.Driver.Context.Peer Methods succ :: PeerAttribute -> PeerAttribute # pred :: PeerAttribute -> PeerAttribute # toEnum :: Int -> PeerAttribute # fromEnum :: PeerAttribute -> Int # enumFrom :: PeerAttribute -> [PeerAttribute] # enumFromThen :: PeerAttribute -> PeerAttribute -> [PeerAttribute] # enumFromTo :: PeerAttribute -> PeerAttribute -> [PeerAttribute] # enumFromThenTo :: PeerAttribute -> PeerAttribute -> PeerAttribute -> [PeerAttribute] #
Eq PeerAttribute Source #
Instance details Defined in Foreign.CUDA.Driver.Context.Peer Methods (==) :: PeerAttribute -> PeerAttribute -> Bool # (/=) :: PeerAttribute -> PeerAttribute -> Bool #
Show PeerAttribute Source #
Instance details Defined in Foreign.CUDA.Driver.Context.Peer Methods showsPrec :: Int -> PeerAttribute -> ShowS # show :: PeerAttribute -> String # showList :: [PeerAttribute] -> ShowS #

data PeerFlag Source #

Possible option values for direct peer memory access

Instances

Enum PeerFlag Source #
Instance details Defined in Foreign.CUDA.Driver.Context.Peer Methods succ :: PeerFlag -> PeerFlag # pred :: PeerFlag -> PeerFlag # toEnum :: Int -> PeerFlag # fromEnum :: PeerFlag -> Int # enumFrom :: PeerFlag -> [PeerFlag] # enumFromThen :: PeerFlag -> PeerFlag -> [PeerFlag] # enumFromTo :: PeerFlag -> PeerFlag -> [PeerFlag] # enumFromThenTo :: PeerFlag -> PeerFlag -> PeerFlag -> [PeerFlag] #

accessible :: Device -> Device -> IO Bool Source #

Queries if the first device can directly access the memory of the second. If direct access is possible, it can then be enabled with add.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PEER__ACCESS.html#group__CUDA__PEER__ACCESS_1g496bdaae1f632ebfb695b99d2c40f19e

add :: Context -> [PeerFlag] -> IO () Source #

If the devices of both the current and supplied contexts support unified addressing, then enable allocations in the supplied context to be accessible by the current context.

Note that access is unidirectional, and in order to access memory in the current context from the peer context, a separate symmetric call to add is required.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PEER__ACCESS.html#group__CUDA__PEER__ACCESS_1g0889ec6728e61c05ed359551d67b3f5a

remove :: Context -> IO () Source #

Disable direct memory access from the current context to the supplied peer context, and unregisters any registered allocations.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PEER__ACCESS.html#group__CUDA__PEER__ACCESS_1g5b4b6936ea868d4954ce4d841a3b4810

getAttribute :: PeerAttribute -> Device -> Device -> IO Int Source #

Queries attributes of the link between two devices

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PEER__ACCESS.html#group__CUDA__PEER__ACCESS_1g4c55c60508f8eba4546b51f2ee545393

Requires CUDA-8.0

since 0.9.0.0

data SharedMem Source #

Device shared memory configuration preference

Constructors

DefaultBankSize
FourByteBankSize
EightByteBankSize

Instances

Enum SharedMem Source #
Instance details Defined in Foreign.CUDA.Driver.Context.Config Methods succ :: SharedMem -> SharedMem # pred :: SharedMem -> SharedMem # toEnum :: Int -> SharedMem # fromEnum :: SharedMem -> Int # enumFrom :: SharedMem -> [SharedMem] # enumFromThen :: SharedMem -> SharedMem -> [SharedMem] # enumFromTo :: SharedMem -> SharedMem -> [SharedMem] # enumFromThenTo :: SharedMem -> SharedMem -> SharedMem -> [SharedMem] #
Eq SharedMem Source #
Instance details Defined in Foreign.CUDA.Driver.Context.Config Methods (==) :: SharedMem -> SharedMem -> Bool # (/=) :: SharedMem -> SharedMem -> Bool #
Show SharedMem Source #
Instance details Defined in Foreign.CUDA.Driver.Context.Config Methods showsPrec :: Int -> SharedMem -> ShowS # show :: SharedMem -> String # showList :: [SharedMem] -> ShowS #

data Cache Source #

Device cache configuration preference

Constructors

PreferNone
PreferShared
PreferL1
PreferEqual

Instances

Enum Cache Source #
Instance details Defined in Foreign.CUDA.Driver.Context.Config Methods succ :: Cache -> Cache # pred :: Cache -> Cache # toEnum :: Int -> Cache # fromEnum :: Cache -> Int # enumFrom :: Cache -> [Cache] # enumFromThen :: Cache -> Cache -> [Cache] # enumFromTo :: Cache -> Cache -> [Cache] # enumFromThenTo :: Cache -> Cache -> Cache -> [Cache] #
Eq Cache Source #
Instance details Defined in Foreign.CUDA.Driver.Context.Config Methods (==) :: Cache -> Cache -> Bool # (/=) :: Cache -> Cache -> Bool #
Show Cache Source #
Instance details Defined in Foreign.CUDA.Driver.Context.Config Methods showsPrec :: Int -> Cache -> ShowS # show :: Cache -> String # showList :: [Cache] -> ShowS #

data Limit Source #

Device limits flags

Constructors

StackSize
PrintfFifoSize
MallocHeapSize
DevRuntimeSyncDepth
DevRuntimePendingLaunchCount
MaxL2FetchGranularity
Max

Instances

Enum Limit Source #
Instance details Defined in Foreign.CUDA.Driver.Context.Config Methods succ :: Limit -> Limit # pred :: Limit -> Limit # toEnum :: Int -> Limit # fromEnum :: Limit -> Int # enumFrom :: Limit -> [Limit] # enumFromThen :: Limit -> Limit -> [Limit] # enumFromTo :: Limit -> Limit -> [Limit] # enumFromThenTo :: Limit -> Limit -> Limit -> [Limit] #
Eq Limit Source #
Instance details Defined in Foreign.CUDA.Driver.Context.Config Methods (==) :: Limit -> Limit -> Bool # (/=) :: Limit -> Limit -> Bool #
Show Limit Source #
Instance details Defined in Foreign.CUDA.Driver.Context.Config Methods showsPrec :: Int -> Limit -> ShowS # show :: Limit -> String # showList :: [Limit] -> ShowS #

getFlags :: IO [ContextFlag] Source #

Return the flags that were used to create the current context.

Requires CUDA-7.0

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1gf81eef983c1e3b2ef4f166d7a930c86d

getLimit :: Limit -> IO Int Source #

Query compute 2.0 call stack limits.

Requires CUDA-3.1.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g9f2d47d1745752aa16da7ed0d111b6a8

setLimit :: Limit -> Int -> IO () Source #

Specify the size of the call stack, for compute 2.0 devices.

Requires CUDA-3.1.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g0651954dfb9788173e60a9af7201e65a

getCache :: IO Cache Source #

On devices where the L1 cache and shared memory use the same hardware resources, this function returns the preferred cache configuration for the current context.

Requires CUDA-3.2.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g40b6b141698f76744dea6e39b9a25360

setCache :: Cache -> IO () Source #

On devices where the L1 cache and shared memory use the same hardware resources, this sets the preferred cache configuration for the current context. This is only a preference.

Any function configuration set via setCacheConfigFun will be preferred over this context-wide setting.

Requires CUDA-3.2.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g54699acf7e2ef27279d013ca2095f4a3

getSharedMem :: IO SharedMem Source #

Return the current size of the shared memory banks in the current context. On devices with configurable shared memory banks, setSharedMem can be used to change the configuration, so that subsequent kernel launches will by default us the new bank size. On devices without configurable shared memory, this function returns the fixed bank size of the hardware.

Requires CUDA-4.2

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g17153a1b8b8c756f7ab8505686a4ad74

setSharedMem :: SharedMem -> IO () Source #

On devices with configurable shared memory banks, this function will set the context's shared memory bank size that will be used by default for subsequent kernel launches.

Changing the shared memory configuration between launches may insert a device synchronisation.

Shared memory bank size does not affect shared memory usage or kernel occupancy, but may have major effects on performance. Larger bank sizes allow for greater potential bandwidth to shared memory, but change the kinds of accesses which result in bank conflicts.

Requires CUDA-4.2

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g2574235fa643f8f251bf7bc28fac3692

getStreamPriorityRange :: IO (StreamPriority, StreamPriority) Source #

Returns the numerical values that correspond to the greatest and least priority execution streams in the current context respectively. Stream priorities follow the convention that lower numerical numbers correspond to higher priorities. The range of meaningful stream priorities is given by the inclusive range [greatestPriority,leastPriority].

Requires CUDA-5.5.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g137920ab61a71be6ce67605b9f294091

data DeviceProperties Source #

The properties of a compute device

Constructors

DeviceProperties

Fields

deviceName :: !String
Identifier
computeCapability :: !Compute
Supported compute capability
totalGlobalMem :: !Int64
Available global memory on the device in bytes
totalConstMem :: !Int64
Available constant memory on the device in bytes
sharedMemPerBlock :: !Int64
Available shared memory per block in bytes
regsPerBlock :: !Int
32-bit registers per block
warpSize :: !Int
Warp size in threads (SIMD width)
maxThreadsPerBlock :: !Int
Maximum number of threads per block
maxThreadsPerMultiProcessor :: !Int
Maximum number of threads per multiprocessor
maxBlockSize :: !(Int, Int, Int)
Maximum size of each dimension of a block
maxGridSize :: !(Int, Int, Int)
Maximum size of each dimension of a grid
maxTextureDim1D :: !Int
Maximum texture dimensions
maxTextureDim2D :: !(Int, Int)
maxTextureDim3D :: !(Int, Int, Int)
clockRate :: !Int
Clock frequency in kilohertz
multiProcessorCount :: !Int
Number of multiprocessors on the device
memPitch :: !Int64
Maximum pitch in bytes allowed by memory copies
memBusWidth :: !Int
Global memory bus width in bits
memClockRate :: !Int
Peak memory clock frequency in kilohertz
textureAlignment :: !Int64
Alignment requirement for textures
computeMode :: !ComputeMode
deviceOverlap :: !Bool
Device can concurrently copy memory and execute a kernel
concurrentKernels :: !Bool
Device can possibly execute multiple kernels concurrently
eccEnabled :: !Bool
Device supports and has enabled error correction
asyncEngineCount :: !Int
Number of asynchronous engines
cacheMemL2 :: !Int
Size of the L2 cache in bytes
pciInfo :: !PCI
PCI device information for the device
tccDriverEnabled :: !Bool
Whether this is a Tesla device using the TCC driver
kernelExecTimeoutEnabled :: !Bool
Whether there is a runtime limit on kernels
integrated :: !Bool
As opposed to discrete
canMapHostMemory :: !Bool
Device can use pinned memory
unifiedAddressing :: !Bool
Device shares a unified address space with the host
streamPriorities :: !Bool
Device supports stream priorities
globalL1Cache :: !Bool
Device supports caching globals in L1 cache
localL1Cache :: !Bool
Device supports caching locals in L1 cache
managedMemory :: !Bool
Device supports allocating managed memory on this system
multiGPUBoard :: !Bool
Device is on a multi-GPU board
multiGPUBoardGroupID :: !Int
Unique identifier for a group of devices associated with the same board
preemption :: !Bool
Device supports compute pre-emption
singleToDoublePerfRatio :: !Int
Ratio of single precision performance (in floating-point operations per second) to double precision performance
cooperativeLaunch :: !Bool
Device supports launching cooperative kernels
cooperativeLaunchMultiDevice :: !Bool
Device can participate in cooperative multi-device kernels

Instances

Show DeviceProperties Source #
Instance details Defined in Foreign.CUDA.Analysis.Device Methods showsPrec :: Int -> DeviceProperties -> ShowS # show :: DeviceProperties -> String # showList :: [DeviceProperties] -> ShowS #
Storable DeviceProperties Source #
Instance details Defined in Foreign.CUDA.Runtime.Device Methods sizeOf :: DeviceProperties -> Int # alignment :: DeviceProperties -> Int # peekElemOff :: Ptr DeviceProperties -> Int -> IO DeviceProperties # pokeElemOff :: Ptr DeviceProperties -> Int -> DeviceProperties -> IO () # peekByteOff :: Ptr b -> Int -> IO DeviceProperties # pokeByteOff :: Ptr b -> Int -> DeviceProperties -> IO () # peek :: Ptr DeviceProperties -> IO DeviceProperties # poke :: Ptr DeviceProperties -> DeviceProperties -> IO () #

data Compute Source #

GPU compute capability, major and minor revision number respectively.

Constructors

Compute !Int !Int

Instances

Eq Compute Source #
Instance details Defined in Foreign.CUDA.Analysis.Device Methods (==) :: Compute -> Compute -> Bool # (/=) :: Compute -> Compute -> Bool #
Ord Compute Source #
Instance details Defined in Foreign.CUDA.Analysis.Device Methods compare :: Compute -> Compute -> Ordering # (<) :: Compute -> Compute -> Bool # (<=) :: Compute -> Compute -> Bool # (>) :: Compute -> Compute -> Bool # (>=) :: Compute -> Compute -> Bool # max :: Compute -> Compute -> Compute # min :: Compute -> Compute -> Compute #
Show Compute Source #
Instance details Defined in Foreign.CUDA.Analysis.Device Methods showsPrec :: Int -> Compute -> ShowS # show :: Compute -> String # showList :: [Compute] -> ShowS #

data ComputeMode Source #

The compute mode the device is currently in

Constructors

Default
Prohibited
ExclusiveProcess

Instances

Enum ComputeMode Source #
Instance details Defined in Foreign.CUDA.Analysis.Device Methods succ :: ComputeMode -> ComputeMode # pred :: ComputeMode -> ComputeMode # toEnum :: Int -> ComputeMode # fromEnum :: ComputeMode -> Int # enumFrom :: ComputeMode -> [ComputeMode] # enumFromThen :: ComputeMode -> ComputeMode -> [ComputeMode] # enumFromTo :: ComputeMode -> ComputeMode -> [ComputeMode] # enumFromThenTo :: ComputeMode -> ComputeMode -> ComputeMode -> [ComputeMode] #
Eq ComputeMode Source #
Instance details Defined in Foreign.CUDA.Analysis.Device Methods (==) :: ComputeMode -> ComputeMode -> Bool # (/=) :: ComputeMode -> ComputeMode -> Bool #
Show ComputeMode Source #
Instance details Defined in Foreign.CUDA.Analysis.Device Methods showsPrec :: Int -> ComputeMode -> ShowS # show :: ComputeMode -> String # showList :: [ComputeMode] -> ShowS #

data InitFlag Source #

Possible option flags for CUDA initialisation. Dummy instance until the API exports actual option values.

Instances

Enum InitFlag Source #
Instance details Defined in Foreign.CUDA.Driver.Device Methods succ :: InitFlag -> InitFlag # pred :: InitFlag -> InitFlag # toEnum :: Int -> InitFlag # fromEnum :: InitFlag -> Int # enumFrom :: InitFlag -> [InitFlag] # enumFromThen :: InitFlag -> InitFlag -> [InitFlag] # enumFromTo :: InitFlag -> InitFlag -> [InitFlag] # enumFromThenTo :: InitFlag -> InitFlag -> InitFlag -> [InitFlag] #

data DeviceAttribute Source #

Device attributes

Constructors

MaxThreadsPerBlock
MaxBlockDimX
MaxBlockDimY
MaxBlockDimZ
MaxGridDimX
MaxGridDimY
MaxGridDimZ
MaxSharedMemoryPerBlock
SharedMemoryPerBlock
TotalConstantMemory
WarpSize
MaxPitch
MaxRegistersPerBlock
RegistersPerBlock
ClockRate
TextureAlignment
GpuOverlap
MultiprocessorCount
KernelExecTimeout
Integrated
CanMapHostMemory
ComputeMode
MaximumTexture1dWidth
MaximumTexture2dWidth
MaximumTexture2dHeight
MaximumTexture3dWidth
MaximumTexture3dHeight
MaximumTexture3dDepth
MaximumTexture2dLayeredWidth
MaximumTexture2dArrayWidth
MaximumTexture2dLayeredHeight
MaximumTexture2dArrayHeight
MaximumTexture2dLayeredLayers
MaximumTexture2dArrayNumslices
SurfaceAlignment
ConcurrentKernels
EccEnabled
PciBusId
PciDeviceId
TccDriver
MemoryClockRate
GlobalMemoryBusWidth
L2CacheSize
MaxThreadsPerMultiprocessor
AsyncEngineCount
UnifiedAddressing
MaximumTexture1dLayeredWidth
MaximumTexture1dLayeredLayers
CanTex2dGather
MaximumTexture2dGatherWidth
MaximumTexture2dGatherHeight
MaximumTexture3dWidthAlternate
MaximumTexture3dHeightAlternate
MaximumTexture3dDepthAlternate
PciDomainId
TexturePitchAlignment
MaximumTexturecubemapWidth
MaximumTexturecubemapLayeredWidth
MaximumTexturecubemapLayeredLayers
MaximumSurface1dWidth
MaximumSurface2dWidth
MaximumSurface2dHeight
MaximumSurface3dWidth
MaximumSurface3dHeight
MaximumSurface3dDepth
MaximumSurface1dLayeredWidth
MaximumSurface1dLayeredLayers
MaximumSurface2dLayeredWidth
MaximumSurface2dLayeredHeight
MaximumSurface2dLayeredLayers
MaximumSurfacecubemapWidth
MaximumSurfacecubemapLayeredWidth
MaximumSurfacecubemapLayeredLayers
MaximumTexture1dLinearWidth
MaximumTexture2dLinearWidth
MaximumTexture2dLinearHeight
MaximumTexture2dLinearPitch
MaximumTexture2dMipmappedWidth
MaximumTexture2dMipmappedHeight
ComputeCapabilityMajor
ComputeCapabilityMinor
MaximumTexture1dMipmappedWidth
StreamPrioritiesSupported
GlobalL1CacheSupported
LocalL1CacheSupported
MaxSharedMemoryPerMultiprocessor
MaxRegistersPerMultiprocessor
ManagedMemory
MultiGpuBoard
MultiGpuBoardGroupId
HostNativeAtomicSupported
SingleToDoublePrecisionPerfRatio
PageableMemoryAccess
ConcurrentManagedAccess
ComputePreemptionSupported
CanUseHostPointerForRegisteredMem
CanUseStreamMemOps
CanUse64BitStreamMemOps
CanUseStreamWaitValueNor
CooperativeLaunch
CooperativeMultiDeviceLaunch
MaxSharedMemoryPerBlockOptin
CanFlushRemoteWrites
HostRegisterSupported
PageableMemoryAccessUsesHostPageTables
DirectManagedMemAccessFromHost
CU_DEVICE_ATTRIBUTE_MAX

Instances

Enum DeviceAttribute Source #
Instance details Defined in Foreign.CUDA.Driver.Device Methods succ :: DeviceAttribute -> DeviceAttribute # pred :: DeviceAttribute -> DeviceAttribute # toEnum :: Int -> DeviceAttribute # fromEnum :: DeviceAttribute -> Int # enumFrom :: DeviceAttribute -> [DeviceAttribute] # enumFromThen :: DeviceAttribute -> DeviceAttribute -> [DeviceAttribute] # enumFromTo :: DeviceAttribute -> DeviceAttribute -> [DeviceAttribute] # enumFromThenTo :: DeviceAttribute -> DeviceAttribute -> DeviceAttribute -> [DeviceAttribute] #
Eq DeviceAttribute Source #
Instance details Defined in Foreign.CUDA.Driver.Device Methods (==) :: DeviceAttribute -> DeviceAttribute -> Bool # (/=) :: DeviceAttribute -> DeviceAttribute -> Bool #
Show DeviceAttribute Source #
Instance details Defined in Foreign.CUDA.Driver.Device Methods showsPrec :: Int -> DeviceAttribute -> ShowS # show :: DeviceAttribute -> String # showList :: [DeviceAttribute] -> ShowS #

newtype Device Source #

A CUDA device

Constructors

Device CInt

Instances

Eq Device Source #
Instance details Defined in Foreign.CUDA.Driver.Device Methods (==) :: Device -> Device -> Bool # (/=) :: Device -> Device -> Bool #
Show Device Source #
Instance details Defined in Foreign.CUDA.Driver.Device Methods showsPrec :: Int -> Device -> ShowS # show :: Device -> String # showList :: [Device] -> ShowS #

initialise :: [InitFlag] -> IO () Source #

Initialise the CUDA driver API. This must be called before any other driver function.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__INITIALIZE.html#group__CUDA__INITIALIZE_1g0a2f1517e1bd8502c7194c3a8c134bc3

capability :: Device -> IO Compute Source #

Return the compute compatibility revision supported by the device

device :: Int -> IO Device Source #

Return a handle to the compute device at the given ordinal.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__DEVICE.html#group__CUDA__DEVICE_1g8bdd1cc7201304b01357b8034f6587cb

attribute :: Device -> DeviceAttribute -> IO Int Source #

Return the selected attribute for the given device.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__DEVICE.html#group__CUDA__DEVICE_1g9c3e1414f0ad901d3278a4d6645fc266

count :: IO Int Source #

Return the number of device with compute capability > 1.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__DEVICE.html#group__CUDA__DEVICE_1g52b5ce05cb8c5fb6831b2c0ff2887c74

name :: Device -> IO String Source #

The identifying name of the device.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__DEVICE.html#group__CUDA__DEVICE_1gef75aa30df95446a845f2a7b9fffbb7f

uuid :: Device -> IO UUID Source #

Returns a UUID for the device

Requires CUDA-9.2

https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__DEVICE.html#group__CUDA__DEVICE_1g987b46b884c101ed5be414ab4d9e60e4

Since: 0.10.0.0

props :: Device -> IO DeviceProperties Source #

Return the properties of the selected device

totalMem :: Device -> IO Int64 Source #

The total memory available on the device (bytes).

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__DEVICE.html#group__CUDA__DEVICE_1gc6a0d6551335a3780f9f3c967a0fde5d

module Foreign.CUDA.Driver.Error

data SharedMem Source #

Device shared memory configuration preference

Constructors

DefaultBankSize
FourByteBankSize
EightByteBankSize

Instances

Enum SharedMem Source #
Instance details Defined in Foreign.CUDA.Driver.Context.Config Methods succ :: SharedMem -> SharedMem # pred :: SharedMem -> SharedMem # toEnum :: Int -> SharedMem # fromEnum :: SharedMem -> Int # enumFrom :: SharedMem -> [SharedMem] # enumFromThen :: SharedMem -> SharedMem -> [SharedMem] # enumFromTo :: SharedMem -> SharedMem -> [SharedMem] # enumFromThenTo :: SharedMem -> SharedMem -> SharedMem -> [SharedMem] #
Eq SharedMem Source #
Instance details Defined in Foreign.CUDA.Driver.Context.Config Methods (==) :: SharedMem -> SharedMem -> Bool # (/=) :: SharedMem -> SharedMem -> Bool #
Show SharedMem Source #
Instance details Defined in Foreign.CUDA.Driver.Context.Config Methods showsPrec :: Int -> SharedMem -> ShowS # show :: SharedMem -> String # showList :: [SharedMem] -> ShowS #

data FunParam where Source #

Kernel function parameters

Constructors

IArg :: !Int32 -> FunParam
FArg :: !Float -> FunParam
VArg :: Storable a => !a -> FunParam

Instances

Storable FunParam Source #

Instance details

Defined in Foreign.CUDA.Driver.Exec

Methods

sizeOf :: FunParam -> Int #

alignment :: FunParam -> Int #

peekElemOff :: Ptr FunParam -> Int -> IO FunParam #

pokeElemOff :: Ptr FunParam -> Int -> FunParam -> IO () #

peekByteOff :: Ptr b -> Int -> IO FunParam #

pokeByteOff :: Ptr b -> Int -> FunParam -> IO () #

peek :: Ptr FunParam -> IO FunParam #

poke :: Ptr FunParam -> FunParam -> IO () #

data FunAttribute Source #

Function attributes

Constructors

MaxKernelThreadsPerBlock
SharedSizeBytes
ConstSizeBytes
LocalSizeBytes
NumRegs
PtxVersion
BinaryVersion
CacheModeCa
MaxDynamicSharedSizeBytes
PreferredSharedMemoryCarveout
CU_FUNC_ATTRIBUTE_MAX

Instances

Enum FunAttribute Source #
Instance details Defined in Foreign.CUDA.Driver.Exec Methods succ :: FunAttribute -> FunAttribute # pred :: FunAttribute -> FunAttribute # toEnum :: Int -> FunAttribute # fromEnum :: FunAttribute -> Int # enumFrom :: FunAttribute -> [FunAttribute] # enumFromThen :: FunAttribute -> FunAttribute -> [FunAttribute] # enumFromTo :: FunAttribute -> FunAttribute -> [FunAttribute] # enumFromThenTo :: FunAttribute -> FunAttribute -> FunAttribute -> [FunAttribute] #
Eq FunAttribute Source #
Instance details Defined in Foreign.CUDA.Driver.Exec Methods (==) :: FunAttribute -> FunAttribute -> Bool # (/=) :: FunAttribute -> FunAttribute -> Bool #
Show FunAttribute Source #
Instance details Defined in Foreign.CUDA.Driver.Exec Methods showsPrec :: Int -> FunAttribute -> ShowS # show :: FunAttribute -> String # showList :: [FunAttribute] -> ShowS #

newtype Fun Source #

A __global__ device function

Constructors

Fun (Ptr ())

requires :: Fun -> FunAttribute -> IO Int Source #

Returns the value of the selected attribute requirement for the given kernel.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1g5e92a1b0d8d1b82cb00dcfb2de15961b

setCacheConfigFun :: Fun -> Cache -> IO () Source #

On devices where the L1 cache and shared memory use the same hardware resources, this sets the preferred cache configuration for the given device function. This is only a preference; the driver is free to choose a different configuration as required to execute the function.

Switching between configuration modes may insert a device-side synchronisation point for streamed kernel launches.

Requires CUDA-3.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1g40f8c11e81def95dc0072a375f965681

setSharedMemConfigFun :: Fun -> SharedMem -> IO () Source #

Set the shared memory configuration of a device function.

On devices with configurable shared memory banks, this will force all subsequent launches of the given device function to use the specified shared memory bank size configuration. On launch of the function, the shared memory configuration of the device will be temporarily changed if needed to suit the function configuration. Changes in shared memory configuration may introduction a device side synchronisation between kernel launches.

Any per-function configuration specified by setSharedMemConfig will override the context-wide configuration set with setSharedMem.

Changing the shared memory bank size will not increase shared memory usage or affect occupancy of kernels, but may have major effects on performance. Larger bank sizes will allow for greater potential bandwidth to shared memory, but will change what kinds of accesses to shared memory will result in bank conflicts.

This function will do nothing on devices with fixed shared memory bank size.

Requires CUDA-5.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1g430b913f24970e63869635395df6d9f5

launchKernel Source #

Arguments

:: Fun	function to execute
-> (Int, Int, Int)	block grid dimension
-> (Int, Int, Int)	thread block shape
-> Int	shared memory (bytes)
-> Maybe Stream	(optional) stream to execute in
-> [FunParam]	list of function parameters
-> IO ()

Invoke a kernel on a (gx * gy * gz) grid of blocks, where each block contains (tx * ty * tz) threads and has access to a given number of bytes of shared memory. The launch may also be associated with a specific Stream.

In launchKernel, the number of kernel parameters and their offsets and sizes do not need to be specified, as this information is retrieved directly from the kernel's image. This requires the kernel to have been compiled with toolchain version 3.2 or later.

The alternative launchKernel' will pass the arguments in directly, requiring the application to know the size and alignment/padding of each kernel parameter.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1gb8f3dc3031b40da29d5f9a7139e52e15

launchKernel' Source #

Arguments

:: Fun	function to execute
-> (Int, Int, Int)	block grid dimension
-> (Int, Int, Int)	thread block shape
-> Int	shared memory (bytes)
-> Maybe Stream	(optional) stream to execute in
-> [FunParam]	list of function parameters
-> IO ()

Invoke a kernel on a (gx * gy * gz) grid of blocks, where each block contains (tx * ty * tz) threads and has access to a given number of bytes of shared memory. The launch may also be associated with a specific Stream.

In launchKernel, the number of kernel parameters and their offsets and sizes do not need to be specified, as this information is retrieved directly from the kernel's image. This requires the kernel to have been compiled with toolchain version 3.2 or later.

The alternative launchKernel' will pass the arguments in directly, requiring the application to know the size and alignment/padding of each kernel parameter.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1gb8f3dc3031b40da29d5f9a7139e52e15

launchKernelCooperative Source #

Arguments

:: Fun	function to execute
-> (Int, Int, Int)	block grid dimension
-> (Int, Int, Int)	thread block shape
-> Int	shared memory (bytes)
-> Maybe Stream	(optional) stream to execute in
-> [FunParam]	list of function parameters
-> IO ()

Invoke a kernel on a (gx * gy * gz) grid of blocks, where each block contains (tx * ty * tz) threads and has access to a given number of bytes of shared memory. The launch may also be associated with a specific stream.

The thread blocks can cooperate and synchronise as they execute.

The device on which this kernel is invoked must have attribute CooperativeLaunch.

The total number of blocks launched can not exceed the maximum number of active thread blocks per multiprocessor (threadBlocksPerMP), multiplied by the number of multiprocessors (multiProcessorCount).

The kernel can not make use of dynamic parallelism.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1g06d753134145c4584c0c62525c1894cb

Requires CUDA-9.0

since 0.9.0.0

launch :: Fun -> (Int, Int) -> Maybe Stream -> IO () Source #

Deprecated: use launchKernel instead

Invoke the kernel on a size (w,h) grid of blocks. Each block contains the number of threads specified by a previous call to setBlockShape. The launch may also be associated with a specific Stream.

setBlockShape :: Fun -> (Int, Int, Int) -> IO () Source #

Deprecated: use launchKernel instead

Specify the (x,y,z) dimensions of the thread blocks that are created when the given kernel function is launched.

setSharedSize :: Fun -> Integer -> IO () Source #

Deprecated: use launchKernel instead

Set the number of bytes of dynamic shared memory to be available to each thread block when the function is launched

setParams :: Fun -> [FunParam] -> IO () Source #

Deprecated: use launchKernel instead

Set the parameters that will specified next time the kernel is invoked

data AttachFlag Source #

Options for unified memory allocations

Constructors

CuMemAttachGlobal
CuMemAttachHost
CuMemAttachSingle

Instances

Bounded AttachFlag Source #
Instance details Defined in Foreign.CUDA.Driver.Marshal Methods minBound :: AttachFlag # maxBound :: AttachFlag #
Enum AttachFlag Source #
Instance details Defined in Foreign.CUDA.Driver.Marshal Methods succ :: AttachFlag -> AttachFlag # pred :: AttachFlag -> AttachFlag # toEnum :: Int -> AttachFlag # fromEnum :: AttachFlag -> Int # enumFrom :: AttachFlag -> [AttachFlag] # enumFromThen :: AttachFlag -> AttachFlag -> [AttachFlag] # enumFromTo :: AttachFlag -> AttachFlag -> [AttachFlag] # enumFromThenTo :: AttachFlag -> AttachFlag -> AttachFlag -> [AttachFlag] #
Eq AttachFlag Source #
Instance details Defined in Foreign.CUDA.Driver.Marshal Methods (==) :: AttachFlag -> AttachFlag -> Bool # (/=) :: AttachFlag -> AttachFlag -> Bool #
Show AttachFlag Source #
Instance details Defined in Foreign.CUDA.Driver.Marshal Methods showsPrec :: Int -> AttachFlag -> ShowS # show :: AttachFlag -> String # showList :: [AttachFlag] -> ShowS #

data AllocFlag Source #

Options for host allocation

Constructors

Portable
DeviceMapped
WriteCombined

Instances

Bounded AllocFlag Source #
Instance details Defined in Foreign.CUDA.Driver.Marshal Methods minBound :: AllocFlag # maxBound :: AllocFlag #
Enum AllocFlag Source #
Instance details Defined in Foreign.CUDA.Driver.Marshal Methods succ :: AllocFlag -> AllocFlag # pred :: AllocFlag -> AllocFlag # toEnum :: Int -> AllocFlag # fromEnum :: AllocFlag -> Int # enumFrom :: AllocFlag -> [AllocFlag] # enumFromThen :: AllocFlag -> AllocFlag -> [AllocFlag] # enumFromTo :: AllocFlag -> AllocFlag -> [AllocFlag] # enumFromThenTo :: AllocFlag -> AllocFlag -> AllocFlag -> [AllocFlag] #
Eq AllocFlag Source #
Instance details Defined in Foreign.CUDA.Driver.Marshal Methods (==) :: AllocFlag -> AllocFlag -> Bool # (/=) :: AllocFlag -> AllocFlag -> Bool #
Show AllocFlag Source #
Instance details Defined in Foreign.CUDA.Driver.Marshal Methods showsPrec :: Int -> AllocFlag -> ShowS # show :: AllocFlag -> String # showList :: [AllocFlag] -> ShowS #

mallocHostArray :: Storable a => [AllocFlag] -> Int -> IO (HostPtr a) Source #

Allocate a section of linear memory on the host which is page-locked and directly accessible from the device. The storage is sufficient to hold the given number of elements of a storable type.

Note that since the amount of pageable memory is thusly reduced, overall system performance may suffer. This is best used sparingly to allocate staging areas for data exchange.

Host memory allocated in this way is automatically and immediately accessible to all contexts on all devices which support unified addressing.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gdd8311286d2c2691605362c689bc64e0

mallocHostForeignPtr :: Storable a => [AllocFlag] -> Int -> IO (ForeignPtr a) Source #

As mallocHostArray, but return a ForeignPtr instead. The array will be deallocated automatically once the last reference to the ForeignPtr is dropped.

freeHost :: HostPtr a -> IO () Source #

Free a section of page-locked host memory.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g62e0fdbe181dab6b1c90fa1a51c7b92c

registerArray :: Storable a => [AllocFlag] -> Int -> Ptr a -> IO (HostPtr a) Source #

Page-locks the specified array (on the host) and maps it for the device(s) as specified by the given allocation flags. Subsequently, the memory is accessed directly by the device so can be read and written with much higher bandwidth than pageable memory that has not been registered. The memory range is added to the same tracking mechanism as mallocHostArray to automatically accelerate calls to functions such as pokeArray.

Note that page-locking excessive amounts of memory may degrade system performance, since it reduces the amount of pageable memory available. This is best used sparingly to allocate staging areas for data exchange.

This function has limited support on Mac OS X. OS 10.7 or later is required.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gf0a9fe11544326dabd743b7aa6b54223

unregisterArray :: HostPtr a -> IO (Ptr a) Source #

Unmaps the memory from the given pointer, and makes it pageable again.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g63f450c8125359be87b7623b1c0b2a14

mallocArray :: Storable a => Int -> IO (DevicePtr a) Source #

Allocate a section of linear memory on the device, and return a reference to it. The memory is sufficient to hold the given number of elements of storable type. It is suitably aligned for any type, and is not cleared.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gb82d2a09844a58dd9e744dc31e8aa467

allocaArray :: Storable a => Int -> (DevicePtr a -> IO b) -> IO b Source #

Execute a computation on the device, passing a pointer to a temporarily allocated block of memory sufficient to hold the given number of elements of storable type. The memory is freed when the computation terminates (normally or via an exception), so the pointer must not be used after this.

Note that kernel launches can be asynchronous, so you may want to add a synchronisation point using sync as part of the continuation.

free :: DevicePtr a -> IO () Source #

Release a section of device memory.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g89b3f154e17cc89b6eea277dbdf5c93a

mallocManagedArray :: Storable a => [AttachFlag] -> Int -> IO (DevicePtr a) Source #

Allocates memory that will be automatically managed by the Unified Memory system. The returned pointer is valid on the CPU and on all GPUs which supported managed memory. All accesses to this pointer must obey the Unified Memory programming model.

On a multi-GPU system with peer-to-peer support, where multiple GPUs support managed memory, the physical storage is created on the GPU which is active at the time mallocManagedArray is called. All other GPUs will access the array at reduced bandwidth via peer mapping over the PCIe bus. The Unified Memory system does not migrate memory between GPUs.

On a multi-GPU system where multiple GPUs support managed memory, but not all pairs of such GPUs have peer-to-peer support between them, the physical storage is allocated in system memory (zero-copy memory) and all GPUs will access the data at reduced bandwidth over the PCIe bus.

Requires CUDA-6.0

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gb347ded34dc326af404aa02af5388a32

prefetchArrayAsync :: Storable a => DevicePtr a -> Int -> Maybe Device -> Maybe Stream -> IO () Source #

Pre-fetches the given number of elements to the specified destination device. If the specified device is Nothing, the data is pre-fetched to host memory. The pointer must refer to a memory range allocated with mallocManagedArray.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html#group__CUDA__UNIFIED_1gfe94f8b7fb56291ebcea44261aa4cb84

Requires CUDA-8.0.

attachArrayAsync :: forall a. Storable a => [AttachFlag] -> Stream -> DevicePtr a -> Int -> IO () Source #

Attach an array of the given number of elements to a stream asynchronously

https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g6e468d680e263e7eba02a56643c50533

Since: 0.10.0.0

peekArray :: Storable a => Int -> DevicePtr a -> Ptr a -> IO () Source #

Copy a number of elements from the device to host memory. This is a synchronous operation.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g3480368ee0208a98f75019c9a8450893

peekArrayAsync :: Storable a => Int -> DevicePtr a -> HostPtr a -> Maybe Stream -> IO () Source #

Copy memory from the device asynchronously, possibly associated with a particular stream. The destination host memory must be page-locked.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g56f30236c7c5247f8e061b59d3268362

peekArray2D Source #

Arguments

:: Storable a
=> Int	width to copy (elements)
-> Int	height to copy (elements)
-> DevicePtr a	source array
-> Int	source array width
-> Int	source x-coordinate
-> Int	source y-coordinate
-> Ptr a	destination array
-> Int	destination array width
-> Int	destination x-coordinate
-> Int	destination y-coordinate
-> IO ()

Copy a 2D array from the device to the host.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g27f885b30c34cc20a663a671dbf6fc27

peekArray2DAsync Source #

Arguments

:: Storable a
=> Int	width to copy (elements)
-> Int	height to copy (elements)
-> DevicePtr a	source array
-> Int	source array width
-> Int	source x-coordinate
-> Int	source y-coordinate
-> HostPtr a	destination array
-> Int	destination array width
-> Int	destination x-coordinate
-> Int	destination y-coordinate
-> Maybe Stream	stream to associate to
-> IO ()

Copy a 2D array from the device to the host asynchronously, possibly associated with a particular execution stream. The destination host memory must be page-locked.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g4acf155faeb969d9d21f5433d3d0f274

peekListArray :: Storable a => Int -> DevicePtr a -> IO [a] Source #

Copy a number of elements from the device into a new Haskell list. Note that this requires two memory copies: firstly from the device into a heap allocated array, and from there marshalled into a list.

pokeArray :: Storable a => Int -> Ptr a -> DevicePtr a -> IO () Source #

Copy a number of elements onto the device. This is a synchronous operation.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g4d32266788c440b0220b1a9ba5795169

pokeArrayAsync :: Storable a => Int -> HostPtr a -> DevicePtr a -> Maybe Stream -> IO () Source #

Copy memory onto the device asynchronously, possibly associated with a particular stream. The source host memory must be page-locked.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g1572263fe2597d7ba4f6964597a354a3

pokeArray2D Source #

Arguments

:: Storable a
=> Int	width to copy (elements)
-> Int	height to copy (elements)
-> Ptr a	source array
-> Int	source array width
-> Int	source x-coordinate
-> Int	source y-coordinate
-> DevicePtr a	destination array
-> Int	destination array width
-> Int	destination x-coordinate
-> Int	destination y-coordinate
-> IO ()

Copy a 2D array from the host to the device.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g27f885b30c34cc20a663a671dbf6fc27

pokeArray2DAsync Source #

Arguments

:: Storable a
=> Int	width to copy (elements)
-> Int	height to copy (elements)
-> HostPtr a	source array
-> Int	source array width
-> Int	source x-coordinate
-> Int	source y-coordinate
-> DevicePtr a	destination array
-> Int	destination array width
-> Int	destination x-coordinate
-> Int	destination y-coordinate
-> Maybe Stream	stream to associate to
-> IO ()

Copy a 2D array from the host to the device asynchronously, possibly associated with a particular execution stream. The source host memory must be page-locked.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g4acf155faeb969d9d21f5433d3d0f274

pokeListArray :: Storable a => [a] -> DevicePtr a -> IO () Source #

Write a list of storable elements into a device array. The device array must be sufficiently large to hold the entire list. This requires two marshalling operations.

copyArray :: Storable a => Int -> DevicePtr a -> DevicePtr a -> IO () Source #

Copy the given number of elements from the first device array (source) to the second device (destination). The copied areas may not overlap. This operation is asynchronous with respect to the host, but will never overlap with kernel execution.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g1725774abf8b51b91945f3336b778c8b

copyArrayAsync :: Storable a => Int -> DevicePtr a -> DevicePtr a -> Maybe Stream -> IO () Source #

Copy the given number of elements from the first device array (source) to the second device array (destination). The copied areas may not overlap. The operation is asynchronous with respect to the host, and can be asynchronous to other device operations by associating it with a particular stream.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g39ea09ba682b8eccc9c3e0c04319b5c8

copyArray2D Source #

Arguments

:: Storable a
=> Int	width to copy (elements)
-> Int	height to copy (elements)
-> DevicePtr a	source array
-> Int	source array width
-> Int	source x-coordinate
-> Int	source y-coordinate
-> DevicePtr a	destination array
-> Int	destination array width
-> Int	destination x-coordinate
-> Int	destination y-coordinate
-> IO ()

Copy a 2D array from the first device array (source) to the second device array (destination). The copied areas must not overlap. This operation is asynchronous with respect to the host, but will never overlap with kernel execution.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g27f885b30c34cc20a663a671dbf6fc27

copyArray2DAsync Source #

Arguments

:: Storable a
=> Int	width to copy (elements)
-> Int	height to copy (elements)
-> DevicePtr a	source array
-> Int	source array width
-> Int	source x-coordinate
-> Int	source y-coordinate
-> DevicePtr a	destination array
-> Int	destination array width
-> Int	destination x-coordinate
-> Int	destination y-coordinate
-> Maybe Stream	stream to associate to
-> IO ()

Copy a 2D array from the first device array (source) to the second device array (destination). The copied areas may not overlap. The operation is asynchronous with respect to the host, and can be asynchronous to other device operations by associating it with a particular execution stream.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g4acf155faeb969d9d21f5433d3d0f274

copyArrayPeer Source #

Arguments

:: Storable a
=> Int	number of array elements
-> DevicePtr a
-> Context	source array and context
-> DevicePtr a
-> Context	destination array and context
-> IO ()

Copies an array from device memory in one context to device memory in another context. Note that this function is asynchronous with respect to the host, but serialised with respect to all pending and future asynchronous work in the source and destination contexts. To avoid this synchronisation, use copyArrayPeerAsync instead.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1ge1f5c7771544fee150ada8853c7cbf4a

copyArrayPeerAsync Source #

Arguments

:: Storable a
=> Int	number of array elements
-> DevicePtr a
-> Context	source array and context
-> DevicePtr a
-> Context	destination array and device context
-> Maybe Stream	stream to associate with
-> IO ()

Copies from device memory in one context to device memory in another context. Note that this function is asynchronous with respect to the host and all work in other streams and devices.

Requires CUDA-4.0.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g82fcecb38018e64b98616a8ac30112f2

newListArrayLen :: Storable a => [a] -> IO (DevicePtr a, Int) Source #

Write a list of storable elements into a newly allocated device array, returning the device pointer together with the number of elements that were written. Note that this requires two memory copies: firstly from a Haskell list to a heap allocated array, and from there onto the graphics device. The memory should be freed when no longer required.

newListArray :: Storable a => [a] -> IO (DevicePtr a) Source #

Write a list of storable elements into a newly allocated device array. This is newListArrayLen composed with fst.

withListArray :: Storable a => [a] -> (DevicePtr a -> IO b) -> IO b Source #

Temporarily store a list of elements into a newly allocated device array. An IO action is applied to to the array, the result of which is returned. Similar to newListArray, this requires copying the data twice.

As with allocaArray, the memory is freed once the action completes, so you should not return the pointer from the action, and be wary of asynchronous kernel execution.

withListArrayLen :: Storable a => [a] -> (Int -> DevicePtr a -> IO b) -> IO b Source #

A variant of withListArray which also supplies the number of elements in the array to the applied function

memset :: Storable a => DevicePtr a -> Int -> a -> IO () Source #

Set a number of data elements to the specified value, which may be either 8-, 16-, or 32-bits wide.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g6e582bf866e9e2fb014297bfaf354d7b

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g7d805e610054392a4d11e8a8bf5eb35c

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g983e8d8759acd1b64326317481fbf132

memsetAsync :: Storable a => DevicePtr a -> Int -> a -> Maybe Stream -> IO () Source #

Set the number of data elements to the specified value, which may be either 8-, 16-, or 32-bits wide. The operation is asynchronous and may optionally be associated with a stream.

Requires CUDA-3.2.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gaef08a7ccd61112f94e82f2b30d43627

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gf731438877dd8ec875e4c43d848c878c

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g58229da5d30f1c0cdf667b320ec2c0f5

getDevicePtr :: [AllocFlag] -> HostPtr a -> IO (DevicePtr a) Source #

Return the device pointer associated with a mapped, pinned host buffer, which was allocated with the DeviceMapped option by mallocHostArray.

Currently, no options are supported and this must be empty.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g57a39e5cba26af4d06be67fc77cc62f0

getBasePtr :: DevicePtr a -> IO (DevicePtr a, Int64) Source #

Return the base address and allocation size of the given device pointer.

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g64fee5711274a2a0573a789c94d8299b

getMemInfo :: IO (Int64, Int64) Source #

Return the amount of free and total memory respectively available to the current context (bytes).

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g808f555540d0143a331cc42aa98835c0

module Foreign.CUDA.Driver.Module

module Foreign.CUDA.Driver.Unified

module Foreign.CUDA.Driver.Utils