Copyright | [2009..2018] Trevor L. McDonell |
---|---|
License | BSD |
Safe Haskell | None |
Language | Haskell98 |
This module defines an interface to the CUDA driver API. The Driver API
is a lower-level interface to CUDA devices than that provided by the
Runtime API. Using the Driver API, the programmer must deal explicitly
with operations such as initialisation, context management, and loading
(kernel) modules. Although more difficult to use initially, the Driver
API provides more control over how CUDA is used. Furthermore, since it
does not require compiling and linking the program with nvcc
, the
Driver API provides better inter-language compatibility.
The following is a short tutorial on using the Driver API. The steps can
be copied into a file, or run directly in ghci
, in which case ghci
should be launched with the option -fno-ghci-sandbox
. This is because
CUDA maintains CPU-local state, so operations should always be run from
a bound thread.
- Using the Driver API
Before any operation can be performed, the Driver API must be initialised:
>>>
import Foreign.CUDA.Driver
>>>
initialise []
Next, we must select a GPU that we will execute operations on. Each GPU
is assigned a unique identifier (beginning at zero). We can get a handle
to a compute device at a given ordinal using the device
operation.
Given a device handle, we can query the properties of that device using
props
. The number of available CUDA-capable devices is given via
count
. For example:
>>>
count
1>>>
dev0 <- device 0
>>>
props dev0
DeviceProperties {deviceName = "GeForce GT 650M", computeCapability = 3.0, ...}
This package also includes the executable 'nvidia-device-query', which when executed displays the key properties of all available devices. See Foreign.CUDA.Driver.Device for additional operations to query the capabilities or status of a device.
Once you have chosen a device to use, the next step is to create a CUDA
context. A context is associated with a particular device, and all
operations, such as memory allocation and kernel execution, take place
with respect to that context. For example, to create
a new execution
context on CUDA device 0:
>>>
ctx <- create dev0 []
The second argument is a set of ContextFlag
s which control how the
context behaves in various situations, for example, whether or not the
CPU should actively spin when waiting for results from the GPU
(SchedSpin
), or to yield control to other threads instead
(SchedYield
).
The newly created context is now the active context, and all subsequent operations take place within that context. More than one context can be created per device, but resources, such as memory allocated in the GPU, are unique to each context. The module Foreign.CUDA.Driver.Context contains operations for managing multiple contexts. Some devices allow data to be shared between contexts without copying, see Foreign.CUDA.Driver.Context.Peer for more information.
Once the context is no longer needed, it should be destroy
ed in order
to free up any resources that were allocated to it.
>>>
destroy ctx
Each device also has a unique context which is used by the Runtime API. This context can be accessed with the module Foreign.CUDA.Driver.Context.Primary.
- Executing kernels onto the GPU
Once the Driver API is initialised and an execution context is created on the GPU, we can begin to interact with it.
At an example, we'll step through executing the CUDA equivalent of the following Haskell function, which element-wise adds the elements of two arrays:
>>>
vecAdd xs ys = zipWith (+) xs ys
The following CUDA kernel can be used to implement this on the GPU:
extern "C" __global__ void vecAdd(float *xs, float *ys, float *zs, int N) { int ix = blockIdx.x * blockDim.x + threadIdx.x; if ( ix < N ) { zs[ix] = xs[ix] + ys[ix]; } }
Here, the global
keyword marks the function as a kernel that
should be computed on the GPU in data parallel. When we execute this
function on the GPU, (at least) N threads will execute N individual
instances of the kernel function vecAdd
. Each thread will operate on
a single element of each input array to create a single value in the
result. See the CUDA programming guide for more details.
We can save this to a file vector_add.cu
, and compile it using nvcc
into a form that we can then load onto the GPU and execute:
$ nvcc --ptx vector_add.cu
The module Foreign.CUDA.Driver.Module contains functions for loading the resulting .ptx file (or .cubin files) into the running program.
>>>
mdl <- loadFile "vector_add.ptx"
Once finished with the module, it is also a good idea to unload
it.
Modules may export kernel functions, global variables, and texture references. Before we can execute our function, we need to look it up in the module by name.
>>>
vecAdd <- getFun mdl "vecAdd"
Given this reference to our kernel function, we are almost ready to
execute it on the device using launchKernel
, but first, we must create
some data that we can execute the function on.
- Transferring data to and from the GPU
GPUs typically have their own memory which is separate from the CPU's memory, and we need to explicitly copy data back and forth between these two regions. The module Foreign.CUDA.Driver.Marshal provides functions for allocating memory on the GPU, and copying data between the CPU and GPU, as well as directly between multiple GPUs.
For simplicity, we'll use standard Haskell lists for our input and output data structure. Note however that this will have significantly lower effective bandwidth than reading a single contiguous region of memory, so for most practical purposes you will want to use some kind of unboxed array.
>>>
let xs = [1..1024] :: [Float]
>>>
let ys = [2,4..2048] :: [Float]
In CUDA, like C, all memory management is explicit, and arrays on the device must be explicitly allocated and freed. As mentioned previously, data transfer is also explicit. However, we do provide convenience functions for combined allocation and marshalling, as well as bracketed operations.
>>>
xs_dev <- newListArray xs
>>>
ys_dev <- newListArray ys
>>>
zs_dev <- mallocArray 1024 :: IO (DevicePtr Float)
After executing the kernel (see next section), we transfer the result back to the host, and free the memory that was allocated on the GPU.
>>>
zs <- peekListArray 1024 zs_dev
>>>
free xs_dev
>>>
free ys_dev
>>>
free zs_dev
- Piecing it all together
Finally, we have everything in place to execute our operation on the
GPU. Launching a kernel on the GPU consists of creating many threads on
the GPU which all execute the same function, and each thread has
a unique identifier in the grid/block hierarchy which can be used to
identify exactly which element this thread should process (the
blockIdx
and threadIdx
parameters that we saw earlier,
respectively).
To execute our function, we will use a grid of 4 blocks, each containing 256 threads. Thus, a total of 1024 threads will be launched, which will each compute a single element of the output array (recall that our input arrays each have 1024 elements). The module Foreign.CUDA.Analysis.Occupancy contains functions to help determine the ideal thread block size for a given kernel and GPU combination.
>>>
launchKernel vecAdd (4,1,1) (256,1,1) 0 Nothing [VArg xs_dev, VArg ys_dev, VArg zs_dev, IArg 1024]
Note that kernel execution is asynchronous, so we should also wait for the operation to complete before attempting to read the results back.
>>>
sync
And that's it!
- Next steps
As mentioned at the end of the previous section, kernels on the GPU are
executed asynchronously with respect to the host, and other operations
such as data transfers can also be executed asynchronously. This allows
the CPU to continue doing other work while the GPU is busy.
Event
s can be used to check whether an
operation has completed yet.
It is also possible to execute multiple kernels or data transfers
concurrently with each other, by assigning those operations to different
execution Stream
s. Used in conjunction with
Event
s, operations will be scheduled
efficiently only once all dependencies (in the form of
Event
s) have been cleared.
See Foreign.CUDA.Driver.Event and Foreign.CUDA.Driver.Stream for more information on this topic.
Synopsis
- module Foreign.CUDA.Ptr
- data ContextFlag
- newtype Context = Context (Ptr ())
- create :: Device -> [ContextFlag] -> IO Context
- attach :: Context -> [ContextFlag] -> IO ()
- detach :: Context -> IO ()
- destroy :: Context -> IO ()
- get :: IO (Maybe Context)
- set :: Context -> IO ()
- pop :: IO Context
- push :: Context -> IO ()
- sync :: IO ()
- type StreamPriority = Int
- data PeerAttribute
- data PeerFlag
- accessible :: Device -> Device -> IO Bool
- add :: Context -> [PeerFlag] -> IO ()
- remove :: Context -> IO ()
- getAttribute :: PeerAttribute -> Device -> Device -> IO Int
- data SharedMem
- data Cache
- data Limit
- getFlags :: IO [ContextFlag]
- getLimit :: Limit -> IO Int
- setLimit :: Limit -> Int -> IO ()
- getCache :: IO Cache
- setCache :: Cache -> IO ()
- getSharedMem :: IO SharedMem
- setSharedMem :: SharedMem -> IO ()
- getStreamPriorityRange :: IO (StreamPriority, StreamPriority)
- data DeviceProperties = DeviceProperties {
- deviceName :: !String
- computeCapability :: !Compute
- totalGlobalMem :: !Int64
- totalConstMem :: !Int64
- sharedMemPerBlock :: !Int64
- regsPerBlock :: !Int
- warpSize :: !Int
- maxThreadsPerBlock :: !Int
- maxThreadsPerMultiProcessor :: !Int
- maxBlockSize :: !(Int, Int, Int)
- maxGridSize :: !(Int, Int, Int)
- maxTextureDim1D :: !Int
- maxTextureDim2D :: !(Int, Int)
- maxTextureDim3D :: !(Int, Int, Int)
- clockRate :: !Int
- multiProcessorCount :: !Int
- memPitch :: !Int64
- memBusWidth :: !Int
- memClockRate :: !Int
- textureAlignment :: !Int64
- computeMode :: !ComputeMode
- deviceOverlap :: !Bool
- concurrentKernels :: !Bool
- eccEnabled :: !Bool
- asyncEngineCount :: !Int
- cacheMemL2 :: !Int
- pciInfo :: !PCI
- tccDriverEnabled :: !Bool
- kernelExecTimeoutEnabled :: !Bool
- integrated :: !Bool
- canMapHostMemory :: !Bool
- unifiedAddressing :: !Bool
- streamPriorities :: !Bool
- globalL1Cache :: !Bool
- localL1Cache :: !Bool
- managedMemory :: !Bool
- multiGPUBoard :: !Bool
- multiGPUBoardGroupID :: !Int
- preemption :: !Bool
- singleToDoublePerfRatio :: !Int
- cooperativeLaunch :: !Bool
- cooperativeLaunchMultiDevice :: !Bool
- data Compute = Compute !Int !Int
- data ComputeMode
- data InitFlag
- data DeviceAttribute
- = MaxThreadsPerBlock
- | MaxBlockDimX
- | MaxBlockDimY
- | MaxBlockDimZ
- | MaxGridDimX
- | MaxGridDimY
- | MaxGridDimZ
- | MaxSharedMemoryPerBlock
- | SharedMemoryPerBlock
- | TotalConstantMemory
- | WarpSize
- | MaxPitch
- | MaxRegistersPerBlock
- | RegistersPerBlock
- | ClockRate
- | TextureAlignment
- | GpuOverlap
- | MultiprocessorCount
- | KernelExecTimeout
- | Integrated
- | CanMapHostMemory
- | ComputeMode
- | MaximumTexture1dWidth
- | MaximumTexture2dWidth
- | MaximumTexture2dHeight
- | MaximumTexture3dWidth
- | MaximumTexture3dHeight
- | MaximumTexture3dDepth
- | MaximumTexture2dLayeredWidth
- | MaximumTexture2dArrayWidth
- | MaximumTexture2dLayeredHeight
- | MaximumTexture2dArrayHeight
- | MaximumTexture2dLayeredLayers
- | MaximumTexture2dArrayNumslices
- | SurfaceAlignment
- | ConcurrentKernels
- | EccEnabled
- | PciBusId
- | PciDeviceId
- | TccDriver
- | MemoryClockRate
- | GlobalMemoryBusWidth
- | L2CacheSize
- | MaxThreadsPerMultiprocessor
- | AsyncEngineCount
- | UnifiedAddressing
- | MaximumTexture1dLayeredWidth
- | MaximumTexture1dLayeredLayers
- | CanTex2dGather
- | MaximumTexture2dGatherWidth
- | MaximumTexture2dGatherHeight
- | MaximumTexture3dWidthAlternate
- | MaximumTexture3dHeightAlternate
- | MaximumTexture3dDepthAlternate
- | PciDomainId
- | TexturePitchAlignment
- | MaximumTexturecubemapWidth
- | MaximumTexturecubemapLayeredWidth
- | MaximumTexturecubemapLayeredLayers
- | MaximumSurface1dWidth
- | MaximumSurface2dWidth
- | MaximumSurface2dHeight
- | MaximumSurface3dWidth
- | MaximumSurface3dHeight
- | MaximumSurface3dDepth
- | MaximumSurface1dLayeredWidth
- | MaximumSurface1dLayeredLayers
- | MaximumSurface2dLayeredWidth
- | MaximumSurface2dLayeredHeight
- | MaximumSurface2dLayeredLayers
- | MaximumSurfacecubemapWidth
- | MaximumSurfacecubemapLayeredWidth
- | MaximumSurfacecubemapLayeredLayers
- | MaximumTexture1dLinearWidth
- | MaximumTexture2dLinearWidth
- | MaximumTexture2dLinearHeight
- | MaximumTexture2dLinearPitch
- | MaximumTexture2dMipmappedWidth
- | MaximumTexture2dMipmappedHeight
- | ComputeCapabilityMajor
- | ComputeCapabilityMinor
- | MaximumTexture1dMipmappedWidth
- | StreamPrioritiesSupported
- | GlobalL1CacheSupported
- | LocalL1CacheSupported
- | MaxSharedMemoryPerMultiprocessor
- | MaxRegistersPerMultiprocessor
- | ManagedMemory
- | MultiGpuBoard
- | MultiGpuBoardGroupId
- | HostNativeAtomicSupported
- | SingleToDoublePrecisionPerfRatio
- | PageableMemoryAccess
- | ConcurrentManagedAccess
- | ComputePreemptionSupported
- | CanUseHostPointerForRegisteredMem
- | CanUseStreamMemOps
- | CanUse64BitStreamMemOps
- | CanUseStreamWaitValueNor
- | CooperativeLaunch
- | CooperativeMultiDeviceLaunch
- | MaxSharedMemoryPerBlockOptin
- | CanFlushRemoteWrites
- | HostRegisterSupported
- | PageableMemoryAccessUsesHostPageTables
- | DirectManagedMemAccessFromHost
- | CU_DEVICE_ATTRIBUTE_MAX
- newtype Device = Device CInt
- initialise :: [InitFlag] -> IO ()
- capability :: Device -> IO Compute
- device :: Int -> IO Device
- attribute :: Device -> DeviceAttribute -> IO Int
- count :: IO Int
- name :: Device -> IO String
- uuid :: Device -> IO UUID
- props :: Device -> IO DeviceProperties
- totalMem :: Device -> IO Int64
- module Foreign.CUDA.Driver.Error
- data SharedMem
- data FunParam where
- data FunAttribute
- newtype Fun = Fun (Ptr ())
- requires :: Fun -> FunAttribute -> IO Int
- setCacheConfigFun :: Fun -> Cache -> IO ()
- setSharedMemConfigFun :: Fun -> SharedMem -> IO ()
- launchKernel :: Fun -> (Int, Int, Int) -> (Int, Int, Int) -> Int -> Maybe Stream -> [FunParam] -> IO ()
- launchKernel' :: Fun -> (Int, Int, Int) -> (Int, Int, Int) -> Int -> Maybe Stream -> [FunParam] -> IO ()
- launchKernelCooperative :: Fun -> (Int, Int, Int) -> (Int, Int, Int) -> Int -> Maybe Stream -> [FunParam] -> IO ()
- launch :: Fun -> (Int, Int) -> Maybe Stream -> IO ()
- setBlockShape :: Fun -> (Int, Int, Int) -> IO ()
- setSharedSize :: Fun -> Integer -> IO ()
- setParams :: Fun -> [FunParam] -> IO ()
- data AttachFlag
- data AllocFlag
- mallocHostArray :: Storable a => [AllocFlag] -> Int -> IO (HostPtr a)
- mallocHostForeignPtr :: Storable a => [AllocFlag] -> Int -> IO (ForeignPtr a)
- freeHost :: HostPtr a -> IO ()
- registerArray :: Storable a => [AllocFlag] -> Int -> Ptr a -> IO (HostPtr a)
- unregisterArray :: HostPtr a -> IO (Ptr a)
- mallocArray :: Storable a => Int -> IO (DevicePtr a)
- allocaArray :: Storable a => Int -> (DevicePtr a -> IO b) -> IO b
- free :: DevicePtr a -> IO ()
- mallocManagedArray :: Storable a => [AttachFlag] -> Int -> IO (DevicePtr a)
- prefetchArrayAsync :: Storable a => DevicePtr a -> Int -> Maybe Device -> Maybe Stream -> IO ()
- attachArrayAsync :: forall a. Storable a => [AttachFlag] -> Stream -> DevicePtr a -> Int -> IO ()
- peekArray :: Storable a => Int -> DevicePtr a -> Ptr a -> IO ()
- peekArrayAsync :: Storable a => Int -> DevicePtr a -> HostPtr a -> Maybe Stream -> IO ()
- peekArray2D :: Storable a => Int -> Int -> DevicePtr a -> Int -> Int -> Int -> Ptr a -> Int -> Int -> Int -> IO ()
- peekArray2DAsync :: Storable a => Int -> Int -> DevicePtr a -> Int -> Int -> Int -> HostPtr a -> Int -> Int -> Int -> Maybe Stream -> IO ()
- peekListArray :: Storable a => Int -> DevicePtr a -> IO [a]
- pokeArray :: Storable a => Int -> Ptr a -> DevicePtr a -> IO ()
- pokeArrayAsync :: Storable a => Int -> HostPtr a -> DevicePtr a -> Maybe Stream -> IO ()
- pokeArray2D :: Storable a => Int -> Int -> Ptr a -> Int -> Int -> Int -> DevicePtr a -> Int -> Int -> Int -> IO ()
- pokeArray2DAsync :: Storable a => Int -> Int -> HostPtr a -> Int -> Int -> Int -> DevicePtr a -> Int -> Int -> Int -> Maybe Stream -> IO ()
- pokeListArray :: Storable a => [a] -> DevicePtr a -> IO ()
- copyArray :: Storable a => Int -> DevicePtr a -> DevicePtr a -> IO ()
- copyArrayAsync :: Storable a => Int -> DevicePtr a -> DevicePtr a -> Maybe Stream -> IO ()
- copyArray2D :: Storable a => Int -> Int -> DevicePtr a -> Int -> Int -> Int -> DevicePtr a -> Int -> Int -> Int -> IO ()
- copyArray2DAsync :: Storable a => Int -> Int -> DevicePtr a -> Int -> Int -> Int -> DevicePtr a -> Int -> Int -> Int -> Maybe Stream -> IO ()
- copyArrayPeer :: Storable a => Int -> DevicePtr a -> Context -> DevicePtr a -> Context -> IO ()
- copyArrayPeerAsync :: Storable a => Int -> DevicePtr a -> Context -> DevicePtr a -> Context -> Maybe Stream -> IO ()
- newListArrayLen :: Storable a => [a] -> IO (DevicePtr a, Int)
- newListArray :: Storable a => [a] -> IO (DevicePtr a)
- withListArray :: Storable a => [a] -> (DevicePtr a -> IO b) -> IO b
- withListArrayLen :: Storable a => [a] -> (Int -> DevicePtr a -> IO b) -> IO b
- memset :: Storable a => DevicePtr a -> Int -> a -> IO ()
- memsetAsync :: Storable a => DevicePtr a -> Int -> a -> Maybe Stream -> IO ()
- getDevicePtr :: [AllocFlag] -> HostPtr a -> IO (DevicePtr a)
- getBasePtr :: DevicePtr a -> IO (DevicePtr a, Int64)
- getMemInfo :: IO (Int64, Int64)
- module Foreign.CUDA.Driver.Module
- module Foreign.CUDA.Driver.Unified
- module Foreign.CUDA.Driver.Utils
Documentation
module Foreign.CUDA.Ptr
data ContextFlag Source #
Context creation flags
SchedAuto | |
SchedSpin | |
SchedYield | |
SchedBlockingSync | |
BlockingSync | Deprecated: use SchedBlockingSync instead |
SchedMask | |
MapHost | |
LmemResizeToMax | |
FlagsMask |
Instances
Bounded ContextFlag Source # | |
Defined in Foreign.CUDA.Driver.Context.Base minBound :: ContextFlag # maxBound :: ContextFlag # | |
Enum ContextFlag Source # | |
Defined in Foreign.CUDA.Driver.Context.Base succ :: ContextFlag -> ContextFlag # pred :: ContextFlag -> ContextFlag # toEnum :: Int -> ContextFlag # fromEnum :: ContextFlag -> Int # enumFrom :: ContextFlag -> [ContextFlag] # enumFromThen :: ContextFlag -> ContextFlag -> [ContextFlag] # enumFromTo :: ContextFlag -> ContextFlag -> [ContextFlag] # enumFromThenTo :: ContextFlag -> ContextFlag -> ContextFlag -> [ContextFlag] # | |
Eq ContextFlag Source # | |
Defined in Foreign.CUDA.Driver.Context.Base (==) :: ContextFlag -> ContextFlag -> Bool # (/=) :: ContextFlag -> ContextFlag -> Bool # | |
Show ContextFlag Source # | |
Defined in Foreign.CUDA.Driver.Context.Base showsPrec :: Int -> ContextFlag -> ShowS # show :: ContextFlag -> String # showList :: [ContextFlag] -> ShowS # |
create :: Device -> [ContextFlag] -> IO Context Source #
Create a new CUDA context and associate it with the calling thread. The
context is created with a usage count of one, and the caller of create
must call destroy
when done using the context. If a context is already
current to the thread, it is supplanted by the newly created context and
must be restored by a subsequent call to pop
.
attach :: Context -> [ContextFlag] -> IO () Source #
Deprecated: as of CUDA-4.0
Increments the usage count of the context. API: no context flags are currently supported, so this parameter must be empty.
detach :: Context -> IO () Source #
Deprecated: as of CUDA-4.0
Detach the context, and destroy if no longer used
destroy :: Context -> IO () Source #
Destroy the specified context, regardless of how many threads it is
current to. The context will be pop
ed from the current thread's
context stack, but if it is current on any other threads it will remain
current to those threads, and attempts to access it will result in an
error.
get :: IO (Maybe Context) Source #
Return the context bound to the calling CPU thread.
Requires CUDA-4.0.
Pop the current CUDA context from the CPU thread. The context may then
be attached to a different CPU thread by calling push
.
push :: Context -> IO () Source #
Push the given context onto the CPU's thread stack of current contexts. The specified context becomes the CPU thread's current context, so all operations that operate on the current context are affected.
Block until the device has completed all preceding requests. If the
context was created with the SchedBlockingSync
flag, the CPU thread
will block until the GPU has finished its work.
type StreamPriority = Int Source #
Priority of an execution stream. Work submitted to a higher priority stream may preempt execution of work already executing in a lower priority stream. Lower numbers represent higher priorities.
data PeerAttribute Source #
Peer-to-peer attributes
PerformanceRank | |
AccessSupported | |
NativeAtomicSupported | |
ArrayAccessAccessSupported | |
CudaArrayAccessSupported |
Instances
Enum PeerAttribute Source # | |
Defined in Foreign.CUDA.Driver.Context.Peer succ :: PeerAttribute -> PeerAttribute # pred :: PeerAttribute -> PeerAttribute # toEnum :: Int -> PeerAttribute # fromEnum :: PeerAttribute -> Int # enumFrom :: PeerAttribute -> [PeerAttribute] # enumFromThen :: PeerAttribute -> PeerAttribute -> [PeerAttribute] # enumFromTo :: PeerAttribute -> PeerAttribute -> [PeerAttribute] # enumFromThenTo :: PeerAttribute -> PeerAttribute -> PeerAttribute -> [PeerAttribute] # | |
Eq PeerAttribute Source # | |
Defined in Foreign.CUDA.Driver.Context.Peer (==) :: PeerAttribute -> PeerAttribute -> Bool # (/=) :: PeerAttribute -> PeerAttribute -> Bool # | |
Show PeerAttribute Source # | |
Defined in Foreign.CUDA.Driver.Context.Peer showsPrec :: Int -> PeerAttribute -> ShowS # show :: PeerAttribute -> String # showList :: [PeerAttribute] -> ShowS # |
Possible option values for direct peer memory access
Instances
Enum PeerFlag Source # | |
Defined in Foreign.CUDA.Driver.Context.Peer |
accessible :: Device -> Device -> IO Bool Source #
Queries if the first device can directly access the memory of the second. If
direct access is possible, it can then be enabled with add
.
Requires CUDA-4.0.
add :: Context -> [PeerFlag] -> IO () Source #
If the devices of both the current and supplied contexts support unified addressing, then enable allocations in the supplied context to be accessible by the current context.
Note that access is unidirectional, and in order to access memory in the
current context from the peer context, a separate symmetric call to
add
is required.
Requires CUDA-4.0.
remove :: Context -> IO () Source #
Disable direct memory access from the current context to the supplied peer context, and unregisters any registered allocations.
Requires CUDA-4.0.
getAttribute :: PeerAttribute -> Device -> Device -> IO Int Source #
Queries attributes of the link between two devices
Requires CUDA-8.0
since 0.9.0.0
Device shared memory configuration preference
Device cache configuration preference
Device limits flags
StackSize | |
PrintfFifoSize | |
MallocHeapSize | |
DevRuntimeSyncDepth | |
DevRuntimePendingLaunchCount | |
MaxL2FetchGranularity | |
Max |
getFlags :: IO [ContextFlag] Source #
Return the flags that were used to create the current context.
Requires CUDA-7.0
setLimit :: Limit -> Int -> IO () Source #
Specify the size of the call stack, for compute 2.0 devices.
Requires CUDA-3.1.
On devices where the L1 cache and shared memory use the same hardware resources, this function returns the preferred cache configuration for the current context.
Requires CUDA-3.2.
setCache :: Cache -> IO () Source #
On devices where the L1 cache and shared memory use the same hardware resources, this sets the preferred cache configuration for the current context. This is only a preference.
Any function configuration set via
setCacheConfigFun
will be preferred over this
context-wide setting.
Requires CUDA-3.2.
getSharedMem :: IO SharedMem Source #
Return the current size of the shared memory banks in the current
context. On devices with configurable shared memory banks,
setSharedMem
can be used to change the configuration, so that
subsequent kernel launches will by default us the new bank size. On
devices without configurable shared memory, this function returns the
fixed bank size of the hardware.
Requires CUDA-4.2
setSharedMem :: SharedMem -> IO () Source #
On devices with configurable shared memory banks, this function will set the context's shared memory bank size that will be used by default for subsequent kernel launches.
Changing the shared memory configuration between launches may insert a device synchronisation.
Shared memory bank size does not affect shared memory usage or kernel occupancy, but may have major effects on performance. Larger bank sizes allow for greater potential bandwidth to shared memory, but change the kinds of accesses which result in bank conflicts.
Requires CUDA-4.2
getStreamPriorityRange :: IO (StreamPriority, StreamPriority) Source #
Returns the numerical values that correspond to the greatest and least priority execution streams in the current context respectively. Stream priorities follow the convention that lower numerical numbers correspond to higher priorities. The range of meaningful stream priorities is given by the inclusive range [greatestPriority,leastPriority].
Requires CUDA-5.5.
data DeviceProperties Source #
The properties of a compute device
DeviceProperties | |
|
Instances
Show DeviceProperties Source # | |
Defined in Foreign.CUDA.Analysis.Device showsPrec :: Int -> DeviceProperties -> ShowS # show :: DeviceProperties -> String # showList :: [DeviceProperties] -> ShowS # | |
Storable DeviceProperties Source # | |
Defined in Foreign.CUDA.Runtime.Device sizeOf :: DeviceProperties -> Int # alignment :: DeviceProperties -> Int # peekElemOff :: Ptr DeviceProperties -> Int -> IO DeviceProperties # pokeElemOff :: Ptr DeviceProperties -> Int -> DeviceProperties -> IO () # peekByteOff :: Ptr b -> Int -> IO DeviceProperties # pokeByteOff :: Ptr b -> Int -> DeviceProperties -> IO () # peek :: Ptr DeviceProperties -> IO DeviceProperties # poke :: Ptr DeviceProperties -> DeviceProperties -> IO () # |
GPU compute capability, major and minor revision number respectively.
data ComputeMode Source #
The compute mode the device is currently in
Instances
Enum ComputeMode Source # | |
Defined in Foreign.CUDA.Analysis.Device succ :: ComputeMode -> ComputeMode # pred :: ComputeMode -> ComputeMode # toEnum :: Int -> ComputeMode # fromEnum :: ComputeMode -> Int # enumFrom :: ComputeMode -> [ComputeMode] # enumFromThen :: ComputeMode -> ComputeMode -> [ComputeMode] # enumFromTo :: ComputeMode -> ComputeMode -> [ComputeMode] # enumFromThenTo :: ComputeMode -> ComputeMode -> ComputeMode -> [ComputeMode] # | |
Eq ComputeMode Source # | |
Defined in Foreign.CUDA.Analysis.Device (==) :: ComputeMode -> ComputeMode -> Bool # (/=) :: ComputeMode -> ComputeMode -> Bool # | |
Show ComputeMode Source # | |
Defined in Foreign.CUDA.Analysis.Device showsPrec :: Int -> ComputeMode -> ShowS # show :: ComputeMode -> String # showList :: [ComputeMode] -> ShowS # |
Possible option flags for CUDA initialisation. Dummy instance until the API exports actual option values.
Instances
Enum InitFlag Source # | |
Defined in Foreign.CUDA.Driver.Device |
data DeviceAttribute Source #
Device attributes
Instances
Enum DeviceAttribute Source # | |
Defined in Foreign.CUDA.Driver.Device succ :: DeviceAttribute -> DeviceAttribute # pred :: DeviceAttribute -> DeviceAttribute # toEnum :: Int -> DeviceAttribute # fromEnum :: DeviceAttribute -> Int # enumFrom :: DeviceAttribute -> [DeviceAttribute] # enumFromThen :: DeviceAttribute -> DeviceAttribute -> [DeviceAttribute] # enumFromTo :: DeviceAttribute -> DeviceAttribute -> [DeviceAttribute] # enumFromThenTo :: DeviceAttribute -> DeviceAttribute -> DeviceAttribute -> [DeviceAttribute] # | |
Eq DeviceAttribute Source # | |
Defined in Foreign.CUDA.Driver.Device (==) :: DeviceAttribute -> DeviceAttribute -> Bool # (/=) :: DeviceAttribute -> DeviceAttribute -> Bool # | |
Show DeviceAttribute Source # | |
Defined in Foreign.CUDA.Driver.Device showsPrec :: Int -> DeviceAttribute -> ShowS # show :: DeviceAttribute -> String # showList :: [DeviceAttribute] -> ShowS # |
A CUDA device
initialise :: [InitFlag] -> IO () Source #
Initialise the CUDA driver API. This must be called before any other driver function.
capability :: Device -> IO Compute Source #
Return the compute compatibility revision supported by the device
attribute :: Device -> DeviceAttribute -> IO Int Source #
Return the selected attribute for the given device.
Return the number of device with compute capability > 1.0.
module Foreign.CUDA.Driver.Error
Device shared memory configuration preference
Kernel function parameters
Instances
Storable FunParam Source # | |
Defined in Foreign.CUDA.Driver.Exec |
data FunAttribute Source #
Function attributes
MaxKernelThreadsPerBlock | |
SharedSizeBytes | |
ConstSizeBytes | |
LocalSizeBytes | |
NumRegs | |
PtxVersion | |
BinaryVersion | |
CacheModeCa | |
MaxDynamicSharedSizeBytes | |
PreferredSharedMemoryCarveout | |
CU_FUNC_ATTRIBUTE_MAX |
Instances
Enum FunAttribute Source # | |
Defined in Foreign.CUDA.Driver.Exec succ :: FunAttribute -> FunAttribute # pred :: FunAttribute -> FunAttribute # toEnum :: Int -> FunAttribute # fromEnum :: FunAttribute -> Int # enumFrom :: FunAttribute -> [FunAttribute] # enumFromThen :: FunAttribute -> FunAttribute -> [FunAttribute] # enumFromTo :: FunAttribute -> FunAttribute -> [FunAttribute] # enumFromThenTo :: FunAttribute -> FunAttribute -> FunAttribute -> [FunAttribute] # | |
Eq FunAttribute Source # | |
Defined in Foreign.CUDA.Driver.Exec (==) :: FunAttribute -> FunAttribute -> Bool # (/=) :: FunAttribute -> FunAttribute -> Bool # | |
Show FunAttribute Source # | |
Defined in Foreign.CUDA.Driver.Exec showsPrec :: Int -> FunAttribute -> ShowS # show :: FunAttribute -> String # showList :: [FunAttribute] -> ShowS # |
requires :: Fun -> FunAttribute -> IO Int Source #
Returns the value of the selected attribute requirement for the given kernel.
setCacheConfigFun :: Fun -> Cache -> IO () Source #
On devices where the L1 cache and shared memory use the same hardware resources, this sets the preferred cache configuration for the given device function. This is only a preference; the driver is free to choose a different configuration as required to execute the function.
Switching between configuration modes may insert a device-side synchronisation point for streamed kernel launches.
Requires CUDA-3.0.
setSharedMemConfigFun :: Fun -> SharedMem -> IO () Source #
Set the shared memory configuration of a device function.
On devices with configurable shared memory banks, this will force all subsequent launches of the given device function to use the specified shared memory bank size configuration. On launch of the function, the shared memory configuration of the device will be temporarily changed if needed to suit the function configuration. Changes in shared memory configuration may introduction a device side synchronisation between kernel launches.
Any per-function configuration specified by setSharedMemConfig
will
override the context-wide configuration set with
setSharedMem
.
Changing the shared memory bank size will not increase shared memory usage or affect occupancy of kernels, but may have major effects on performance. Larger bank sizes will allow for greater potential bandwidth to shared memory, but will change what kinds of accesses to shared memory will result in bank conflicts.
This function will do nothing on devices with fixed shared memory bank size.
Requires CUDA-5.0.
:: Fun | function to execute |
-> (Int, Int, Int) | block grid dimension |
-> (Int, Int, Int) | thread block shape |
-> Int | shared memory (bytes) |
-> Maybe Stream | (optional) stream to execute in |
-> [FunParam] | list of function parameters |
-> IO () |
Invoke a kernel on a (gx * gy * gz)
grid of blocks, where each block
contains (tx * ty * tz)
threads and has access to a given number of bytes
of shared memory. The launch may also be associated with a specific Stream
.
In launchKernel
, the number of kernel parameters and their offsets and
sizes do not need to be specified, as this information is retrieved directly
from the kernel's image. This requires the kernel to have been compiled with
toolchain version 3.2 or later.
The alternative launchKernel'
will pass the arguments in directly,
requiring the application to know the size and alignment/padding of each
kernel parameter.
:: Fun | function to execute |
-> (Int, Int, Int) | block grid dimension |
-> (Int, Int, Int) | thread block shape |
-> Int | shared memory (bytes) |
-> Maybe Stream | (optional) stream to execute in |
-> [FunParam] | list of function parameters |
-> IO () |
Invoke a kernel on a (gx * gy * gz)
grid of blocks, where each block
contains (tx * ty * tz)
threads and has access to a given number of bytes
of shared memory. The launch may also be associated with a specific Stream
.
In launchKernel
, the number of kernel parameters and their offsets and
sizes do not need to be specified, as this information is retrieved directly
from the kernel's image. This requires the kernel to have been compiled with
toolchain version 3.2 or later.
The alternative launchKernel'
will pass the arguments in directly,
requiring the application to know the size and alignment/padding of each
kernel parameter.
launchKernelCooperative Source #
:: Fun | function to execute |
-> (Int, Int, Int) | block grid dimension |
-> (Int, Int, Int) | thread block shape |
-> Int | shared memory (bytes) |
-> Maybe Stream | (optional) stream to execute in |
-> [FunParam] | list of function parameters |
-> IO () |
Invoke a kernel on a (gx * gy * gz)
grid of blocks, where each block
contains (tx * ty * tz)
threads and has access to a given number of bytes
of shared memory. The launch may also be associated with a specific stream.
The thread blocks can cooperate and synchronise as they execute.
The device on which this kernel is invoked must have
attribute
CooperativeLaunch
.
The total number of blocks launched can not exceed the maximum number of
active thread blocks per multiprocessor
(threadBlocksPerMP
), multiplied by the number
of multiprocessors (multiProcessorCount
).
The kernel can not make use of dynamic parallelism.
Requires CUDA-9.0
since 0.9.0.0
launch :: Fun -> (Int, Int) -> Maybe Stream -> IO () Source #
Deprecated: use launchKernel instead
Invoke the kernel on a size (w,h)
grid of blocks. Each block contains the
number of threads specified by a previous call to setBlockShape
. The launch
may also be associated with a specific Stream
.
setBlockShape :: Fun -> (Int, Int, Int) -> IO () Source #
Deprecated: use launchKernel instead
Specify the (x,y,z)
dimensions of the thread blocks that are created when
the given kernel function is launched.
setSharedSize :: Fun -> Integer -> IO () Source #
Deprecated: use launchKernel instead
Set the number of bytes of dynamic shared memory to be available to each thread block when the function is launched
setParams :: Fun -> [FunParam] -> IO () Source #
Deprecated: use launchKernel instead
Set the parameters that will specified next time the kernel is invoked
data AttachFlag Source #
Options for unified memory allocations
Instances
Bounded AttachFlag Source # | |
Defined in Foreign.CUDA.Driver.Marshal minBound :: AttachFlag # maxBound :: AttachFlag # | |
Enum AttachFlag Source # | |
Defined in Foreign.CUDA.Driver.Marshal succ :: AttachFlag -> AttachFlag # pred :: AttachFlag -> AttachFlag # toEnum :: Int -> AttachFlag # fromEnum :: AttachFlag -> Int # enumFrom :: AttachFlag -> [AttachFlag] # enumFromThen :: AttachFlag -> AttachFlag -> [AttachFlag] # enumFromTo :: AttachFlag -> AttachFlag -> [AttachFlag] # enumFromThenTo :: AttachFlag -> AttachFlag -> AttachFlag -> [AttachFlag] # | |
Eq AttachFlag Source # | |
Defined in Foreign.CUDA.Driver.Marshal (==) :: AttachFlag -> AttachFlag -> Bool # (/=) :: AttachFlag -> AttachFlag -> Bool # | |
Show AttachFlag Source # | |
Defined in Foreign.CUDA.Driver.Marshal showsPrec :: Int -> AttachFlag -> ShowS # show :: AttachFlag -> String # showList :: [AttachFlag] -> ShowS # |
Options for host allocation
Instances
Bounded AllocFlag Source # | |
Enum AllocFlag Source # | |
Defined in Foreign.CUDA.Driver.Marshal succ :: AllocFlag -> AllocFlag # pred :: AllocFlag -> AllocFlag # fromEnum :: AllocFlag -> Int # enumFrom :: AllocFlag -> [AllocFlag] # enumFromThen :: AllocFlag -> AllocFlag -> [AllocFlag] # enumFromTo :: AllocFlag -> AllocFlag -> [AllocFlag] # enumFromThenTo :: AllocFlag -> AllocFlag -> AllocFlag -> [AllocFlag] # | |
Eq AllocFlag Source # | |
Show AllocFlag Source # | |
mallocHostArray :: Storable a => [AllocFlag] -> Int -> IO (HostPtr a) Source #
Allocate a section of linear memory on the host which is page-locked and directly accessible from the device. The storage is sufficient to hold the given number of elements of a storable type.
Note that since the amount of pageable memory is thusly reduced, overall system performance may suffer. This is best used sparingly to allocate staging areas for data exchange.
Host memory allocated in this way is automatically and immediately accessible to all contexts on all devices which support unified addressing.
mallocHostForeignPtr :: Storable a => [AllocFlag] -> Int -> IO (ForeignPtr a) Source #
As mallocHostArray
, but return a ForeignPtr
instead. The array will be
deallocated automatically once the last reference to the ForeignPtr
is
dropped.
registerArray :: Storable a => [AllocFlag] -> Int -> Ptr a -> IO (HostPtr a) Source #
Page-locks the specified array (on the host) and maps it for the device(s) as
specified by the given allocation flags. Subsequently, the memory is accessed
directly by the device so can be read and written with much higher bandwidth
than pageable memory that has not been registered. The memory range is added
to the same tracking mechanism as mallocHostArray
to automatically
accelerate calls to functions such as pokeArray
.
Note that page-locking excessive amounts of memory may degrade system performance, since it reduces the amount of pageable memory available. This is best used sparingly to allocate staging areas for data exchange.
This function has limited support on Mac OS X. OS 10.7 or later is required.
Requires CUDA-4.0.
unregisterArray :: HostPtr a -> IO (Ptr a) Source #
Unmaps the memory from the given pointer, and makes it pageable again.
Requires CUDA-4.0.
mallocArray :: Storable a => Int -> IO (DevicePtr a) Source #
Allocate a section of linear memory on the device, and return a reference to it. The memory is sufficient to hold the given number of elements of storable type. It is suitably aligned for any type, and is not cleared.
allocaArray :: Storable a => Int -> (DevicePtr a -> IO b) -> IO b Source #
Execute a computation on the device, passing a pointer to a temporarily allocated block of memory sufficient to hold the given number of elements of storable type. The memory is freed when the computation terminates (normally or via an exception), so the pointer must not be used after this.
Note that kernel launches can be asynchronous, so you may want to add a
synchronisation point using sync
as part
of the continuation.
mallocManagedArray :: Storable a => [AttachFlag] -> Int -> IO (DevicePtr a) Source #
Allocates memory that will be automatically managed by the Unified Memory system. The returned pointer is valid on the CPU and on all GPUs which supported managed memory. All accesses to this pointer must obey the Unified Memory programming model.
On a multi-GPU system with peer-to-peer support, where multiple GPUs
support managed memory, the physical storage is created on the GPU which
is active at the time mallocManagedArray
is called. All other GPUs
will access the array at reduced bandwidth via peer mapping over the
PCIe bus. The Unified Memory system does not migrate memory between
GPUs.
On a multi-GPU system where multiple GPUs support managed memory, but not all pairs of such GPUs have peer-to-peer support between them, the physical storage is allocated in system memory (zero-copy memory) and all GPUs will access the data at reduced bandwidth over the PCIe bus.
Requires CUDA-6.0
prefetchArrayAsync :: Storable a => DevicePtr a -> Int -> Maybe Device -> Maybe Stream -> IO () Source #
Pre-fetches the given number of elements to the specified destination
device. If the specified device is Nothing, the data is pre-fetched to host
memory. The pointer must refer to a memory range allocated with
mallocManagedArray
.
Requires CUDA-8.0.
attachArrayAsync :: forall a. Storable a => [AttachFlag] -> Stream -> DevicePtr a -> Int -> IO () Source #
Attach an array of the given number of elements to a stream asynchronously
Since: 0.10.0.0
peekArray :: Storable a => Int -> DevicePtr a -> Ptr a -> IO () Source #
Copy a number of elements from the device to host memory. This is a synchronous operation.
peekArrayAsync :: Storable a => Int -> DevicePtr a -> HostPtr a -> Maybe Stream -> IO () Source #
Copy memory from the device asynchronously, possibly associated with a particular stream. The destination host memory must be page-locked.
:: Storable a | |
=> Int | width to copy (elements) |
-> Int | height to copy (elements) |
-> DevicePtr a | source array |
-> Int | source array width |
-> Int | source x-coordinate |
-> Int | source y-coordinate |
-> Ptr a | destination array |
-> Int | destination array width |
-> Int | destination x-coordinate |
-> Int | destination y-coordinate |
-> IO () |
Copy a 2D array from the device to the host.
:: Storable a | |
=> Int | width to copy (elements) |
-> Int | height to copy (elements) |
-> DevicePtr a | source array |
-> Int | source array width |
-> Int | source x-coordinate |
-> Int | source y-coordinate |
-> HostPtr a | destination array |
-> Int | destination array width |
-> Int | destination x-coordinate |
-> Int | destination y-coordinate |
-> Maybe Stream | stream to associate to |
-> IO () |
Copy a 2D array from the device to the host asynchronously, possibly associated with a particular execution stream. The destination host memory must be page-locked.
peekListArray :: Storable a => Int -> DevicePtr a -> IO [a] Source #
Copy a number of elements from the device into a new Haskell list. Note that this requires two memory copies: firstly from the device into a heap allocated array, and from there marshalled into a list.
pokeArray :: Storable a => Int -> Ptr a -> DevicePtr a -> IO () Source #
Copy a number of elements onto the device. This is a synchronous operation.
pokeArrayAsync :: Storable a => Int -> HostPtr a -> DevicePtr a -> Maybe Stream -> IO () Source #
Copy memory onto the device asynchronously, possibly associated with a particular stream. The source host memory must be page-locked.
:: Storable a | |
=> Int | width to copy (elements) |
-> Int | height to copy (elements) |
-> Ptr a | source array |
-> Int | source array width |
-> Int | source x-coordinate |
-> Int | source y-coordinate |
-> DevicePtr a | destination array |
-> Int | destination array width |
-> Int | destination x-coordinate |
-> Int | destination y-coordinate |
-> IO () |
Copy a 2D array from the host to the device.
:: Storable a | |
=> Int | width to copy (elements) |
-> Int | height to copy (elements) |
-> HostPtr a | source array |
-> Int | source array width |
-> Int | source x-coordinate |
-> Int | source y-coordinate |
-> DevicePtr a | destination array |
-> Int | destination array width |
-> Int | destination x-coordinate |
-> Int | destination y-coordinate |
-> Maybe Stream | stream to associate to |
-> IO () |
Copy a 2D array from the host to the device asynchronously, possibly associated with a particular execution stream. The source host memory must be page-locked.
pokeListArray :: Storable a => [a] -> DevicePtr a -> IO () Source #
Write a list of storable elements into a device array. The device array must be sufficiently large to hold the entire list. This requires two marshalling operations.
copyArray :: Storable a => Int -> DevicePtr a -> DevicePtr a -> IO () Source #
Copy the given number of elements from the first device array (source) to the second device (destination). The copied areas may not overlap. This operation is asynchronous with respect to the host, but will never overlap with kernel execution.
copyArrayAsync :: Storable a => Int -> DevicePtr a -> DevicePtr a -> Maybe Stream -> IO () Source #
Copy the given number of elements from the first device array (source) to the second device array (destination). The copied areas may not overlap. The operation is asynchronous with respect to the host, and can be asynchronous to other device operations by associating it with a particular stream.
:: Storable a | |
=> Int | width to copy (elements) |
-> Int | height to copy (elements) |
-> DevicePtr a | source array |
-> Int | source array width |
-> Int | source x-coordinate |
-> Int | source y-coordinate |
-> DevicePtr a | destination array |
-> Int | destination array width |
-> Int | destination x-coordinate |
-> Int | destination y-coordinate |
-> IO () |
Copy a 2D array from the first device array (source) to the second device array (destination). The copied areas must not overlap. This operation is asynchronous with respect to the host, but will never overlap with kernel execution.
:: Storable a | |
=> Int | width to copy (elements) |
-> Int | height to copy (elements) |
-> DevicePtr a | source array |
-> Int | source array width |
-> Int | source x-coordinate |
-> Int | source y-coordinate |
-> DevicePtr a | destination array |
-> Int | destination array width |
-> Int | destination x-coordinate |
-> Int | destination y-coordinate |
-> Maybe Stream | stream to associate to |
-> IO () |
Copy a 2D array from the first device array (source) to the second device array (destination). The copied areas may not overlap. The operation is asynchronous with respect to the host, and can be asynchronous to other device operations by associating it with a particular execution stream.
:: Storable a | |
=> Int | number of array elements |
-> DevicePtr a | |
-> Context | source array and context |
-> DevicePtr a | |
-> Context | destination array and context |
-> IO () |
Copies an array from device memory in one context to device memory in another
context. Note that this function is asynchronous with respect to the host,
but serialised with respect to all pending and future asynchronous work in
the source and destination contexts. To avoid this synchronisation, use
copyArrayPeerAsync
instead.
Requires CUDA-4.0.
:: Storable a | |
=> Int | number of array elements |
-> DevicePtr a | |
-> Context | source array and context |
-> DevicePtr a | |
-> Context | destination array and device context |
-> Maybe Stream | stream to associate with |
-> IO () |
Copies from device memory in one context to device memory in another context. Note that this function is asynchronous with respect to the host and all work in other streams and devices.
Requires CUDA-4.0.
newListArrayLen :: Storable a => [a] -> IO (DevicePtr a, Int) Source #
Write a list of storable elements into a newly allocated device array,
returning the device pointer together with the number of elements that were
written. Note that this requires two memory copies: firstly from a Haskell
list to a heap allocated array, and from there onto the graphics device. The
memory should be free
d when no longer required.
newListArray :: Storable a => [a] -> IO (DevicePtr a) Source #
Write a list of storable elements into a newly allocated device array. This
is newListArrayLen
composed with fst
.
withListArray :: Storable a => [a] -> (DevicePtr a -> IO b) -> IO b Source #
Temporarily store a list of elements into a newly allocated device array. An
IO action is applied to to the array, the result of which is returned.
Similar to newListArray
, this requires copying the data twice.
As with allocaArray
, the memory is freed once the action completes, so you
should not return the pointer from the action, and be wary of asynchronous
kernel execution.
withListArrayLen :: Storable a => [a] -> (Int -> DevicePtr a -> IO b) -> IO b Source #
A variant of withListArray
which also supplies the number of elements in
the array to the applied function
memset :: Storable a => DevicePtr a -> Int -> a -> IO () Source #
Set a number of data elements to the specified value, which may be either 8-, 16-, or 32-bits wide.
memsetAsync :: Storable a => DevicePtr a -> Int -> a -> Maybe Stream -> IO () Source #
Set the number of data elements to the specified value, which may be either 8-, 16-, or 32-bits wide. The operation is asynchronous and may optionally be associated with a stream.
Requires CUDA-3.2.
getDevicePtr :: [AllocFlag] -> HostPtr a -> IO (DevicePtr a) Source #
Return the device pointer associated with a mapped, pinned host buffer, which
was allocated with the DeviceMapped
option by mallocHostArray
.
Currently, no options are supported and this must be empty.
getBasePtr :: DevicePtr a -> IO (DevicePtr a, Int64) Source #
Return the base address and allocation size of the given device pointer.
getMemInfo :: IO (Int64, Int64) Source #
Return the amount of free and total memory respectively available to the current context (bytes).
module Foreign.CUDA.Driver.Module
module Foreign.CUDA.Driver.Unified
module Foreign.CUDA.Driver.Utils