\      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw x y z { | } ~                                                                                                !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ Safe-Inferred   None*Instance for special casing null pointers.:Given a bit pattern, yield all bit masks that it contains.This does *not* attempt to compute a minimal set of bit masks that when combined yield the bit pattern, instead all contained bit masks are produced.Integral conversionFloating conversionObtain C value from Haskell .Obtain Haskell  from C value.#Convert a C enumeration to Haskell.#Convert a Haskell enumeration to C.[2009..2014] Trevor L. McDonellBSDNone +?IReturn a descriptive error string associated with a particular error codeA%Raise a CUDAException in the IO MonadB#A specially formatted error messageCReturn the results of a function on successful execution, otherwise throw an exception with an error string associated with the return codeDlThrow an exception with an error string associated with an unsuccessful return code, otherwise return unit.H  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGE  !"#$%&'()*+,-./0123456789:;<=>?@ABCDH=<;:9876543210/.-,+*)('&%$#"!  ?@FEABGCD> :=<;:9876543210/.-,+*)('&%$#"!  >?@ABCDEFG[2009..2014] Trevor L. McDonellBSDNone H6Return the version number of the installed CUDA driverHHHH[2009..2014] Trevor L. McDonellBSDNone +LReturn codes from API functionsRaise a I in the IO Monad#A specially formatted error messageEReturn the descriptive string associated with a particular error code|Return the results of a function on successful execution, otherwise return the error string associated with the return codeWReturn the error string associated with an unsuccessful return code, otherwise Nothing]IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~YIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~YL~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMIKJ IKJLP~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONM[2009..2014] Trevor L. McDonellBSDNone 6Return the version number of the installed CUDA driver7Return the version number of the installed CUDA runtime[2009..2014] Trevor L. McDonellBSDNone0 Warp size7Maximum number of in-flight threads on a multiprocessor<Maximum number of thread blocks resident on a multiprocessor4Maximum number of in-flight warps per multiprocessor2Number of SIMD arithmetic units per multiprocessor8Total amount of shared memory per multiprocessor (bytes)*Shared memory allocation unit size (bytes)-Total number of registers in a multiprocessorRegister allocation unit size)Register allocation granularity for warps&Maximum number of registers per thread(How multiprocessor resources are dividedPCI bus ID of the device PCI device ID PCI domain ID"The properties of a compute device IdentifierSupported compute capability.Available global memory on the device in bytes0Available constant memory on the device in bytes*Available shared memory per block in bytes32-bit registers per block!Warp size in threads (SIMD width)Max number of threads per block(Max number of threads per multiprocessor%Max size of each dimension of a block$Max size of each dimension of a gridMaximum texture dimensionsClock frequency in kilohertz'Number of multiprocessors on the device+Max pitch in bytes allowed by memory copiesGlobal memory bus width in bits(Peak memory clock frequency in kilohertz"Alignment requirement for textures8Device can concurrently copy memory and execute a kernel9Device can possibly execute multiple kernels concurrently0Device supports and has enabled error correctionNumber of asynchronous enginesSize of the L2 cache in bytes3Whether this is a Tesla device using the TCC driver%PCI device information for the device+Whether there is a runtime limit on kernelsAs opposed to discreteDevice can use pinned memory3Device shares a unified address space with the host+The compute mode the device is currently inIExtract some additional hardware resource limitations for a given device.EGPU compute capability, major and minor revision number respectively.C@@ ![2009..2014] Trevor L. McDonellBSDNone !Active threads per multiprocessor'Active thread blocks per multiprocessorActive warps per multiprocessor*Occupancy of each multiprocessor (percent)BCalculate occupancy data for a given GPU and kernel resource usageOptimise multiprocessor occupancy as a function of thread block size and resource usage. This returns the smallest satisfying block size in increments of a single warp.As G, but with a generator that produces the specific thread block sizes that should be tested. The generated list can produce values in any order, but the last satisfying block size will be returned. Hence, values should be monotonically decreasing to return the smallest block size yielding maximum occupancy, and vice-versa.bIncrements in powers-of-two, over the range of supported thread block sizes for the given device.bDecrements in powers-of-two, over the range of supported thread block sizes for the given device.[Decrements in the warp size of the device, over the range of supported thread block sizes.[Increments in the warp size of the device, over the range of supported thread block sizes.mDetermine the maximum number of CTAs that can be run simultaneously for a given kernel / device combination."Properties of the card in questionThreads per blockRegisters per threadShared memory per block (bytes)Architecture to optimise for1Register count as a function of thread block size>Shared memory usage (bytes) as a function of thread block size"Properties of the card in questionThreads per blockRegisters per threadShared memory per block (bytes)!Maximum number of resident blocks [2009..2014] Trevor L. McDonellBSDNone <WDevice limit flags4Possible option values for direct peer memory accessDevice execution flagsA device identifier?Select the compute device which best matches the given criteria,Returns which device is currently being usedVReturns the number of devices available for execution, with compute capability >= 1.04Return information about the selected compute device'Set device to be used for GPU execution*Set flags to be used for device executions8Set list of devices for CUDA execution in priority orderpBlock until the device has completed all preceding requested tasks. Returns an error if one of the tasks fails. Explicitly destroys and cleans up all runtime resources associated with the current device in the current process. Any subsequent API call will reinitialise the device.Note that this function will reset the device immediately. It is the caller s responsibility to ensure that the device is not being accessed by any other host threads from the process when this function is called. Queries if the first device can directly access the memory of the second. If direct access is possible, it can then be enabled with  . Requires cuda-4.0. If the devices of both the current and supplied contexts support unified addressing, then enable allocations in the supplied context to be accessible by the current context. Requires cuda-4.0. bDisable direct memory access from the current context to the supplied context. Requires cuda-4.0. 7Query compute 2.0 call stack limits. Requires cuda-3.1.5Set compute 2.0 call stack limits. Requires cuda-3.1.=     F     F     2     [2009..2014] Trevor L. McDonellBSDNone <W jPossible option flags for CUDA initialisation. Dummy instance until the API exports actual option values.Device attributesoQInitialise the CUDA driver API. Must be called before any other driver function.pAReturn the compute compatibility revision supported by the deviceqReturn a device handler2Return the selected attribute for the given devices9Return the number of device with compute capability > 1.0tName of the deviceu,Return the properties of the selected devicev,Total memory available on the device (bytes)      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs t!u"v#$%& !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvlmnkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"! opqrstuv      [kjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"! lmnopqrs t!u"v#$%& [2009..2014] Trevor L. McDonellBSDNone <Ww4Possible option values for direct peer memory accessx%Device cache configuration preference}Device limits flagsContext creation flagsA device contextBCreate a new CUDA context and associate it with the calling thread{Increments the usage count of the context. API: no context flags are currently supported, so this parameter must be empty.1Detach the context, and destroy if no longer usedDestroy the specified context. This fails if the context is more than a single attachment (including that from initial creation).FReturn the context bound to the calling CPU thread. Requires cuda-4.0.DBind the specified context to the calling thread. Requires cuda-4.0.1Return the device of the currently active contextqPop the current CUDA context from the CPU thread. The context must have a single usage count (matching calls to  and _). If successful, the new context is returned, and the old may be attached to a different CPU.kPush the given context onto the CPU's thread stack of current contexts. The context must be floating (via #), i.e. not attached to any thread.;Block until the device has completed all preceding requestsQueries if the first device can directly access the memory of the second. If direct access is possible, it can then be enabled with . Requires cuda-4.0.If the devices of both the current and supplied contexts support unified addressing, then enable allocations in the supplied context to be accessible by the current context. Requires cuda-4.0.bDisable direct memory access from the current context to the supplied context. Requires cuda-4.0.7Query compute 2.0 call stack limits. Requires cuda-3.1.PSpecify the size of the call stack, for compute 2.0 devices. Requires cuda-3.1.On devices where the L1 cache and shared memory use the same hardware resources, this sets the preferred cache configuration for the current context. This is only a preference. Requires cuda-3.2.Nwxyz{|}~'()*+,-./0123456789:;<=>?@ABCDEFGHIJ*wxyz{|}~*wx|{zy}~9wx|{zy}~ '()*+,-./0123456789:;<=>?@ABCDEFGHIJ[2009..2014] Trevor L. McDonellBSDNoneN [2009..2014] Trevor L. McDonellBSD Safe-Inferred<WlPossible option flags for stream initialisation. Dummy instance until the API exports actual option values.A processing stream. All operations in a stream are synchronous and executed in sequence, but operations in different non-default streams may happen out-of-order or concurrently with one another.Use ,s to synchronise operations between streams.Event creation flagsZEvents are markers that can be inserted into the CUDA execution stream and later queried.'A reference to page-locked host memory.A  is just a plain K, but the memory has been allocated by CUDA into page locked memory. This means that the data can be copied to the GPU via DMA (direct memory access). Note that the use of the system function mlock? is not sufficient here --- the CUDA version ensures that the physical7 address stays this same, not just the virtual address.To copy data into a  array, you may use for example  withHostPtr together with  or .)A reference to data stored on the device.XThe main execution stream. No operations overlap with operations in the default stream.L,Possible option flags for waiting for eventsMNLOPQRMNLOPQR [2009..2014] Trevor L. McDonellBSDNone <Create a new eventDestroy an event?Determine the elapsed time (in milliseconds) between two events0Determines if a event has actually been recordedRecord an event once all operations in the current context (or optionally specified stream) have completed. This operation is asynchronous. Makes all future work submitted to the (optional) stream wait until the given event reports completion before beginning execution. Synchronisation is performed on the device, including when the event and stream are from different device contexts. Requires cuda-3.2.&Wait until the event has been recordedSTUVWXYZ[\]^_`STUVWXYZ[\]^_` [2009..2014] Trevor L. McDonellBSDNone  Create a new asynchronous stream+Destroy and clean up an asynchronous stream6Determine if all operations in a stream have completed:Block until all operations in a Stream have been completedaThe main execution stream (0){- INLINE defaultStream -} defaultStream :: Stream #if CUDART_VERSION < 3010 defaultStream = Stream 0 #else defaultStream = Stream nullPtr #endif bcdefghia bcdefghia [2009..2014] Trevor L. McDonellBSDNone !" Kernel function parameters. Doubles will be converted to an internal float representation on devices that do not support doubles natively.Cache configuration preferenceNmaximum block size that can be successively launched (based on register usage),number of registers required for each threadA global device function.dNote that the use of a string naming a function was deprecated in CUDA 4.1 and removed in CUDA 5.0.#Obtain the attributes of the named globalZ device function. This itemises the requirements to successfully launch the given kernel.SSpecify the grid and block dimensions for a device call. Used in conjunction with T, this pushes data onto the execution stack that will be popped when a function is ed.qSet the argument parameters that will be passed to the next kernel invocation. This is used in conjunction with  to control kernel execution.On devices where the L1 cache and shared memory use the same hardware resources, this sets the preferred cache configuration for the given device function. This is only a preference; the driver is free to choose a different configuration as required to execute the function.rSwitching between configuration modes may insert a device-side synchronisation point for streamed kernel launches Invoke the globalD kernel function on the device. This must be preceded by a call to  and (if appropriate) .Invoke a kernel on a  (gx * gy), grid of blocks, where each block contains (tx * ty * tz)x threads and has access to a given number of bytes of shared memory. The launch may also be associated with a specific .'jklmnopgrid dimensionsblock dimensionsshared memory per block (bytes)associated processing streamqrstuDevice function symbolgrid dimensionsthread block shapeshared memory per block (bytes)(optional) execution streamvwxjklmnopqrstuvwx[2009..2014] Trevor L. McDonellBSDNone <Create a new eventDestroy an event?Determine the elapsed time (in milliseconds) between two events0Determines if a event has actually been recordedRecord an event once all operations in the current context (or optionally specified stream) have completed. This operation is asynchronous. Makes all future work submitted to the (optional) stream wait until the given event reports completion before beginning execution. Synchronisation is performed on the device, including when the event and stream are from different device contexts. Requires cuda-3.2.&Wait until the event has been recordedyz{|}~yz{|}~[2009..2014] Trevor L. McDonellBSDNone <Create a new streamDestroy a stream4Check if all operations in the stream have completed@Wait until the device has completed all operations in the Stream    [2009..2014] Trevor L. McDonellBSDNone !"< Function attributesA global device functionLReturns the value of the selected attribute requirement for the given kernel Specify the (x,y,z)^ dimensions of the thread blocks that are created when the given kernel function is launched.tSet the number of bytes of dynamic shared memory to be available to each thread block when the function is launchedOn devices where the L1 cache and shared memory use the same hardware resources, this sets the preferred cache configuration for the given device function. This is only a preference; the driver is free to choose a different configuration as required to execute the function.sSwitching between configuration modes may insert a device-side synchronisation point for streamed kernel launches.Invoke the kernel on a size (w,h)\ grid of blocks. Each block contains the number of threads specified by a previous call to 5. The launch may also be associated with a specific .Invoke a kernel on a (gx * gy * gz), grid of blocks, where each block contains (tx * ty * tz)x threads and has access to a given number of bytes of shared memory. The launch may also be associated with a specific .In , the number of kernel parameters and their offsets and sizes do not need to be specified, as this information is retrieved directly from the kernel's image. This requires the kernel to have been compiled with toolchain version 3.2 or later.The alternative  will pass the arguments in directly, requiring the application to know the size and alignment/padding of each kernel parameter.Invoke a kernel on a (gx * gy * gz), grid of blocks, where each block contains (tx * ty * tz)x threads and has access to a given number of bytes of shared memory. The launch may also be associated with a specific .In , the number of kernel parameters and their offsets and sizes do not need to be specified, as this information is retrieved directly from the kernel's image. This requires the kernel to have been compiled with toolchain version 3.2 or later.The alternative  will pass the arguments in directly, requiring the application to know the size and alignment/padding of each kernel parameter.FSet the parameters that will specified next time the kernel is invokedKernel function parameters/function to executeblock grid dimensionthread block shapeshared memory (bytes)(optional) stream to execute inlist of function parametersfunction to executeblock grid dimensionthread block shapeshared memory (bytes)(optional) stream to execute inlist of function parameters! [2009..2014] Trevor L. McDonellBSD Safe-InferredLook at the contents of device memory. This takes an IO action that will be applied to that pointer, the result of which is returned. It would be silly to return the pointer from the action.?Return a unique handle associated with the given device pointer-Return a device pointer from the given handle The constant ` contains the distinguished memory location that is not associated with a valid memory location.Cast a device pointer from one type to another9Advance the pointer address by the given offset in bytes.nGiven an alignment constraint, align the device pointer to the next highest address satisfying the constraintXCompute the difference between the second and first argument. This fulfils the relation +p2 == p1 `plusDevPtr` (p2 `minusDevPtr` p1)EAdvance a pointer into a device array by the given number of elementsApply an IO action to the memory reference living inside the host pointer object. All uses of the pointer should be inside the  bracket. The constant ` contains the distinguished memory location that is not associated with a valid memory location,Cast a host pointer from one type to another8Advance the pointer address by the given offset in byteslGiven an alignment constraint, align the host pointer to the next highest address satisfying the constraint <Compute the difference between the second and first argument AAdvance a pointer into a host array by a given number of elements        [2009..2014] Trevor L. McDonellBSDNone < &Options for unified memory allocationsOptions for host allocation Allocate a section of linear memory on the host which is page-locked and directly accessible from the device. The storage is sufficient to hold the given number of elements of a storable type. The runtime system automatically accelerates calls to functions such as  and # that refer to page-locked memory.Note that since the amount of pageable memory is thusly reduced, overall system performance may suffer. This is best used sparingly to allocate staging areas for data exchange7Free page-locked host memory previously allocated with  mallecHostAllocate a section of linear memory on the device, and return a reference to it. The memory is sufficient to hold the given number of elements of storable type. It is suitable aligned, and not cleared.Execute a computation, passing a pointer to a temporarily allocated block of memory sufficient to hold the given number of elements of storable type. The memory is freed when the computation terminates (normally or via an exception), so the pointer must not be used after this.}Note that kernel launches can be asynchronous, so you may need to add a synchronisation point at the end of the computation..Free previously allocated memory on the deviceQAllocates memory that will be automatically managed by the Unified Memory system[Copy a number of elements from the device to host memory. This is a synchronous operation.Copy memory from the device asynchronously, possibly associated with a particular stream. The destination memory must be page locked.TCopy a 2D memory area from the device to the host. This is a synchronous operation.Copy a 2D memory area from the device to the host asynchronously, possibly associated with a particular stream. The destination array must be page locked.Copy a number of elements from the device into a new Haskell list. Note that this requires two memory copies: firstly from the device into a heap allocated array, and from there marshalled into a listKCopy a number of elements onto the device. This is a synchronous operation.Copy memory onto the device asynchronously, possibly associated with a particular stream. The source memory must be page-locked. GCopy a 2D memory area onto the device. This is a synchronous operation.!Copy a 2D memory area onto the device asynchronously, possibly associated with a particular stream. The source array must be page locked."Write a list of storable elements into a device array. The array must be sufficiently large to hold the entire list. This requires two marshalling operations#Copy the given number of elements from the first device array (source) to the second (destination). The copied areas may not overlap. This operation is asynchronous with respect to host, but will not overlap other device operations.$Copy the given number of elements from the first device array (source) to the second (destination). The copied areas may not overlap. This operation is asynchronous with respect to the host, and may be associated with a particular stream.%Copy a 2D memory area from the first device array (source) to the second (destination). The copied areas may not overlap. This operation is asynchronous with respect to the host, but will not overlap other device operations.&Copy a 2D memory area from the first device array (source) to the second device array (destination). The copied areas may not overlay. This operation is asynchronous with respect to the host, and may be associated with a particular stream.Copy data between the host and device asynchronously, possibly associated with a particular stream. The host-side memory must be page-locked (allocated with ).TCopy a 2D memory area between the host and device. This is a synchronous operation.Copy a 2D memory area between the host and device asynchronously, possibly associated with a particular stream. The host-side memory must be page-locked.'=Write a list of storable elements into a newly allocated device array, returning the device pointer together with the number of elements that were written. Note that this requires two copy operations: firstly from a Haskell list into a heap-allocated array, and from there into device memory. The array should be d when no longer required.(PWrite a list of storable elements into a newly allocated device array. This is ' composed with .)Temporarily store a list of elements into a newly allocated device array. An IO action is applied to the array, the result of which is returned. Similar to (7, this requires two marshalling operations of the data.As with , the memory is freed once the action completes, so you should not return the pointer from the action, and be sure that any asynchronous operations (such as kernel execution) have completed.* A variant of )Q which also supplies the number of elements in the array to the applied function+/Initialise device memory to a given 8-bit valueCCopy data between host and device. This is a synchronous operation.B   width to copy (elements)height to copy (elements) source arraysource array widthdestination arraydestination array widthwidth to copy (elements)height to copy (elements) source arraysource array widthdestination arraydestination array width width to copy (elements)height to copy (elements) source arraysource array widthdestination arraydestination array width!width to copy (elements)height to copy (elements) source arraysource array widthdestination arraydestination array width"#$%width to copy (elements)height to copy (elements) source arraysource array widthdestination arraydestination array width&width to copy (elements)height to copy (elements) source arraysource array widthdestination arraydestination array width destinationsourcenumber of elements destinationsourcenumber of elements destinationwidth of destination arraysourcewidth of source array width to copyheight to copy destinationwidth of destination arraysourcewidth of source array width to copyheight to copy'()*+The device memoryNumber of bytesValue to set for each byte!    !"#$%&'()*+!    !"#$%&(')*+7    !"#$%&'()*+[2009..2014] Trevor L. McDonellBSDNone  ,A description of how memory read through the texture cache should be interpreted, including the kind of data and the number of bits of each component (x,y,z and w, respectively).8Texture channel format kind?5access texture using normalised coordinates [0.0,1.0)A texture referenceCBind the memory area associated with the device pointer to a texture reference given by the named symbol. Any previously bound references are unbound.DBind the two-dimensional memory area to the texture reference associated with the given symbol. The size of the area is constrained by (width,height) in texel units, and the row pitch in bytes. Any previously bound references are unbound.>Returns the texture reference associated with the given symbolTexture filtering modeTexture addressing mode(,-./0123456789:;<=>?@ABCD,-./0123456789:;<=>?@ABCD=>?@AB8<;:937654021,-./CD,-./021376548<;:9=>?@ABCD[2009..2014] Trevor L. McDonellBSDNone <#E&Options for unified memory allocationsIOptions for host allocationMAllocate a section of linear memory on the host which is page-locked and directly accessible from the device. The storage is sufficient to hold the given number of elements of a storable type.Note that since the amount of pageable memory is thusly reduced, overall system performance may suffer. This is best used sparingly to allocate staging areas for data exchange.N)Free a section of page-locked host memoryO\Page-locks the specified array (on the host) and maps it for the device(s) as specified by the given allocation flags. Subsequently, the memory is accessed directly by the device so can be read and written with much higher bandwidth than pageable memory that has not been registered. The memory range is added to the same tracking mechanism as M9 to automatically accelerate calls to functions such as Z.Note that page-locking excessive amounts of memory may degrade system performance, since it reduces the amount of pageable memory available. This is best used sparingly to allocate staging areas for data exchange.DThis function is not yet implemented on Mac OS X. Requires cuda-4.0.PFUnmaps the memory from the given pointer, and makes it pageable again.DThis function is not yet implemented on Mac OS X. Requires cuda-4.0.QAllocate a section of linear memory on the device, and return a reference to it. The memory is sufficient to hold the given number of elements of storable type. It is suitably aligned for any type, and is not cleared.R&Execute a computation on the device, passing a pointer to a temporarily allocated block of memory sufficient to hold the given number of elements of storable type. The memory is freed when the computation terminates (normally or via an exception), so the pointer must not be used after this.eNote that kernel launches can be asynchronous, so you may want to add a synchronisation point using sync as part of the computation.S"Release a section of device memoryTQAllocates memory that will be automatically managed by the Unified Memory systemUZCopy a number of elements from the device to host memory. This is a synchronous operationVCopy memory from the device asynchronously, possibly associated with a particular stream. The destination host memory must be page-locked.W,Copy a 2D array from the device to the host.XCopy a 2D array from the device to the host asynchronously, possibly associated with a particular execution stream. The destination host memory must be page-locked.YCopy a number of elements from the device into a new Haskell list. Note that this requires two memory copies: firstly from the device into a heap allocated array, and from there marshalled into a list.ZJCopy a number of elements onto the device. This is a synchronous operation[Copy memory onto the device asynchronously, possibly associated with a particular stream. The source host memory must be page-locked.\,Copy a 2D array from the host to the device.]Copy a 2D array from the host to the device asynchronously, possibly associated with a particular execution stream. The source host memory must be page-locked.^Write a list of storable elements into a device array. The device array must be sufficiently large to hold the entire list. This requires two marshalling operations._Copy the given number of elements from the first device array (source) to the second device (destination). The copied areas may not overlap. This operation is asynchronous with respect to the host, but will never overlap with kernel execution.`,Copy the given number of elements from the first device array (source) to the second device array (destination). The copied areas may not overlap. The operation is asynchronous with respect to the host, and can be asynchronous to other device operations by associating it with a particular stream.aCopy a 2D array from the first device array (source) to the second device array (destination). The copied areas must not overlap. This operation is asynchronous with respect to the host, but will never overlap with kernel execution.b$Copy a 2D array from the first device array (source) to the second device array (destination). The copied areas may not overlap. The operation is asynchronous with respect to the host, and can be asynchronous to other device operations by associating it with a particular execution stream.c0Copies an array from device memory in one context to device memory in another context. Note that this function is asynchronous with respect to the host, but serialised with respect to all pending and future asynchronous work in the source and destination contexts. To avoid this synchronisation, use d instead.dCopies from device memory in one context to device memory in another context. Note that this function is asynchronous with respect to the host and all work in other streams and devices.e@Write a list of storable elements into a newly allocated device array, returning the device pointer together with the number of elements that were written. Note that this requires two memory copies: firstly from a Haskell list to a heap allocated array, and from there onto the graphics device. The memory should be Sd when no longer required.fPWrite a list of storable elements into a newly allocated device array. This is e composed with .gTemporarily store a list of elements into a newly allocated device array. An IO action is applied to to the array, the result of which is returned. Similar to f', this requires copying the data twice.As with R, the memory is freed once the action completes, so you should not return the pointer from the action, and be wary of asynchronous kernel execution.h A variant of gQ which also supplies the number of elements in the array to the applied functionidSet a number of data elements to the specified value, which may be either 8-, 16-, or 32-bits wide.jSet the number of data elements to the specified value, which may be either 8-, 16-, or 32-bits wide. The operation is asynchronous and may optionally be associated with a stream. Requires cuda-3.2.kfReturn the device pointer associated with a mapped, pinned host buffer, which was allocated with the K option by M.;Currently, no options are supported and this must be empty.lGReturn the base address and allocation size of the given device pointermaReturn the amount of free and total memory respectively available to the current context (bytes)jEFGHIJKLMNOPQRSTUVW width to copy (elements)height to copy (elements) source arraysource array widthsource x-coordinatesource y-coordinatedestination arraydestination array widthdestination x-coordinate destination y-coordinateX width to copy (elements)height to copy (elements) source arraysource array widthsource x-coordinatesource y-coordinatedestination arraydestination array widthdestination x-coordinate destination y-coordinate stream to associate toYZ[\ width to copy (elements)height to copy (elements) source arraysource array widthsource x-coordinatesource y-coordinatedestination arraydestination array widthdestination x-coordinate destination y-coordinate] width to copy (elements)height to copy (elements) source arraysource array widthsource x-coordinatesource y-coordinatedestination arraydestination array widthdestination x-coordinate destination y-coordinate stream to associate to^_`a width to copy (elements)height to copy (elements) source arraysource array widthsource x-coordinatesource y-coordinatedestination arraydestination array widthdestination x-coordinate destination y-coordinateb width to copy (elements)height to copy (elements) source arraysource array widthsource x-coordinatesource y-coordinatedestination arraydestination array widthdestination x-coordinate destination y-coordinate stream to associate tocnumber of array elementssource array and contextdestination array and context dnumber of array elementssource array and context$destination array and device contextstream to associate with efghi   jklmno+EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmno+ILKJMNOPQRSEHGFTUVWXYZ[\]^_`abcdfeghijklmondEHGFILKJMNOPQRSTUVWXYZ[\]^_`abc d efghi   jklmno[2009..2014] Trevor L. McDonellBSDNone pTexture data formatsyTexture read mode options} Texture reference filtering mode"Texture reference addressing modesA texture referenceICreate a new texture reference. Once created, the application must call setPtr to associate the reference with allocated memory. Other texture reference functions are used to specify the format and interpretation to be used when the memory is read through this reference.Destroy a texture reference{Bind a linear array address of the given size (bytes) as a texture reference. Any previously bound references are unbound.Bind a linear address range to the given texture reference as a two-dimensional arena. Any previously bound reference is unbound. Note that calls to  can not follow a call to ! for the same texture reference.Get the addressing mode used by a texture reference, corresponding to the given dimension (currently the only supported dimension values are 0 or 1).2Get the filtering mode used by a texture referenceIGet the data format and number of channel components of the bound textureJSpecify the addressing mode for the given dimension of a texture referenceVSpecify the filtering mode to be used when reading memory through a texture referenceRSpecify additional characteristics for reading and indexing the texture referenceoSpecify the format of the data and number of packed components per element to be read by the texture reference?pqrstuvwxyz{|}~ !"#$%&'()*+,-./0$pqrstuvwxyz{|}~$pxwvutsrq}~y|{z,pxwvutsrqy|{z}~ !"#$%&'()*+,-./0[2009..2014] Trevor L. McDonellBSDNone Results of online compilation milliseconds spent compiling PTXinformation about PTX assembly(compilation error log or compiled module Just-in-time compilation options+verbose log messages (requires cuda >= 5.5)Cgenerate line number information (-lineinfo) (requires cuda >= 5.5)/generate debug info (-g) (requires cuda >= 5.5)-fallback strategy if matching cubin not found5compilation target, otherwise determined from context/level of optimisation to apply (1-4, default 4))number of threads per block to target for&maximum number of registers per threadJA reference to a Module object, containing collections of device functionsReturns a function handle:Return a global pointer, and size of the global (in bytes)&Return a handle to a texture referenceLoad the contents of the specified file (either a ptx or cubin file) to create a new module, and load that module into the current context.Load the contents of the given image into a new module, and load that module into the current context. The image is (typically) the contents of a cubin or PTX file.Note that the 1M will be copied into a temporary staging area so that it can be passed to C.As d, but read the image data from the given pointer. The image is a NULL-terminated sequence of bytes.Load the contents of the given image into a module with online compiler options, and load the module into the current context. The image is (typically) the contents of a cubin or PTX file. The actual attributes of the compiled kernel can be probed using .Note that the 1M will be copied into a temporary staging area so that it can be passed to C.As d, but read the image data from the given pointer. The image is a NULL-terminated sequence of bytes.(Unload a module from the current contextP23456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXY(($ 2BA@?>=<;:9876543CDEFGHIJKLMNOPQRSTUVWXY[2009..2014] Trevor L. McDonellBSDNone~  !"#$%&'()*+,-./0123456789:;<=>?@ABCDH !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~  EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklm[2009..2014] Trevor L. McDonellBSDNoneIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~           !"#$%&'()*+[2009..2014] Trevor L. McDonellBSDNoneIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~           !"#$%&'()*+Z !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh !"#ijAk8l)*+,%&mnopqC<-./0UVWXr34stuvwxyE=>zT{|}9~$\S76]ab_cdh      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abc  d e f g h  i j k l m n o p q  r  s t u v w w x y z { | ` } ~                       y |      y |                          y|y|     y| !"#$%&'()*+,-./01123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijkefglmnopqrstuvwxyz{|}~                                                                                       !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRASTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~PQ_S]> cuda-0.6.6.2Foreign.CUDA.Driver.ErrorForeign.CUDA.Driver.UtilsForeign.CUDA.Runtime.ErrorForeign.CUDA.Runtime.UtilsForeign.CUDA.Analysis.DeviceForeign.CUDA.Analysis.OccupancyForeign.CUDA.Runtime.DeviceForeign.CUDA.Driver.DeviceForeign.CUDA.Driver.ContextForeign.CUDA.TypesForeign.CUDA.Runtime.EventForeign.CUDA.Runtime.StreamForeign.CUDA.Runtime.ExecForeign.CUDA.Driver.EventForeign.CUDA.Driver.StreamForeign.CUDA.Driver.ExecForeign.CUDA.PtrForeign.CUDA.Runtime.MarshalForeign.CUDA.Runtime.TextureForeign.CUDA.Driver.MarshalForeign.CUDA.Driver.TextureForeign.CUDA.Driver.ModuleForeign.CUDA.Internal.OffsetsForeign.CUDA.Internal.C2HSForeign.CUDA.AnalysisForeign.Marshal.Array copyArray moveArrayForeign.CUDA.DriverForeign.CUDA.Runtime Foreign.CUDA CUDAException UserErrorExitCodeStatusUnknown NotSupported NotPermitted LaunchFailed InvalidPCInvalidAddressSpaceMisalignedAddressIllegalInstructionHardwareStackErrorHostMemoryNotRegisteredHostMemoryAlreadyRegistered TooManyPeersAssertContextIsDestroyedPrimaryContextActivePeerAccessNotEnabledPeerAccessAlreadyEnabledLaunchIncompatibleTexturing LaunchTimeoutLaunchOutOfResourcesIllegalAddressNotReadyNotFound InvalidHandleOperatingSystemSharedObjectInitFailedSharedObjectSymbolNotFound FileNotFound InvalidSourceInvalidGraphicsContext InvalidPTXPeerAccessUnsupportedContextAlreadyInUseUnsupportedLimitEccUncorrectableNotMappedAsPointerNotMappedAsArray NotMappedAlreadyAcquiredNoBinaryForGPU AlreadyMapped ArrayIsMapped UnmapFailed MapFailedContextAlreadyCurrentInvalidContext InvalidImage InvalidDeviceNoDeviceProfilerAlreadyStoppedProfilerAlreadyStartedProfilerNotInitializedProfilerDisabled DeinitializedNotInitialized OutOfMemory InvalidValueSuccesscuGetErrorString'_describecuGetErrorString cudaError requireSDK resultIfOk nothingIfOk$fShowCUDAException$fExceptionCUDAException $fEnumStatus driverVersionApiFailureBaseStartupFailure InvalidPtx InvalidPcLaunchPendingCountExceededSyncDepthExceededLaunchFileScopedSurfLaunchFileScopedTexLaunchMaxDepthExceededDeviceAlreadyInUseIncompatibleDriverContextNoKernelImageForDeviceInvalidKernelImageDevicesUnavailableDuplicateSurfaceNameDuplicateTextureNameDuplicateVariableNameECCUncorrectableInvalidSurfaceSetOnActiveProcessInsufficientDriverInvalidResourceHandleMemoryValueTooLargeNotYetImplementedCudartUnloadingMixedDeviceExecutionInvalidNormSettingInvalidFilterSettingSynchronizationErrorTextureNotBoundTextureFetchFailedAddressOfConstantInvalidMemcpyDirectionInvalidChannelDescriptorInvalidTextureBindingInvalidTextureInvalidDevicePointerInvalidHostPointerUnmapBufferObjectFailedMapBufferObjectFailed InvalidSymbolInvalidPitchValueInvalidConfigurationInvalidDeviceFunctionPriorLaunchFailure LaunchFailureInitializationErrorMemoryAllocationMissingConfigurationruntimeVersionDeviceResourcesthreadsPerWarp threadsPerMPthreadBlocksPerMP warpsPerMP coresPerMPsharedMemPerMPsharedMemAllocUnit regFileSize regAllocUnit regAllocWarp regPerThread allocation AllocationBlockWarpPCIbusIDdeviceIDdomainIDDeviceProperties deviceNamecomputeCapabilitytotalGlobalMem totalConstMemsharedMemPerBlock regsPerBlockwarpSizemaxThreadsPerBlockmaxThreadsPerMultiProcessor maxBlockSize maxGridSizemaxTextureDim1DmaxTextureDim2DmaxTextureDim3D clockRatemultiProcessorCountmemPitch memBusWidth memClockRatetextureAlignment computeMode deviceOverlapconcurrentKernels eccEnabledasyncEngineCount cacheMemL2tccDriverEnabledpciInfokernelExecTimeoutEnabled integratedcanMapHostMemoryunifiedAddressingCompute ComputeModeExclusiveProcess Prohibited ExclusiveDefaultdeviceResources Occupancy activeThreadsactiveThreadBlocks activeWarps occupancy100 occupancyoptimalBlockSizeoptimalBlockSizeByincPow2decPow2decWarpincWarpmaxResidentBlocksLimitDevruntimependinglaunchcountDevruntimesyncdepthMallocheapsizePrintffifosize StacksizePeerFlag DeviceFlagLMemResizeToMaxMapHost BlockingSync ScheduleYield ScheduleSpin ScheduleAutoDevicechoosegetcountpropssetsetFlagssetOrdersyncreset accessibleaddremovegetLimitsetLimitInitFlagDeviceAttributeCU_DEVICE_ATTRIBUTE_MAXMultiGpuBoardGroupId MultiGpuBoard ManagedMemoryMaxRegistersPerMultiprocessor MaxSharedMemoryPerMultiprocessorLocalL1CacheSupportedGlobalL1CacheSupportedStreamPrioritiesSupportedMaximumTexture1dMipmappedWidthComputeCapabilityMinorComputeCapabilityMajorMaximumTexture2dMipmappedHeightMaximumTexture2dMipmappedWidthMaximumTexture2dLinearPitchMaximumTexture2dLinearHeightMaximumTexture2dLinearWidthMaximumTexture1dLinearWidth"MaximumSurfacecubemapLayeredLayers!MaximumSurfacecubemapLayeredWidthMaximumSurfacecubemapWidthMaximumSurface2dLayeredLayersMaximumSurface2dLayeredHeightMaximumSurface2dLayeredWidthMaximumSurface1dLayeredLayersMaximumSurface1dLayeredWidthMaximumSurface3dDepthMaximumSurface3dHeightMaximumSurface3dWidthMaximumSurface2dHeightMaximumSurface2dWidthMaximumSurface1dWidth"MaximumTexturecubemapLayeredLayers!MaximumTexturecubemapLayeredWidthMaximumTexturecubemapWidthTexturePitchAlignment PciDomainIdMaximumTexture3dDepthAlternateMaximumTexture3dHeightAlternateMaximumTexture3dWidthAlternateMaximumTexture2dGatherHeightMaximumTexture2dGatherWidthCanTex2dGatherMaximumTexture1dLayeredLayersMaximumTexture1dLayeredWidthUnifiedAddressingAsyncEngineCountMaxThreadsPerMultiprocessor L2CacheSizeGlobalMemoryBusWidthMemoryClockRate TccDriver PciDeviceIdPciBusId EccEnabledConcurrentKernelsSurfaceAlignmentMaximumTexture2dArrayNumslicesMaximumTexture2dLayeredLayersMaximumTexture2dArrayHeightMaximumTexture2dLayeredHeightMaximumTexture2dArrayWidthMaximumTexture2dLayeredWidthMaximumTexture3dDepthMaximumTexture3dHeightMaximumTexture3dWidthMaximumTexture2dHeightMaximumTexture2dWidthMaximumTexture1dWidthCanMapHostMemory IntegratedKernelExecTimeoutMultiprocessorCount GpuOverlapTextureAlignment ClockRateRegistersPerBlockMaxRegistersPerBlockMaxPitchWarpSizeTotalConstantMemorySharedMemoryPerBlockMaxSharedMemoryPerBlock MaxGridDimZ MaxGridDimY MaxGridDimX MaxBlockDimZ MaxBlockDimY MaxBlockDimXMaxThreadsPerBlock useDevice initialise capabilitydevice attributenametotalMemCache PreferEqualPreferL1 PreferShared PreferNoneMaxDevRuntimePendingLaunchCountDevRuntimeSyncDepthMallocHeapSizePrintfFifoSize StackSize ContextFlag FlagsMaskLmemResizeToMax SchedMaskSchedBlockingSync SchedYield SchedSpin SchedAutoContext useContextcreateattachdetachdestroypoppushsetCacheConfig StreamFlagStream useStreamWaitFlag EventFlag Interprocess DisableTimingEventuseEventHostPtr useHostPtr DevicePtr useDevicePtr defaultStream elapsedTimequeryrecordwaitblockfinishedFunParamVArgDArgFArgIArg CacheConfigEqualL1SharedNone FunAttributesconstSizeByteslocalSizeBytessharedSizeBytesmaxKernelThreadsPerBlocknumRegsFun attributes setConfig setParamslaunch launchKernel FunAttributeCU_FUNC_ATTRIBUTE_MAX CacheModeCa BinaryVersion PtxVersionNumRegsLocalSizeBytesConstSizeBytesSharedSizeBytesMaxKernelThreadsPerBlockrequires setBlockShape setSharedSizesetCacheConfigFun launchKernel' withDevicePtrdevPtrToWordPtrwordPtrToDevPtr nullDevPtr castDevPtr plusDevPtr alignDevPtr minusDevPtr advanceDevPtr withHostPtr nullHostPtr castHostPtr plusHostPtr alignHostPtr minusHostPtradvanceHostPtr AttachFlagSingleHostGlobal AllocFlag WriteCombined DeviceMappedPortablemallocHostArrayfreeHost mallocArray allocaArrayfreemallocManagedArray peekArraypeekArrayAsync peekArray2DpeekArray2DAsync peekListArray pokeArraypokeArrayAsync pokeArray2DpokeArray2DAsync pokeListArraycopyArrayAsync copyArray2DcopyArray2DAsyncnewListArrayLen newListArray withListArraywithListArrayLenmemset FormatDescdepthkind FilterModeLinearPoint AddressModeBorderMirrorClampWrap FormatKindFloatUnsignedSignedTexture normalised filtering addressingformatbindbind2DCuMemAttachSingleCuMemAttachHostCuMemAttachGlobal registerArrayunregisterArray copyArrayPeercopyArrayPeerAsync memsetAsync getDevicePtr getBasePtr getMemInfopeekDeviceHandleuseDeviceHandleFormatHalfInt32Int16Int8Word32Word16Word8ReadModeSRGBNormalizedCoordinates ReadAsInteger useTexturegetAddressMode getFilterMode getFormatsetAddressMode setFilterMode setReadMode setFormatpeekTex JITFallbackBinaryPTX JITTarget Compute52 Compute50 Compute37 Compute35 Compute32 Compute30 Compute21 Compute20 Compute13 Compute12 Compute11 Compute10 JITResultjitTime jitInfoLog jitModule JITOptionVerboseGenerateLineInfoGenerateDebugInfoFallbackStrategyTargetOptimisationLevelThreadsPerBlock MaxRegistersModulegetFungetPtrgetTexloadFileloadDataloadDataFromPtr loadDataExloadDataFromPtrExunload devNameOffsetdevMaxThreadDimOffsetdevMaxGridSizeOffsetdevMaxTexture2DOffsetdevMaxTexture3DOffsettexRefAddressModeOffsettexRefChannelDescOffsetdevMaxThreadDimOffset'devMaxGridSizeOffset' nothingIfNullextractBitMaskscIntConv cFloatConv cFromBoolghc-prim GHC.TypesBoolcToBoolcToEnum cFromEnumwithCStringLenIntConvpeekCStringLenIntConv withIntConv withFloatConv peekIntConv peekFloatConvwithBoolpeekBoolwithEnumpeekEnum nothingIfcombineBitMaskscontainsBitMaskcuDriverGetVersion'_cuDriverGetVersion describe'_cudaDriverGetVersion'_cudaRuntimeGetVersion'_cudaRuntimeGetVersioncudaDriverGetVersion$fEnumComputeMode $fOrdCompute $fShowComputecudaDeviceSetLimit'_cudaDeviceGetLimit'_cudaDeviceDisablePeerAccess'_cudaDeviceEnablePeerAccess'_cudaDeviceCanAccessPeer'_cudaDeviceReset'_cudaDeviceSynchronize'_cudaSetValidDevices'_cudaSetDeviceFlags'_cudaSetDevice'_cudaGetDeviceProperties'_cudaGetDeviceCount'_cudaGetDevice'_cudaChooseDevice'_cudaChooseDevice cudaGetDevicecudaGetDeviceCountcudaGetDeviceProperties cudaSetDevicecudaSetDeviceFlagscudaSetValidDevicescudaDeviceSynchronizecudaDeviceResetcudaDeviceCanAccessPeercudaDeviceEnablePeerAccesscudaDeviceDisablePeerAccesscudaDeviceGetLimitcudaDeviceSetLimit $fEnumLimit$fEnumPeerFlag$fStorableDeviceProperties$fEnumDeviceFlag CUDevPropcuMaxThreadsPerBlockcuMaxBlockSize cuMaxGridSizecuSharedMemPerBlockcuTotalConstMem cuWarpSize cuMemPitchcuRegsPerBlock cuClockRatecuTextureAlignmentcuDeviceTotalMem'_cuDeviceGetProperties'_cuDeviceGetName'_cuDeviceGetCount'_cuDeviceGetAttribute'_ cuDeviceGet'_cuDeviceComputeCapability'_cuInit'_cuInitcuDeviceComputeCapability cuDeviceGetcuDeviceGetAttributecuDeviceGetCountcuDeviceGetNamecuDeviceGetPropertiescuDeviceTotalMem$fEnumInitFlag$fStorableCUDevProp$fEnumDeviceAttributecuCtxSetCacheConfig'_cuCtxSetLimit'_cuCtxGetLimit'_cuCtxDisablePeerAccess'_cuCtxEnablePeerAccess'_cuDeviceCanAccessPeer'_cuCtxSynchronize'_cuCtxPushCurrent'_cuCtxPopCurrent'_cuCtxGetDevice'_cuCtxSetCurrent'_cuCtxGetCurrent'_cuCtxDestroy'_ cuCtxDetach'_ cuCtxAttach'_ cuCtxCreate'_ cuCtxCreate cuCtxAttach cuCtxDetach cuCtxDestroycuCtxGetCurrentcuCtxSetCurrentcuCtxGetDevicecuCtxPopCurrentcuCtxPushCurrentcuCtxSynchronizecuDeviceCanAccessPeercuCtxEnablePeerAccesscuCtxDisablePeerAccess cuCtxGetLimit cuCtxSetLimitcuCtxSetCacheConfig $fEnumCache$fEnumContextFlagbaseGHC.PtrPtr$fEnumEventFlag$fEnumStreamFlag$fEnumWaitFlag$fStorableHostPtr $fShowHostPtr$fStorableDevicePtr$fShowDevicePtrcudaEventSynchronize'_cudaStreamWaitEvent'_cudaEventRecord'_cudaEventQuery'_cudaEventElapsedTime'_cudaEventDestroy'_cudaEventCreateWithFlags'_cudaEventCreateWithFlagscudaEventDestroycudaEventElapsedTimecudaEventQuerycudaEventRecordcudaStreamWaitEventcudaEventSynchronize peekStreamcudaStreamSynchronize'_cudaStreamQuery'_cudaStreamDestroy'_cudaStreamCreate'_cudaStreamCreatecudaStreamDestroycudaStreamQuerycudaStreamSynchronize cudaLaunch'_cudaFuncSetCacheConfig'_cudaSetDoubleForDevice'_cudaSetupArgument'_cudaConfigureCallSimple'_cudaFuncGetAttributes'_cudaFuncGetAttributescudaConfigureCallSimplecudaSetupArgumentcudaSetDoubleForDevicecudaFuncSetCacheConfig cudaLaunchwithFun$fEnumCacheConfig$fStorableFunAttributescuEventSynchronize'_cuStreamWaitEvent'_cuEventRecord'_cuEventQuery'_cuEventElapsedTime'_cuEventDestroy'_cuEventCreate'_ cuEventCreatecuEventDestroycuEventElapsedTime cuEventQuery cuEventRecordcuStreamWaitEventcuEventSynchronizecuStreamSynchronize'_cuStreamQuery'_cuStreamDestroy'_cuStreamCreate'_cuStreamCreatecuStreamDestroy cuStreamQuerycuStreamSynchronize$fEnumFunAttributeuseFun cuParamSetv'_ cuParamSetf'_ cuParamSeti'_cuParamSetSize'_cuLaunchKernel'_cuLaunchGridAsync'_cuFuncSetCacheConfig'_cuFuncSetSharedSize'_cuFuncSetBlockShape'_cuFuncGetAttribute'_cuFuncGetAttributecuFuncSetBlockShapecuFuncSetSharedSizecuFuncSetCacheConfigcuLaunchGridAsynccuLaunchKernelcuParamSetSize cuParamSeti cuParamSetf cuParamSetv$fStorableFunParam memcpyAsyncmemcpy2D memcpy2DAsync Data.Tuplefst$fEnumCopyDirection CopyDirectionDeviceToDevice DeviceToHost HostToDevice HostToHost cudaMemset'_cudaMemcpy2DAsync'_cudaMemcpy2D'_cudaMemcpyAsync'_ cudaMemcpy'_cudaMallocManaged'_ cudaFree'_ cudaMalloc'_cudaFreeHost'_cudaHostAlloc'_ cudaHostAlloc cudaFreeHost cudaMalloccudaFreecudaMallocManagedmemcpy cudaMemcpycudaMemcpyAsync cudaMemcpy2DcudaMemcpy2DAsync cudaMemset$fEnumAttachFlag$fEnumAllocFlagTextureReference$fEnumAddressMode$fEnumFormatKindcudaGetTextureReference'_cudaBindTexture2D'_cudaBindTexture'_cudaBindTexturecudaBindTexture2DcudaGetTextureReferencewith_ withCString_$fStorableTexture$fStorableFormatDesc$fEnumFilterMode DeviceHandlecuMemGetInfo'_cuMemGetAddressRange'_cuMemHostGetDevicePointer'_cuMemsetD32Async'_cuMemsetD16Async'_cuMemsetD8Async'_ cuMemsetD32'_ cuMemsetD16'_ cuMemsetD8'_cuMemcpyPeerAsync'_cuMemcpyPeer'_cuMemcpy2DDtoDAsync'_cuMemcpy2DDtoD'_cuMemcpyDtoDAsync'_cuMemcpyDtoD'_cuMemcpy2DHtoDAsync'_cuMemcpy2DHtoD'_cuMemcpyHtoDAsync'_cuMemcpyHtoD'_cuMemcpy2DDtoHAsync'_cuMemcpy2DDtoH'_cuMemcpyDtoHAsync'_cuMemcpyDtoH'_cuMemAllocManaged'_ cuMemFree'_ cuMemAlloc'_cuMemHostUnregister'_cuMemHostRegister'_cuMemFreeHost'_cuMemHostAlloc'_cuMemHostAlloc cuMemFreeHostcuMemHostRegistercuMemHostUnregister cuMemAlloc cuMemFreecuMemAllocManaged cuMemcpyDtoHcuMemcpyDtoHAsynccuMemcpy2DDtoHcuMemcpy2DDtoHAsync cuMemcpyHtoDcuMemcpyHtoDAsynccuMemcpy2DHtoDcuMemcpy2DHtoDAsync cuMemcpyDtoDcuMemcpyDtoDAsynccuMemcpy2DDtoDcuMemcpy2DDtoDAsync cuMemcpyPeercuMemcpyPeerAsync cuMemsetD8 cuMemsetD16 cuMemsetD32cuMemsetD8AsynccuMemsetD16AsynccuMemsetD32AsynccuMemHostGetDevicePointercuMemGetAddressRange cuMemGetInfocuTexRefSetFormat'_cuTexRefSetFlags'_cuTexRefSetFilterMode'_cuTexRefSetAddressMode'_cuTexRefGetFormat'_cuTexRefGetFilterMode'_cuTexRefGetAddressMode'_cuTexRefSetAddress2DSimple'_cuTexRefSetAddress'_cuTexRefDestroy'_cuTexRefCreate'_cuTexRefCreatecuTexRefDestroycuTexRefSetAddresscuTexRefSetAddress2DSimplecuTexRefGetAddressModecuTexRefGetFilterModecuTexRefGetFormatcuTexRefSetAddressModecuTexRefSetFilterModecuTexRefSetFlagscuTexRefSetFormat $fEnumFormat$fEnumReadModebytestring-0.10.4.0Data.ByteString.Internal ByteStringJITOptionInternalJIT_NUM_OPTIONSJIT_CACHE_MODEJIT_GENERATE_LINE_INFOJIT_LOG_VERBOSEJIT_GENERATE_DEBUG_INFOJIT_FALLBACK_STRATEGY JIT_TARGETJIT_TARGET_FROM_CUCONTEXTJIT_OPTIMIZATION_LEVELJIT_ERROR_LOG_BUFFER_SIZE_BYTESJIT_ERROR_LOG_BUFFERJIT_INFO_LOG_BUFFER_SIZE_BYTESJIT_INFO_LOG_BUFFER JIT_WALL_TIMEJIT_THREADS_PER_BLOCKJIT_MAX_REGISTERS useModulecuModuleUnload'_cuModuleLoadDataEx'_cuModuleLoadData'_cuModuleLoad'_cuModuleGetTexRef'_cuModuleGetGlobal'_cuModuleGetFunction'_ c_strnlen'cuModuleGetFunctioncuModuleGetGlobalcuModuleGetTexRef cuModuleLoadcuModuleLoadDatacuModuleLoadDataExcuModuleUnload resultIfFoundpeekMod c_strnlen$fEnumJITFallback$fEnumJITTarget$fEnumJITOptionInternal