n_      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~                                  ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~                                                                  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ Safe-Inferred   None*Instance for special casing null pointers.:Given a bit pattern, yield all bit masks that it contains.This does *not* attempt to compute a minimal set of bit masks that when combined yield the bit pattern, instead all contained bit masks are produced.Integral conversionFloating conversionObtain C value from Haskell .Obtain Haskell  from C value.#Convert a C enumeration to Haskell.#Convert a Haskell enumeration to C.#(c) [2009..2012] Trevor L. McDonellBSDNone +?IReturn a descriptive error string associated with a particular error codeA%Raise a CUDAException in the IO MonadB#A specially formatted error messageCReturn the results of a function on successful execution, otherwise throw an exception with an error string associated with the return codeDlThrow an exception with an error string associated with an unsuccessful return code, otherwise return unit.H  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGE  !"#$%&'()*+,-./0123456789:;<=>?@ABCDH=<;:9876543210/.-,+*)('&%$#"!  ?@FEABGCD> :=<;:9876543210/.-,+*)('&%$#"!  >?@ABCDEFG#(c) [2009..2012] Trevor L. McDonellBSDNone H6Return the version number of the installed CUDA driverHHHH#(c) [2009..2012] Trevor L. McDonellBSDNone <IlPossible option flags for stream initialisation. Dummy instance until the API exports actual option values.JA processing streamMCreate a new streamNDestroy a streamO4Check if all operations in the stream have completedP@Wait until the device has completed all operations in the StreamIJKLMNOPIJKLMNOPJKLIMNOPIJKLMNOP#(c) [2009..2012] Trevor L. McDonellBSDNone < REvent creation flagsWEventsZCreate a new event[Destroy an event\?Determine the elapsed time (in milliseconds) between two events]0Determines if a event has actually been recorded^Record an event once all operations in the current context (or optionally specified stream) have completed. This operation is asynchronous._Makes all future work submitted to the (optional) stream wait until the given event reports completion before beginning execution. Requires cuda-3.2.`&Wait until the event has been recorded,Possible option flags for waiting for events QRSTUVWXYZ[\]^_`QRSTUVWXYZ[\]^_`WXYRVUTSQZ[\]^_`QRVUTSWXYZ[\]^_`#(c) [2009..2012] Trevor L. McDonellBSDNone +dReturn codes from API functionsRaise a a in the IO Monad#A specially formatted error messageEReturn the descriptive string associated with a particular error code|Return the results of a function on successful execution, otherwise return the error string associated with the return codeWReturn the error string associated with an unsuccessful return code, otherwise Nothing]abcdefghijklmnopqrstuvwxyz{|}~Yabcdefghijklmnopqrstuvwxyz{|}~Yd~}|{zyxwvutsrqponmlkjihgfeacb acbdP~}|{zyxwvutsrqponmlkjihgfe#(c) [2009..2012] Trevor L. McDonellBSDNone 6Return the version number of the installed CUDA driver7Return the version number of the installed CUDA runtime#(c) [2009..2012] Trevor L. McDonellBSDNone A processing stream Create a new asynchronous stream+Destroy and clean up an asynchronous stream6Determine if all operations in a stream have completed:Block until all operations in a Stream have been completedThe main execution stream (0)        #(c) [2009..2012] Trevor L. McDonellBSDNone !" Kernel function parameters. Doubles will be converted to an internal float representation on devices that do not support doubles natively.Cache configuration preferenceNmaximum block size that can be successively launched (based on register usage),number of registers required for each threadA global device function.dNote that the use of a string naming a function was deprecated in CUDA 4.1 and removed in CUDA 5.0.#Obtain the attributes of the named globalZ device function. This itemises the requirements to successfully launch the given kernel.SSpecify the grid and block dimensions for a device call. Used in conjunction with T, this pushes data onto the execution stack that will be popped when a function is ed.qSet the argument parameters that will be passed to the next kernel invocation. This is used in conjunction with  to control kernel execution.On devices where the L1 cache and shared memory use the same hardware resources, this sets the preferred cache configuration for the given device function. This is only a preference; the driver is free to choose a different configuration as required to execute the function.rSwitching between configuration modes may insert a device-side synchronisation point for streamed kernel launches Invoke the globalD kernel function on the device. This must be preceded by a call to  and (if appropriate) .Invoke a kernel on a  (gx * gy), grid of blocks, where each block contains (tx * ty * tz)x threads and has access to a given number of bytes of shared memory. The launch may also be associated with a specific .' grid dimensionsblock dimensionsshared memory per block (bytes)associated processing streamDevice function symbolgrid dimensionsthread block shapeshared memory per block (bytes)(optional) execution stream  #(c) [2009..2012] Trevor L. McDonellBSDNone < ,Possible option flags for waiting for eventsEvent creation flagsEventsCreate a new eventDestroy an event?Determine the elapsed time (in milliseconds) between two events0Determines if a event has actually been recordedRecord an event once all operations in the current context (or optionally specified stream) have completed. This operation is asynchronous.Makes all future work submitted to the (optional) stream wait until the given event reports completion before beginning execution. Requires cuda-3.2.&Wait until the event has been recorded !"#$%&'()*+,-  !"#$%&'()*+,- #(c) [2009..2012] Trevor L. McDonellBSDNone0 Warp size7Maximum number of in-flight threads on a multiprocessor<Maximum number of thread blocks resident on a multiprocessor4Maximum number of in-flight warps per multiprocessor2Number of SIMD arithmetic units per multiprocessor8Total amount of shared memory per multiprocessor (bytes)*Shared memory allocation unit size (bytes)-Total number of registers in a multiprocessorRegister allocation unit size)Register allocation granularity for warps&Maximum number of registers per thread(How multiprocessor resources are dividedPCI bus ID of the device PCI device ID PCI domain ID"The properties of a compute device IdentifierSupported compute capability.Available global memory on the device in bytes0Available constant memory on the device in bytes*Available shared memory per block in bytes32-bit registers per block!Warp size in threads (SIMD width)Max number of threads per block(Max number of threads per multiprocessor%Max size of each dimension of a block $Max size of each dimension of a grid Maximum texture dimensions Clock frequency in kilohertz'Number of multiprocessors on the device+Max pitch in bytes allowed by memory copiesGlobal memory bus width in bits(Peak memory clock frequency in kilohertz"Alignment requirement for textures8Device can concurrently copy memory and execute a kernel9Device can possibly execute multiple kernels concurrently0Device supports and has enabled error correctionNumber of asynchronous enginesSize of the L2 cache in bytes3Whether this is a Tesla device using the TCC driver%PCI device information for the device+Whether there is a runtime limit on kernelsAs opposed to discreteDevice can use pinned memory3Device shares a unified address space with the host!+The compute mode the device is currently in&IExtract some additional hardware resource limitations for a given device..EGPU compute capability, major and minor revision number respectively.C      !"#$%&/0.@      !"#$%&@ !%$#"     & !      !%$#"&/0. #(c) [2009..2012] Trevor L. McDonellBSDNone )!Active threads per multiprocessor*'Active thread blocks per multiprocessor+Active warps per multiprocessor,*Occupancy of each multiprocessor (percent)-BCalculate occupancy data for a given GPU and kernel resource usage.Optimise multiprocessor occupancy as a function of thread block size and resource usage. This returns the smallest satisfying block size in increments of a single warp./As .G, but with a generator that produces the specific thread block sizes that should be tested. The generated list can produce values in any order, but the last satisfying block size will be returned. Hence, values should be monotonically decreasing to return the smallest block size yielding maximum occupancy, and vice-versa.0bIncrements in powers-of-two, over the range of supported thread block sizes for the given device.1bDecrements in powers-of-two, over the range of supported thread block sizes for the given device.2[Decrements in the warp size of the device, over the range of supported thread block sizes.3[Increments in the warp size of the device, over the range of supported thread block sizes.4mDetermine the maximum number of CTAs that can be run simultaneously for a given kernel / device combination.'()*+,-"Properties of the card in questionThreads per blockRegisters per threadShared memory per block (bytes).Architecture to optimise for1Register count as a function of thread block size>Shared memory usage (bytes) as a function of thread block size/01234"Properties of the card in questionThreads per blockRegisters per threadShared memory per block (bytes)!Maximum number of resident blocks'()*+,-./01234'()*+,-./40312 '()*+,-./01234 #(c) [2009..2012] Trevor L. McDonellBSDNone <5Device limit flags;4Possible option values for direct peer memory access<Device execution flagsCA device identifierD?Select the compute device which best matches the given criteriaE,Returns which device is currently being usedFVReturns the number of devices available for execution, with compute capability >= 1.0G4Return information about the selected compute deviceH'Set device to be used for GPU executionI*Set flags to be used for device executionsJ8Set list of devices for CUDA execution in priority orderKpBlock until the device has completed all preceding requested tasks. Returns an error if one of the tasks fails.LExplicitly destroys and cleans up all runtime resources associated with the current device in the current process. Any subsequent API call will reinitialise the device.Note that this function will reset the device immediately. It is the caller s responsibility to ensure that the device is not being accessed by any other host threads from the process when this function is called.MQueries if the first device can directly access the memory of the second. If direct access is possible, it can then be enabled with N. Requires cuda-4.0.NIf the devices of both the current and supplied contexts support unified addressing, then enable allocations in the supplied context to be accessible by the current context. Requires cuda-4.0.ObDisable direct memory access from the current context to the supplied context. Requires cuda-4.0.P7Query compute 2.0 call stack limits. Requires cuda-3.1.Q5Set compute 2.0 call stack limits. Requires cuda-3.1.=56789:;<=>?@ABC123456789:;<=>D?E@FAGBHCIDJEKFLGMHNIOJPKQLMNOPF      !"#$%56789:;<=>?@ABCDEFGHIJKLMNOPQFC<BA@?>=      !%$#"DEFGHIJLK;MNO5:9876PQ25:9876;<BA@?>=C123456789:;<=>D?E@FAGBHCIDJEKFLGMHNIOJPKQLMNOP #(c) [2009..2012] Trevor L. McDonellBSDNone < RjPossible option flags for CUDA initialisation. Dummy instance until the API exports actual option values.SDevice attributesQInitialise the CUDA driver API. Must be called before any other driver function.Return a device handle2Return the selected attribute for the given device9Return the number of device with compute capability > 1.0Name of the device,Return the properties of the selected device,Total memory available on the device (bytes)QAReturn the compute compatibility revision supported by the deviceRRSTUVWXYZ[\]STUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~^_`abcdefghijklmnoQ      !"#$%RSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~     S~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUT !%$#"RRR STUVWXYZ[\]S[~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUT^_`abcdefghijklmnoQ#(c) [2009..2012] Trevor L. McDonellBSDNone <4Possible option values for direct peer memory access%Device cache configuration preferenceDevice limits flagsContext creation flagsA device contextBCreate a new CUDA context and associate it with the calling thread{Increments the usage count of the context. API: no context flags are currently supported, so this parameter must be empty.1Detach the context, and destroy if no longer usedDestroy the specified context. This fails if the context is more than a single attachment (including that from initial creation).FReturn the context bound to the calling CPU thread. Requires cuda-4.0.DBind the specified context to the calling thread. Requires cuda-4.0.1Return the device of the currently active contextqPop the current CUDA context from the CPU thread. The context must have a single usage count (matching calls to  and _). If successful, the new context is returned, and the old may be attached to a different CPU.kPush the given context onto the CPU's thread stack of current contexts. The context must be floating (via #), i.e. not attached to any thread.;Block until the device has completed all preceding requestsQueries if the first device can directly access the memory of the second. If direct access is possible, it can then be enabled with . Requires cuda-4.0.If the devices of both the current and supplied contexts support unified addressing, then enable allocations in the supplied context to be accessible by the current context. Requires cuda-4.0.bDisable direct memory access from the current context to the supplied context. Requires cuda-4.0.7Query compute 2.0 call stack limits. Requires cuda-3.1.PSpecify the size of the call stack, for compute 2.0 devices. Requires cuda-3.1.On devices where the L1 cache and shared memory use the same hardware resources, this sets the preferred cache configuration for the current context. This is only a preference. Requires cuda-3.2.Npqrstuvwxyz{|}~**9 pqrstuvwxyz{|}~#(c) [2009..2012] Trevor L. McDonellBSDNone !"< Function attributesA global device functionLReturns the value of the selected attribute requirement for the given kernel Specify the (x,y,z)^ dimensions of the thread blocks that are created when the given kernel function is launched.tSet the number of bytes of dynamic shared memory to be available to each thread block when the function is launchedOn devices where the L1 cache and shared memory use the same hardware resources, this sets the preferred cache configuration for the given device function. This is only a preference; the driver is free to choose a different configuration as required to execute the function.sSwitching between configuration modes may insert a device-side synchronisation point for streamed kernel launches.Invoke the kernel on a size (w,h)\ grid of blocks. Each block contains the number of threads specified by a previous call to 5. The launch may also be associated with a specific J.Invoke a kernel on a (gx * gy * gz), grid of blocks, where each block contains (tx * ty * tz)x threads and has access to a given number of bytes of shared memory. The launch may also be associated with a specific J.In , the number of kernel parameters and their offsets and sizes do not need to be specified, as this information is retrieved directly from the kernel's image. This requires the kernel to have been compiled with toolchain version 3.2 or later.The alternative  will pass the arguments in directly, requiring the application to know the size and alignment/padding of each kernel parameter.Invoke a kernel on a (gx * gy * gz), grid of blocks, where each block contains (tx * ty * tz)x threads and has access to a given number of bytes of shared memory. The launch may also be associated with a specific J.In , the number of kernel parameters and their offsets and sizes do not need to be specified, as this information is retrieved directly from the kernel's image. This requires the kernel to have been compiled with toolchain version 3.2 or later.The alternative  will pass the arguments in directly, requiring the application to know the size and alignment/padding of each kernel parameter.FSet the parameters that will specified next time the kernel is invokedKernel function parameters/function to executeblock grid dimensionthread block shapeshared memory (bytes)(optional) stream to execute inlist of function parametersfunction to executeblock grid dimensionthread block shapeshared memory (bytes)(optional) stream to execute inlist of function parameters! #(c) [2009..2012] Trevor L. McDonellBSDNoneN      !"#$%&'()*+,-./01234#(c) [2009..2012] Trevor L. McDonellBSD Safe-Inferred'A reference to page-locked host memory.A  is just a plain , but the memory has been allocated by the CUDA runtime into page locked memory. This means that the data can be copied to the GPU via DMA (direct memory access). Note that the use of the system function mlock? is not sufficient here --- the CUDA version ensures that the physical8 addresses stay this same, not just the virtual address.To copy data into a  array, you may use for example   together with  or .)A reference to data stored on the device.Look at the contents of device memory. This takes an IO action that will be applied to that pointer, the result of which is returned. It would be silly to return the pointer from the action.?Return a unique handle associated with the given device pointer-Return a device pointer from the given handle The constant ` contains the distinguished memory location that is not associated with a valid memory location.Cast a device pointer from one type to another9Advance the pointer address by the given offset in bytes.nGiven an alignment constraint, align the device pointer to the next highest address satisfying the constraint XCompute the difference between the second and first argument. This fulfils the relation +p2 == p1 `plusDevPtr` (p2 `minusDevPtr` p1) EAdvance a pointer into a device array by the given number of elements Apply an IO action to the memory reference living inside the host pointer object. All uses of the pointer should be inside the   bracket.  The constant  ` contains the distinguished memory location that is not associated with a valid memory location ,Cast a host pointer from one type to another8Advance the pointer address by the given offset in byteslGiven an alignment constraint, align the host pointer to the next highest address satisfying the constraint<Compute the difference between the second and first argumentAAdvance a pointer into a host array by a given number of elements                    #(c) [2009..2012] Trevor L. McDonellBSDNone <&Options for unified memory allocationsOptions for host allocation Allocate a section of linear memory on the host which is page-locked and directly accessible from the device. The storage is sufficient to hold the given number of elements of a storable type. The runtime system automatically accelerates calls to functions such as  to page-locked memory.Note that since the amount of pageable memory is thusly reduced, overall system performance may suffer. This is best used sparingly to allocate staging areas for data exchange7Free page-locked host memory previously allocated with  mallecHost Allocate a section of linear memory on the device, and return a reference to it. The memory is sufficient to hold the given number of elements of storable type. It is suitable aligned, and not cleared.!Execute a computation, passing a pointer to a temporarily allocated block of memory sufficient to hold the given number of elements of storable type. The memory is freed when the computation terminates (normally or via an exception), so the pointer must not be used after this.}Note that kernel launches can be asynchronous, so you may need to add a synchronisation point at the end of the computation.".Free previously allocated memory on the device#QAllocates memory that will be automatically managed by the Unified Memory system$[Copy a number of elements from the device to host memory. This is a synchronous operation.%Copy memory from the device asynchronously, possibly associated with a particular stream. The destination memory must be page locked.&TCopy a 2D memory area from the device to the host. This is a synchronous operation.'Copy a 2D memory area from the device to the host asynchronously, possibly associated with a particular stream. The destination array must be page locked.(Copy a number of elements from the device into a new Haskell list. Note that this requires two memory copies: firstly from the device into a heap allocated array, and from there marshalled into a list)KCopy a number of elements onto the device. This is a synchronous operation.*Copy memory onto the device asynchronously, possibly associated with a particular stream. The source memory must be page-locked.+GCopy a 2D memory area onto the device. This is a synchronous operation.,Copy a 2D memory area onto the device asynchronously, possibly associated with a particular stream. The source array must be page locked.-Write a list of storable elements into a device array. The array must be sufficiently large to hold the entire list. This requires two marshalling operations.Copy the given number of elements from the first device array (source) to the second (destination). The copied areas may not overlap. This operation is asynchronous with respect to host, but will not overlap other device operations./Copy the given number of elements from the first device array (source) to the second (destination). The copied areas may not overlap. This operation is asynchronous with respect to the host, and may be associated with a particular stream.0Copy a 2D memory area from the first device array (source) to the second (destination). The copied areas may not overlap. This operation is asynchronous with respect to the host, but will not overlap other device operations.1Copy a 2D memory area from the first device array (source) to the second device array (destination). The copied areas may not overlay. This operation is asynchronous with respect to the host, and may be associated with a particular stream.Copy data between the host and device asynchronously, possibly associated with a particular stream. The host-side memory must be page-locked (allocated with ).TCopy a 2D memory area between the host and device. This is a synchronous operation.Copy a 2D memory area between the host and device asynchronously, possibly associated with a particular stream. The host-side memory must be page-locked.2=Write a list of storable elements into a newly allocated device array, returning the device pointer together with the number of elements that were written. Note that this requires two copy operations: firstly from a Haskell list into a heap-allocated array, and from there into device memory. The array should be "d when no longer required.3PWrite a list of storable elements into a newly allocated device array. This is 2 composed with .4Temporarily store a list of elements into a newly allocated device array. An IO action is applied to the array, the result of which is returned. Similar to 37, this requires two marshalling operations of the data.As with !, the memory is freed once the action completes, so you should not return the pointer from the action, and be sure that any asynchronous operations (such as kernel execution) have completed.5 A variant of 4Q which also supplies the number of elements in the array to the applied function6/Initialise device memory to a given 8-bit valueCCopy data between host and device. This is a synchronous operation.B !"#$%&width to copy (elements)height to copy (elements) source arraysource array widthdestination arraydestination array width'width to copy (elements)height to copy (elements) source arraysource array widthdestination arraydestination array width()*+width to copy (elements)height to copy (elements) source arraysource array widthdestination arraydestination array width,width to copy (elements)height to copy (elements) source arraysource array widthdestination arraydestination array width-./0width to copy (elements)height to copy (elements) source arraysource array widthdestination arraydestination array width1width to copy (elements)height to copy (elements) source arraysource array widthdestination arraydestination array width destinationsourcenumber of elements destinationsourcenumber of elements destinationwidth of destination arraysourcewidth of source array width to copyheight to copy destinationwidth of destination arraysourcewidth of source array width to copyheight to copy23456The device memoryNumber of bytesValue to set for each byte! !"#$%&'()*+,-./0123456! !"#$%&'()*+,-./01324567 !"#$%&'()*+,-./0123456#(c) [2009..2012] Trevor L. McDonellBSDNone  7A description of how memory read through the texture cache should be interpreted, including the kind of data and the number of bits of each component (x,y,z and w, respectively).CTexture channel format kindJ5access texture using normalised coordinates [0.0,1.0)A texture referenceNBind the memory area associated with the device pointer to a texture reference given by the named symbol. Any previously bound references are unbound.OBind the two-dimensional memory area to the texture reference associated with the given symbol. The size of the area is constrained by (width,height) in texel units, and the row pitch in bytes. Any previously bound references are unbound.>Returns the texture reference associated with the given symbolTexture filtering modeTexture addressing mode(789:;<=>?@ABCDEFGHIJKLMNO789:;<=>?@ABCDEFGHIJKLMNOHIJKLMCGFED>BA@?;=<789:NO789:;=<>BA@?CGFEDHIJKLMNO#(c) [2009..2012] Trevor L. McDonellBSDNone <#P&Options for unified memory allocationsTOptions for host allocationXAllocate a section of linear memory on the host which is page-locked and directly accessible from the device. The storage is sufficient to hold the given number of elements of a storable type.Note that since the amount of pageable memory is thusly reduced, overall system performance may suffer. This is best used sparingly to allocate staging areas for data exchange.Y)Free a section of page-locked host memoryZ\Page-locks the specified array (on the host) and maps it for the device(s) as specified by the given allocation flags. Subsequently, the memory is accessed directly by the device so can be read and written with much higher bandwidth than pageable memory that has not been registered. The memory range is added to the same tracking mechanism as X9 to automatically accelerate calls to functions such as e.Note that page-locking excessive amounts of memory may degrade system performance, since it reduces the amount of pageable memory available. This is best used sparingly to allocate staging areas for data exchange.DThis function is not yet implemented on Mac OS X. Requires cuda-4.0.[FUnmaps the memory from the given pointer, and makes it pageable again.DThis function is not yet implemented on Mac OS X. Requires cuda-4.0.\Allocate a section of linear memory on the device, and return a reference to it. The memory is sufficient to hold the given number of elements of storable type. It is suitably aligned for any type, and is not cleared.]&Execute a computation on the device, passing a pointer to a temporarily allocated block of memory sufficient to hold the given number of elements of storable type. The memory is freed when the computation terminates (normally or via an exception), so the pointer must not be used after this.eNote that kernel launches can be asynchronous, so you may want to add a synchronisation point using sync as part of the computation.^"Release a section of device memory_QAllocates memory that will be automatically managed by the Unified Memory system`ZCopy a number of elements from the device to host memory. This is a synchronous operationaCopy memory from the device asynchronously, possibly associated with a particular stream. The destination host memory must be page-locked.b,Copy a 2D array from the device to the host.cCopy a 2D array from the device to the host asynchronously, possibly associated with a particular execution stream. The destination host memory must be page-locked.dCopy a number of elements from the device into a new Haskell list. Note that this requires two memory copies: firstly from the device into a heap allocated array, and from there marshalled into a list.eJCopy a number of elements onto the device. This is a synchronous operationfCopy memory onto the device asynchronously, possibly associated with a particular stream. The source host memory must be page-locked.g,Copy a 2D array from the host to the device.hCopy a 2D array from the host to the device asynchronously, possibly associated with a particular execution stream. The source host memory must be page-locked.iWrite a list of storable elements into a device array. The device array must be sufficiently large to hold the entire list. This requires two marshalling operations.jCopy the given number of elements from the first device array (source) to the second device (destination). The copied areas may not overlap. This operation is asynchronous with respect to the host, but will never overlap with kernel execution.k,Copy the given number of elements from the first device array (source) to the second device array (destination). The copied areas may not overlap. The operation is asynchronous with respect to the host, and can be asynchronous to other device operations by associating it with a particular stream.lCopy a 2D array from the first device array (source) to the second device array (destination). The copied areas must not overlap. This operation is asynchronous with respect to the host, but will never overlap with kernel execution.m$Copy a 2D array from the first device array (source) to the second device array (destination). The copied areas may not overlap. The operation is asynchronous with respect to the host, and can be asynchronous to other device operations by associating it with a particular execution stream.n0Copies an array from device memory in one context to device memory in another context. Note that this function is asynchronous with respect to the host, but serialised with respect to all pending and future asynchronous work in the source and destination contexts. To avoid this synchronisation, use o instead.oCopies from device memory in one context to device memory in another context. Note that this function is asynchronous with respect to the host and all work in other streams and devices.p@Write a list of storable elements into a newly allocated device array, returning the device pointer together with the number of elements that were written. Note that this requires two memory copies: firstly from a Haskell list to a heap allocated array, and from there onto the graphics device. The memory should be ^d when no longer required.qPWrite a list of storable elements into a newly allocated device array. This is p composed with .rTemporarily store a list of elements into a newly allocated device array. An IO action is applied to to the array, the result of which is returned. Similar to q', this requires copying the data twice.As with ], the memory is freed once the action completes, so you should not return the pointer from the action, and be wary of asynchronous kernel execution.s A variant of rQ which also supplies the number of elements in the array to the applied functiontdSet a number of data elements to the specified value, which may be either 8-, 16-, or 32-bits wide.uSet the number of data elements to the specified value, which may be either 8-, 16-, or 32-bits wide. The operation is asynchronous and may optionally be associated with a stream. Requires cuda-3.2.vfReturn the device pointer associated with a mapped, pinned host buffer, which was allocated with the V option by X.;Currently, no options are supported and this must be empty.wGReturn the base address and allocation size of the given device pointerxaReturn the amount of free and total memory respectively available to the current context (bytes)jPQRSTUVWXYZ[\]^_`ab width to copy (elements)height to copy (elements) source arraysource array widthsource x-coordinatesource y-coordinatedestination arraydestination array widthdestination x-coordinate destination y-coordinatec width to copy (elements)height to copy (elements) source arraysource array widthsource x-coordinatesource y-coordinatedestination arraydestination array widthdestination x-coordinate destination y-coordinate stream to associate todefg width to copy (elements)height to copy (elements) source arraysource array widthsource x-coordinatesource y-coordinatedestination arraydestination array widthdestination x-coordinate destination y-coordinate h width to copy (elements)height to copy (elements) source arraysource array widthsource x-coordinatesource y-coordinatedestination arraydestination array widthdestination x-coordinate destination y-coordinate stream to associate to ij k l width to copy (elements)height to copy (elements) source arraysource array widthsource x-coordinatesource y-coordinatedestination arraydestination array widthdestination x-coordinate destination y-coordinate m width to copy (elements)height to copy (elements) source arraysource array widthsource x-coordinatesource y-coordinatedestination arraydestination array widthdestination x-coordinate destination y-coordinate stream to associate tonnumber of array elementssource array and contextdestination array and contextonumber of array elementssource array and context$destination array and device contextstream to associate withpqrstuvwxyz+PQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz+TWVUXYZ[\]^PSRQ_`abcdefghijklmnoqprstuvwxzydPSRQTWVUXYZ[\]^_`abcdefg h ij k l mnopqrstuvwxyz#(c) [2009..2012] Trevor L. McDonellBSDNone {Texture data formatsTexture read mode options Texture reference filtering mode"Texture reference addressing modesA texture referenceICreate a new texture reference. Once created, the application must call setPtr to associate the reference with allocated memory. Other texture reference functions are used to specify the format and interpretation to be used when the memory is read through this reference.Destroy a texture reference{Bind a linear array address of the given size (bytes) as a texture reference. Any previously bound references are unbound.Bind a linear address range to the given texture reference as a two-dimensional arena. Any previously bound reference is unbound. Note that calls to  can not follow a call to ! for the same texture reference.Get the addressing mode used by a texture reference, corresponding to the given dimension (currently the only supported dimension values are 0 or 1).2Get the filtering mode used by a texture referenceIGet the data format and number of channel components of the bound textureJSpecify the addressing mode for the given dimension of a texture referenceVSpecify the filtering mode to be used when reading memory through a texture referenceRSpecify additional characteristics for reading and indexing the texture referenceoSpecify the format of the data and number of packed components per element to be read by the texture reference>{|}~ !"#$%&'()*+,-./0123456#{|}~#{~}|,{~}| !"#$%&'()*+,-./0123456#(c) [2009..2012] Trevor L. McDonellBSDNone Results of online compilation milliseconds spent compiling PTXinformation about PTX assembly(compilation error log or compiled module Just-in-time compilation options+verbose log messages (requires cuda >= 5.5)Cgenerate line number information (-lineinfo) (requires cuda >= 5.5)/generate debug info (-g) (requires cuda >= 5.5)-fallback strategy if matching cubin not found5compilation target, otherwise determined from context/level of optimisation to apply (1-4, default 4))number of threads per block to target for&maximum number of registers per threadJA reference to a Module object, containing collections of device functionsReturns a function handle:Return a global pointer, and size of the global (in bytes)&Return a handle to a texture referenceLoad the contents of the specified file (either a ptx or cubin file) to create a new module, and load that module into the current context.Load the contents of the given image into a new module, and load that module into the current context. The image is (typically) the contents of a cubin or PTX file.Note that the 7M will be copied into a temporary staging area so that it can be passed to C.As d, but read the image data from the given pointer. The image is a NULL-terminated sequence of bytes.Load the contents of the given image into a module with online compiler options, and load the module into the current context. The image is (typically) the contents of a cubin or PTX file. The actual attributes of the compiled kernel can be probed using .Note that the 7M will be copied into a temporary staging area so that it can be passed to C.As d, but read the image data from the given pointer. The image is a NULL-terminated sequence of bytes.(Unload a module from the current contextO89:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ab$$$8:9 ;KJIHGFEDCBA@?>=<LMNOPQRSTUVWXYZ[\]^_`ab#(c) [2009..2012] Trevor L. McDonellBSDNonez  !"#$%&'()*+,-./0123456789:;<=>?@ABCDH      !"#$%RSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~     PQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx#(c) [2009..2012] Trevor L. McDonellBSDNoneabcdefghijklmnopqrstuvwxyz{|}~      !"#$%56789:;<=>?@ABCDEFGHIJKLMNOPQ      !"#$%&'()*+,-./0123456#(c) [2009..2012] Trevor L. McDonellBSDNoneabcdefghijklmnopqrstuvwxyz{|}~      !"#$%56789:;<=>?@ABCDEFGHIJKLMNOPQ      !"#$%&'()*+,-./0123456c !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghiijklmnopqrstuuvklwxyzn !"{|@}7~()*+$%B;,-./TUVW23D<=S8#[R65\`a^bcgiijklmn o p s u k l w x y z n                t                         s           ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o  p q r s t u v w x y z { | } ~                  skl"$%&'(     kl !"#$%&'()*+,,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~def u v               ~ o                                                                        !"#$%&'()*$+,-.t/0123456789:;<=>?@ABCDEFGHI<JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~GHVJT9 cuda-0.6.5.0Foreign.CUDA.Driver.ErrorForeign.CUDA.Driver.UtilsForeign.CUDA.Driver.StreamForeign.CUDA.Driver.EventForeign.CUDA.Runtime.ErrorForeign.CUDA.Runtime.UtilsForeign.CUDA.Runtime.StreamForeign.CUDA.Runtime.ExecForeign.CUDA.Runtime.EventForeign.CUDA.Analysis.DeviceForeign.CUDA.Analysis.OccupancyForeign.CUDA.Runtime.DeviceForeign.CUDA.Driver.DeviceForeign.CUDA.Driver.ContextForeign.CUDA.Driver.ExecForeign.CUDA.PtrForeign.CUDA.Runtime.MarshalForeign.CUDA.Runtime.TextureForeign.CUDA.Driver.MarshalForeign.CUDA.Driver.TextureForeign.CUDA.Driver.ModuleForeign.CUDA.Internal.OffsetsForeign.CUDA.Internal.C2HSForeign.CUDA.AnalysisForeign.Marshal.Array copyArray moveArrayForeign.CUDA.DriverForeign.CUDA.Runtime Foreign.CUDA CUDAException UserErrorExitCodeStatusUnknown NotSupported NotPermitted LaunchFailed InvalidPCInvalidAddressSpaceMisalignedAddressIllegalInstructionHardwareStackErrorHostMemoryNotRegisteredHostMemoryAlreadyRegistered TooManyPeersAssertContextIsDestroyedPrimaryContextActivePeerAccessNotEnabledPeerAccessAlreadyEnabledLaunchIncompatibleTexturing LaunchTimeoutLaunchOutOfResourcesIllegalAddressNotReadyNotFound InvalidHandleOperatingSystemSharedObjectInitFailedSharedObjectSymbolNotFound FileNotFound InvalidSourceInvalidGraphicsContext InvalidPTXPeerAccessUnsupportedContextAlreadyInUseUnsupportedLimitEccUncorrectableNotMappedAsPointerNotMappedAsArray NotMappedAlreadyAcquiredNoBinaryForGPU AlreadyMapped ArrayIsMapped UnmapFailed MapFailedContextAlreadyCurrentInvalidContext InvalidImage InvalidDeviceNoDeviceProfilerAlreadyStoppedProfilerAlreadyStartedProfilerNotInitializedProfilerDisabled DeinitializedNotInitialized OutOfMemory InvalidValueSuccesscuGetErrorString'_describecuGetErrorString cudaError requireSDK resultIfOk nothingIfOk$fShowCUDAException$fExceptionCUDAException $fEnumStatus driverVersion StreamFlagStream useStreamcreatedestroyfinishedblockWaitFlag EventFlag Interprocess DisableTiming BlockingSyncDefaultEventuseEvent elapsedTimequeryrecordwaitApiFailureBaseStartupFailure InvalidPtx InvalidPcLaunchPendingCountExceededSyncDepthExceededLaunchFileScopedSurfLaunchFileScopedTexLaunchMaxDepthExceededDeviceAlreadyInUseIncompatibleDriverContextNoKernelImageForDeviceInvalidKernelImageDevicesUnavailableDuplicateSurfaceNameDuplicateTextureNameDuplicateVariableNameECCUncorrectableInvalidSurfaceSetOnActiveProcessInsufficientDriverInvalidResourceHandleMemoryValueTooLargeNotYetImplementedCudartUnloadingMixedDeviceExecutionInvalidNormSettingInvalidFilterSettingSynchronizationErrorTextureNotBoundTextureFetchFailedAddressOfConstantInvalidMemcpyDirectionInvalidChannelDescriptorInvalidTextureBindingInvalidTextureInvalidDevicePointerInvalidHostPointerUnmapBufferObjectFailedMapBufferObjectFailed InvalidSymbolInvalidPitchValueInvalidConfigurationInvalidDeviceFunctionPriorLaunchFailure LaunchFailureInitializationErrorMemoryAllocationMissingConfigurationruntimeVersion defaultStreamFunParamVArgDArgFArgIArg CacheConfigEqualL1SharedNone FunAttributesconstSizeByteslocalSizeBytessharedSizeBytesmaxKernelThreadsPerBlocknumRegsFun attributes setConfig setParamssetCacheConfiglaunch launchKernelDeviceResourcesthreadsPerWarp threadsPerMPthreadBlocksPerMP warpsPerMP coresPerMPsharedMemPerMPsharedMemAllocUnit regFileSize regAllocUnit regAllocWarp regPerThread allocation AllocationBlockWarpPCIbusIDdeviceIDdomainIDDeviceProperties deviceNamecomputeCapabilitytotalGlobalMem totalConstMemsharedMemPerBlock regsPerBlockwarpSizemaxThreadsPerBlockmaxThreadsPerMultiProcessor maxBlockSize maxGridSizemaxTextureDim1DmaxTextureDim2DmaxTextureDim3D clockRatemultiProcessorCountmemPitch memBusWidth memClockRatetextureAlignment computeMode deviceOverlapconcurrentKernels eccEnabledasyncEngineCount cacheMemL2tccDriverEnabledpciInfokernelExecTimeoutEnabled integratedcanMapHostMemoryunifiedAddressingCompute ComputeModeExclusiveProcess Prohibited ExclusivedeviceResources Occupancy activeThreadsactiveThreadBlocks activeWarps occupancy100 occupancyoptimalBlockSizeoptimalBlockSizeByincPow2decPow2decWarpincWarpmaxResidentBlocksLimitDevruntimependinglaunchcountDevruntimesyncdepthMallocheapsizePrintffifosize StacksizePeerFlag DeviceFlagLMemResizeToMaxMapHost ScheduleYield ScheduleSpin ScheduleAutoDevicechoosegetcountpropssetsetFlagssetOrdersyncreset accessibleaddremovegetLimitsetLimitInitFlagDeviceAttributeCU_DEVICE_ATTRIBUTE_MAXMultiGpuBoardGroupId MultiGpuBoard ManagedMemoryMaxRegistersPerMultiprocessor MaxSharedMemoryPerMultiprocessorLocalL1CacheSupportedGlobalL1CacheSupportedStreamPrioritiesSupportedMaximumTexture1dMipmappedWidthComputeCapabilityMinorComputeCapabilityMajorMaximumTexture2dMipmappedHeightMaximumTexture2dMipmappedWidthMaximumTexture2dLinearPitchMaximumTexture2dLinearHeightMaximumTexture2dLinearWidthMaximumTexture1dLinearWidth"MaximumSurfacecubemapLayeredLayers!MaximumSurfacecubemapLayeredWidthMaximumSurfacecubemapWidthMaximumSurface2dLayeredLayersMaximumSurface2dLayeredHeightMaximumSurface2dLayeredWidthMaximumSurface1dLayeredLayersMaximumSurface1dLayeredWidthMaximumSurface3dDepthMaximumSurface3dHeightMaximumSurface3dWidthMaximumSurface2dHeightMaximumSurface2dWidthMaximumSurface1dWidth"MaximumTexturecubemapLayeredLayers!MaximumTexturecubemapLayeredWidthMaximumTexturecubemapWidthTexturePitchAlignment PciDomainIdMaximumTexture3dDepthAlternateMaximumTexture3dHeightAlternateMaximumTexture3dWidthAlternateMaximumTexture2dGatherHeightMaximumTexture2dGatherWidthCanTex2dGatherMaximumTexture1dLayeredLayersMaximumTexture1dLayeredWidthUnifiedAddressingAsyncEngineCountMaxThreadsPerMultiprocessor L2CacheSizeGlobalMemoryBusWidthMemoryClockRate TccDriver PciDeviceIdPciBusId EccEnabledConcurrentKernelsSurfaceAlignmentMaximumTexture2dArrayNumslicesMaximumTexture2dArrayHeightMaximumTexture2dArrayWidthMaximumTexture2dLayeredLayersMaximumTexture2dLayeredHeightMaximumTexture2dLayeredWidthMaximumTexture3dDepthMaximumTexture3dHeightMaximumTexture3dWidthMaximumTexture2dHeightMaximumTexture2dWidthMaximumTexture1dWidthCanMapHostMemory IntegratedKernelExecTimeoutMultiprocessorCount GpuOverlapTextureAlignment ClockRateRegistersPerBlockMaxRegistersPerBlockMaxPitchWarpSizeTotalConstantMemorySharedMemoryPerBlockMaxSharedMemoryPerBlock MaxGridDimZ MaxGridDimY MaxGridDimX MaxBlockDimZ MaxBlockDimY MaxBlockDimXMaxThreadsPerBlock useDevice initialise capabilitydevice attributenametotalMemCache PreferEqualPreferL1 PreferShared PreferNoneMaxDevRuntimePendingLaunchCountDevRuntimeSyncDepthMallocHeapSizePrintfFifoSize StackSize ContextFlag FlagsMaskLmemResizeToMax SchedMaskSchedBlockingSync SchedYield SchedSpin SchedAutoContext useContextattachdetachpoppush FunAttributeCU_FUNC_ATTRIBUTE_MAX CacheModeCa BinaryVersion PtxVersionNumRegsLocalSizeBytesConstSizeBytesSharedSizeBytesMaxKernelThreadsPerBlockrequires setBlockShape setSharedSizesetCacheConfigFun launchKernel'HostPtr useHostPtr DevicePtr useDevicePtr withDevicePtrdevPtrToWordPtrwordPtrToDevPtr nullDevPtr castDevPtr plusDevPtr alignDevPtr minusDevPtr advanceDevPtr withHostPtr nullHostPtr castHostPtr plusHostPtr alignHostPtr minusHostPtradvanceHostPtr$fStorableHostPtr $fShowHostPtr$fStorableDevicePtr$fShowDevicePtr AttachFlagSingleHostGlobal AllocFlag WriteCombinedPortable DeviceMappedmallocHostArrayfreeHost mallocArray allocaArrayfreemallocManagedArray peekArraypeekArrayAsync peekArray2DpeekArray2DAsync peekListArray pokeArraypokeArrayAsync pokeArray2DpokeArray2DAsync pokeListArraycopyArrayAsync copyArray2DcopyArray2DAsyncnewListArrayLen newListArray withListArraywithListArrayLenmemset FormatDescdepthkind FilterModeLinearPoint AddressModeBorderMirrorClampWrap FormatKindFloatUnsignedSignedTexture normalised filtering addressingformatbindbind2D registerArrayunregisterArray copyArrayPeercopyArrayPeerAsync memsetAsync getDevicePtr getBasePtr getMemInfopeekDeviceHandleuseDeviceHandleFormatHalfInt32Int16Int8Word32Word16Word8ReadModeNormalizedCoordinates ReadAsInteger useTexturegetAddressMode getFilterMode getFormatsetAddressMode setFilterMode setReadMode setFormatpeekTex JITTarget Compute50 Compute37 Compute35 Compute32 Compute30 Compute21 Compute20 Compute13 Compute12 Compute11 Compute10 JITResultjitTime jitInfoLog jitModule JITOptionVerboseGenerateLineInfoGenerateDebugInfoFallbackStrategyTargetOptimisationLevelThreadsPerBlock MaxRegistersModulegetFungetPtrgetTexloadFileloadDataloadDataFromPtr loadDataExloadDataFromPtrExunload devNameOffsetdevMaxThreadDimOffsetdevMaxGridSizeOffsetdevMaxTexture2DOffsetdevMaxTexture3DOffsettexRefAddressModeOffsettexRefChannelDescOffsetdevMaxThreadDimOffset'devMaxGridSizeOffset' nothingIfNullextractBitMaskscIntConv cFloatConv cFromBoolghc-prim GHC.TypesBoolcToBoolcToEnum cFromEnumwithCStringLenIntConvpeekCStringLenIntConv withIntConv withFloatConv peekIntConv peekFloatConvwithBoolpeekBoolwithEnumpeekEnum nothingIfcombineBitMaskscontainsBitMaskcuDriverGetVersion'_cuDriverGetVersioncuStreamSynchronize'_cuStreamQuery'_cuStreamDestroy'_cuStreamCreate'_cuStreamCreatecuStreamDestroy cuStreamQuerycuStreamSynchronize$fEnumStreamFlag$fEnumEventFlagcuEventSynchronize'_cuStreamWaitEvent'_cuEventRecord'_cuEventQuery'_cuEventElapsedTime'_cuEventDestroy'_cuEventCreate'_ cuEventCreatecuEventDestroycuEventElapsedTime cuEventQuery cuEventRecordcuStreamWaitEventcuEventSynchronize$fEnumWaitFlag describe'_cudaDriverGetVersion'_cudaRuntimeGetVersion'_cudaRuntimeGetVersioncudaDriverGetVersioncudaStreamSynchronize'_cudaStreamQuery'_cudaStreamDestroy'_cudaStreamCreate'_cudaStreamCreatecudaStreamDestroycudaStreamQuerycudaStreamSynchronize peekStream cudaLaunch'_cudaFuncSetCacheConfig'_cudaSetDoubleForDevice'_cudaSetupArgument'_cudaConfigureCallSimple'_cudaFuncGetAttributes'_cudaFuncGetAttributescudaConfigureCallSimplecudaSetupArgumentcudaSetDoubleForDevicecudaFuncSetCacheConfig cudaLaunchwithFun$fEnumCacheConfig$fStorableFunAttributescudaEventSynchronize'_cudaStreamWaitEvent'_cudaEventRecord'_cudaEventQuery'_cudaEventElapsedTime'_cudaEventDestroy'_cudaEventCreateWithFlags'_cudaEventCreateWithFlagscudaEventDestroycudaEventElapsedTimecudaEventQuerycudaEventRecordcudaStreamWaitEventcudaEventSynchronize$fEnumComputeMode $fOrdCompute $fShowComputecudaDeviceSetLimit'_cudaDeviceGetLimit'_cudaDeviceDisablePeerAccess'_cudaDeviceEnablePeerAccess'_cudaDeviceCanAccessPeer'_cudaDeviceReset'_cudaDeviceSynchronize'_cudaSetValidDevices'_cudaSetDeviceFlags'_cudaSetDevice'_cudaGetDeviceProperties'_cudaGetDeviceCount'_cudaGetDevice'_cudaChooseDevice'_cudaChooseDevice cudaGetDevicecudaGetDeviceCountcudaGetDeviceProperties cudaSetDevicecudaSetDeviceFlagscudaSetValidDevicescudaDeviceSynchronizecudaDeviceResetcudaDeviceCanAccessPeercudaDeviceEnablePeerAccesscudaDeviceDisablePeerAccesscudaDeviceGetLimitcudaDeviceSetLimit $fEnumLimit$fEnumPeerFlag$fStorableDeviceProperties$fEnumDeviceFlag$fEnumDeviceAttribute CUDevPropcuMaxThreadsPerBlockcuMaxBlockSize cuMaxGridSizecuSharedMemPerBlockcuTotalConstMem cuWarpSize cuMemPitchcuRegsPerBlock cuClockRatecuTextureAlignmentcuDeviceTotalMem'_cuDeviceGetProperties'_cuDeviceGetName'_cuDeviceGetCount'_cuDeviceGetAttribute'_ cuDeviceGet'_cuDeviceComputeCapability'_cuInit'_cuInitcuDeviceComputeCapability cuDeviceGetcuDeviceGetAttributecuDeviceGetCountcuDeviceGetNamecuDeviceGetPropertiescuDeviceTotalMem$fEnumInitFlag$fStorableCUDevPropcuCtxSetCacheConfig'_cuCtxSetLimit'_cuCtxGetLimit'_cuCtxDisablePeerAccess'_cuCtxEnablePeerAccess'_cuDeviceCanAccessPeer'_cuCtxSynchronize'_cuCtxPushCurrent'_cuCtxPopCurrent'_cuCtxGetDevice'_cuCtxSetCurrent'_cuCtxGetCurrent'_cuCtxDestroy'_ cuCtxDetach'_ cuCtxAttach'_ cuCtxCreate'_ cuCtxCreate cuCtxAttach cuCtxDetach cuCtxDestroycuCtxGetCurrentcuCtxSetCurrentcuCtxGetDevicecuCtxPopCurrentcuCtxPushCurrentcuCtxSynchronizecuDeviceCanAccessPeercuCtxEnablePeerAccesscuCtxDisablePeerAccess cuCtxGetLimit cuCtxSetLimitcuCtxSetCacheConfig $fEnumCache$fEnumContextFlag$fEnumFunAttributeuseFun cuParamSetv'_ cuParamSetf'_ cuParamSeti'_cuParamSetSize'_cuLaunchKernel'_cuLaunchGridAsync'_cuFuncSetCacheConfig'_cuFuncSetSharedSize'_cuFuncSetBlockShape'_cuFuncGetAttribute'_cuFuncGetAttributecuFuncSetBlockShapecuFuncSetSharedSizecuFuncSetCacheConfigcuLaunchGridAsynccuLaunchKernelcuParamSetSize cuParamSeti cuParamSetf cuParamSetv$fStorableFunParambaseGHC.PtrPtrmemcpy memcpyAsyncmemcpy2D memcpy2DAsync Data.Tuplefst$fEnumCopyDirection CopyDirectionDeviceToDevice DeviceToHost HostToDevice HostToHost cudaMemset'_cudaMemcpy2DAsync'_cudaMemcpy2D'_cudaMemcpyAsync'_ cudaMemcpy'_cudaMallocManaged'_ cudaFree'_ cudaMalloc'_cudaFreeHost'_cudaHostAlloc'_ cudaHostAlloc cudaFreeHost cudaMalloccudaFreecudaMallocManaged cudaMemcpycudaMemcpyAsync cudaMemcpy2DcudaMemcpy2DAsync cudaMemset$fEnumAttachFlag$fEnumAllocFlagTextureReference$fEnumAddressMode$fEnumFormatKindcudaGetTextureReference'_cudaBindTexture2D'_cudaBindTexture'_cudaBindTexturecudaBindTexture2DcudaGetTextureReferencewith_ withCString_$fStorableTexture$fStorableFormatDesc$fEnumFilterMode DeviceHandlecuMemGetInfo'_cuMemGetAddressRange'_cuMemHostGetDevicePointer'_cuMemsetD32Async'_cuMemsetD16Async'_cuMemsetD8Async'_ cuMemsetD32'_ cuMemsetD16'_ cuMemsetD8'_cuMemcpyPeerAsync'_cuMemcpyPeer'_cuMemcpy2DDtoDAsync'_cuMemcpy2DDtoD'_cuMemcpyDtoDAsync'_cuMemcpyDtoD'_cuMemcpy2DHtoDAsync'_cuMemcpy2DHtoD'_cuMemcpyHtoDAsync'_cuMemcpyHtoD'_cuMemcpy2DDtoHAsync'_cuMemcpy2DDtoH'_cuMemcpyDtoHAsync'_cuMemcpyDtoH'_cuMemAllocManaged'_ cuMemFree'_ cuMemAlloc'_cuMemHostUnregister'_cuMemHostRegister'_cuMemFreeHost'_cuMemHostAlloc'_cuMemHostAlloc cuMemFreeHostcuMemHostRegistercuMemHostUnregister cuMemAlloc cuMemFreecuMemAllocManaged cuMemcpyDtoHcuMemcpyDtoHAsynccuMemcpy2DDtoHcuMemcpy2DDtoHAsync cuMemcpyHtoDcuMemcpyHtoDAsynccuMemcpy2DHtoDcuMemcpy2DHtoDAsync cuMemcpyDtoDcuMemcpyDtoDAsynccuMemcpy2DDtoDcuMemcpy2DDtoDAsync cuMemcpyPeercuMemcpyPeerAsync cuMemsetD8 cuMemsetD16 cuMemsetD32cuMemsetD8AsynccuMemsetD16AsynccuMemsetD32AsynccuMemHostGetDevicePointercuMemGetAddressRange cuMemGetInfocuTexRefSetFormat'_cuTexRefSetFlags'_cuTexRefSetFilterMode'_cuTexRefSetAddressMode'_cuTexRefGetFormat'_cuTexRefGetFilterMode'_cuTexRefGetAddressMode'_cuTexRefSetAddress2DSimple'_cuTexRefSetAddress'_cuTexRefDestroy'_cuTexRefCreate'_cuTexRefCreatecuTexRefDestroycuTexRefSetAddresscuTexRefSetAddress2DSimplecuTexRefGetAddressModecuTexRefGetFilterModecuTexRefGetFormatcuTexRefSetAddressModecuTexRefSetFilterModecuTexRefSetFlagscuTexRefSetFormat $fEnumFormat$fEnumReadModebytestring-0.10.4.0Data.ByteString.Internal ByteString JITFallbackBinaryPtxJITOptionInternalJIT_NUM_OPTIONSJIT_CACHE_MODEJIT_GENERATE_LINE_INFOJIT_LOG_VERBOSEJIT_GENERATE_DEBUG_INFOJIT_FALLBACK_STRATEGY JIT_TARGETJIT_TARGET_FROM_CUCONTEXTJIT_OPTIMIZATION_LEVELJIT_ERROR_LOG_BUFFER_SIZE_BYTESJIT_ERROR_LOG_BUFFERJIT_INFO_LOG_BUFFER_SIZE_BYTESJIT_INFO_LOG_BUFFER JIT_WALL_TIMEJIT_THREADS_PER_BLOCKJIT_MAX_REGISTERS useModulecuModuleUnload'_cuModuleLoadDataEx'_cuModuleLoadData'_cuModuleLoad'_cuModuleGetTexRef'_cuModuleGetGlobal'_cuModuleGetFunction'_ c_strnlen'cuModuleGetFunctioncuModuleGetGlobalcuModuleGetTexRef cuModuleLoadcuModuleLoadDatacuModuleLoadDataExcuModuleUnload resultIfFoundpeekMod c_strnlen$fEnumJITFallback$fEnumJITTarget$fEnumJITOptionInternal