b      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghi j k l m n o p q r s t u v w x y z { | } ~   Analogous to  The type parameter r- and its functional dependency are necessary since g must be a function of the form $a -> ... -> c -> CodeGenFunction r d %and we must ensure that the explicit r and the implicit r in the g do match. /This is an Applicative functor that registers, Gwhat extensions are needed in order to run the contained instructions. +You can escape from the functor by calling  (and providing a generic implementation. We use an applicative functor since with a monadic interface 5we had to create the specialised code in every case, ,in order to see which extensions where used ,in the course of creating the instructions. ;We use only one (unparameterized) type for all extensions, (since this is the most simple solution. ,Alternatively we could use a type parameter 9where class constraints show what extensions are needed. KThis would be just like exceptions that are explicit in the type signature +as in the control-monad-exception package. RHowever we would still need to lift all basic LLVM instructions to the new monad. .Declare that a certain plain LLVM instruction #depends on a particular extension. 2This can be useful if you rely on the data layout 0of a certain architecture when doing a bitcast, @or if you know that LLVM translates a certain generic operation <to something especially optimal for the declared extension. 7Create an intrinsic and register the needed extension. :We cannot immediately check whether the signature matches )or whether the right extension is given. #However, when resolving intrinsics <LLVM will not find the intrinsic if the extension is wrong, "and it also checks the signature. run generic specific generates the specific code ?if the required extensions are available on the host processor and generic otherwise. Convenient variant of : -Only run the code with extended instructions %if an additional condition is given. Only for debugging purposes.            $An implementation of both  and  must ensure that  haskellValue is compatible with  llvmStruct. That is, writing and reading  llvmStruct by LLVM must be the same as accessing  haskellValue by  methods. KWe use a functional dependency in order to let type inference work nicely.  !"#$%&'()*+,-./0>Adding the finalizer to a ForeignPtr seems to be the only way Hthat warrants execution of the finalizer (not too early and not never). THowever, the normal ForeignPtr finalizers must be independent from Haskell runtime. &In contrast to ForeignPtr finalizers, @addFinalizer adds finalizers to boxes, that are optimized away. 1Thus finalizers are run too early or not at all. 3Concurrent.ForeignPtr and using threaded execution 1is the only way to get finalizers in Haskell IO. 12)Returns 16 Byte aligned piece of memory. BOtherwise program crashes when vectors are part of the structure. >I think that malloc in LLVM-2.5 and LLVM-2.6 is simply buggy. FIXME: TAligning to 16 Byte might not be appropriate for all vector types on all platforms. 0Maybe we should use alignment of Storable class +in order to determine the right alignment. 3 !"#$%&'()*+,-./0123 !"#)$%&'(*+,-./0123 !" !"#$%&'()*+,-./0123456456456456 789:;<=>?@ABCD"This would also work for vectors, if LLVM would support select! with bool vectors as condition. EFGHIJKLMNOPQR789:;<=>?@ABCDEFGHIJKLMNOPQR789:;<=@A>?BCDEFGHIJKLMNOPQR789:;<=>?@ABCDEFGHIJKLMNOPQR STUVAn alternative to V 'where I try to persuade LLVM to use x86's LOOP instruction. %Unfortunately it becomes even worse. 5LLVM developers say that x86 LOOP is actually slower 9than manual decrement, zero test and conditional branch. WXY"This construct starts new blocks, )so be prepared when continueing after an Y. Z[\Branch-free variant of Z 5that is faster if the enclosed block is very simply, .say, if it contains at most two instructions. &It can only be used as alternative to Z /if the enclosed block is free of side effects. STUVWXYZ[\ UVWXYZST[\ STTUVWXYZ[\ ]Isomorphic to =ReaderT (CodeGenFunction r z) (ContT z (CodeGenFunction r)) a, (where the reader provides the block for  &and the continuation part manages the . ^_`a%counterpart to Data.Maybe.HT.toMaybe bcdefg9If the returned position is smaller than the array size, (then returned final state is undefined. h ]^_`abcdefgh ]^_`abcdefgh ]^_^_`abcdefgh ;ijklmnopqrstuvwxyz{|}~Wthe upper two integers are set to zero, there is no instruction that converts to Int64 +MXCSR is not really supported by LLVM-2.6. ILLVM does not know about the dependency of all floating point operations on this status register. cumulative sum: "(a,b,c,d) -> (a,a+b,a+b+c,a+b+c+d) &I try to cleverly use horizontal add, 8but the generic version in the Vector module is better. 6ijklmnopqrstuvwxyz{|}~6ijklmnopqrstuvwxyz{|}~6ijklmnopqrstuvwxyz{|}~ F Attention: .The rounding and fraction functions only work 4for floating point values with maximum magnitude of maxBound :: Int32. >This way we safe expensive handling of possibly seldom cases. 8The order of addition is chosen for maximum efficiency. 'We do not try to prevent cancelations. CThe first result value is the sum of all vector elements from 0 to  div n 2 + 1 B and the second result value is the sum of vector elements from div n 2 to n-1.  n must be at least D2. JTreat the vector as concatenation of pairs and all these pairs are added. ( Useful for stereo signal processing.  n must be at least D2. GAllow to work on records of vectors as if they are vectors of records. EThis is a reasonable approach for records of different element types Jsince processor vectors can only be built from elements of the same type. 9But also say for chunked stereo signal this makes sense. In this case we would work on Stereo (Value a). ,Manually assemble a vector of equal values. %Better use ScalarOrVector.replicate. *construct a vector out of single elements EYou must assert that the length of the list matches the vector size. LManually implement vector shuffling using insertelement and extractelement. In contrast to LLVM':s built-in instruction it supports distinct vector sizes, $but it allows only one input vector =(or a tuple of vectors, but we cannot shuffle between them). 0Rotate one element towards the higher elements. I don'-t want to call it rotateLeft or rotateRight, =because there is no prefered layout for the vector elements. In Intel's instruction manual vector $elements are indexed like the bits, that is from right to left. @However, when working with Haskell list and enumeration syntax, the start index is left. 8Like LLVM.Util.Loop.mapVector but the loop is unrolled, >which is faster since it can be packed by the code generator. 7Ideally on ix86 with SSE41 this would be translated to dpps. +If the target vector type is a native type Fthen the chop operation produces no actual machine instruction. (nop) 3If the vector cannot be evenly divided into chunks 5the last chunk will be padded with undefined values. +The target size is determined by the type. EIf the chunk list provides more data, the exceeding data is dropped. )If the chunk list provides too few data, 5the target vector is filled with undefined elements.  6We partition a vector of size n into chunks of size m -and add these chunks using vector additions. .We do this by repeated halving of the vector, Hsince this way we do not need assumptions about the native vector size. *We reduce the vector size only virtually, Dthat is we maintain the vector size and fill with undefined values. This is reasonable Wsince LLVM-2.5 and LLVM-2.6 does not allow shuffling between vectors of different size ;and because it likes to do computations on Vector D2 Float in MMX registers on ix86 CPU's, &which interacts badly with FPU usage. 0Since we fill the vector with undefined values, ?LLVM actually treats the vectors like vectors of smaller size.    Needs (log n) vector additions  .On LLVM-2.6 and X86 this produces branch-free but even slower code than fractionSelect, %since the comparison to booleans and 8back to a floating point number is translated literally 8to elementwise comparison, conversion to a 0 or -1 byte %and then to a floating point number. MLLVM.select on boolean vectors cannot be translated to X86 code in LLVM-2.6, >thus I code my own version that calls select on all elements. This is slow but works. IWhen this issue is fixed, this function will be replaced by LLVM.select.  implemented using . This will need jumps.  implemented using . This will need jumps. Another implementation of , 1this time in terms of binary logical operations. The selecting integers must be 5(-1) for selecting an element from the first operand 8and 0 for selecting an element from the second operand. This leads to optimal code. 7On SSE41 this could be done with blendvps or blendvpd. ... 0The fraction has the same sign as the argument. @This is not particular useful but fast on IEEE implementations. +increment (first operand) may be negative, "phase must always be non-negative .both increment and phase must be non-negative =There are functions that are intended for processing scalars +but have formally vector input and output. ?This function breaks vector function down to a scalar function (by accessing the lowest vector element.    !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefg hijklmnop`q r s t u v w x y z { | } ~    F   . j _    h'                           llvm-extra-0.1LLVM.Extra.ExtensionLLVM.Extra.ExtensionCheck.X86LLVM.Extra.ClassLLVM.Extra.RepresentationLLVM.Extra.MonadLLVM.Extra.ArithmeticLLVM.Extra.ControlLLVM.Extra.MaybeContinuationLLVM.Extra.Extension.X86LLVM.Extra.VectorLLVM.Extra.ScalarOrVectorCallArgsT Subtargetwrap intrinsic intrinsicAttrrunrunWhen runUnsafewithwith2with3sse1sse2sse3ssse3sse41sse42Zero zeroTuplezeroTuplePointedbuildTupleTraversableundefTuplePointedvalueTupleOfFunctortupleDescFoldablephisTraversableaddPhisFoldable MemoryElement MemoryRecordMemoryloadstore decomposecomposemodify memoryElement loadRecord storeRecorddecomposeRecord composeRecordcastStorablePtr loadNewtype storeNewtypedecomposeNewtypecomposeNewtypenewForeignPtrInitnewForeignPtrParam newForeignPtrwithForeignPtrmallocfreechainliftR2liftR3addsubincdecmulsquarefdivfcmpicmpudivuremandoruminumaxsminsmaxsabsfminfmaxfabsadvanceArrayElementPtrsqrtsincosexplogpowSelectselect arrayLooparrayLoopWithExitarrayLoop2WithExit whileLoop ifThenElseifThenselectTraversable ifThenSelectConsresolvemapwithBoolfromBooltoBoolliftguardbind arrayLoop2maxssminssmaxpsminpsmaxsdminsdmaxpdminpdcmpsscmppscmpsdcmppdpcmpgtbpcmpgtwpcmpgtdpcmpgtqpcmpugtbpcmpugtwpcmpugtdpcmpugtqpminsbpminswpminsdpmaxsbpmaxswpmaxsdpminubpminuwpminudpmaxubpmaxuwpmaxudpabsbpabswpabsdpmuludqpmulldcvtps2dqcvtpd2dqldmxcsrstmxcsr withMXCSRhaddpshaddpddppsdppdroundssroundpsroundsdroundpdabsssabssdabspsabspdRealminmaxabstruncatefractionfloor Arithmeticsum sumToPairsumInterleavedToPaircumulate dotProductAccessinsertextract ShuffleMatch shuffleMatchsize replicateassemble insertChunkiterateshuffle sizeInTuplerotateUp rotateDownreverseshiftUp shiftDownshiftUpMultiZeroshiftDownMultiZeroshuffleMatchTraversableinsertTraversableextractTraversable mapChunks zipChunksWithchopconcat cumulate1signedFraction umul32to64 ReplicatereplicateConstFraction addToPhaseincPhase replicateOfllvm-ht-0.7.0.0LLVM.Core.CodeGen FunctionArgsbuildIntrinsic targetNamenamecheck subtargetderefStartParamPtr derefStartPtr AlignedPtrAlignedImporter loadElement storeElementextractElement insertElementMakeValueTuplebaseForeign.StorableStorable pairMemory tripleMemory cmpSelect valueTypeNamecallIntrinsic1callIntrinsic2_arrayLoopWithExitDecLoop _emitCode Data.MaybeNothingJustVDoubleVFloat switchFPPred pcmpuFromPcmp _cumulate1s replicateCore iterateCoremapAuto zipAutoWithdotProductPartial sumPartialchopCore getLowestPair_reduceAddInterleaved sumGenericsumToPairGenericreduceSumInterleaved_cumulateSimplecumulateGeneric cumulateFrom1 floorGenericfractionGeneric _floorSelect_fractionSelect selectLogical floorLogicalfractionLogicalorderByorder fractionGen runScalar