"      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~   Analogous to  The type parameter r- and its functional dependency are necessary since g must be a function of the form $a -> ... -> c -> CodeGenFunction r d %and we must ensure that the explicit r and the implicit r in the g do match. /This is an Applicative functor that registers, Gwhat extensions are needed in order to run the contained instructions. +You can escape from the functor by calling  (and providing a generic implementation. We use an applicative functor since with a monadic interface 5we had to create the specialised code in every case, ,in order to see which extensions where used ,in the course of creating the instructions. ;We use only one (unparameterized) type for all extensions, (since this is the most simple solution. ,Alternatively we could use a type parameter 9where class constraints show what extensions are needed. KThis would be just like exceptions that are explicit in the type signature +as in the control-monad-exception package. RHowever we would still need to lift all basic LLVM instructions to the new monad. .Declare that a certain plain LLVM instruction #depends on a particular extension. 2This can be useful if you rely on the data layout 0of a certain architecture when doing a bitcast, @or if you know that LLVM translates a certain generic operation <to something especially optimal for the declared extension. 7Create an intrinsic and register the needed extension. :We cannot immediately check whether the signature matches )or whether the right extension is given. #However, when resolving intrinsics <LLVM will not find the intrinsic if the extension is wrong, "and it also checks the signature. run generic specific generates the specific code ?if the required extensions are available on the host processor and generic otherwise. Convenient variant of : -Only run the code with extended instructions %if an additional condition is given. Only for debugging purposes.             An implementation of both  and Memory.C must ensure that  haskellValue is compatible with  llvmStruct. That is, writing and reading  llvmStruct by LLVM must be the same as accessing  haskellValue by Storable methods. HToDo: In future we may also require Storable constraint for llvmStruct. KWe use a functional dependency in order to let type inference work nicely. !"#$%&'()*+,-./ToDo: EThis is dangerous because LLVM uses one bit for Bool representation,  and I think one byte in memory, 3whereas Storable uses 4 byte and 4 byte alignment. GWe should define a sub-class of IsFirstClass for all compatible types, .and make this a super-class of this instance.  !"#$%&'()*+,-./ !"#$%+&'()*,-./ !"#$!"#$%&'()*+,-./012>Adding the finalizer to a ForeignPtr seems to be the only way Hthat warrants execution of the finalizer (not too early and not never). THowever, the normal ForeignPtr finalizers must be independent from Haskell runtime. &In contrast to ForeignPtr finalizers, @addFinalizer adds finalizers to boxes, that are optimized away. 1Thus finalizers are run too early or not at all. 3Concurrent.ForeignPtr and using threaded execution 1is the only way to get finalizers in Haskell IO. 3012301230123456456456456789:;<=>?@ABCD"This would also work for vectors, if LLVM would support select! with bool vectors as condition. EFGHIJKLM789:;<=>?@ABCDEFGHIJKLM789:;<=@A>?BCDEFGHIJKLM789:;<=>?@ABCDEFGHIJKLMNOPQAn alternative to Q 'where I try to persuade LLVM to use x86's LOOP instruction. %Unfortunately it becomes even worse. 5LLVM developers say that x86 LOOP is actually slower 9than manual decrement, zero test and conditional branch. RSTUThis is a variant of T that may be more convient, ,because you only need one lambda expression 'for both loop condition and loop body. V"This construct starts new blocks, )so be prepared when continueing after an V. WXYBranch-free variant of W 5that is faster if the enclosed block is very simply, .say, if it contains at most two instructions. &It can only be used as alternative to W /if the enclosed block is free of side effects. NOPQRSTUVWXY PQRSTUVWNOXY NOOPQRSTUVWXY ZIsomorphic to =ReaderT (CodeGenFunction r z) (ContT z (CodeGenFunction r)) a, (where the reader provides the block for  &and the continuation part manages the . [\]^%counterpart to Data.Maybe.HT.toMaybe _`abcde9If the returned position is smaller than the array size, (then returned final state is undefined. fgZ[\]^_`abcdefgZ[\]^_`abcdefgZ[\[\]^_`abcdefg <hijklmnopqrstuvwxyz{|}~Wthe upper two integers are set to zero, there is no instruction that converts to Int64 +MXCSR is not really supported by LLVM-2.6. ILLVM does not know about the dependency of all floating point operations on this status register. cumulative sum: "(a,b,c,d) -> (a,a+b,a+b+c,a+b+c+d) &I try to cleverly use horizontal add, 8but the generic version in the Vector module is better. 6hijklmnopqrstuvwxyz{|}~6hijklmnopqrstuvwxyz{|}~6hijklmnopqrstuvwxyz{|}~ K Attention: .The rounding and fraction functions only work 4for floating point values with maximum magnitude of maxBound :: Int32. >This way we safe expensive handling of possibly seldom cases. 8The order of addition is chosen for maximum efficiency. 'We do not try to prevent cancelations. CThe first result value is the sum of all vector elements from 0 to  div n 2 + 1 B and the second result value is the sum of vector elements from div n 2 to n-1.  n must be at least D2. JTreat the vector as concatenation of pairs and all these pairs are added. ( Useful for stereo signal processing.  n must be at least D2. GAllow to work on records of vectors as if they are vectors of records. EThis is a reasonable approach for records of different element types Jsince processor vectors can only be built from elements of the same type. 9But also say for chunked stereo signal this makes sense. In this case we would work on Stereo (Value a). ,Manually assemble a vector of equal values. %Better use ScalarOrVector.replicate. *construct a vector out of single elements EYou must assert that the length of the list matches the vector size. &This can be considered the inverse of . LManually implement vector shuffling using insertelement and extractelement. In contrast to LLVM':s built-in instruction it supports distinct vector sizes, $but it allows only one input vector =(or a tuple of vectors, but we cannot shuffle between them). (For more complex shuffling we recommend  and . 0Rotate one element towards the higher elements. I don'-t want to call it rotateLeft or rotateRight, =because there is no prefered layout for the vector elements. In Intel's instruction manual vector $elements are indexed like the bits, that is from right to left. @However, when working with Haskell list and enumeration syntax, the start index is left. Implement the ! method using the methods of the  class. Kprovide the elements of a vector as a list of individual virtual registers &This can be considered the inverse of . 8Like LLVM.Util.Loop.mapVector but the loop is unrolled, >which is faster since it can be packed by the code generator. 7Ideally on ix86 with SSE41 this would be translated to dpps. +If the target vector type is a native type Fthen the chop operation produces no actual machine instruction. (nop) 3If the vector cannot be evenly divided into chunks 5the last chunk will be padded with undefined values. +The target size is determined by the type. EIf the chunk list provides more data, the exceeding data is dropped. )If the chunk list provides too few data, 5the target vector is filled with undefined elements.     6We partition a vector of size n into chunks of size m -and add these chunks using vector additions. .We do this by repeated halving of the vector, Hsince this way we do not need assumptions about the native vector size. *We reduce the vector size only virtually, Dthat is we maintain the vector size and fill with undefined values. This is reasonable Wsince LLVM-2.5 and LLVM-2.6 does not allow shuffling between vectors of different size ;and because it likes to do computations on Vector D2 Float in MMX registers on ix86 CPU's, &which interacts badly with FPU usage. 0Since we fill the vector with undefined values, ?LLVM actually treats the vectors like vectors of smaller size.  Needs (log n) vector additions .On LLVM-2.6 and X86 this produces branch-free but even slower code than fractionSelect, %since the comparison to booleans and 8back to a floating point number is translated literally 8to elementwise comparison, conversion to a 0 or -1 byte %and then to a floating point number. MLLVM.select on boolean vectors cannot be translated to X86 code in LLVM-2.6, >thus I code my own version that calls select on all elements. This is slow but works. IWhen this issue is fixed, this function will be replaced by LLVM.select.  implemented using . This will need jumps.  implemented using . This will need jumps. Another implementation of , 1this time in terms of binary logical operations. The selecting integers must be 5(-1) for selecting an element from the first operand 8and 0 for selecting an element from the second operand. This leads to optimal code. 7On SSE41 this could be done with blendvps or blendvpd. 222 0The fraction has the same sign as the argument. @This is not particular useful but fast on IEEE implementations. +increment (first operand) may be negative, "phase must always be non-negative .both increment and phase must be non-negative =There are functions that are intended for processing scalars +but have formally vector input and output. ?This function breaks vector function down to a scalar function (by accessing the lowest vector element.   !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcd  e f g h i j k l m n [ o ^ p q r s t u v w x y z { | } ~   O P Q  F   1 g Z  O P Q   e*                          llvm-extra-0.2.0.2LLVM.Extra.ExtensionLLVM.Extra.ExtensionCheck.X86LLVM.Extra.ClassLLVM.Extra.MemoryLLVM.Extra.ForeignPtrLLVM.Extra.MonadLLVM.Extra.ArithmeticLLVM.Extra.ControlLLVM.Extra.MaybeContinuationLLVM.Extra.Extension.X86LLVM.Extra.VectorLLVM.Extra.ScalarOrVectorCallArgsT Subtargetwrap intrinsic intrinsicAttrrunrunWhen runUnsafewithwith2with3sse1sse2sse3ssse3sse41sse42MakeValueTuple valueTupleOfZero zeroTuple Undefined undefTuplezeroTuplePointedundefTuplePointedvalueTupleOfFunctorphisTraversableaddPhisFoldableElementRecordCloadstore decomposecomposemodifyelement loadRecord storeRecorddecomposeRecord composeRecordcastStorablePtr loadNewtype storeNewtypedecomposeNewtypecomposeNewtypenewInitnewParamnewchainliftR2liftR3addsubincdecmulsquarefdivfcmpcmpidiviremandorminmaxabsadvanceArrayElementPtrsqrtsincosexplogpowSelectselect arrayLooparrayLoopWithExitarrayLoop2WithExitfixedLengthLoop whileLoopwhileLoopShared ifThenElseifThenselectTraversable ifThenSelectConsresolvemapwithBoolfromBooltoBoolisJustliftguardbind arrayLoop2maxssminssmaxpsminpsmaxsdminsdmaxpdminpdcmpsscmppscmpsdcmppdpcmpgtbpcmpgtwpcmpgtdpcmpgtqpcmpugtbpcmpugtwpcmpugtdpcmpugtqpminsbpminswpminsdpmaxsbpmaxswpmaxsdpminubpminuwpminudpmaxubpmaxuwpmaxudpabsbpabswpabsdpmuludqpmulldcvtps2dqcvtpd2dqldmxcsrstmxcsr withMXCSRhaddpshaddpddppsdppdroundssroundpsroundsdroundpdabsssabssdabspsabspdRealtruncatefractionfloor Arithmeticsum sumToPairsumInterleavedToPaircumulate dotProductAccessinsertextract ShuffleMatch shuffleMatchsize replicateassemble insertChunkiterateshuffle sizeInTuplerotateUp rotateDownreverseshiftUp shiftDownshiftUpMultiZeroshiftDownMultiZeroshuffleMatchTraversableshuffleMatchAccessshuffleMatchPlain1shuffleMatchPlain2insertTraversableextractTraversable extractAll mapChunks zipChunksWithchopconcat cumulate1signedFraction umul32to64 ReplicatereplicateConstFraction addToPhaseincPhase replicateOf llvm-0.9.1.2LLVM.Core.CodeGen FunctionArgsbuildIntrinsic targetNamenamecheck subtarget loadElement storeElementextractElement insertElementpairtriple $fCValueaderefStartParamPtr derefStartPtrImporter cmpSelect valueTypeNamecallIntrinsic1callIntrinsic2 addReadNone_arrayLoopWithExitDecLoop _emitCodebase Data.MaybeNothingJustVDoubleVFloat switchFPPred pcmpuFromPcmp valueUnit _cumulate1s replicateCore iterateCore _mapByFoldmapAuto zipAutoWithdotProductPartial sumPartialchopCore getLowestPair_reduceAddInterleaved sumGenericsumToPairGenericreduceSumInterleaved_cumulateSimplecumulateGeneric cumulateFrom1 floorGenericfractionGeneric _floorSelect_fractionSelect selectLogical floorLogicalfractionLogicalorderByorder fractionGen singleton runScalar