úÎÈS¾Ùy      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxportable experimentalUwe Schmidt (uwe@fh-wedel.de)portable experimentalUwe Schmidt (uwe@fh-wedel.de)     portable experimentalUwe Schmidt (uwe@fh-wedel.de)$ !"#$%&'()*+,-./012$ !"#$%&'()*+,-./012$ !"#$%&'()*+,-./012$ !"#$%&'()*+,-./012portable experimentalUwe Schmidt (uwe@fh-wedel.de)3"checking for valid XML characters 4"checking for XML space character: \n, \r, \t and " " 5Fchecking for XML1.1 space character: additional space 0x85 and 0x2028  see also : 4 6 checking for XML name character 7&checking for XML name start character  see also : 6 8&checking for XML NCName character: no ":" allowed  see also : 6 9,checking for XML NCName start character: no ":" allowed  see also : 6, 8 :%checking for XML public id character ;checking for XML letter <checking for XML base charater =&checking for XML ideographic charater >$checking for XML combining charater ?checking for XML digit @checking for XML extender A9checking for XML control or permanently discouraged char see Errata to XML1.0 (http:// www.w3.org/XML/xml-V10-2e-errata) No 46 )Document authors are encouraged to avoid compatibility characters, & as defined in section 6.8 of [Unicode]* (see also D21 in section 3.6 of [Unicode3]). F The characters defined in the following ranges are also discouraged. Q They are either control characters or permanently undefined Unicode characters: 3456789:;<=>?@A3456789:;<=>?@A3456789:;<=>?@Aportable experimentalUwe Schmidt (uwe@fh-wedel.de)ByzBC{|}~€‚ƒ„…†‡ˆ‰ŠD&construct the r.e. for the empty set. % An (error-) message may be attached E9construct the r.e. for the set containing the empty word F&construct the r.e. for a set of chars G(construct an r.e. for a single char set H,construct an r.e. for an intervall of chars ImkSym generaized for strings J3construct an r.e. for the set of all Unicode chars K3construct an r.e. for the set of all Unicode words Lconstruct r.e. for r* ‹Mconstruct the r.e for r1|r2 N/construct the r.e. for r1{|}r2 (r1 orElse r2). 1This represents the same r.e. as r1|r2, but when H collecting the results of subexpressions in (...) and r1 succeeds, the C subexpressions of r2 are discarded, so r1 matches are prioritized example @ splitSubex "({1}x)|({2}.)" "x" = ([("1","x"),("2","x")], "")  6 splitSubex "({1}x){|}({2}.)" "x" = ([("1","x")], "") O"Construct the sequence r.e. r1.r2 PmkSeq extened to lists QConstruct repetition r{i,} RConstruct range r{i,j} SConstruct option r? TConstruct difference r.e.: r1 {\} r2 example + match "[a-z]+{\\}bush" "obama" = True + match "[a-z]+{\\}bush" "clinton" = True J match "[a-z]+{\\}bush" "bush" = False -- not important any more U<Construct the Complement of an r.e.: whole set of words - r V%Construct r.e. for intersection: r1 {&} r2 example ' match ".*a.*{&}.*b.*" "-a-b-" = True ' match ".*a.*{&}.*b.*" "-b-a-" = True ( match ".*a.*{&}.*b.*" "-a-a-" = False ( match ".*a.*{&}.*b.*" "---b-" = False W+Construct r.e. for exclusive or: r1 {^} r2 example ' match "[a-c]+{^}[c-d]+" "abc" = True ( match "[a-c]+{^}[c-d]+" "acdc" = False ( match "[a-c]+{^}[c-d]+" "ccc" = False ' match "[a-c]+{^}[c-d]+" "cdc" = True XŒY.Construct a labeled subexpression: ({label}r) ŽZ[\]‘’“”^FIRST for regular expressions Dthis is only an approximation, the real set of char may be smaller, M when the expression contains intersection, set difference or exor operators _`abcGThis function wraps the whole regex in a subexpression before starting . the parse. This is done for getting acces to I the whole parsed string. Therfore we need one special label, this label < is the Nothing value, all explicit labels are Just labels. deThe main scanner function •f"speedup version for splitWithRegex' IThis function checks whether the input starts with a char from FIRST re. N If this is not the case, the split fails. The FIRST set can be computed once 9 for a whole tokenizer and reused by every call of split %BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdef%BCDEFGHIJLKMNOPQRSTVWXUYZ[\]_`^abcedf%BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefportable experimentalUwe Schmidt (uwe@fh-wedel.de)%g*parse a W3C XML Schema regular expression 5the Syntax of the W3C XML Schema spec is extended by E further useful set operations, like intersection, difference, exor. + Subexpression match becomes possible with "named" pairs of parentheses.  The multi char escape sequence \a represents any Unicode char,  The multi char escape sequence \ A represents any Unicode word, (\A = \a*). S All syntactically wrong inputs are mapped to the Zero expression representing the P empty set of words. Zero contains as data field a string for an error message. L So error checking after parsing becomes possible by checking against Zero (Z predicate) –—˜™š›œžŸ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹gggportable experimentalUwe Schmidt (uwe@fh-wedel.de)hJsplit a string by taking the longest prefix matching a regular expression Nothing2 is returned in case there is no matching prefix, . else the pair of prefix and rest is returned iconvenient function for h  examples:  split "a*b" "abc" = ("ab","c")  split "a*" "bc" = ("", "bc")  split "a+" "bc" = ("", "bc") ! split "[" "abc" = ("", "abc") jLsplit a string by removing the longest prefix matching a regular expression G and then return the list of subexpressions found in the matching part Nothing, is returned in case of no matching prefix, 9 else the list of pairs of labels and submatches and the  rest is returned kconvenient function for k  examples: 2 splitSubex "({1}a*)b" "abc" = ([("1","a")],"c") 3 splitSubex "({2}a*)" "bc" = ([("2","")], "bc") ^ splitSubex "({1}a|b)+" "abc" = ([("1","a"),("1","b")],"c") -- subex 1 matches 2 times  E splitSubex ".*({x}a*)" "aa" = ([("x",""),("x","a"),("x","aa")],"") g -- nondeterminism: 3 matches for a*  0 splitSubex "({1}do)|({2}[a-z]+)" "do you know" H = ([("1","do"),("2","do")]," you know") g -- nondeterminism: 2 matches for do  2 splitSubex "({1}do){|}({2}[a-z]+)" "do you know" = = ([("1","do")]," you know") r -- no nondeterminism with {|}: 1. match for do  O splitSubex "({1}a+)" "bcd" = ([], "bcd") -- no match S splitSubex "[" "abc" = ([], "abc") -- syntax error l*The function, that does the real work for m mBsplit a string into tokens (words) by giving a regular expression  which all tokens must match. Convenient function for l (This can be used for simple tokenizers. S It is recommended to use regular expressions where the empty word does not match. N Else there will appear a lot of probably useless empty tokens in the output. S All none matching chars are discarded. If the given regex contains syntax errors,  Nothing is returned  examples: + tokenize "a" "aabba" = ["a","a","a"] ) tokenize "a*" "aaaba" = ["aaa","a"] ( tokenize "a*" "bbb" = ["","",""]  tokenize "a+" "bbb" = []   tokenize "a*b" "" = [] $ tokenize "a*b" "abc" = ["ab"] / tokenize "a*b" "abaab ab" = ["ab","aab","ab"]  A tokenize "[a-z]{2,}|[0-9]{2,}|[0-9]+[.][0-9]+" "ab123 456.7abc" 8 = ["ab","123","456.7","abc"]  ? tokenize "[a-z]*|[0-9]{2,}|[0-9]+[.][0-9]+" "cab123 456.7abc" 9 = ["cab","123","456.7","abc"]  + tokenize "[^ \t\n\r]*" "abc def\t\n\rxyz" 1 = ["abc","def","xyz"]  ' tokenize ".*" "\nabc\n123\n\nxyz\n" 7 = ["","abc","123","","xyz"]  # tokenize ".*" = lines  # tokenize "[^ \t\n\r]*" = words nJsplit a string into tokens and delimierter by giving a regular expression  wich all tokens must match &This is a generalisation of the above l functions. 2 The none matching char sequences are marked with Left$, the matching ones are marked with Right 1If the regular expression contains syntax errors Nothing is returned The following Law holds: 4 concat . map (either id id) . tokenizeRE' re == id oconvenient function for n ,When the regular expression parses as Zero,  [Left input]- is returned, that means no tokens are found pUsplit a string into tokens (pair of labels and words) by giving a regular expression $ containing labeled subexpressions. <This function should not be called with regular expressions W without any labeled subexpressions. This does not make sense, because the result list  will always be empty. .Result is the list of matching subexpressions ) This can be used for simple tokenizers. 3 At least one char is consumed by parsing a token. ? The pairs in the result list contain the matching substrings. S All none matching chars are discarded. If the given regex contains syntax errors,  Nothing is returned qconvenient function for p a string  examples: I tokenizeSubex "({name}[a-z]+)|({num}[0-9]{2,})|({real}[0-9]+[.][0-9]+)" # "cab123 456.7abc" 4 = [("name","cab") 3 ,("num","123") 6 ,("real","456.7") 5 ,("name","abc")]  6 tokenizeSubex "({real}({n}[0-9]+)([.]({f}[0-9]+))?)" 6 "12.34" = [("real","12.34") 0 ,("n","12") 1 ,("f","34")]  6 tokenizeSubex "({real}({n}[0-9]+)([.]({f}[0-9]+))?)" > "12 34" = [("real","12"),("n","12") ? ,("real","34"),("n","34")]  = tokenizeSubex "({real}({n}[0-9]+)(([.]({f}[0-9]+))|({f})))" G "12 34.56" = [("real","12"),("n","12"),("f","") M ,("real","34.56"),("n","34"),("f","56")] rsed like editing function IAll matching tokens are edited by the 1. argument, the editing function, $ all other chars remain as they are sconvenient function for r  examples: - sed (const "b") "a" "xaxax" = "xbxbx" , sed (\ x -> x ++ x) "a" "xax" = "xaax" + sed undefined "[" "xxx" = "xxx" t)match a string with a regular expression uconvenient function for t  Examples:  match "x*" "xxx" = True  match "x" "xxx" = False  match "[" "xxx" = False v)match a string with a regular expression # and extract subexpression matches wconvenient function for t  Examples: A matchSubex "({1}x*)" "xxx" = [("1","xxx")] 6 matchSubex "({1}x*)" "y" = [] M matchSubex "({w}[0-9]+)x({h}[0-9]+)" "800x600" = [("w","800"),("h","600")] 6 matchSubex "[" "xxx" = [] x&grep like filter for lists of strings CThe regular expression may be prefixed with the usual context spec "^" for start of string,  and \< for start of word.  and suffixed with "$" for end of text and \> end of word. : Word chars are defined by the multi char escape sequence \w  Examples L grep "a" ["_a_", "_a", "a_", "a", "_"] => ["_a_", "_a", "a_", "a"] ? grep "^a" ["_a_", "_a", "a_", "a", "_"] => ["a_", "a"] ? grep "a$" ["_a_", "_a", "a_", "a", "_"] => ["_a", "a"] 9 grep "^a$" ["_a_", "_a", "a_", "a", "_"] => ["a"] E grep "\\<a" ["x a b", " ax ", " xa ", "xab"] => ["x a b", " ax "] E grep "a\\>" ["x a b", " ax ", " xa ", "xab"] => ["x a b", " xa "] *BCDEGHIJKLMNOPQRSTUVWYZ[ghijklmnopqrstuvwx*CBxuwsikmoqtvrhjlnpDEGHIJLKMNOPQRSTVWUYZ[ghijklmnopqrstuvwxº      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂregex-xmlschema-0.1.3*Text.Regex.XMLSchema.String.Unicode.Blocks#Text.Regex.XMLSchema.String.CharSet-Text.Regex.XMLSchema.String.Unicode.CharProps)Text.Regex.XMLSchema.String.XML.CharProps!Text.Regex.XMLSchema.String.Regex'Text.Regex.XMLSchema.String.RegexParserText.Regex.XMLSchema.String codeBlocksCharSetemptyCSallCSsingleCSstringCSrangeCSnullCSfullCSelemCSunionCSdiffCScompCS intersectCSexorCS isUnicodeC isUnicodeCc isUnicodeCf isUnicodeCo isUnicodeCs isUnicodeL isUnicodeLl isUnicodeLm isUnicodeLo isUnicodeLt isUnicodeLu isUnicodeM isUnicodeMc isUnicodeMe isUnicodeMn isUnicodeN isUnicodeNd isUnicodeNl isUnicodeNo isUnicodeP isUnicodePc isUnicodePd isUnicodePe isUnicodePf isUnicodePi isUnicodePo isUnicodePs isUnicodeS isUnicodeSc isUnicodeSk isUnicodeSm isUnicodeSo isUnicodeZ isUnicodeZl isUnicodeZp isUnicodeZs isXmlCharisXmlSpaceCharisXml11SpaceChar isXmlNameCharisXmlNameStartCharisXmlNCNameCharisXmlNCNameStartCharisXmlPubidChar isXmlLetter isXmlBaseCharisXmlIdeographicCharisXmlCombiningChar isXmlDigit isXmlExtender"isXmlControlOrPermanentlyUndefinedRegexGenRegexmkZeromkUnitmkSymmkSym1mkSymRngmkWordmkDotmkAllmkStarmkAltmkElsemkSeqmkSeqsmkRepmkRngmkOptmkDiffmkComplmkIsectmkExor mkInterleavemkBrisZeroerrRegexnullable nullable' firstCharsdelta1deltamatchWithRegexmatchWithRegex'splitWithRegexsplitWithRegexCSsplitWithRegex'splitWithRegexCS' parseRegexsplitREsplit splitSubexRE splitSubex tokenizeREtokenize tokenizeRE' tokenize'tokenizeSubexRE tokenizeSubexsedREsedmatchREmatch matchSubexRE matchSubexgrepNullableLabelCbrBrIntlExorIsecDiffRngRepSeqElseAltStarDotSymUnitZerormStarmkBr0mkBr'mkCbrshowLisectNunionNorElseNdiffNexorNsplitWithRegex''regExp branchList orElseListinterleaveListexorListdiffList intersectListseqListpiece quantifierquantity quantityRestatomregExp'char1 charClass charClassEsc singleCharEscsingleCharEsc' multiCharEsccatEsccharPropisBlock isCategory categories isCategory'complEsc charClassExpr charGroup posCharGroup charRangeseRange charOrEsc'xmlCharIncDash negCharGroup wildCardEsc