hxt-8.5.4: A collection of tools for processing XML with Haskell.

MaintainerUwe Schmidt (uwe@fh-wedel.de)



Convenient functions for W3C XML Schema Regular Expression Matcher. For internals see Text.XML.HXT.RelaxNG.XmlSchema.Regex

Grammar can be found under http://www.w3.org/TR/xmlschema11-2/#regexs



matchRE :: String -> String -> Maybe BoolSource

match a string with a regular expression

First argument is the regex, second the input string, if the regex is not well formed, Nothing is returned, else Just the match result


 matchRE "x*" "xxx" = Just True
 matchRE "x" "xxx"  = Just False
 matchRE "[" "xxx"  = Nothing

splitRE :: String -> String -> Maybe (String, String)Source

split a string by taking the longest prefix matching a regular expression

Nothing is returned in case of a syntactically wrong regex string or in case there is no matching prefix, else the pair of prefix and rest is returned


 splitRE "a*b" "abc" = Just ("ab","c")
 splitRE "a*"  "bc"  = Just ("", "bc")
 splitRE "a+"  "bc"  = Nothing
 splitRE "["   "abc" = Nothing

sedRE :: (String -> String) -> String -> String -> Maybe StringSource

sed like editing function

All matching tokens are edited by the 1. argument, the editing function, all other chars remain as they are


 sedRE (const "b") "a" "xaxax"       = Just "xbxbx"
 sedRE (\ x -> x ++ x) "a" "xax"     = Just "xaax"
 sedRE undefined       "[" undefined = Nothing

tokenizeRE :: String -> String -> Maybe [String]Source

split a string into tokens (words) by giving a regular expression which all tokens must match

This can be used for simple tokenizers. The words in the result list contain at least one char. All none matching chars are discarded. If the given regex contains syntax errors, Nothing is returned


 tokenizeRE "a*b" ""         = Just []
 tokenizeRE "a*b" "abc"      = Just ["ab"]
 tokenizeRE "a*b" "abaab ab" = Just ["ab","aab","ab"]

 tokenizeRE "[a-z]{2,}|[0-9]{2,}|[0-9]+[.][0-9]+" "ab123 456.7abc"
                                = Just ["ab","123","456.7","abc"]

 tokenizeRE "[a-z]*|[0-9]{2,}|[0-9]+[.][0-9]+" "cab123 456.7abc"
                                = Just ["cab","123","456.7","abc"]

 tokenizeRE "[^ \t\n\r]*" "abc def\t\n\rxyz"
                                = Just ["abc","def","xyz"]

 tokenizeRE "[^ \t\n\r]*"    = words

tokenizeRE' :: String -> String -> Maybe [Either String String]Source

split a string into tokens and delimierter by giving a regular expression wich all tokens must match

This is a generalisation of the above tokenizeRE functions. The none matching char sequences are marked with Left, the matching ones are marked with Right

If the regular expression contains syntax errors Nothing is returned

The following Law holds:

 concat . map (either id id) . fromJust . tokenizeRE' re == id

match :: String -> String -> BoolSource

convenient function for matchRE

syntax errors in R.E. are interpreted as no match found

tokenize :: String -> String -> [String]Source

convenient function for tokenizeRE a string

syntax errors in R.E. result in an empty list

tokenize' :: String -> String -> [Either String String]Source

convenient function for tokenizeRE'

When the regular expression contains errors [Left input] is returned, that means tokens are found

sed :: (String -> String) -> String -> String -> StringSource

convenient function for sedRE

When the regular expression contains errors, sed is the identity, else the funtionality is like sedRE

 sed undefined "["  == id

split :: String -> String -> (String, String)Source

convenient function for splitRE

syntax errors in R.E. are interpreted as no matching prefix found