The regex Tutorial
==================

This tutorial is a self-testing literate Haskell programme introducing
the vanilla API of the [regex package](http://hs.regex.uk). There
are other tutorials for explaining the more specialist aspects of regex
and you can load them into into you Haskell REPL of choice: see the
[regex Tutorials page](http://tutorial.regex.uk) for details.


Language Pragmas
----------------

The first thing you will have to do is enable `QuasiQuotes` as regex
uses them to check that REs are well-formed at compile time.
\begin{code}
{-# LANGUAGE QuasiQuotes                      #-}
{-# OPTIONS_GHC -fno-warn-missing-signatures  #-}
\end{code}
If you are trying out examples interactively at the ghci prompt then you
will need
```
:seti -XQuasiQuotes
```

Importing the API
-----------------

\begin{code}
module Main(main) where
\end{code}

*********************************************************
*
* WARNING: this is generated from pp-tutorial-master.lhs 
*
*********************************************************


Before importing the `regex` API into your Haskell script you will need
to answer two questions:

  1. Which flavour of REs do I need? If you need Posix REs then the `TDFA`
     is for you, otherwise it is the PCRE back end, which is housed in
     a seperate `regex-with-pcre` package.

  2. Which Haskell type is being used for the text I need to match? This
     can influence as, at the time of writing, the `PCRE` `regex` back end
     [does not support the`Text` types](https://github.com/iconnect/regex/issues/58).

The import statement will in general look like this
```
  import Text.RE.<back-end>.<text-type>
```

As we have no interest in Posix/PCRE distinctions or performance here,
we have chosen to work with the `TDFA` back end with `String` types.

\begin{code}
import TestKit
import Text.RE.TDFA.String
\end{code}

You could also import `Text.RE.TDFA` or `Text.RE.PCRE` to get an API
in which the operators are overloaded over all text types accepted by
each of these back ends: see the [Tools Tutorial](re-tutorial-tools.html)
for details.


Single `Match` with `?=~`
-------------------------

The regex API provides two matching operators: one for looking for the first
match in its search string and the other for finding all of the matches. The
first-match operator, `?=~`, yields the result of attempting to find the first
match.
```
(?=~) :: String -> RE -> Match String
```
The boolean `matched` function,
```
matched :: Match a -> Bool
```
can be used to test whether a match was found:
\begin{code}
evalme_SGL_01 = checkThis "evalme_SGL_01" (True) $ matched $ "2016-01-09 2015-12-5 2015-10-05" ?=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|]
\end{code}

To get the matched text use `matchText`,
```
matchedText :: Match a -> Maybe a
```
which returns `Nothing` if no match was found in the search string:
\begin{code}
evalme_SGL_02 = checkThis "evalme_SGL_02" (Just "2016-01-09") $ matchedText $ "2016-01-09 2015-12-5 2015-10-05" ?=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|]
\end{code}
\begin{code}
evalme_SGL_03 = checkThis "evalme_SGL_03" (Nothing) $ matchedText $ "2015-12-5" ?=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|]
\end{code}


Multiple `Matches` with `*=~`
-----------------------------

Use `*=~` to locate all of the non-overlapping substrings that match a RE,
```
(*=~)      :: String -> RE -> Matches String
anyMatches :: Matches a -> Bool
```
`anyMatches` can be used to determine if any matches were found
\begin{code}
evalme_MLT_01 = checkThis "evalme_MLT_01" (True) $ anyMatches $ "2016-01-09 2015-12-5 2015-10-05" *=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|]
\end{code}
and `countMatches` will tell us how many sub-strings matched:
\begin{code}
evalme_MLT_02 = checkThis "evalme_MLT_02" (2) $ countMatches $ "2016-01-09 2015-12-5 2015-10-05" *=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|]
\end{code}
`matches` will return all of the matches.
```
matches :: Natches a -> [a]
```
\begin{code}
evalme_MLT_03 = checkThis "evalme_MLT_03" (["2016-01-09","2015-10-05"]) $ matches $ "2016-01-09 2015-12-5 2015-10-05" *=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|]
\end{code}


The `regex` Macros and Parsers
------------------------------

regex supports macros in regular expressions. There are a bunch of
standard macros that you can just use, and you can define your own.

RE macros are enclosed in `@{` ... '}'. By convention the macros in
the standard environment start with a '%'. `@{%date}` will match an
ISO 8601 date, this
\begin{code}
evalme_MAC_00 = checkThis "evalme_MAC_00" (2) $ countMatches $ "2016-01-09 2015-12-5 2015-10-05" *=~ [re|@{%date}|]
\end{code}
will pick out the two dates.

There are also parsing functions for analysing the matched text. The
`@{%string}` macro will match quoted strings (in which double quotes can be
escaped with backslashes in the usual way) and its companion `parseString`
function will extract the string that was being quoted, interpreting any
escaped double quotes:
\begin{code}
evalme_MAC_01 = checkThisWith convertMaybeTextList "evalme_MAC_01" ([Just "foo",Just "bar", Just "\""]) $ map parseString $ matches $ "\"foo\", \"bar\" and a quote \"\\\"\"" *=~ [re|@{%string}|]
\end{code}

See the [macro tables page](http://macros.regex.uk) for details of the standard macros and their parsers.

See the [testbench tutorial](re-tutorial-testbench.html) for more on how
you can develop, document and test RE macros with the regex test bench.


Search and Replace
------------------

If you need to edit a string then `SearchReplace` `[ed|` ... `|]`
templates can be used with `?=~/` to replace a single instance or
`*=~/` to replace all matching instances.

\begin{code}
evalme_SRP_00 = checkThis "evalme_SRP_00" ("0x0000: 40AA fab0") $ "0000 40AA fab0" ?=~/ [ed|${adr}([0-9A-Fa-f]{4}):?///0x${adr}:|]
\end{code}
\begin{code}
evalme_SRP_01 = checkThis "evalme_SRP_01" ("0x0000: 0x40AA 0xfab0") $ "0000: 40AA fab0" *=~/ [ed|[0-9A-Fa-f]{4}///0x$0|]
\end{code}


Specifying Options
------------------

By default regular expressions are of the multi-line case-sensitive
variety so this
\begin{code}
evalme_SOP_01 = checkThis "evalme_SOP_01" (2) $ countMatches $ "0a\nbb\nFe\nA5" *=~ [re|[0-9a-f]{2}$|]
\end{code}
will find 2 matches, the '$' anchor matching each of the newlines, but only
the first two lowercase hex numbers matching the RE. The case sensitivity
and multiline-ness can be controled by selecting alternative parsers.

+--------------------------+-------------+-----------+----------------+
| long name                | short forms | multiline | case sensitive |
+==========================+=============+===========+================+
| reMultilineSensitive     | reMS, re    | yes       | yes            |
+--------------------------+-------------+-----------+----------------+
| reMultilineInsensitive   | reMI        | yes       | no             |
+--------------------------+-------------+-----------+----------------+
| reBlockSensitive         | reBS        | no        | yes            |
+--------------------------+-------------+-----------+----------------+
| reBlockInsensitive       | reBI        | no        | no             |
+--------------------------+-------------+-----------+----------------+

So while the default setup
\begin{code}
evalme_SOP_02 = checkThis "evalme_SOP_02" (2) $ countMatches $ "0a\nbb\nFe\nA5" *=~ [reMultilineSensitive|[0-9a-f]{2}$|]
\end{code}
finds 2 matches, a case-insensitive RE
\begin{code}
evalme_SOP_03 = checkThis "evalme_SOP_03" (4) $ countMatches $ "0a\nbb\nFe\nA5" *=~ [reMultilineInsensitive|[0-9a-f]{2}$|]
\end{code}
finds 4 matches, while a non-multiline RE
\begin{code}
evalme_SOP_04 = checkThis "evalme_SOP_04" (0) $ countMatches $ "0a\nbb\nFe\nA5" *=~ [reBlockSensitive|[0-9a-f]{2}$|]
\end{code}
finds no matches but a non-multiline, case-insensitive match
\begin{code}
evalme_SOP_05 = checkThis "evalme_SOP_05" (1) $ countMatches $ "0a\nbb\nFe\nA5" *=~ [reBlockInsensitive|[0-9a-f]{2}$|]
\end{code}
finds the final match.

For the hard of typing the shortforms are available.
\begin{code}
evalme_SOP_06 = checkThis "evalme_SOP_06" (True) $ matched $ "SuperCaliFragilisticExpialidocious" ?=~ [reMI|supercalifragilisticexpialidocious|]
\end{code}


Compiling and Escaping
----------------------

It is possible to compile a dynamically aquired RE string at run-time using
`compileRegex`:
```
compileRegex :: (Functor m, Monad m) => String -> m RE
```
\begin{code}
evalme_CPL_01 = checkThis "evalme_CPL_01" (["2016-01-09","2015-10-05"]) $ matches $ "2016-01-09 2015-12-5 2015-10-05" *=~ (maybe (error "evalme_CPL_01") id $ compileRegex "[0-9]{4}-[0-9]{2}-[0-9]{2}")
\end{code}

These will compile the RE using the default multiline, case-sensitive options,
but you can specify the options dynamically using `compileRegexWith`:
```
compileRegexWith :: (Functor m, Monad m) => SimpleREOptions -> String -> m RE
```
where `SimpleREOptions` is a simple enumerated type.

%include "Text/RE/REOptions.lhs" "^data SimpleREOptions"

\begin{code}
evalme_CPL_02 = checkThis "evalme_CPL_02" (["2016-01-09","2015-10-05"]) $ matches $ "2016-01-09 2015-12-5 2015-10-05" *=~ (maybe (error "evalme_CPL_01") id $ compileRegexWith MultilineSensitive "[0-9]{4}-[0-9]{2}-[0-9]{2}")
\end{code}

If you need to compile `SearchReplace` templates for use with `?=~/` and
`*=~/` then the `compileSearchReplace` and `compileSearchReplaceWith`,
```
compileSearchReplace     :: (Monad m, Functor m, IsRegex RE s) => String -> String -> m (SearchReplace RE s)
compileSearchReplaceWith :: (Monad m, Functor m, IsRegex RE s) => SimpleREOptions -> String -> String -> m (SearchReplace RE s)
```
work analagously to `compileRegex` and `compileRegexWith`, with the RE
and replacement template (either side of the '///' in the `[ed|...///...|]`
quasi quoters) being passed into these functions in two separate strings,
to compile to the `SearchReplace` type expected by the `?=~/` and `*=~/`
operators.

%include "Text/RE/ZeInternals/Types/SearchReplace.lhs" "^data SearchReplace"

The `escape` and `escapeWith` functions are special compilers that compile
a string into a RE that should match itself, which is assumed to be embedded
in a complex RE to be compiled.
```
escape :: (Functor m, Monad m) => (String->String) -> String -> m RE
```
The function pased in the first argument to `escape` takes the RE string
that will match the string passed in the second argument and yields the
RE to be compiled, which is returned from the parsing action.
\begin{code}
evalme_CPL_03 = checkThis "evalme_CPL_03" ("foobar") $ "fooe{0}bar" *=~/ SearchReplace (maybe (error "evalme_CPL_03") id $ escape id "e{0}") ""
\end{code}


The Classic regex-base Match Operators
--------------------------------------

The original `=~` and `=~~` match operators are still available for
those that have mastered them.
\begin{code}
evalme_CLC_01 = checkThis "evalme_CLC_01" (True )    $ ("bar"    =~  [re|(foo|bar)|] :: Bool)
\end{code}
\begin{code}
evalme_CLC_02 = checkThis "evalme_CLC_02" (False)    $ ("quux"   =~  [re|(foo|bar)|] :: Bool)
\end{code}
\begin{code}
evalme_CLC_03 = checkThis "evalme_CLC_03" (2)        $ ("foobar" =~  [re|(foo|bar)|] :: Int)
\end{code}
\begin{code}
evalme_CLC_04 = checkThis "evalme_CLC_04" (Nothing)  $ ("foo"    =~~ [re|bar|]       :: Maybe String)
\end{code}

\begin{code}
main :: IO ()
main = runTheTests
  [ evalme_CLC_04
  , evalme_CLC_03
  , evalme_CLC_02
  , evalme_CLC_01
  , evalme_CPL_03
  , evalme_CPL_02
  , evalme_CPL_01
  , evalme_SOP_06
  , evalme_SOP_05
  , evalme_SOP_04
  , evalme_SOP_03
  , evalme_SOP_02
  , evalme_SOP_01
  , evalme_SRP_01
  , evalme_SRP_00
  , evalme_MAC_01
  , evalme_MAC_00
  , evalme_MLT_03
  , evalme_MLT_02
  , evalme_MLT_01
  , evalme_SGL_03
  , evalme_SGL_02
  , evalme_SGL_01
  ]
\end{code}