| Safe Haskell | Safe-Inferred |
|---|---|
| Language | Haskell2010 |
Brassica.SoundChange.Tokenise
Synopsis
- data Component a
- getWords :: [Component a] -> [a]
- splitMultipleResults :: String -> Component [a] -> [Component a]
- tokeniseWord :: [String] -> String -> Either (ParseErrorBundle String Void) PWord
- tokeniseWords :: [String] -> String -> Either (ParseErrorBundle String Void) [Component PWord]
- detokeniseWords' :: (a -> String) -> [Component a] -> String
- detokeniseWords :: [Component PWord] -> String
- concatWithBoundary :: PWord -> String
- findFirstCategoriesDecl :: SoundChanges c [Grapheme] -> [String]
- withFirstCategoriesDecl :: ([String] -> t) -> SoundChanges c [Grapheme] -> t
- wordParser :: [Char] -> [String] -> ParsecT Void String Identity PWord
- componentsParser :: ParsecT Void String Identity a -> ParsecT Void String Identity [Component a]
- sortByDescendingLength :: [[a]] -> [[a]]
Components
Represents a component of a tokenised input string. Words in
the input are represented as the type parameter a — which for
this reason will usually, though not always, be PWord.
Instances
getWords :: [Component a] -> [a] Source #
Given a tokenised input string, return only the Words within
it.
splitMultipleResults :: String -> Component [a] -> [Component a] Source #
Given a Component containing multiple values in a Word,
split it apart into a list of Components in which the given
String is used as a Separator between multiple results.
For instance:
>>>splitMultipleResults " " (Word ["abc", "def", "ghi"])[Word "abc", Separator " ", Word "def", Separator " ", Word "ghi"]
>>>splitMultipleResults " " (Word ["abc"])[Word "abc"]
High-level interface
tokeniseWord :: [String] -> String -> Either (ParseErrorBundle String Void) PWord Source #
Tokenise a String input word into a PWord by splitting it up
into Graphemes. A list of available multigraphs is supplied as
the first argument.
Note that this tokeniser is greedy: if one of the given
multigraphs is a prefix of another, the tokeniser will prefer the
longest if possible. If there are no matching multigraphs starting
at a particular character in the String, tokeniseWord will
treat that character as its own Grapheme. For instance:
>>>tokeniseWord [] "cherish"Right [GMulti "c",GMulti "h",GMulti "e",GMulti "r",GMulti "i",GMulti "s",GMulti "h"]
>>>tokeniseWord ["e","h","i","r","s","sh"] "cherish"Right [GMulti "c",GMulti "h",GMulti "e",GMulti "r",GMulti "i",GMulti "sh"]
>>>tokeniseWord ["c","ch","e","h","i","r","s","sh"] "cherish"Right [GMulti "ch",GMulti "e",GMulti "r",GMulti "i",GMulti "sh"]
tokeniseWords :: [String] -> String -> Either (ParseErrorBundle String Void) [Component PWord] Source #
Given a list of available multigraphs, tokenise an input string
into a list of words and other Components. This uses the same
tokenisation strategy as tokeniseWords, but also recognises
Glosses (in square brackets) and Separators (in the form of
whitespace).
detokeniseWords :: [Component PWord] -> String Source #
Specialisation of detokeniseWords' for PWords, converting
words to strings using concatWithBoundary.
concatWithBoundary :: PWord -> String Source #
findFirstCategoriesDecl :: SoundChanges c [Grapheme] -> [String] Source #
Given a list of sound changes, extract the list of multigraphs
defined in the first categories declaration of the SoundChanges.
withFirstCategoriesDecl :: ([String] -> t) -> SoundChanges c [Grapheme] -> t Source #
CPS'd form of findFirstCategoriesDecl. Nice for doing things
like (to
tokenise using the graphemes from the first categories declaration)
and so on.withFirstCategoriesDecl tokeniseWords changes words
Lower-level functions
wordParser :: [Char] -> [String] -> ParsecT Void String Identity PWord Source #
Megaparsec parser for PWords — see tokeniseWord documentation
for details on the parsing strategy and the meaning of the second
parameter. For most usecases tokeniseWord should suffice;
wordParser itself is only really useful in unusual situations
(e.g. as part of a larger parser). The first parameter gives a list
of characters (aside from whitespace) which should be excluded from
words, i.e. the parser will stop if any of them are found. The second
gives a list of multigraphs which might be expected.
Note: the second parameter must be sortByDescendingLength-ed;
otherwise multigraphs will not be parsed correctly.
componentsParser :: ParsecT Void String Identity a -> ParsecT Void String Identity [Component a] Source #
Megaparsec parser for Components. Similarly to wordParser,
usually it’s easier to use tokeniseWords instead.
sortByDescendingLength :: [[a]] -> [[a]] Source #