Copyright | See LICENSE file |
---|---|
License | BSD3 |
Maintainer | Brad Neimann |
Safe Haskell | Safe-Inferred |
Language | Haskell2010 |
Brassica.SoundChange.Tokenise
Description
Synopsis
- tokeniseWord :: [String] -> String -> Either (ParseErrorBundle String Void) PWord
- data Component a
- getWords :: [Component a] -> [a]
- splitMultipleResults :: String -> Component [a] -> [Component a]
- joinComponents :: [Component [Component a]] -> [Component a]
- tokeniseWords :: [String] -> String -> Either (ParseErrorBundle String Void) [Component PWord]
- detokeniseWords' :: (a -> String) -> [Component a] -> String
- detokeniseWords :: [Component PWord] -> String
- findFirstCategoriesDecl :: SoundChanges c GraphemeList -> [String]
- withFirstCategoriesDecl :: ([String] -> t) -> SoundChanges c GraphemeList -> t
- wordParser :: [Char] -> [String] -> ParsecT Void String Identity PWord
- componentsParser :: ParsecT Void String Identity a -> ParsecT Void String Identity [Component a]
- sortByDescendingLength :: [[a]] -> [[a]]
High-level interface
tokeniseWord :: [String] -> String -> Either (ParseErrorBundle String Void) PWord Source #
Tokenise a String
input word into a PWord
by splitting it up
into Grapheme
s. A list of available multigraphs is supplied as
the first argument.
Note that this tokeniser is greedy: if one of the given
multigraphs is a prefix of another, the tokeniser will prefer the
longest if possible. If there are no matching multigraphs starting
at a particular character in the String
, tokeniseWord
will
take that character as forming its own Grapheme
. For instance:
>>>
tokeniseWord [] "cherish"
Right [GMulti "c",GMulti "h",GMulti "e",GMulti "r",GMulti "i",GMulti "s",GMulti "h"]
>>>
tokeniseWord ["e","h","i","r","s","sh"] "cherish"
Right [GMulti "c",GMulti "h",GMulti "e",GMulti "r",GMulti "i",GMulti "sh"]
>>>
tokeniseWord ["c","ch","e","h","i","r","s","sh"] "cherish"
Right [GMulti "ch",GMulti "e",GMulti "r",GMulti "i",GMulti "sh"]
The resulting PWord
can be converted back to a String
using
concatWithBoundary
. (However, it is not strictly speaking a true
inverse as it deletes word boundaries).
Represents a component of a Brassica words file. Each word in the
input has type a
(often PWord
or [
).PWord
]
Constructors
Word a | An input word to which sound changes will be applied |
Separator String | A separator, e.g. whitespace |
Gloss String | A gloss (in Brassica syntax, between square brackets) |
Instances
getWords :: [Component a] -> [a] Source #
Given a tokenised input string, return only the Word
s within
it.
splitMultipleResults :: String -> Component [a] -> [Component a] Source #
Given a Component
containing multiple values in a Word
,
split it apart into a list of Component
s in which the given
String
is used as a Separator
between multiple results.
For instance:
>>>
splitMultipleResults "/" (Word ["abc", "def", "ghi"])
[Word "abc", Separator "/", Word "def", Separator "/", Word "ghi"]
>>>
splitMultipleResults " " (Word ["abc"])
[Word "abc"]
joinComponents :: [Component [Component a]] -> [Component a] Source #
Flatten a nested list of Component
s.
tokeniseWords :: [String] -> String -> Either (ParseErrorBundle String Void) [Component PWord] Source #
Given a list of available multigraphs, tokenise an input words
file into a list of words and other Component
s. This uses the
same tokenisation strategy as tokeniseWords
, but also recognises
Gloss
es (in square brackets) and Separator
s (as whitespace).
detokeniseWords' :: (a -> String) -> [Component a] -> String Source #
Inverse of tokeniseWords
: given a function to convert Word
s
to strings, converts a list of Component
s to strings.
detokeniseWords :: [Component PWord] -> String Source #
Specialisation of detokeniseWords'
for PWord
s, converting
words to strings using concatWithBoundary
.
findFirstCategoriesDecl :: SoundChanges c GraphemeList -> [String] Source #
Given a list of sound changes, extract the list of multigraphs
defined in the first GraphemeList
of the SoundChanges
.
withFirstCategoriesDecl :: ([String] -> t) -> SoundChanges c GraphemeList -> t Source #
CPS'd form of findFirstCategoriesDecl
. Nice for doing things
like
(to
tokenise using the graphemes from the first categories declaration)
and so on.withFirstCategoriesDecl
tokeniseWords
changes words
Lower-level functions
wordParser :: [Char] -> [String] -> ParsecT Void String Identity PWord Source #
Megaparsec parser for PWord
s — see tokeniseWord
documentation
for details on the parsing strategy. For most usecases
tokeniseWord
should suffice; wordParser
itself is only really
useful in unusual situations (e.g. as part of a larger parser).
The first parameter gives a list of characters aside from
whitespace which should be excluded from words, i.e. the parser
will stop if any of them are found. The second gives a list of
multigraphs which might be expected, as with tokeniseWord
.
Note: the second parameter must be already be sorted by descending length; otherwise multigraphs will not be parsed correctly (i.e. greedily).
Arguments
:: ParsecT Void String Identity a | Parser for individual words (e.g. |
-> ParsecT Void String Identity [Component a] |
Megaparsec parser for Component
s. Similarly to wordParser
,
usually it’s easier to use tokeniseWords
instead.
sortByDescendingLength :: [[a]] -> [[a]] Source #
Sort a list of lists by the length of the inner lists, in descending order.