Copyright	See LICENSE file
License	BSD3
Maintainer	Brad Neimann
Safe Haskell	Safe-Inferred
Language	Haskell2010

Brassica.SoundChange.Tokenise

Contents

High-level interface
Lower-level functions

Description

This module provides functions to parse a Brassica words file into its constituent Components, and to tokenise the words in that file into their constituent graphemes to form PWords. It also provides functions to reverse these processes.

Synopsis

tokeniseWord :: [String] -> String -> Either (ParseErrorBundle String Void) PWord
data Component a
- = Word a
- | Separator String
- | Gloss String
getWords :: [Component a] -> [a]
splitMultipleResults :: String -> Component [a] -> [Component a]
joinComponents :: [Component [Component a]] -> [Component a]
tokeniseWords :: [String] -> String -> Either (ParseErrorBundle String Void) [Component PWord]
detokeniseWords' :: (a -> String) -> [Component a] -> String
detokeniseWords :: [Component PWord] -> String
findFirstCategoriesDecl :: SoundChanges c GraphemeList -> [String]
withFirstCategoriesDecl :: ([String] -> t) -> SoundChanges c GraphemeList -> t
wordParser :: [Char] -> [String] -> ParsecT Void String Identity PWord
componentsParser :: ParsecT Void String Identity a -> ParsecT Void String Identity [Component a]
sortByDescendingLength :: [[a]] -> [[a]]

High-level interface

tokeniseWord :: [String] -> String -> Either (ParseErrorBundle String Void) PWord Source #

Tokenise a String input word into a PWord by splitting it up into Graphemes. A list of available multigraphs is supplied as the first argument.

Note that this tokeniser is greedy: if one of the given multigraphs is a prefix of another, the tokeniser will prefer the longest if possible. If there are no matching multigraphs starting at a particular character in the String, tokeniseWord will take that character as forming its own Grapheme. For instance:

>>> tokeniseWord [] "cherish"
Right [GMulti "c",GMulti "h",GMulti "e",GMulti "r",GMulti "i",GMulti "s",GMulti "h"]

>>> tokeniseWord ["e","h","i","r","s","sh"] "cherish"
Right [GMulti "c",GMulti "h",GMulti "e",GMulti "r",GMulti "i",GMulti "sh"]

>>> tokeniseWord ["c","ch","e","h","i","r","s","sh"] "cherish"
Right [GMulti "ch",GMulti "e",GMulti "r",GMulti "i",GMulti "sh"]

The resulting PWord can be converted back to a String using concatWithBoundary. (However, it is not strictly speaking a true inverse as it deletes word boundaries).

data Component a Source #

Represents a component of a Brassica words file. Each word in the input has type a (often PWord or [PWord]).

Constructors

Word a	An input word to which sound changes will be applied
Separator String	A separator, e.g. whitespace
Gloss String	A gloss (in Brassica syntax, between square brackets)

Instances

Instances details

Foldable Component Source #
Instance details Defined in Brassica.SoundChange.Tokenise Methods fold :: Monoid m => Component m -> m # foldMap :: Monoid m => (a -> m) -> Component a -> m # foldMap' :: Monoid m => (a -> m) -> Component a -> m # foldr :: (a -> b -> b) -> b -> Component a -> b # foldr' :: (a -> b -> b) -> b -> Component a -> b # foldl :: (b -> a -> b) -> b -> Component a -> b # foldl' :: (b -> a -> b) -> b -> Component a -> b # foldr1 :: (a -> a -> a) -> Component a -> a # foldl1 :: (a -> a -> a) -> Component a -> a # toList :: Component a -> [a] # null :: Component a -> Bool # length :: Component a -> Int # elem :: Eq a => a -> Component a -> Bool # maximum :: Ord a => Component a -> a # minimum :: Ord a => Component a -> a # sum :: Num a => Component a -> a # product :: Num a => Component a -> a #
Traversable Component Source #
Instance details Defined in Brassica.SoundChange.Tokenise Methods traverse :: Applicative f => (a -> f b) -> Component a -> f (Component b) # sequenceA :: Applicative f => Component (f a) -> f (Component a) # mapM :: Monad m => (a -> m b) -> Component a -> m (Component b) # sequence :: Monad m => Component (m a) -> m (Component a) #
Functor Component Source #
Instance details Defined in Brassica.SoundChange.Tokenise Methods fmap :: (a -> b) -> Component a -> Component b # (<$) :: a -> Component b -> Component a #
Generic (Component a) Source #
Instance details Defined in Brassica.SoundChange.Tokenise Associated Types type Rep (Component a) :: Type -> Type # Methods from :: Component a -> Rep (Component a) x # to :: Rep (Component a) x -> Component a #
Show a => Show (Component a) Source #
Instance details Defined in Brassica.SoundChange.Tokenise Methods showsPrec :: Int -> Component a -> ShowS # show :: Component a -> String # showList :: [Component a] -> ShowS #
NFData a => NFData (Component a) Source #
Instance details Defined in Brassica.SoundChange.Tokenise Methods rnf :: Component a -> () #
Eq a => Eq (Component a) Source #
Instance details Defined in Brassica.SoundChange.Tokenise Methods (==) :: Component a -> Component a -> Bool # (/=) :: Component a -> Component a -> Bool #
type Rep (Component a) Source #
Instance details Defined in Brassica.SoundChange.Tokenise type Rep (Component a) = D1 ('MetaData "Component" "Brassica.SoundChange.Tokenise" "brassica-1.0.0-inplace" 'False) (C1 ('MetaCons "Word" 'PrefixI 'False) (S1 ('MetaSel ('Nothing :: Maybe Symbol) 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 a)) :+: (C1 ('MetaCons "Separator" 'PrefixI 'False) (S1 ('MetaSel ('Nothing :: Maybe Symbol) 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 String)) :+: C1 ('MetaCons "Gloss" 'PrefixI 'False) (S1 ('MetaSel ('Nothing :: Maybe Symbol) 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 String))))

getWords :: [Component a] -> [a] Source #

Given a tokenised input string, return only the Words within it.

splitMultipleResults :: String -> Component [a] -> [Component a] Source #

Given a Component containing multiple values in a Word, split it apart into a list of Components in which the given String is used as a Separator between multiple results.

For instance:

>>> splitMultipleResults "/" (Word ["abc", "def", "ghi"])
[Word "abc", Separator "/", Word "def", Separator "/", Word "ghi"]

>>> splitMultipleResults " " (Word ["abc"])
[Word "abc"]

joinComponents :: [Component [Component a]] -> [Component a] Source #

Flatten a nested list of Components.

tokeniseWords :: [String] -> String -> Either (ParseErrorBundle String Void) [Component PWord] Source #

Given a list of available multigraphs, tokenise an input words file into a list of words and other Components. This uses the same tokenisation strategy as tokeniseWords, but also recognises Glosses (in square brackets) and Separators (as whitespace).

detokeniseWords' :: (a -> String) -> [Component a] -> String Source #

Inverse of tokeniseWords: given a function to convert Words to strings, converts a list of Components to strings.

detokeniseWords :: [Component PWord] -> String Source #

Specialisation of detokeniseWords' for PWords, converting words to strings using concatWithBoundary.

findFirstCategoriesDecl :: SoundChanges c GraphemeList -> [String] Source #

Given a list of sound changes, extract the list of multigraphs defined in the first GraphemeList of the SoundChanges.

withFirstCategoriesDecl :: ([String] -> t) -> SoundChanges c GraphemeList -> t Source #

CPS'd form of findFirstCategoriesDecl. Nice for doing things like withFirstCategoriesDecl tokeniseWords changes words (to tokenise using the graphemes from the first categories declaration) and so on.

Lower-level functions

wordParser :: [Char] -> [String] -> ParsecT Void String Identity PWord Source #

Megaparsec parser for PWords — see tokeniseWord documentation for details on the parsing strategy. For most usecases tokeniseWord should suffice; wordParser itself is only really useful in unusual situations (e.g. as part of a larger parser).

The first parameter gives a list of characters aside from whitespace which should be excluded from words, i.e. the parser will stop if any of them are found. The second gives a list of multigraphs which might be expected, as with tokeniseWord.

Note: the second parameter must be already be sorted by descending length; otherwise multigraphs will not be parsed correctly (i.e. greedily).

componentsParser Source #

Arguments

:: ParsecT Void String Identity a	Parser for individual words (e.g. `wordParser`)
-> ParsecT Void String Identity [Component a]

Megaparsec parser for Components. Similarly to wordParser, usually it’s easier to use tokeniseWords instead.

sortByDescendingLength :: [[a]] -> [[a]] Source #

Sort a list of lists by the length of the inner lists, in descending order.