brassica-1.0.0: Featureful sound change applier
CopyrightSee LICENSE file
LicenseBSD3
MaintainerBrad Neimann
Safe HaskellSafe-Inferred
LanguageHaskell2010

Brassica.SoundChange.Tokenise

Description

This module provides functions to parse a Brassica words file into its constituent Components, and to tokenise the words in that file into their constituent graphemes to form PWords. It also provides functions to reverse these processes.

Synopsis

High-level interface

tokeniseWord :: [String] -> String -> Either (ParseErrorBundle String Void) PWord Source #

Tokenise a String input word into a PWord by splitting it up into Graphemes. A list of available multigraphs is supplied as the first argument.

Note that this tokeniser is greedy: if one of the given multigraphs is a prefix of another, the tokeniser will prefer the longest if possible. If there are no matching multigraphs starting at a particular character in the String, tokeniseWord will take that character as forming its own Grapheme. For instance:

>>> tokeniseWord [] "cherish"
Right [GMulti "c",GMulti "h",GMulti "e",GMulti "r",GMulti "i",GMulti "s",GMulti "h"]
>>> tokeniseWord ["e","h","i","r","s","sh"] "cherish"
Right [GMulti "c",GMulti "h",GMulti "e",GMulti "r",GMulti "i",GMulti "sh"]
>>> tokeniseWord ["c","ch","e","h","i","r","s","sh"] "cherish"
Right [GMulti "ch",GMulti "e",GMulti "r",GMulti "i",GMulti "sh"]

The resulting PWord can be converted back to a String using concatWithBoundary. (However, it is not strictly speaking a true inverse as it deletes word boundaries).

data Component a Source #

Represents a component of a Brassica words file. Each word in the input has type a (often PWord or [PWord]).

Constructors

Word a

An input word to which sound changes will be applied

Separator String

A separator, e.g. whitespace

Gloss String

A gloss (in Brassica syntax, between square brackets)

Instances

Instances details
Foldable Component Source # 
Instance details

Defined in Brassica.SoundChange.Tokenise

Methods

fold :: Monoid m => Component m -> m #

foldMap :: Monoid m => (a -> m) -> Component a -> m #

foldMap' :: Monoid m => (a -> m) -> Component a -> m #

foldr :: (a -> b -> b) -> b -> Component a -> b #

foldr' :: (a -> b -> b) -> b -> Component a -> b #

foldl :: (b -> a -> b) -> b -> Component a -> b #

foldl' :: (b -> a -> b) -> b -> Component a -> b #

foldr1 :: (a -> a -> a) -> Component a -> a #

foldl1 :: (a -> a -> a) -> Component a -> a #

toList :: Component a -> [a] #

null :: Component a -> Bool #

length :: Component a -> Int #

elem :: Eq a => a -> Component a -> Bool #

maximum :: Ord a => Component a -> a #

minimum :: Ord a => Component a -> a #

sum :: Num a => Component a -> a #

product :: Num a => Component a -> a #

Traversable Component Source # 
Instance details

Defined in Brassica.SoundChange.Tokenise

Methods

traverse :: Applicative f => (a -> f b) -> Component a -> f (Component b) #

sequenceA :: Applicative f => Component (f a) -> f (Component a) #

mapM :: Monad m => (a -> m b) -> Component a -> m (Component b) #

sequence :: Monad m => Component (m a) -> m (Component a) #

Functor Component Source # 
Instance details

Defined in Brassica.SoundChange.Tokenise

Methods

fmap :: (a -> b) -> Component a -> Component b #

(<$) :: a -> Component b -> Component a #

Generic (Component a) Source # 
Instance details

Defined in Brassica.SoundChange.Tokenise

Associated Types

type Rep (Component a) :: Type -> Type #

Methods

from :: Component a -> Rep (Component a) x #

to :: Rep (Component a) x -> Component a #

Show a => Show (Component a) Source # 
Instance details

Defined in Brassica.SoundChange.Tokenise

NFData a => NFData (Component a) Source # 
Instance details

Defined in Brassica.SoundChange.Tokenise

Methods

rnf :: Component a -> () #

Eq a => Eq (Component a) Source # 
Instance details

Defined in Brassica.SoundChange.Tokenise

Methods

(==) :: Component a -> Component a -> Bool #

(/=) :: Component a -> Component a -> Bool #

type Rep (Component a) Source # 
Instance details

Defined in Brassica.SoundChange.Tokenise

getWords :: [Component a] -> [a] Source #

Given a tokenised input string, return only the Words within it.

splitMultipleResults :: String -> Component [a] -> [Component a] Source #

Given a Component containing multiple values in a Word, split it apart into a list of Components in which the given String is used as a Separator between multiple results.

For instance:

>>> splitMultipleResults "/" (Word ["abc", "def", "ghi"])
[Word "abc", Separator "/", Word "def", Separator "/", Word "ghi"]
>>> splitMultipleResults " " (Word ["abc"])
[Word "abc"]

joinComponents :: [Component [Component a]] -> [Component a] Source #

Flatten a nested list of Components.

tokeniseWords :: [String] -> String -> Either (ParseErrorBundle String Void) [Component PWord] Source #

Given a list of available multigraphs, tokenise an input words file into a list of words and other Components. This uses the same tokenisation strategy as tokeniseWords, but also recognises Glosses (in square brackets) and Separators (as whitespace).

detokeniseWords' :: (a -> String) -> [Component a] -> String Source #

Inverse of tokeniseWords: given a function to convert Words to strings, converts a list of Components to strings.

detokeniseWords :: [Component PWord] -> String Source #

Specialisation of detokeniseWords' for PWords, converting words to strings using concatWithBoundary.

findFirstCategoriesDecl :: SoundChanges c GraphemeList -> [String] Source #

Given a list of sound changes, extract the list of multigraphs defined in the first GraphemeList of the SoundChanges.

withFirstCategoriesDecl :: ([String] -> t) -> SoundChanges c GraphemeList -> t Source #

CPS'd form of findFirstCategoriesDecl. Nice for doing things like withFirstCategoriesDecl tokeniseWords changes words (to tokenise using the graphemes from the first categories declaration) and so on.

Lower-level functions

wordParser :: [Char] -> [String] -> ParsecT Void String Identity PWord Source #

Megaparsec parser for PWords — see tokeniseWord documentation for details on the parsing strategy. For most usecases tokeniseWord should suffice; wordParser itself is only really useful in unusual situations (e.g. as part of a larger parser).

The first parameter gives a list of characters aside from whitespace which should be excluded from words, i.e. the parser will stop if any of them are found. The second gives a list of multigraphs which might be expected, as with tokeniseWord.

Note: the second parameter must be already be sorted by descending length; otherwise multigraphs will not be parsed correctly (i.e. greedily).

componentsParser Source #

Arguments

:: ParsecT Void String Identity a

Parser for individual words (e.g. wordParser)

-> ParsecT Void String Identity [Component a] 

Megaparsec parser for Components. Similarly to wordParser, usually it’s easier to use tokeniseWords instead.

sortByDescendingLength :: [[a]] -> [[a]] Source #

Sort a list of lists by the length of the inner lists, in descending order.