chatter-0.5.2.0: A library of simple NLP algorithms.

Safe HaskellNone
LanguageHaskell2010

NLP.Types.Annotations

Synopsis

Documentation

prettyShow :: Pretty p => p -> Text

Convert a pretty-printable value into a text string.

newtype Index a

Safe index type, uses a phantom type to prevent us from indexing into the wrong thing.

Constructors

Index Int 

Instances

Eq (Index a) 
Ord (Index a) 
Read (Index a) 
Show (Index a) 
Generic (Index a) 
Hashable (Index a) 
type Rep (Index a) 

data Annotation dat tag

Annotations are the base of all tags (POS tags, Chunks, marked entities, etc.)

The semantics of the particular annotation depend on the type of the value, and these can be wrapped up in a newtype for easier use.

Constructors

Annotation 

Fields

startIdx :: !(Index dat)

The starting index of the annotation (a character offset into the underlying data).

len :: !Int

The end index of the annotation.

value :: tag

The value, such as a POS tag.

payload :: dat

The underlying thing that is being annotated (such as a text string, or a list of other annotations)

Instances

(Eq dat, Eq tag) => Eq (Annotation dat tag) 
(Ord dat, Ord tag) => Ord (Annotation dat tag) 
(Read dat, Read tag) => Read (Annotation dat tag) 
(Show dat, Show tag) => Show (Annotation dat tag) 
Generic (Annotation dat tag) 
(Hashable dat, Hashable tag) => Hashable (Annotation dat tag) 
AnnotatedText (Annotation Text Token) 
type Rep (Annotation dat tag) 

data TokenizedSentence

Wrapper around both the underlying text and the tokenizer results.

Constructors

TokenizedSentence 

Fields

tokText :: Text
 
tokAnnotations :: [Annotation Text Token]
 

tokens :: TokenizedSentence -> [Token]

Get the raw tokens out of a TokenizedSentence

data TaggedSentence pos

Results of the POS tagger, which encompases a TokenizedSentence

Instances

Eq pos => Eq (TaggedSentence pos) 
Ord pos => Ord (TaggedSentence pos) 
Read pos => Read (TaggedSentence pos) 
Show pos => Show (TaggedSentence pos) 
Generic (TaggedSentence pos) 
Arbitrary pos => Arbitrary (TaggedSentence pos) 
Hashable pos => Hashable (TaggedSentence pos) 
POS pos => Pretty (TaggedSentence pos) 
AnnotatedText (TaggedSentence pos) 
type Rep (TaggedSentence pos) 

tsLength :: POS pos => TaggedSentence pos -> Int

Count the length of the tokens of a TaggedSentence.

Note that this is *probably* the number of annotations also, but it is not necessarily the same.

tsToPairs :: POS pos => TaggedSentence pos -> [(Token, pos)]

Generate a list of Tokens and their corresponding POS tags. Creates a token for each POS tag, just in case any POS tags are annotated over multiple tokens.

applyTags :: POS pos => TokenizedSentence -> [pos] -> TaggedSentence pos

Apply a parallel list of POS tags to a TokenizedSentence

getTags :: POS pos => TaggedSentence pos -> [pos]

Extract the POS tags from a tagged sentence.

unapplyTags :: POS pos => TaggedSentence pos -> (TokenizedSentence, [pos])

Extract the POS tags from a tagged sentence, returning the tokenized sentence that they applied to.

data ChunkedSentence pos chunk

A Chunked sentence, with underlying Part-of-Speech tags and tokens. Note: This is not a deep tree, a separate parse tree is needed.

Instances

(Eq pos, Eq chunk) => Eq (ChunkedSentence pos chunk) 
(Ord pos, Ord chunk) => Ord (ChunkedSentence pos chunk) 
(Read pos, Read chunk) => Read (ChunkedSentence pos chunk) 
(Show pos, Show chunk) => Show (ChunkedSentence pos chunk) 
Generic (ChunkedSentence pos chunk) 
(Hashable pos, Hashable chunk) => Hashable (ChunkedSentence pos chunk) 
AnnotatedText (ChunkedSentence pos chunk) 
type Rep (ChunkedSentence pos chunk) 

data NERedSentence pos chunk ne

A sentence that has been marked with named entities.

Constructors

NERedSentence 

Fields

neChunkSentence :: ChunkedSentence pos chunk
 
neAnnotations :: [Annotation (TaggedSentence pos) ne]

These annotations are annotating the TaggedSentence contained in the ChunkedSentence

Instances

(Eq pos, Eq chunk, Eq ne) => Eq (NERedSentence pos chunk ne) 
(Ord pos, Ord chunk, Ord ne) => Ord (NERedSentence pos chunk ne) 
(Read pos, Read chunk, Read ne) => Read (NERedSentence pos chunk ne) 
(Show pos, Show chunk, Show ne) => Show (NERedSentence pos chunk ne) 
Generic (NERedSentence pos chunk ne) 
(Hashable pos, Hashable chunk, Hashable ne) => Hashable (NERedSentence pos chunk ne) 
AnnotatedText (NERedSentence pos chunk ne) 
type Rep (NERedSentence pos chunk ne) 

class AnnotatedText sentence where

Typeclass of things that have underlying text, so it's easy to get the annotated document out of a tagged, tokenized, or chunked result.

Methods

getText :: sentence -> Text

newtype Token

Tokenization takes in text, produces annotations. type Tokenizer = Text -> TokenizedSentence

Chunking requires POS-tags (and tokenization) and generates annotations on the tokens. type Chunker pos chunk = TaggedSentence pos -> ChunkedSentence pos chunk

Named Entity recognition requires POS tags and tokens, and produces annotations with Named Entities marked. type NERer pos chunk ne = ChunkedSentence pos chunk -> NERedSentence pos chunk ne

Sentinel value for tokens.

Constructors

Token Text 

Instances

showTok :: Token -> Text

Unwrap the text of a Token

suffix :: Token -> Text

Extract the last three characters of a Token, if the token is long enough, otherwise returns the full token text.

class (Ord a, Eq a, Read a, Show a, Generic a, Serialize a, Hashable a) => POS a where

The class of POS Tags.

We use a typeclass here because POS tags just need a few things in excess of equality (they also need to be serializable and human readable). Passing around all the constraints everywhere becomes a hassle, and it's handy to have a uniform interface to the diferent kinds of tag types.

This typeclass also allows for corpus-specific tags to be distinguished; They have different semantics, so they should not be merged. That said, if you wish to create a unifying POS Tag set, and mappings into that set, you can use the type system to ensure that that is done correctly.

Minimal complete definition

serializePOS, parsePOS, tagUNK, startPOS, endPOS, isDt

Methods

serializePOS :: a -> Text

Serialize a POS to a text representation. eg: NN, VB, etc.. This is the dual of parsePOS

parsePOS :: Text -> Either Error a

Parse a POS tag into a structured POS value. (eg: NN, VB, etc..) This is the dual of serializePOS

safeParsePOS :: Text -> a

tagUNK :: a

The value used to represent "unknown".

startPOS :: a

Special marker POS for start of a corpus.

endPOS :: a

Special marker POS for the end of a corpus.

isDt :: a -> Bool

Check if a tag is a determiner tag.

Instances

POS RawTag

Tag instance for unknown tagsets.

POS Tag 
POS Tag 

class (Ord a, Eq a, Read a, Show a, Generic a, Serialize a, Hashable a) => Chunk a where

The class of things that can be regarded as chunks; Chunk tags are much like POS tags, but should not be confused. Generally, chunks distinguish between different phrasal categories (e.g.; Noun Phrases, Verb Phrases, Prepositional Phrases, etc..)

Methods

serializeChunk :: a -> Text

Serialize a chunk to a text representation (such as NP, VP, etc.) This is the dual of parseChunk.

parseChunk :: Text -> Either Error a

Parse a chunk from a text representation (such as NP, VP, etc.) This is the dual of serializeChunk.

notChunk :: a

Special chunk value to indicate something is not in a chunk.

class (Ord a, Eq a, Read a, Show a, Generic a, Serialize a, Hashable a) => NamedEntity a where

The class of named entity sets. This typeclass can be defined entirely in terms of the required class constraints.

Minimal complete definition

Nothing

Methods

serializeNETag :: a -> Text

Serialize a Named Entity to a Textual representation (eg: MISC, PER, ORG, etc..) This is the dual of parseNETag.

parseNETag :: Text -> Either Error a

Parse a Named Entity from a textual representation (eg: MISC, PER, ORG, etc..) This is the dual of serializeNETag.

Instances