chatter-0.8.0.1: A library of simple NLP algorithms.

Safe HaskellNone

NLP.Similarity.VectorSim

Synopsis

Documentation

newtype TermVector Source

An efficient (ish) representation for documents in the bag of words sense.

Constructors

TermVector (DefaultMap Text Double) 

data Document Source

Constructors

Document 

Fields

docTermFrequencies :: HashMap Text Int
 
docTokens :: [Text]
 

mkDocument :: [Text] -> DocumentSource

Make a document from a list of tokens.

fromTV :: TermVector -> DefaultMap Text DoubleSource

Access the underlying DefaultMap used to store term vector details.

mkVector :: Corpus -> Document -> TermVectorSource

Generate a TermVector from a tokenized document.

sim :: Corpus -> Text -> Text -> DoubleSource

Invokes similarity on full strings, using words for tokenization, and no stemming. The return value will be in the range [0, 1]

There *must* be at least one document in the corpus.

similarity :: Corpus -> [Text] -> [Text] -> DoubleSource

Determine how similar two documents are.

This function assumes that each document has been tokenized and (if desired) stemmed/case-normalized.

This is a wrapper around tvSim, which is a *much* more efficient implementation. If you need to run similarity against any single document more than once, then you should create TermVectors for each of your documents and use tvSim instead of similarity.

The return value will be in the range [0, 1].

There *must* be at least one document in the corpus.

tvSim :: TermVector -> TermVector -> DoubleSource

Determine how similar two documents are.

Calculates the similarity between two documents, represented as TermVectors, returning a double in the range [0, 1] where 1 represents most similar.

tf :: Text -> Document -> IntSource

Return the raw frequency of a term in a body of text.

The firt argument is the term to find, the second is a tokenized document. This function does not do any stemming or additional text modification.

idf :: Text -> Corpus -> DoubleSource

Calculate the inverse document frequency.

The IDF is, roughly speaking, a measure of how popular a term is.

tf_idf :: Text -> Document -> Corpus -> DoubleSource

Calculate the tf*idf measure for a term given a document and a corpus.

addVectors :: TermVector -> TermVector -> TermVectorSource

Add two term vectors. When a term is added, its value in each vector is used (or that vector's default value is used if the term is absent from the vector). The new term vector resulting from the addition always uses a default value of zero.

zeroVector :: TermVectorSource

A zero vector term vector (i.e. addVector v zeroVector = v).

negate :: TermVector -> TermVectorSource

Negate a term vector.

sum :: [TermVector] -> TermVectorSource

Add a list of term vectors.

magnitude :: TermVector -> DoubleSource

Calculate the magnitude of a vector.

dotProd :: TermVector -> TermVector -> DoubleSource

find the dot product of two vectors.

keys :: TermVector -> [Text]Source