chatter-0.5.0.1: A library of simple NLP algorithms.

Safe HaskellNone
LanguageHaskell2010

NLP.Similarity.VectorSim

Synopsis

Documentation

type TermVector = DefaultMap Text Double Source

An efficient (ish) representation for documents in the "bag of words" sense.

mkVector :: Corpus -> [Text] -> TermVector Source

Generate a TermVector from a tokenized document.

sim :: Corpus -> Text -> Text -> Double Source

Invokes similarity on full strings, using words for tokenization, and no stemming.

There *must* be at least one document in the corpus.

similarity :: Corpus -> [Text] -> [Text] -> Double Source

Determine how similar two documents are.

This function assumes that each document has been tokenized and (if desired) stemmed/case-normalized.

This is a wrapper around tvSim, which is a *much* more efficient implementation. If you need to run similarity against any single document more than once, then you should create TermVectors for each of your documents and use tvSim instead of similarity.

There *must* be at least one document in the corpus.

tvSim :: TermVector -> TermVector -> Double Source

Determine how similar two documents are.

Calculates the similarity between two documents, represented as TermVectors

tf :: Eq a => a -> [a] -> Int Source

Return the raw frequency of a term in a body of text.

The firt argument is the term to find, the second is a tokenized document. This function does not do any stemming or additional text modification.

idf :: Text -> Corpus -> Double Source

Calculate the inverse document frequency.

The IDF is, roughly speaking, a measure of how popular a term is.

tf_idf :: Text -> [Text] -> Corpus -> Double Source

Calculate the tf*idf measure for a term given a document and a corpus.

magnitude :: TermVector -> Double Source

Calculate the magnitude of a vector.

dotProd :: TermVector -> TermVector -> Double Source

find the dot product of two vectors.