| Safe Haskell | None |
|---|
NLP.Similarity.VectorSim
- type TermVector = DefaultMap Text Double
- mkVector :: Corpus -> [Text] -> TermVector
- sim :: Corpus -> Text -> Text -> Double
- similarity :: Corpus -> [Text] -> [Text] -> Double
- tvSim :: TermVector -> TermVector -> Double
- tf :: Eq a => a -> [a] -> Int
- idf :: Text -> Corpus -> Double
- tf_idf :: Text -> [Text] -> Corpus -> Double
- cosVec :: TermVector -> TermVector -> Double
- magnitude :: TermVector -> Double
- dotProd :: TermVector -> TermVector -> Double
Documentation
type TermVector = DefaultMap Text DoubleSource
An efficient (ish) representation for documents in the bag of words sense.
mkVector :: Corpus -> [Text] -> TermVectorSource
Generate a TermVector from a tokenized document.
sim :: Corpus -> Text -> Text -> DoubleSource
Invokes similarity on full strings, using words for
tokenization, and no stemming.
There *must* be at least one document in the corpus.
similarity :: Corpus -> [Text] -> [Text] -> DoubleSource
Determine how similar two documents are.
This function assumes that each document has been tokenized and (if desired) stemmed/case-normalized.
This is a wrapper around tvSim, which is a *much* more efficient
implementation. If you need to run similarity against any single
document more than once, then you should create TermVectors for
each of your documents and use tvSim instead of similarity.
There *must* be at least one document in the corpus.
tvSim :: TermVector -> TermVector -> DoubleSource
Determine how similar two documents are.
Calculates the similarity between two documents, represented as
TermVectors
tf :: Eq a => a -> [a] -> IntSource
Return the raw frequency of a term in a body of text.
The firt argument is the term to find, the second is a tokenized document. This function does not do any stemming or additional text modification.
idf :: Text -> Corpus -> DoubleSource
Calculate the inverse document frequency.
The IDF is, roughly speaking, a measure of how popular a term is.
tf_idf :: Text -> [Text] -> Corpus -> DoubleSource
Calculate the tf*idf measure for a term given a document and a corpus.
cosVec :: TermVector -> TermVector -> DoubleSource
magnitude :: TermVector -> DoubleSource
Calculate the magnitude of a vector.
dotProd :: TermVector -> TermVector -> DoubleSource
find the dot product of two vectors.