Safe Haskell | None |
---|---|
Language | Haskell2010 |
- type TermVector = DefaultMap Text Double
- mkVector :: Corpus -> [Text] -> TermVector
- sim :: Corpus -> Text -> Text -> Double
- similarity :: Corpus -> [Text] -> [Text] -> Double
- tvSim :: TermVector -> TermVector -> Double
- tf :: Eq a => a -> [a] -> Int
- idf :: Text -> Corpus -> Double
- tf_idf :: Text -> [Text] -> Corpus -> Double
- cosVec :: TermVector -> TermVector -> Double
- magnitude :: TermVector -> Double
- dotProd :: TermVector -> TermVector -> Double
Documentation
type TermVector = DefaultMap Text Double Source
An efficient (ish) representation for documents in the "bag of words" sense.
mkVector :: Corpus -> [Text] -> TermVector Source
Generate a TermVector
from a tokenized document.
sim :: Corpus -> Text -> Text -> Double Source
Invokes similarity on full strings, using words
for
tokenization, and no stemming.
There *must* be at least one document in the corpus.
similarity :: Corpus -> [Text] -> [Text] -> Double Source
Determine how similar two documents are.
This function assumes that each document has been tokenized and (if desired) stemmed/case-normalized.
This is a wrapper around tvSim
, which is a *much* more efficient
implementation. If you need to run similarity against any single
document more than once, then you should create TermVector
s for
each of your documents and use tvSim
instead of similarity
.
There *must* be at least one document in the corpus.
tvSim :: TermVector -> TermVector -> Double Source
Determine how similar two documents are.
Calculates the similarity between two documents, represented as
TermVectors
tf :: Eq a => a -> [a] -> Int Source
Return the raw frequency of a term in a body of text.
The firt argument is the term to find, the second is a tokenized document. This function does not do any stemming or additional text modification.
idf :: Text -> Corpus -> Double Source
Calculate the inverse document frequency.
The IDF is, roughly speaking, a measure of how popular a term is.
tf_idf :: Text -> [Text] -> Corpus -> Double Source
Calculate the tf*idf measure for a term given a document and a corpus.
cosVec :: TermVector -> TermVector -> Double Source
magnitude :: TermVector -> Double Source
Calculate the magnitude of a vector.
dotProd :: TermVector -> TermVector -> Double Source
find the dot product of two vectors.