Safe Haskell | None |
---|---|
Language | Haskell98 |
Scoring functions commonly used for evaluation of NLP
systems. Most functions in this module work on sequences which are
instances of Foldable
, but some take a precomputed table of
Counts
. This will give a speedup if you want to compute multiple
scores on the same data. For example to compute the Mutual
Information, Variation of Information and the Adjusted Rand Index
on the same pair of clusterings:
>>>
let cs = counts "abcabc" "abaaba"
>>>
mapM_ (print . ($ cs)) [mi, ari, vi]
>>>
0.9182958340544894
>>>
0.4444444444444445
>>>
0.6666666666666663
- errorRate :: (Eq a, Fractional c, Traversable t, Foldable s) => t a -> s a -> c
- accuracy :: (Eq a, Fractional c, Traversable t, Foldable s) => t a -> s a -> c
- recipRank :: (Eq a, Fractional b, Foldable t) => a -> t a -> b
- avgPrecision :: (Fractional n, Ord a, Foldable t) => Set a -> t a -> n
- ari :: (Ord a, Ord b) => Counts a b -> Double
- mi :: (Ord a, Ord b) => Counts a b -> Double
- vi :: (Ord a, Ord b) => Counts a b -> Double
- logLikelihoodRatio :: (Ord a, Ord b) => Counts a b -> a -> b -> Double
- kullbackLeibler :: (Eq a, Floating a, Foldable f, Traversable t) => t a -> f a -> a
- jensenShannon :: (Eq a, Floating a, Traversable t, Traversable u) => t a -> u a -> a
- type Count = Double
- data Counts a b
- counts :: (Ord a, Ord b, Traversable t, Foldable s) => t a -> s b -> Counts a b
- sum :: (Foldable t, Num a) => t a -> a
- mean :: (Foldable t, Fractional n, Real a) => t a -> n
- jaccard :: (Fractional n, Ord a) => Set a -> Set a -> n
- entropy :: (Floating c, Foldable t) => t c -> c
- histogram :: (Num a, Ord k, Foldable t) => t k -> Map k a
- countJoint :: (Ord a, Ord b) => a -> b -> Counts a b -> Count
- countFst :: Ord k => k -> Counts k b -> Count
- countSnd :: Ord k => k -> Counts a k -> Count
- countTotal :: Counts a k -> Count
- fstElems :: Counts k b -> [k]
- sndElems :: Counts a k -> [k]
Scores for classification and ranking
errorRate :: (Eq a, Fractional c, Traversable t, Foldable s) => t a -> s a -> c Source
Error rate: the proportion of elements in the first sequence NOT equal to elements at corresponding positions in second sequence. Sequences should be of equal lengths.
accuracy :: (Eq a, Fractional c, Traversable t, Foldable s) => t a -> s a -> c Source
Accuracy: the proportion of elements in the first sequence equal to elements at corresponding positions in second sequence. Sequences should be of equal lengths.
recipRank :: (Eq a, Fractional b, Foldable t) => a -> t a -> b Source
Reciprocal rank: the reciprocal of the rank at which the first arguments occurs in the sequence given as the second argument.
avgPrecision :: (Fractional n, Ord a, Foldable t) => Set a -> t a -> n Source
Average precision. http://en.wikipedia.org/wiki/Information_retrieval#Average_precision
Scores for clustering
ari :: (Ord a, Ord b) => Counts a b -> Double Source
Adjusted Rand Index: http://en.wikipedia.org/wiki/Rand_index
mi :: (Ord a, Ord b) => Counts a b -> Double Source
Mutual information: MI(X,Y) = H(X) - H(X|Y) = H(Y) - H(Y|X). Also known as information gain.
vi :: (Ord a, Ord b) => Counts a b -> Double Source
Variation of information: VI(X,Y) = H(X) + H(Y) - 2 MI(X,Y)
Strength of association
logLikelihoodRatio :: (Ord a, Ord b) => Counts a b -> a -> b -> Double Source
Log-likelihood ratio for two binomial distributions. H_0: P(x|y) = p = P(x|~y) H_1: P(x|y) = p1 =/= p2 = P(x|~y)
Comparing probability distributions
kullbackLeibler :: (Eq a, Floating a, Foldable f, Traversable t) => t a -> f a -> a Source
Kullback-Leibler divergence: KL(X,Y) = SUM_i P(X=i) log_2(P(X=i)/P(Y=i)). The distributions can be unnormalized.
jensenShannon :: (Eq a, Floating a, Traversable t, Traversable u) => t a -> u a -> a Source
Jensen-Shannon divergence: JS(X,Y) = 1/2 KL(X,(X+Y)/2) + 1/2 KL(Y,(X+Y)/2). The distributions can be unnormalized.
Auxiliary types and functions
counts :: (Ord a, Ord b, Traversable t, Foldable s) => t a -> s b -> Counts a b Source
Creates count table Counts
mean :: (Foldable t, Fractional n, Real a) => t a -> n Source
The mean of a sequence of numbers.
jaccard :: (Fractional n, Ord a) => Set a -> Set a -> n Source
Jaccard coefficient J(A,B) = |AB| / |A union B|
entropy :: (Floating c, Foldable t) => t c -> c Source
Entropy: H(X) = -SUM_i P(X=i) log_2(P(X=i)). entropy xs
is the
entropy of the random variable represented by the sequence xs
,
where each element of xs
is the count of the one particular
value the random variable can take. If you need to compute the
entropy from a sequence of outcomes, the following will work:
entropy . elems . histogram
histogram :: (Num a, Ord k, Foldable t) => t k -> Map k a Source
histogram xs
is returns the map of the frequency counts of the
elements in sequence xs
Extracting joint and marginal counts from Counts
countTotal :: Counts a k -> Count Source
Total element count