Safe Haskell | Safe-Inferred |
---|

Scoring functions commonly used for evaluation of NLP
systems. Most functions in this module work on sequences which are
instances of `Foldable`

, but some take a precomputed table of
`Counts`

. This will give a speedup if you want to compute multiple
scores on the same data. For example to compute the Mutual
Information, Variation of Information and the Adjusted Rand Index
on the same pair of clusterings:

`>>>`

`let cs = counts "abcabc" "abaaba"`

`>>>`

`mapM_ (print . ($ cs)) [mi, ari, vi]`

`>>>`

`0.9182958340544894`

`>>>`

`0.4444444444444445`

`>>>`

`0.6666666666666663`

- accuracy :: (Eq a, Fractional c, Traversable t, Foldable s) => t a -> s a -> c
- recipRank :: (Eq a, Fractional b, Foldable t) => a -> t a -> b
- avgPrecision :: (Fractional n, Ord a, Foldable t) => Set a -> t a -> n
- ari :: (Ord a, Ord b) => Counts a b -> Double
- mi :: (Ord a, Ord b) => Counts a b -> Double
- vi :: (Ord a, Ord b) => Counts a b -> Double
- kullbackLeibler :: (Eq a, Floating a, Foldable f, Traversable t) => t a -> f a -> a
- jensenShannon :: (Eq a, Floating a, Traversable t, Traversable u) => t a -> u a -> a
- type Count = Double
- data Counts a b
- counts :: (Ord a, Ord b, Traversable t, Foldable s) => t a -> s b -> Counts a b
- sum :: (Foldable t, Num a) => t a -> a
- mean :: (Foldable t, Fractional n, Real a) => t a -> n
- jaccard :: (Fractional n, Ord a) => Set a -> Set a -> n
- entropy :: (Floating c, Foldable t) => t c -> c
- histogram :: (Num a, Ord k, Foldable t) => t k -> Map k a
- countJoint :: (Ord a, Ord b) => a -> b -> Counts a b -> Count
- countFst :: Ord k => k -> Counts k b -> Count
- countSnd :: Ord k => k -> Counts a k -> Count
- fstElems :: Counts k b -> [k]
- sndElems :: Counts a k -> [k]

# Scores for classification and ranking

accuracy :: (Eq a, Fractional c, Traversable t, Foldable s) => t a -> s a -> cSource

Accuracy: the proportion of elements in the first sequence equal to elements at corresponding positions in second sequence. Sequences should be of equal lengths.

recipRank :: (Eq a, Fractional b, Foldable t) => a -> t a -> bSource

Reciprocal rank: the reciprocal of the rank at which the first arguments occurs in the sequence given as the second argument.

avgPrecision :: (Fractional n, Ord a, Foldable t) => Set a -> t a -> nSource

Average precision. http://en.wikipedia.org/wiki/Information_retrieval#Average_precision

# Scores for clustering

ari :: (Ord a, Ord b) => Counts a b -> DoubleSource

Adjusted Rand Index: http://en.wikipedia.org/wiki/Rand_index

mi :: (Ord a, Ord b) => Counts a b -> DoubleSource

Mutual information: MI(X,Y) = H(X) - H(X|Y) = H(Y) - H(Y|X). Also known as information gain.

vi :: (Ord a, Ord b) => Counts a b -> DoubleSource

Variation of information: VI(X,Y) = H(X) + H(Y) - 2 MI(X,Y)

# Comparing probability distributions

kullbackLeibler :: (Eq a, Floating a, Foldable f, Traversable t) => t a -> f a -> aSource

Kullback-Leibler divergence: KL(X,Y) = SUM_i P(X=i) log_2(P(X=i)/P(Y=i)). The distributions can be unnormalized.

jensenShannon :: (Eq a, Floating a, Traversable t, Traversable u) => t a -> u a -> aSource

Jensen-Shannon divergence: JS(X,Y) = 1*2 KL(X,(X+Y)*2) + 1*2 KL(Y,(X+Y)*2).
The distributions can be unnormalized.

# Auxiliary types and functions

counts :: (Ord a, Ord b, Traversable t, Foldable s) => t a -> s b -> Counts a bSource

Creates count table `Counts`

mean :: (Foldable t, Fractional n, Real a) => t a -> nSource

The mean of a sequence of numbers.

jaccard :: (Fractional n, Ord a) => Set a -> Set a -> nSource

Jaccard coefficient J(A,B) = |AB| / |A union B|

entropy :: (Floating c, Foldable t) => t c -> cSource

Entropy: H(X) = -SUM_i P(X=i) log_2(P(X=i)). `entropy xs`

is the
entropy of the random variable represented by the sequence `xs`

,
where each element of `xs`

is the count of the one particular
value the random variable can take. If you need to compute the
entropy from a sequence of outcomes, the following will work:

entropy . elems . histogram

histogram :: (Num a, Ord k, Foldable t) => t k -> Map k aSource

`histogram xs`

is returns the map of the frequency counts of the
elements in sequence `xs`