swift-lda- Online sampler for Latent Dirichlet Allocation

Safe HaskellNone




Latent Dirichlet Allocation

Imperative implementation of a collapsed Gibbs sampler for LDA. This library uses the topic modeling terminology (documents, words, topics), even though it is generic. For example if used for word class induction, replace documents with word types, words with features and topics with word classes.

Usage example:

 import qualified Data.Vector as V
 import qualified Data.Vector.Unboxed as U
 import  Control.Monad.ST
 import NLP.SwiftLDA
 main = do
   -- Initialize model.
   let river = 1
       money = 2
       bank  = 3
   m <- stToIO $ initial (U.singleton 42) 2 10 0.1
   let  docs = [ (1, U.fromList [river, river, bank])  
               , (2, U.fromList [money, money, bank])
               , (3, U.fromList [river, money, bank])
   -- Run 10 iterations of sampling on this batch of documents.
   result <- stToIO $ run m 10 docs
   -- Display topic assignments.
   print result
   -- Run one iteration of sampling on another batch and display result
   print =<< (stToIO $ run m 1 [(4, U.fromList [bank, bank])]) 
   -- Retrieve and display word-topic weights.
   fm <- stToIO $ finalize m
   print $ wordTopics fm


Initialization and finalization

initial :: Vector Word32 -> Int -> Double -> Double -> ST s (LDA s)Source

initial s k a b initializes model with k topics, a/k alpha hyperparameter, b beta hyperparameter and random seed s.

finalize :: LDA s -> ST s FinalizedSource

Create transparent immutable object holding model information from opaque internal representation.

Running sampler

run :: Traversable f => LDA s -> Int -> f Doc -> ST s (f (Vector Z))Source

run m i batch runs an outer loop of i passes of Gibbs sampling over documents in batch using the model m and returns the topic assignments for words in the documents of the batch.


data LDA s Source

Abstract type holding the settings and the state of the sampler

type Doc = (D, Vector W)Source

A document consists of a document ID and a sequence of word IDs.

type D = IntSource

Document ID

type W = IntSource

Word ID

type Z = IntSource

Topic ID

Access model information

data Finalized Source




docTopics :: !Table2D

Document topic counts

wordTopics :: !Table2D

Word topic counts

topics :: !Table1D

Topics counts

topicDocs :: !Table2D

Inverse document-topic counts

topicWords :: !Table2D

Inverse word-topic counts

alphasum :: !Double

alpha * K Dirichlet parameter (topic sparseness)

beta :: !Double

beta Dirichlet parameter (word sparseness)

topicNum :: !Int

Number of topics K

wSize :: !Int

Number of unique words

exponent :: !(Maybe Double)

Learning rate exponent


Querying evolving model

Querying finalized model

docTopicWeights :: Finalized -> Doc -> Vector DoubleSource

docTopicWeights m doc returns unnormalized topic probabilities for document doc given LDA model m.

wordTopicWeights :: Finalized -> D -> W -> Vector DoubleSource

topicWeights m d w returns the unnormalized probabilities of topics for word w in document d given LDA model m.

docCounts :: Finalized -> Table1DSource

For each document sum the topic counts.