Safe Haskell | None |
---|---|
Language | Haskell2010 |
Citeseer document classification dataset, from :
Qing Lu, and Lise Getoor. "Link-based classification." ICML, 2003.
Synopsis
- stash :: FilePath -> IO ()
- citeseerGraph :: FilePath -> IO (Graph ContentRow)
- citeseerGraphEdges :: (MonadResource m, MonadThrow m) => FilePath -> Map String (Seq Int16, DocClass) -> ConduitT i (Maybe (Graph ContentRow)) m ()
- restoreContent :: FilePath -> IO (Map String (Seq Int16, DocClass))
- data ContentRow = CRow {}
- data DocClass
1. Download the dataset
Download, parse, serialize and save the dataset to local storage
2. Reconstruct the citation graph
:: FilePath | directory where the data files were saved |
-> IO (Graph ContentRow) |
Reconstruct the citation graph
NB : relies on the user having stash
ed the dataset to local disk first.
:: (MonadResource m, MonadThrow m) | |
=> FilePath | directory of data files |
-> Map String (Seq Int16, DocClass) |
|
-> ConduitT i (Maybe (Graph ContentRow)) m () |
Stream out the edges of the citation graph, in which the nodes are decorated with the document metadata.
The full citation graph can be reconstructed by folding over this stream and overlay
ing the graph edges as they arrive.
This way the graph can be partitioned in training , test and validation subsets at the usage site
Load the graph node data from local storage
Types
data ContentRow Source #
Dataset row of the .content file
The .content file contains descriptions of the papers in the following format:
<paper_id> <word_attributes> <class_label>
The first entry in each line contains the unique string ID of the paper followed by binary values indicating whether each word in the vocabulary is present (indicated by 1) or absent (indicated by 0) in the paper (vocabulary : 3703 unique words). Finally, the last entry in the line contains the class label of the paper.
Instances
document classes of the Citeseer dataset
Instances
Enum DocClass Source # | |
Defined in Algebra.Graph.IO.Datasets.LINQS.Citeseer | |
Eq DocClass Source # | |
Ord DocClass Source # | |
Defined in Algebra.Graph.IO.Datasets.LINQS.Citeseer | |
Show DocClass Source # | |
Generic DocClass Source # | |
Binary DocClass Source # | |
type Rep DocClass Source # | |
Defined in Algebra.Graph.IO.Datasets.LINQS.Citeseer type Rep DocClass = D1 ('MetaData "DocClass" "Algebra.Graph.IO.Datasets.LINQS.Citeseer" "algebraic-graphs-io-0.2-HM3hsaOKtKl5JJVlqGeywp" 'False) ((C1 ('MetaCons "Agents" 'PrefixI 'False) (U1 :: Type -> Type) :+: (C1 ('MetaCons "AI" 'PrefixI 'False) (U1 :: Type -> Type) :+: C1 ('MetaCons "DB" 'PrefixI 'False) (U1 :: Type -> Type))) :+: (C1 ('MetaCons "IR" 'PrefixI 'False) (U1 :: Type -> Type) :+: (C1 ('MetaCons "ML" 'PrefixI 'False) (U1 :: Type -> Type) :+: C1 ('MetaCons "HCI" 'PrefixI 'False) (U1 :: Type -> Type)))) |