rake-0.0.1: Rapid Automatic Keyword Extraction (RAKE)

Copyright(c) Tobias Schoofs
LicenseLGPL
Stabilityexperimental
Portabilityportable
Safe HaskellSafe-Inferred
LanguageHaskell98

NLP.RAKE.Text

Contents

Description

The RAKE Text interface. (Currently the only one...)

Synopsis

Keywords

type WordScore = (Text, Double) Source

The result is a keyword candidate, a keyword consisting of one or more words and a score associated with this keyword.

candidates :: StopwordsMap -> NoSplit -> NoList -> [Text] -> [WordScore] Source

This interface provides most flexibility. It expects a Map of stop words, a nosplit list used by the word splitter, an additional list of words or symbols you want to exclude for a specific document and a text split into phrases. Users may pass in their own stop word list (e.g. by loading it from a file, see loadStopWords) or one of the predefined lists (smartStopwords, foxStopwords).

keywords :: Text -> [WordScore] Source

The keywords function is a convenience interface that takes a couple of decisions internally: it uses the defaultStoplist, the English language nosplit list, the default nolist and it splits the text into phrases using the pSplitter.

The function is equivalent to

candidates defaultStoplist defaultNosplit defaultNolist . pSplitter

Utitlities

sortByScore :: [WordScore] -> [WordScore] Source

Sort the WordScore list by scores (descending!)

sortByWord :: [WordScore] -> [WordScore] Source

Sort the WordScore list by words (ascending!)

pSplitter :: Text -> [Text] Source

Default phrase splitter. It splits phrases at characters in the punctuation category (those for which isPunctuation is True) with the exception of '-'.

Resources

type NoSplit = String Source

List containing characters at which we do not split words. This list is language dependent.

defaultNosplit :: NoSplit Source

The default list is for English and does only consider ASCII characters, the numbers 0..9 and some other symbols.

There are resources for other languages, but they need review and contribution!

enNosplit :: NoSplit Source

ASCII characters,

othNosplit :: NoSplit Source

and some more symbols ("+-/")

latinExAnosplit :: NoSplit Source

Latin1 extended-A

latinExBnosplit :: NoSplit Source

Latin1 extended-B

greekNosplit :: NoSplit Source

Greek and Coptic (needs revision)

cyrillicNosplit :: NoSplit Source

Cyrillic (needs revision)

Stopwords

The very heart of the RAKE algorithm is the use of stop words, a concept defined by NLP pioneer Hans Peter Luhn. Stop words are frequent words in a language that are considered to be void of specific semantics. They, of course, have an important role in the language, but they do not help to determine the topic a specific document is about, e.g. "is", "the", "of" and so on. Stop words depend on the specific context of the documents to be analysed; there are, however, frequently used lists with wide applicability.

The library comes with two stop word lists built in: the smartStoplist and the foxStoplist, both for English. The list used by default is smartStoplist.

The user is free to define her own stop word list, which can be loaded from a file using loadStopWords. The file format is simple:

  • Lines starting with '#' are ignored (comments);

    • Each line contains one word.

type StopwordsMap = Map Text () Source

Search tree for stop words

mkStopwords :: [Text] -> StopwordsMap Source

Make StopwordsMap starting from a list of stop words encoded as Text

mkStopwordsStr :: [String] -> StopwordsMap Source

Make StopwordsMap starting from a list of stop words encoded as String

loadStopWords :: FilePath -> IO StopwordsMap Source

Load a stop word list from a file.

stopword :: StopwordsMap -> NoList -> Text -> Bool Source

Search for a chunk of Text in the StopwordsMap. Note that, if a word or symbol does not appear in the stop word list, it may still be on the the nolist and, then, still counts as stop word (e.g. "-").

defaultStoplist :: StopwordsMap Source

The default stop word list (smartStoplist).

smartStoplist :: StopwordsMap Source

The "smart" stop word list

foxStoplist :: StopwordsMap Source

The "Fox" stop word list

type NoList = [Text] Source

The nolist: Symbols in this list count as stop words independently from the chosen stop word list. This list can be used to exclude very specific "words" that may occur in a given domain like, for instance, mathematical formulas and symbols.

defaultNolist :: NoList Source

Currently, the default nolist contains only the symbol "-".