Copyright | (c) Tobias Schoofs |
---|---|
License | LGPL |
Stability | experimental |
Portability | portable |
Safe Haskell | Safe-Inferred |
Language | Haskell98 |
The RAKE Text interface. (Currently the only one...)
- type WordScore = (Text, Double)
- candidates :: StopwordsMap -> NoSplit -> NoList -> [Text] -> [WordScore]
- keywords :: Text -> [WordScore]
- sortByScore :: [WordScore] -> [WordScore]
- sortByWord :: [WordScore] -> [WordScore]
- pSplitter :: Text -> [Text]
- type NoSplit = String
- defaultNosplit :: NoSplit
- enNosplit :: NoSplit
- numNosplit :: NoSplit
- othNosplit :: NoSplit
- latin1Nosplit :: NoSplit
- latinExAnosplit :: NoSplit
- latinExBnosplit :: NoSplit
- greekNosplit :: NoSplit
- cyrillicNosplit :: NoSplit
- type StopwordsMap = Map Text ()
- mkStopwords :: [Text] -> StopwordsMap
- mkStopwordsStr :: [String] -> StopwordsMap
- loadStopWords :: FilePath -> IO StopwordsMap
- stopword :: StopwordsMap -> NoList -> Text -> Bool
- defaultStoplist :: StopwordsMap
- smartStoplist :: StopwordsMap
- foxStoplist :: StopwordsMap
- type NoList = [Text]
- defaultNolist :: NoList
Keywords
type WordScore = (Text, Double) Source
The result is a keyword candidate, a keyword consisting of one or more words and a score associated with this keyword.
candidates :: StopwordsMap -> NoSplit -> NoList -> [Text] -> [WordScore] Source
This interface provides most flexibility.
It expects a Map
of stop words, a nosplit list
used by the word splitter,
an additional list of words or symbols
you want to exclude for a specific document
and a text split into phrases.
Users may pass in their own stop word list
(e.g. by loading it from a file, see loadStopWords
)
or one of the predefined lists (smartStopwords
, foxStopwords
).
keywords :: Text -> [WordScore] Source
The keywords
function is a convenience interface
that takes a couple of decisions internally:
it uses the defaultStoplist
, the English language
nosplit list, the default nolist
and it splits the text
into phrases using the pSplitter
.
The function is equivalent to
candidates defaultStoplist defaultNosplit defaultNolist . pSplitter
Utitlities
sortByScore :: [WordScore] -> [WordScore] Source
Sort the WordScore
list by scores (descending!)
sortByWord :: [WordScore] -> [WordScore] Source
Sort the WordScore
list by words (ascending!)
pSplitter :: Text -> [Text] Source
Default phrase splitter. It splits phrases at characters
in the punctuation category
(those for which isPunctuation
is True
)
with the exception of '-'.
Resources
List containing characters at which we do not split words. This list is language dependent.
defaultNosplit :: NoSplit Source
The default list is for English and does only consider ASCII characters, the numbers 0..9 and some other symbols.
There are resources for other languages, but they need review and contribution!
digits
and some more symbols ("+-/")
latin1Nosplit :: NoSplit Source
Latin1
latinExAnosplit :: NoSplit Source
Latin1 extended-A
latinExBnosplit :: NoSplit Source
Latin1 extended-B
greekNosplit :: NoSplit Source
Greek and Coptic (needs revision)
cyrillicNosplit :: NoSplit Source
Cyrillic (needs revision)
Stopwords
The very heart of the RAKE algorithm is the use of stop words, a concept defined by NLP pioneer Hans Peter Luhn. Stop words are frequent words in a language that are considered to be void of specific semantics. They, of course, have an important role in the language, but they do not help to determine the topic a specific document is about, e.g. "is", "the", "of" and so on. Stop words depend on the specific context of the documents to be analysed; there are, however, frequently used lists with wide applicability.
The library comes with two stop word lists built in:
the smartStoplist
and the foxStoplist
, both for English.
The list used by default is smartStoplist
.
The user is free to define her own stop word list,
which can be loaded from a file using loadStopWords
.
The file format is simple:
Lines starting with '#' are ignored (comments);
- Each line contains one word.
type StopwordsMap = Map Text () Source
Search tree for stop words
mkStopwords :: [Text] -> StopwordsMap Source
Make StopwordsMap
starting from a list of stop words
encoded as Text
mkStopwordsStr :: [String] -> StopwordsMap Source
Make StopwordsMap
starting from a list of stop words
encoded as String
loadStopWords :: FilePath -> IO StopwordsMap Source
Load a stop word list from a file.
stopword :: StopwordsMap -> NoList -> Text -> Bool Source
Search for a chunk of Text
in the StopwordsMap
.
Note that, if a word or symbol does not appear in the stop word list,
it may still be on the the nolist
and, then, still counts as stop word (e.g. "-").
defaultStoplist :: StopwordsMap Source
The default stop word list (smartStoplist
).
smartStoplist :: StopwordsMap Source
The "smart" stop word list
foxStoplist :: StopwordsMap Source
The "Fox" stop word list
The nolist: Symbols in this list count as stop words independently from the chosen stop word list. This list can be used to exclude very specific "words" that may occur in a given domain like, for instance, mathematical formulas and symbols.
defaultNolist :: NoList Source
Currently, the default nolist contains only the symbol "-".