Safe Haskell	None
Language	Haskell2010

Freq

Contents

Frequency table builder (trainer) type
Construction
Training
Using a trained model
Pretty Printing

Description

This library provides a way to train a model that predicts the "randomness" of an input ByteString, and two datatypes to facilitate this:

FreqTrain is a datatype that can be constructed via training functions that take ByteStrings as input, and can be used with the measure function to gather an estimate of the aforementioned probability of "randomness".

Freq is a datatype that is constructed by calling the tabulate function on a FreqTrain. Freqs are meant solely for using (accessing the "randomness" values) the trained model in practise, by making significant increases to speed in exchange for less extensibility; you can neither make a change to a Freq or convert it back to a FreqTrain. In practise this however proves to not be a problem, because training usually only happens once.

Laws:

 measure (f :: FreqTrain) b ≡ measure (tabulate f) b

Below is a simple illustration of how to use this library. We are going to write a small command-line application that trains on some data, and scores ByteStrings according to how random they are. We will say that a ByteString is random if it scores less than 0.05 (on a scale of 0 to 1), and not random otherwise.

First, some imports:

 import Freq
 import Control.Monad (forever)
 
 import qualified Data.ByteString.Char8 as BC

Next, a list of FilePaths containing training data. The training data here is the same as is provided in the sample executable of this library. It consists solely of books in the Public Domain.

 trainTexts :: [FilePath]
 trainText
   = fmap (x -> "txtdocs/" ++ x ".txt")
     -- ^
     -- | this line just tells us that all
     --   of the training data is in the txtdocs
     --   directory, and has a '.txt' file extension.
       [ "2000010"
       , "2city10"
       , "80day10"
       , "alcott-little-261"
       , "byron-don-315"
       , "carol10"
       , "center_earth"
       , "defoe-robinson-103"
       , "dracula"
       , "freck10"
       , "invisman"
       , "kipling-jungle-148"
       , "lesms10"
       , "london-call-203"
       , "london-sea-206"
       , "longfellow-paul-210"
       , "madambov"
       , "monroe-d"
       , "moon10"
       , "ozland10"
       , "plgrm10"
       , "sawy210"
       , "speckldb"
       , "swift-modest-171"
       , "time_machine"
       , "war_peace"
       , "white_fang"
       , "zenda10"
       ]

We are going to use a function provided by this library called trainWithMany. Its type signature is:

 trainWithMany
   :: Foldable t
   => t FilePath   -- ^ FilePaths containing training data
   -> IO FreqTrain -- ^ Frequency table generated as a result of training, inside of IO

In other words, trainWithMany takes a bunch of files, trains a model with all of the training data contained therein, and returns a FreqTrain inside of IO.

And now, we get freaky:

 -- | "passes" returns a message letting the user know whether
 --   or not their input ByteString was most likely random.
 --   Recall that our threshold is 0.05 on a scale of 0 to 1.
 passes :: Double -> String
 passes x
   | x < 0.05  = "Too random!"
   | otherwise = "Looks good to me!"

 main :: IO ()
 main = do
   !freak <- trainWithMany trainTexts
   -- ^
   -- | create the trained model
   
   let !freakTable = tabulate freak
   -- ^
   -- | optimise the trained model for
   --   read access
   
   putStrLn "Done loading frequencies."
   -- ^
   -- | let the user know that our model
   --   is done training and has finished
   --   optimising into a Freq
   
   forever $ do
   -- ^
   -- | make the following code loop forever 
     
     putStrLn "Enter text:"
     -- ^
     -- | ask the user for some text
     
     !bs <- BC.getLine
     -- ^
     -- | bs is the input ByteString to score
     
     let !score = measure freakTable bs
     -- ^
     -- | score of the ByteString!
     
     putStrLn $ "Score: " ++ show score ++ "n"
       ++ passes score
     -- ^  
     -- | print out what the score of the ByteString was,
     --   along with its 'passing status'.

This results in the following interactions, split up for readability:

>>> Done loading frequencies.
>>> Enter text:
>>> freq
>>> Score: 0.10314131395591991
>>> Looks good to me!

>>> Enter text:
>>> kjdslfkajdslkfjsd
>>> Score: 6.693203041828383e-3
>>> Too random!

>>> Enter text:
>>> William
>>> Score: 7.086442245879888e-2
>>> Looks good to me!

>>> Enter text:
>>> 8op3u92jf
>>> Score: 6.687182330334067e-3
>>> Too random!

As we can see, it rejects the keysmashed text as being too random, while the human-readable text is A-OK. I actually made the threshold of 0.05 too high - it should be somewhere between 0.01 and 0.03, but even then the outcomes would have still been the same. The digram-based approach that freq uses may seem ridiculously naive, but still maintains a high degree of accuracy.

As an example of a real-world use case, I wrote freq to use at my workplace (I work at a Network Security company) as a way to score domain names according to how random they are. Malicious users spin up fake domains frequently using strings of random characters. This can also be used to score Windows executables, since those follow the same pattern of malicious naming.

An obvious weakness of this library is that it suffers from what can be referred to as the "xkcd problem". It can score things such as xkcd poorly, even though they are perfectly legitimate domains. The fix I use is to use something like the alexa top 1 million list of domains, along with a HashMap(s) for whitelisting/blacklisting.

As a wise man once told me - "And then I freaked it."

Synopsis

Frequency table builder (trainer) type

data FreqTrain Source #

A FreqTrain is a digram-based frequency table.

One can construct a FreqTrain with train, trainWith, or trainWithMany.

One can use a trained FreqTrain with prob and measure.

mappend == <> will add the values of each of the matching keys.

It is highly recommended to convert a FreqTrain to a Freq with tabulate before using the trained model, because Freqs have O(1) reads as well as significantly faster constant-time operations, however keep in mind that Freqs cannot be neither modified nor converted back to a FreqTrain.

Instances

Semigroup FreqTrain Source #
Methods (<>) :: FreqTrain -> FreqTrain -> FreqTrain # sconcat :: NonEmpty FreqTrain -> FreqTrain # stimes :: Integral b => b -> FreqTrain -> FreqTrain #
Monoid FreqTrain Source #
Methods mempty :: FreqTrain # mappend :: FreqTrain -> FreqTrain -> FreqTrain # mconcat :: [FreqTrain] -> FreqTrain #
Freaky FreqTrain Source #
Methods prob :: FreqTrain -> Word8 -> Word8 -> Double Source #

Construction

empty :: FreqTrain Source #

O(1). The empty frequency table.

singleton Source #

Arguments

:: Word8	Outer key
-> Word8	Inner key
-> Double	Weight
-> FreqTrain	The singleton frequency table

O(1). A Frequency table with a single entry.

Training

train :: ByteString -> FreqTrain Source #

Given a ByteString consisting of training data, build a Frequency table.

trainWith Source #

Arguments

:: FilePath	`FilePath` containing training data
-> IO FreqTrain	Frequency table generated as a result of training, inside of `IO`.

Given a FilePath containing training data, build a Frequency table inside of the IO monad.

trainWithMany Source #

Arguments

:: Foldable t
=> t FilePath	`FilePath`s containing training data
-> IO FreqTrain	Frequency table generated as a result of training, inside of `IO`.

Given a list of FilePath containing training data, build a Frequency table inside of the IO monad.

Using a trained model

tabulate :: FreqTrain -> Freq Source #

Optimise a FreqTrain for O(1) read access.

data Freq Source #

A variant of FreqTrain that holds identical information but is optimised for reads. There are no operations that imbue a Freq with additional information.

Reading from a Freq is orders of magnitude faster than reading from a FreqTrain. It is highly recommended that you use your trained model by first converting a FreqTrain to a Freq with tabulate.

Instances

Show Freq Source #
Methods showsPrec :: Int -> Freq -> ShowS # show :: Freq -> String # showList :: [Freq] -> ShowS #
Freaky Freq Source #
Methods prob :: Freq -> Word8 -> Word8 -> Double Source #

measure :: Freaky a => a -> ByteString -> Double Source #

Given a Frequency table and a ByteString, measure returns the probability that the ByteString is not randomised. The accuracy of measure is is heavily affected by your training data.

prob :: Freaky a => a -> Word8 -> Word8 -> Double Source #

Given a Frequency table and characters c1 and c2, what is the probability that c1 follows c2?

Pretty Printing

prettyFreqTrain :: FreqTrain -> IO () Source #

Pretty-print a FreqTrain.