Safe Haskell | None |
---|---|
Language | Haskell2010 |
This library provides a way to train a model
that predicts the "randomness" of an input
,
and two datatypes to facilitate this:ByteString
is a datatype that can be constructed via
training functions that take FreqTrain
s as input, and
can be used with the ByteString
function to gather an
estimate of the aforementioned probability of "randomness".measure
is a datatype that is constructed by calling the Freq
function on a tabulate
. FreqTrain
s are meant solely for using (accessing
the "randomness" values) the trained model in practise, by making
significant increases to speed in exchange for less extensibility;
you can neither make a change to a Freq
or convert it back to
a Freq
. In practise this however proves to not be a problem,
because training usually only happens once.FreqTrain
Laws:
measure
(f ::FreqTrain
) b ≡measure
(tabulate
f) b
Below is a simple illustration of how to use this library.
We are going to write a small command-line application that
trains on some data, and scores
s according to how
random they are. We will say that a ByteString
is ByteString
random
if it scores less than 0.05 (on a scale of 0 to 1), and not random
otherwise.
First, some imports:
import Freq import Control.Monad (forever) import qualified Data.ByteString.Char8 as BC
Next, a list of
s containing training data.
The training data here is the same as is provided in
the sample executable of this library. It consists solely
of books in the Public Domain.FilePath
trainTexts :: [FilePath]
trainText
= fmap (x -> "txtdocs/" ++ x ++ ".txt")
-- ^
-- | this line just tells us that all
-- of the training data is in the txtdocs
-- directory, and has a '.txt' file extension.
-- | These are the text files from which we wish to train.
-- v
[ "2000010"
, "2city10"
, "80day10"
, "alcott-little-261"
, "byron-don-315"
, "carol10"
, "center_earth"
, "defoe-robinson-103"
, "dracula"
, "freck10"
, "invisman"
, "kipling-jungle-148"
, "lesms10"
, "london-call-203"
, "london-sea-206"
, "longfellow-paul-210"
, "madambov"
, "monroe-d"
, "moon10"
, "ozland10"
, "plgrm10"
, "sawy210"
, "speckldb"
, "swift-modest-171"
, "time_machine"
, "war_peace"
, "white_fang"
, "zenda10"
]
We are going to use a function provided by this library
called
. Its type signature is:trainWithMany
trainWithMany
:: Foldable t
=> t FilePath -- ^ FilePaths containing training data
-> IO FreqTrain -- ^ Frequency table generated as a result of training, inside of IO
In other words,
takes a bunch of files,
trains a model with all of the training data contained therein,
and returns a trainWithMany
inside of FreqTrain
.IO
And now, we get freaky:
-- | "passes" returns a message letting the user know whether -- or not their inputByteString
was most likely random. -- Recall that our threshold is 0.05 on a scale of 0 to 1. passes :: Double -> String passes x | x < 0.05 = "Too random!" | otherwise = "Looks good to me!" main :: IO () main = do !freak <- trainWithMany trainTexts -- ^ -- | create the trained model -- | Note that we do this strictly, -- | so that the model is ready to -- | go when we intuitively expect it -- | to be. let !freakTable = tabulate freak -- ^ -- | optimise the trained model for -- read access putStrLn "Done loading frequencies." -- ^ -- | let the user know that our model -- is done training and has finished -- optimising into aFreq
forever $ do -- ^ -- | make the following code loop forever putStrLn "Enter text:" -- ^ -- | ask the user for some text !bs <- BC.getLine -- ^ -- | bs is the inputByteString
to score let !score = measure freakTable bs -- ^ -- | score of theByteString
! putStrLn $ "Score: " ++ show score ++ "n" ++ passes score -- ^ -- | print out what the score of theByteString
was, -- along with its 'passing status'.
This results in the following interactions, split up for readability:
>>>
Done loading frequencies.
>>>
Enter text:
>>>
freq
>>>
Score: 0.10314131395591991
>>>
Looks good to me!
>>>
Enter text:
>>>
kjdslfkajdslkfjsd
>>>
Score: 6.693203041828383e-3
>>>
Too random!
>>>
Enter text:
>>>
William
>>>
Score: 7.086442245879888e-2
>>>
Looks good to me!
>>>
Enter text:
>>>
8op3u92jf
>>>
Score: 6.687182330334067e-3
>>>
Too random!
As we can see, it rejects the keysmashed text as being too random,
while the human-readable text is A-OK. I actually made the threshold
of 0.05 too high - it should be somewhere between 0.01 and 0.03, but
even then the outcomes would have still been the same. The digram-based
approach that freq
uses may seem ridiculously naive, but still
maintains a high degree of accuracy.
As an example of a real-world use case, I wrote freq
to use at my
workplace (I work at a Network Security company) as a way to score
domain names according to how random they are. Malicious
users spin up fake domains frequently using strings of random characters.
This can also be used to score Windows executables, since
those follow the same pattern of malicious naming.
An obvious weakness of this library is that it suffers from what can
be referred to as the "xkcd problem". It can score things such as xkcd
poorly, even though they are perfectly legitimate domains. The fix I use is
to use something like the alexa top 1 million list of domains, along with a
HashMap(s) for whitelisting/blacklisting.
As a wise man once told me - "And then I freaked it."
Synopsis
- data FreqTrain
- empty :: FreqTrain
- singleton :: Word8 -> Word8 -> Double -> FreqTrain
- train :: ByteString -> FreqTrain
- trainWith :: FilePath -> IO FreqTrain
- trainWithMany :: Foldable t => t FilePath -> IO FreqTrain
- tabulate :: FreqTrain -> Freq
- data Freq
- measure :: Freaky a => a -> ByteString -> Double
- prob :: Freaky a => a -> Word8 -> Word8 -> Double
- prettyFreqTrain :: FreqTrain -> IO ()
Frequency table builder (trainer) type
A
is a digram-based frequency table.FreqTrain
One can construct a
with FreqTrain
,
train
, or trainWith
.trainWithMany
One can use a trained
with FreqTrain
and prob
.measure
will add the values of each
of the matching keys.mappend
== <>
It is highly recommended to convert a
to a FreqTrain
with Freq
before using the trained model,
because tabulate
s have O(1) reads as well as significantly
faster constant-time operations, however keep in mind
that Freq
s cannot be neither modified nor converted
back to a Freq
.FreqTrain
Instances
Eq FreqTrain Source # | |
Data FreqTrain Source # | |
Defined in Freq.Internal gfoldl :: (forall d b. Data d => c (d -> b) -> d -> c b) -> (forall g. g -> c g) -> FreqTrain -> c FreqTrain # gunfold :: (forall b r. Data b => c (b -> r) -> c r) -> (forall r. r -> c r) -> Constr -> c FreqTrain # toConstr :: FreqTrain -> Constr # dataTypeOf :: FreqTrain -> DataType # dataCast1 :: Typeable t => (forall d. Data d => c (t d)) -> Maybe (c FreqTrain) # dataCast2 :: Typeable t => (forall d e. (Data d, Data e) => c (t d e)) -> Maybe (c FreqTrain) # gmapT :: (forall b. Data b => b -> b) -> FreqTrain -> FreqTrain # gmapQl :: (r -> r' -> r) -> r -> (forall d. Data d => d -> r') -> FreqTrain -> r # gmapQr :: (r' -> r -> r) -> r -> (forall d. Data d => d -> r') -> FreqTrain -> r # gmapQ :: (forall d. Data d => d -> u) -> FreqTrain -> [u] # gmapQi :: Int -> (forall d. Data d => d -> u) -> FreqTrain -> u # gmapM :: Monad m => (forall d. Data d => d -> m d) -> FreqTrain -> m FreqTrain # gmapMp :: MonadPlus m => (forall d. Data d => d -> m d) -> FreqTrain -> m FreqTrain # gmapMo :: MonadPlus m => (forall d. Data d => d -> m d) -> FreqTrain -> m FreqTrain # | |
Ord FreqTrain Source # | |
Defined in Freq.Internal | |
Read FreqTrain Source # | |
Show FreqTrain Source # | |
Semigroup FreqTrain Source # | |
Monoid FreqTrain Source # | |
NFData FreqTrain Source # | |
Defined in Freq.Internal | |
Freaky FreqTrain Source # | |
Construction
O(1). A Frequency table with a single entry.
Training
train :: ByteString -> FreqTrain Source #
Given a
consisting of training data,
build a Frequency table.ByteString
Using a trained model
A variant of
that holds identical information but
is optimised for reads. There are no operations that imbue
a FreqTrain
with additional information.Freq
Reading from a
is orders of magnitude faster
than reading from a Freq
. It is highly
recommended that you use your trained model by first
converting a FreqTrain
to a FreqTrain
with Freq
.tabulate
measure :: Freaky a => a -> ByteString -> Double Source #
Given a Frequency table and a
, ByteString
returns the probability that the measure
is not
randomised. The accuracy of ByteString
is is heavily affected
by your training data.measure
prob :: Freaky a => a -> Word8 -> Word8 -> Double Source #
Given a Frequency table and characters c1
and c2
,
what is the probability that c1
follows c2
?