dawgdic
Overview
dawgdic is a library for building and accessing
dictionaries implemented with directed acyclic word
graphs (DAWG).
This is a ported version of C++ dawgdic library.
Features
This library offers DAWG, Dictionary, Guide and Completer data types as well as their builders.
DAWG represents word graph. It is provided as intermediate data structure.
 
Dictionary represents compact layout for DAWG. It offers API to check the presence of the word and associated value in the graph. Also it could be stored into a file and loaded from file.
 
Guide is being used as a tree-like index to get the faster navigation through the dictionary. Also could be stored into a file and loaded from file.
 
Input characters must be in range 0-255.
This port does not contain RankedGuide and RankedCompleter (yet).
Comparison with other DAWG libraries
- dictionary file size, in bytes:
 
| package/lexicon size (words) | 
2.6K | 
26K | 
370K | 
| dawg: Data.DAWG.Dynamic | 
370868 | 
2823660 | 
29933964 | 
| dawg: Data.DAWG.Static | 
133138 | 
1030578 | 
11128690 | 
| dawg-ord: Data.DAWG.Int | 
N/A | 
N/A | 
N/A | 
| dawg-ord: Data.DAWG.Ord | 
N/A | 
N/A | 
N/A | 
| packed-dawg: Data.DAWG.Packed | 
15619 | 
128491 | 
1481671 | 
| dawgdic: Data.DAWG.Dictionary | 
3088 | 
141328 | 
1603600 | 
| dawgdic: Guide (w/ Dictionary) | 
4640 | 
212000 | 
2405408 | 
| feature | 
dawg | 
dawg-ord | 
packed-dawg | 
dawgdic | 
| build from list | 
+ | 
+ | 
+ | 
+ | 
| persistence | 
+ | 
- | 
+ | 
+ | 
| insert | 
Dynamic | 
+ | 
- | 
Builder | 
| delete key | 
Dynamic | 
+ | 
- | 
- | 
| follow character | 
Static | 
~edges | 
lookupPrefix | 
+ | 
| lookup value | 
+ | 
+ | 
- | 
+ | 
| member | 
~lookup | 
~lookup | 
+ | 
+ | 
| keys | 
+ | 
+ | 
+ | 
+ | 
| values | 
+ | 
+ | 
- | 
+ | 
| toList (assocs) | 
+ | 
+ | 
- | 
+ | 
| complete word | 
~submap | 
- | 
~toList | 
+ | 
| fuzzy search | 
- | 
- | 
- | 
- | 
| max size | 
N/A | 
N/A | 
2^22 nodes | 
N/A | 
Installation
Add dawgdic dependency to your project and run cabal build.
Building and Querying
Building DAWG from lexicon of words and ignoring insertion failures is as simple as this:
>>> import Data.DAWG.DAWG
>>> dawg <- fromAscList . lines =<< readFile "/path/to/lexicon"
Otherwise, consider mutable builder. It could also be useful if words are associated with values.
buildOrError content = do
  dawgBuilder <- new
  forM_ content \(word, value) -> do
    result <- insert word (Just value) dawgBuilder
    unless result $ error "Insert failed"
  freeze dawgBuilder
Building dictionary and guide:
>>> import qualified Data.DAWG.Dictionary as D
>>> import qualified Data.DAWG.Guide as G
>>> dict <- D.build' dawg
>>> guide <- G.build' dawg dict
Saving dictionary and guide:
>>> D.write "dict.dawg" dict
>>> G.write "guide.dawg" guide
Loading dictionary and guide:
>>> dict <- D.load "dict.dawg"
>>> guide <- G.load "guide.dawg"
Consider using Completer for auto-complete-like queries or if you need to obtain lexicon back from Dictionary and Guide.
Benchmarks
Utilities
Benchmark                             default(μs)
------------------------------------- ------------
Utilities/10/Dawg.fromAscList             5549.76
Utilities/10/Dict.build'                 26981.64
Utilities/10/Dict.contains                  22.46
Utilities/10/Dict.lookup                    22.68
Utilities/10/Dict.follow                    22.42
Utilities/10/Guide.build'                   76.81
Utilities/10/Completer.completeKeys        385.35
------------------------------------- ------------
Utilities/100/Dawg.fromAscList           66486.21
Utilities/100/Dict.build'               194881.38
Utilities/100/Dict.contains                326.16
Utilities/100/Dict.lookup                  323.45
Utilities/100/Dict.follow                  319.86
Utilities/100/Guide.build'                 782.24
Utilities/100/Completer.completeKeys      7016.36
------------------------------------- ------------
Utilities/1000/Dawg.fromAscList         888061.61
Utilities/1000/Dict.build'             1659798.44
Utilities/1000/Dict.contains              3627.54
Utilities/1000/Dict.lookup                3638.10
Utilities/1000/Dict.follow                3564.64
Utilities/1000/Guide.build'               7992.73
Utilities/1000/Completer.completeKeys    82343.20
------------------------------------- ------------
How to reproduce:
cabal install bench-show --overwrite-policy=always
- Run and wait (it might take around 
3m to complete):
 
time cabal bench +RTS "-N4 -A64m -n4m -qb0" -RTS  --benchmark-options="--output bench.html --csv results.csv"
bench-show --presentation=Solo report results.csv
Comparison
Benchmark                      default(μs)
------------------------------ -----------
10/dawgdic.follow                    21.04
10/dawg.follow                       85.41
10/dawg-ord.follow                  110.42
10/packed-dawg.follow               147.30
100/dawgdic.follow                  311.05
100/dawg.follow                    1367.81
100/dawg-ord.follow                1774.78
100/packed-dawg.follow             2098.12
1000/dawgdic.follow                3273.02
1000/dawg.follow                  18842.47
1000/dawg-ord.follow              21992.03
1000/packed-dawg.follow           28938.01
------------------------------ -----------
10/dawgdic.lookup value              21.74
10/dawg.lookup value                 61.89
10/dawg-ord.lookup value            209.93
100/dawgdic.lookup value            290.13
100/dawg.lookup value               847.27
100/dawg-ord.lookup value          2973.35
1000/dawgdic.lookup value          3551.44
1000/dawg.lookup value            14217.29
1000/dawg-ord.lookup value        40111.28
10/dawgdic.member                    21.02
------------------------------ -----------
10/dawg.member                       56.43
10/dawg-ord.member                  191.74
10/packed-dawg.member               173.97
100/dawgdic.member                  343.32
100/dawg.member                     873.70
100/dawg-ord.member                2863.96
100/packed-dawg.member             2200.33
1000/dawgdic.member                9288.22
1000/dawg.member                  13878.72
1000/dawg-ord.member              31428.47
1000/packed-dawg.member           29089.83
------------------------------ -----------
10/dawgdic.keys                     108.33
10/dawg.keys                        100.23
10/dawg-ord.keys                    160.51
10/packed-dawg.keys                  50.50
100/dawgdic.keys                   1392.78
100/dawg.keys                      1413.00
100/dawg-ord.keys                  2324.99
100/packed-dawg.keys                692.23
1000/dawgdic.keys                 15628.83
1000/dawg.keys                    16541.79
1000/dawg-ord.keys                27612.77
1000/packed-dawg.keys              7415.48
------------------------------ -----------
10/dawgdic.values                    52.26
10/dawg.values                       72.29
10/dawg-ord.values                  134.79
100/dawgdic.values                  616.02
100/dawg.values                    1060.87
100/dawg-ord.values                1794.82
1000/dawgdic.values                6457.32
1000/dawg.values                  11734.25
1000/dawg-ord.values              21532.54
------------------------------ -----------
10/dawgdic.toList                   107.26
10/dawg.toList                      116.89
10/dawg-ord.toList                  202.35
100/dawgdic.toList                 1367.87
100/dawg.toList                    1608.26
100/dawg-ord.toList                2898.26
1000/dawgdic.toList               16334.34
1000/dawg.toList                  18578.59
1000/dawg-ord.toList              36360.44
------------------------------ -----------
10/dawgdic.complete word            367.97
10/dawg.complete word               248.14
10/packed-dawg.complete word        303.69
100/dawgdic.complete word          6151.51
100/dawg.complete word             4494.85
100/packed-dawg.complete word      4913.83
1000/dawgdic.complete word        71152.49
1000/dawg.complete word           58222.76
1000/packed-dawg.complete word    60828.53
------------------------------ -----------
How to reproduce:
cabal install bench-show --overwrite-policy=always
- Run and wait (it might take around 
7m to complete):
 
time cabal bench +RTS "-N4 -A64m -n4m -qb0" -RTS  --benchmark-options="--output bench.html --csv results.csv"
bench-show --presentation=Solo report results.csv
Contributing
In cases of issues please attach callstack, provide minimal dictionary lexicon and provide logs with enabled tracing.
cabal build -ftrace
Acknowledgments