-- |
-- Module : Data.ByteString.Search.BoyerMoore
-- Copyright : Daniel Fischer
-- Chris Kuklewicz
-- Licence : BSD3
-- Maintainer : Daniel Fischer
-- Stability : Provisional
-- Portability : non-portable (BangPatterns)
--
-- Fast overlapping Boyer-Moore search of both strict and lazy
-- 'ByteString' values.
--
-- Descriptions of the algorithm can be found at
--
-- and
--
--
-- Original authors: Daniel Fischer (daniel.is.fischer at googlemail.com) and
-- Chris Kuklewicz (haskell at list.mightyreason.com).
module Data.ByteString.Search.BoyerMoore
{-# DEPRECATED "Use the new interface instead" #-} (
-- * Overview
-- $overview
-- ** Changes
-- $changes
-- ** Deprecation
-- $deprecation
-- ** Parameter and return types
-- $types
-- ** Lazy ByteStrings
-- $lazy
-- ** Performance
-- $performance
-- ** Complexity
-- $complexity
-- ** Partial application
-- $currying
-- ** Integer overflow
-- $overflow
-- * Functions
matchLL
, matchLS
, matchSL
, matchSS
) where
import Data.ByteString.Search.Internal.BoyerMoore
(matchLS, matchSS)
import Data.ByteString.Lazy.Search.Internal.BoyerMoore
(matchLL, matchSL)
-- $overview
--
-- This module exists only for backwards compatibility. Nevertheless
-- there have been small changes in the behaviour of the functions.
-- The module exports four search functions: 'matchLL', 'matchLS',
-- 'matchSL', and 'matchSS'. All of them return the list of all
-- starting positions of possibly overlapping occurrences of a pattern
-- in a string.
-- $changes
--
-- Formerly, all four functions returned an empty list when passed
-- an empty pattern. Now, in accordance with the functions from the other
-- modules, @matchXY \"\" target = [0 .. 'length' target]@.
-- $deprecation
--
-- This module is /deprecated/. You should use the new interface provided
-- in "Data.ByteString.Search" resp. "Data.ByteString.Lazy.Search".
-- $types
--
-- The first parameter is always the pattern string. The second
-- parameter is always the target string to be searched. The returned
-- list contains the offsets of all /overlapping/ patterns.
--
-- A returned @Int@ or @Int64@ is an index into the target string
-- which is aligned to the head of the pattern string. Strict targets
-- return @Int@ indices and lazy targets return @Int64@ indices. All
-- returned lists are computed and returned in a lazy fashion.
-- $lazy
--
-- 'matchLL' and 'matchLS' take lazy bytestrings as patterns. For
-- performance, if the pattern is not a single strict chunk then all
-- the the pattern chunks will copied into a concatenated strict
-- bytestring. This limits the patterns to a length of (maxBound ::
-- Int).
--
-- 'matchLL' and 'matchSL' take lazy bytestrings as targets.
-- These are written so that while they work they will not retain a
-- reference to all the earlier parts of the the lazy bytestring.
-- This means the garbage collector would be able to keep only a small
-- amount of the target string and free the rest.
-- $currying
--
-- These functions can all be usefully partially applied.
-- Given only a pattern the partially applied version will compute
-- the supporting lookup tables only once, allowing for efficient re-use.
-- Similarly, the partially applied 'matchLL' and 'matchLS' will compute
-- the concatenated pattern only once.
-- $performance
--
-- In general, the Boyer-Moore algorithm is the most efficient method to
-- search for a pattern inside a string. The advantage over other algorithms
-- (e.g. Naïve, Knuth-Morris-Pratt, Horspool, Sunday) can be made
-- arbitrarily large for specially selected patterns and targets, but
-- usually, it's a factor of 2–3 versus Knuth-Morris-Pratt and of
-- 6–10 versus the naïve algorithm. The Horspool and Sunday
-- algorithms, which are simplified variants of the Boyer-Moore algorithm,
-- typically have performance between Boyer-Moore and Knuth-Morris-Pratt,
-- mostly closer to Boyer-Moore. The advantage of the Boyer-moore variants
-- over other algorithms generally becomes larger for longer patterns. For
-- very short patterns (or patterns with a very short period), other
-- algorithms, e.g. "Data.ByteString.Search.DFA" can be faster (my
-- tests suggest that \"very short\" means two, maybe three bytes).
--
-- In general, searching in a strict 'S.ByteString' is slightly faster
-- than searching in a lazy 'L.ByteString', but for long targets the
-- smaller memory footprint of lazy 'L.ByteStrings' can make searching
-- those (sometimes much) faster. On the other hand, there are cases
-- where searching in a strict target is much faster, even for long targets.
--
-- On 32-bit systems, 'Int'-arithmetic is much faster than 'Int64'-arithmetic,
-- so when there are many matches, that can make a significant difference.
--
-- Also, the modification to ameliorate the case of periodic patterns
-- is defeated by chunk-boundaries, so long patterns with a short period
-- and many matches exhibit poor behaviour (consider using @indices@ from
-- "Data.ByteString.Lazy.Search.DFA" or "Data.ByteString.Lazy.Search.KMP"
-- in those cases, the former for medium-length patterns, the latter for
-- long patterns; only 'matchLL' and 'matchSL' suffer from
-- this problem, though).
-- $complexity
--
-- Preprocessing the pattern string is O(@patternLength@). The search
-- performance is O(@targetLength@\/@patternLength@) in the best case,
-- allowing it to go faster than a Knuth-Morris-Pratt algorithm. With
-- a non-periodic pattern the worst case uses O(3\*@targetLength@)
-- comparisons. The periodic pattern worst case is quadratic
-- O(@targetLength@\*@patternLength@) complexity for the original
-- Boyer-Moore algorithm.
--
-- The searching functions in this module contain a modification which
-- drastically improves the performance for periodic patterns.
-- I believe that for strict target strings, the worst case is now
-- /O/(@targetLength@) also for periodic patterns and for lazy target
-- strings, my semi-educated guess is
-- /O/(@targetLength@ * (1 + @patternLength@ \/ @chunkSize@)).
-- $overflow
--
-- The current code uses @Int@ to keep track of the locations in the
-- target string. If the length of the pattern plus the length of any
-- strict chunk of the target string is greater or equal to
-- @'maxBound'::Int@ then this will overflow causing an error. We try
-- to detect this and call 'error' before a segfault occurs.