%-------------------------------= -------------------------------------------- \chapter{Sorting by merging} %-------------------------------= -------------------------------------------- %align > module Swish.HaskellRDF.Sort.MergeSort > where > import Swish.HaskellRDF.Sort.LibBase > import Swish.HaskellRDF.Sort.ListLib > infixr {-"\,"-} `merge`, \+/ %align 33 > sort :: (Ord a) => [a] -> [a] > sort = bottomUpMergeSort %-------------------------------= -------------------------------------------- \section{Top-down merge sort} %-------------------------------= -------------------------------------------- The archetypical functional sorting algorithm is without any doubt merge sort. It follows the divide and conquer scheme: the input list is split into two halves, both are sorted recursively and the results are finally merged together. > merge :: (Ord a) => [a] -> [a] -> [a] > merge = mergeBy (<=) > > (\+/) :: (Ord a) => [a] -> [a] -> [a] > (\+/) = mergeBy (<=) > > mergeBy :: Rel a -> [a] -> [a] -> [a] > mergeBy (<=) = merge > where > merge [] bs = bs > merge as@(_ : _) [] = as > merge as@(a : as') bs@(b : bs') > | a <= b = a : merge as' bs > | otherwise = b : merge as bs' > > mergeSort :: (Ord a) => [a] -> [a] > mergeSort = mergeSortBy (<=) > > mergeSortBy :: Rel a -> [a] -> [a] > mergeSortBy (<=) as > | simple as = as > | otherwise = mergeBy (<=) (mergeSortBy (<=) as1) > (mergeSortBy (<=) as2) > where (as1, as2) = halve as Since the divide phase takes $\Theta(n\log n)$ time, |mergeSort| is not lazy: |head . mergeSort| has a running time of $\Theta(n\log n)$. %-------------------------------= -------------------------------------------- \section{Bottom-up merge sort} %-------------------------------= -------------------------------------------- The function |bottomUpMergeSort| improves the divide phase to $\Theta(n)$ and is consequently asymptotically optimal, stable, and lazy, but alas not in any way adaptive. > bottomUpMergeSort :: (Ord a) => [a] -> [a] > bottomUpMergeSort = bottomUpMergeSortBy (<=) > > bottomUpMergeSortBy :: Rel a -> [a] -> [a] > bottomUpMergeSortBy (<=) = gfoldm [] (\a -> [a]) (mergeBy (<=)) %-------------------------------= -------------------------------------------- \section{Straight merge sort} %-------------------------------= -------------------------------------------- Both |mergeSort| and |bottomUpMergeSort| take $\Theta(n\log n)$ irrespective of the presortedness of the input. If we replace the test |simple| by |ordered| we obtain an adaptive variant which is optimal with respect to the measure |Runs| and adaptive wrt |Inv| and |Rem| \cite[p.~449]{ECW92Sur}. > straightMergeSort :: (Ord a) => [a] -> [a] > straightMergeSort = straightMergeSortBy (<=) > straightMergeSortBy :: Rel a -> [a] -> [a] > straightMergeSortBy (<=) as > | orderedBy (<=) as = as > | otherwise = mergeBy (<=) (straightMergeSortBy (<=) as1) > (straightMergeSortBy (<=) as2) > where (as1, as2) = halve as \Todo{To adapt to ascending as well as descending sequences we could reverse the sublists in every step and apply Augustson's stable/anti-stable trick.} %-------------------------------= -------------------------------------------- \section{Odd-even merge sort} %-------------------------------= -------------------------------------------- If we use a different partioning scheme, |uninterleave| instead of |halve|, we obtain an adaptive variant which is optimal wrt |Dis = Max| \cite[p.~450]{ECW92Sur}. > oddEvenMergeSort :: (Ord a) => [a] -> [a] > oddEvenMergeSort = oddEvenMergeSortBy (<=) > > oddEvenMergeSortBy :: Rel a -> [a] -> [a] > oddEvenMergeSortBy (<=) as > | orderedBy (<=) as = as > | otherwise = mergeBy (<=) (oddEvenMergeSortBy (<=) as1) > (oddEvenMergeSortBy (<=) as2) > where (as1, as2) = uninterleave as Unfortunately, |oddEvenMergeSort| is no longer stable. Consider, for instance, |uninterleave [a1, a2, a3] = ([a1, a3], [a2])| and assume that the three elements are equal. However, |oddEvenMergeSort| can be improved so that the divide phase takes only linear time. This is left as an instructive exercise to the reader. %-------------------------------= -------------------------------------------- \section{Split sort} %-------------------------------= -------------------------------------------- A partioning scheme which adapts to |Rem| was given by Levcopoulos and Petersson \cite[p.~451]{ECW92Sur} and is based on a method by Cook and Kim for removing $\Theta(|Rem(as)|)$ elements from a list |as|, such that an ordered sequence is left over. The function |lpDivision as| divides its input into three lists |g|, |s|, and |l| such that |s| is sorted and |g| and |l| have the same length which is at most |Rem(as)|. The tricky thing is to ensure that the splitting is performed in a stable way, ie the order in which the elements in |g| and |l| appear is the same in which they appear in |as|. Consider the sequence |1 2 5 1 4 3 1 9 2 8 9 1|: % \[ \begin{array}{l||l||r} \text{|g| and |s|} & |l| & |as| \\\hline |1 2 5| & & |1 4 3 1 9 2 8 9 1| \\ |1 2 [5]| & |1| & |4 3 1 9 2 8 9 1| \\ |1 2 [5] 4| & |1| & |3 1 9 2 8 9 1| \\ |1 2 [5 4]| & |3 1| & |1 9 2 8 9 1| \\ |1 [2 5 4]| & |1 3 1| & |9 2 8 9 1| \\ |1 [2 5 4] 9| & |1 3 1| & |2 8 9 1| \\ |1 [2 5 4 9]| & |2 1 3 1| & |8 9 1| \\ |1 [2 5 4 9] 8| & |2 1 3 1| & |9 1| \\ |1 [2 5 4 9] 8 9| & |2 1 3 1| & |1| \\ |1 [2 5 4 9] 8 [9]| & |1 2 1 3 1| & \end{array} \] The data type |Region| is designed to represent |g| and |s|. Elements in |g| are grouped to allow for efficient access to the last element in |s|. For instance, |1 [2 5 4 9] 8 [9]| is essentially represented by |G (S (G (S Nil 1) [2, 5, 4, 9]) 8) [9]|. > type Sequ a = [a] -> [a] > > data Region a = Nil > | S (Region a) a > | G (Region a) (Sequ a) > > single :: a -> Sequ a > single a = \x -> a : x > > g :: Region a -> Sequ a -> Region a > g Nil gs = G Nil gs > g (S s a) gs = G (S s a) gs > g (G s gs1) gs2 = G s (gs1 . gs2) > > lpDivisionBy :: Rel a -> [a] -> ([a], [a], [a]) > lpDivisionBy (<=) [] = ([], [], []) > lpDivisionBy (<=) (a : as) = lp (S Nil a) [] as > where > lp s l [] = (g [], reverse s', reverse l) > where (g, s') = lpPart s > lp Nil l (a : as) = lp (S Nil a) l as > lp s@(G Nil gs) l (a : as)= lp (S s a) l as > lp s@(G (S s' m) gs) l (a : as) > | m <= a = lp (S s a) l as > | otherwise = lp (g s' (single m . gs)) (a : l) as > lp (G (G _ _) _) _ (_ : _)= error "lp" > lp s@(S s' m) l (a : as) > | m <= a = lp (S s a) l as > | otherwise = lp (g s' (single m)) (a : l) as > > lpPart :: Region a -> ([a] -> [a], [a]) > lpPart Nil = (id, []) > lpPart (S s a) = (g, a : s') > where (g, s') = lpPart s > lpPart (G s gs) = (gs . g, s') > where (g, s') = lpPart s Unfortunately, the relative order between equal elements in |g ++ s ++ l| is not the same as in |as|. Hence, |lpMergeSort| is not stable either. > lpMergeSort :: (Ord a) => [a] -> [a] > lpMergeSort = lpMergeSortBy (<=) > > lpMergeSortBy :: Rel a -> [a] -> [a] > lpMergeSortBy (<=) as > | simple as = as > | otherwise = mergeBy (<=) (lpMergeSortBy (<=) as1) > (mergeBy (<=) s (lpMergeSortBy (<=) as2)) > where (as1, s, as2) = lpDivisionBy (<=) as \Todo{To adapt to ascending as well as descending sequences we could reverse the sublists in every step. \NB stability is already lost.} %-------------------------------= -------------------------------------------- \section{Adaptive merge sort} %-------------------------------= -------------------------------------------- %format sort1 = sort "_1" %format sort2 = sort "_2" %format sort3 = sort "_3" If we combine |halve|, |uninterleave|, and |lpDivision| we obtain a sorting algorithm which is adaptive wrt |Exc|, |Dis|, |Inv|, |Rem|, and |Runs| \cite[p.~451]{ECW92Sur}. > adaptiveMergeSort :: (Ord a) => [a] -> [a] > adaptiveMergeSort = adaptiveMergeSortBy (<=) > > adaptiveMergeSortBy :: Rel a -> [a] -> [a] > adaptiveMergeSortBy (<=) = sort1 > where > (\+/) = mergeBy (<=) > > sort1 as > | simple as = as > | otherwise = sort2 as1 \+/ s \+/ sort2 as2 > where (as1, s, as2) = lpDivisionBy (<=) as > > sort2 as = sort3 as1 \+/ sort3 as2 > where (as1, as2) = uninterleave as > > sort3 as = sort1 as1 \+/ sort1 as2 > where (as1, as2) = halve as \Todo{How about reversing the lists in the first step in order to adapt to descending sequences as well? cf.~\cite[p.~52]{ECW91Pra}} %-------------------------------= -------------------------------------------- \section{Natural merge sort} %-------------------------------= -------------------------------------------- The function |straightMergeSort| somehow guesses the number of runs. It is more efficient to group the input into runs beforehand. > naturalMergeSort :: (Ord a) => [a] -> [a] > naturalMergeSort = naturalMergeSortBy (<=) > > naturalMergeSortBy :: Rel a -> [a] -> [a] > naturalMergeSortBy (<=) = foldm (mergeBy (<=)) [] . runsBy (<=) > > runsBy :: Rel a -> [a] -> [[a]] > runsBy (<=) [] = [[]] > runsBy (<=) (a : as) = upRun a [] as > where > upRun m r [] = [reverse (m : r)] > upRun m r (a : as) > | m <= a = upRun a (m : r) as > | otherwise = reverse (m : r) : upRun a [] as Natural merge sort was first studied in the context of external sorting. The hbc library contains a similar function. The function |runs| recognized only ascending runs. With little additional effort we can also detect descending runs. > symmetricNaturalMergeSort :: (Ord a) => [a] -> [a] > symmetricNaturalMergeSort = symmetricNaturalMergeSortBy (<=) > > symmetricNaturalMergeSortBy :: Rel a -> [a] -> [a] > symmetricNaturalMergeSortBy (<=) > = foldm (mergeBy (<=)) [] . upDownRunsBy (<=) > > upDownRunsBy :: Rel a -> [a] -> [[a]] > upDownRunsBy (<=) [] = [] > upDownRunsBy (<=) (a : as) = upDownRun a as > where > upDownRun a [] = [[a]] > upDownRun a1 (a2 : as) > | a1 <= a2 = upRun a2 [a1] as > | otherwise = downRun a2 [a1] as > upRun m r [] = [reverse (m : r)] > upRun m r (a : as) > | m <= a = upRun a (m : r) as > | otherwise = reverse (m : r) : upDownRun a as > > downRun m r [] = [m : r] > downRun m r (a : as) > | m <= a = (m : r) : upDownRun a as > | otherwise = downRun a (m : r) as \NB To preserve stability |downRun| uses only \emph{strictly} decreasing sequences: |[n, n, n-1, n-1, .. 1, 1]| is split into |n| runs. Does anybody know of a better solution?