-- Hoogle documentation, generated by Haddock -- See Hoogle, http://www.haskell.org/hoogle/ -- | Phonetic codes: Soundex and Phonix -- -- This package implements the phonetic coding algorithms Soundex -- and Phonix. A phonetic coding algorithm transforms a word into a -- similarity hash based on an approximation of its sounds. Thus, -- similar-sounding words tend to have the same hash. @package phonetic-code @version 0.1 -- | Phonix codes (Gadd 1990) augment slightly improved Soundex codes with -- a preprocessing step for cleaning up certain n-grams. Since the -- preprocessing step contains more than 150 rules processed by a slow -- custom-written scanner, this implementation is not too fast. -- -- This code was based on a number of sources, including the CPAN Phonix -- code calculator Text::Phonetic::Phonix.pm . Because the paper -- describing the codes is not freely available and I'm lazy, I did not -- use it as a reference. Also because Phonix involves over 150 -- substitution rules, I transformed the Perl ones, which was easier than -- generating them from scratch. module Text.PhoneticCode.Phonix -- | Compute a full phonix code; i.e., do not drop any encodable -- characters from the result. The leading character of the code will be -- folded to uppercase. Non-alphabetics are not encoded. If no -- alphabetics are present, the phonix code will be 0. -- -- There appear to be many, many variants of phonix implemented on the -- web, and I'm too cheap and lazy to go find the original paper by Gadd -- (1990) that actually describes the original algorithm. Thus, I am -- taking some big guesses on intent here as I implement. Corrections, -- especially those involving getting me a copy of the article, are -- welcome. -- -- Dropping the trailing sound seems to be an integral part of -- Gadd's technique, but I'm not sure how it is supposed to be done. I am -- currently compressing runs of vowels, and then dropping the trailing -- digit or vowel from the code. -- -- Another area of confusion is whether to compress strings of the same -- code, as in Soundex, or merely strings of the same consonant. I have -- chosen the former. phonix :: String -> String -- | Array of phonix codes for single characters. The array maps uppercase -- letters (only) to a character representing a code in the range -- ['1'..'8'] or ?. phonixCodes :: Array Char Char -- | Substitution rules for Phonix canonicalization. ^ ($) is -- used to anchor a pattern to the beginning (end) of the word. c -- (v, .) at the beginning or end of a pattern match a -- consonant (vowel, arbitrary character). A character matched in this -- fashion is automatically tacked onto the beginning (end) of the -- pattern. phonixRules :: [(String, String)] -- | Soundex is a phonetic coding algorithm. It transforms word into a -- similarity hash based on an approximation of its sounds. Thus, -- similar-sounding words tend to have the same hash. -- -- This implementation is based on a number of sources, including a -- description of soundex at http:wikipedia.orgwikiSoundex -- and in Knuth's The Art of Computer Programming 2nd ed v1 -- pp394-395. A very helpful reference on the details and differences -- among soundex algorithms is Soundex: The True Story, -- http:west-penwith.org.ukmiscsoundex.htm accessed 11 -- September 2008. -- -- This code was originally written for the thimk spelling -- suggestion application in Nickle (http:nickle.org) in July 2002 -- based on a description from -- http:www.geocities.comHeartlandHills3916soundex.html -- which is now http:www.searchforancestors.com/soundex.html The -- code was ported September 2008; the Soundex variants were also added -- at this time. module Text.PhoneticCode.Soundex -- | Compute a full soundex code; i.e., do not drop any encodable -- characters from the result. The leading character of the code will be -- folded to uppercase. Non-alphabetics are not encoded. If no -- alphabetics are present, the soundex code will be 0. -- -- The two commonly encountered forms of soundex are Simplified and -- another known as American, Miracode, NARA or Knuth. This code will -- calculate either---passing True gets NARA, and False gets Simplified. soundex :: Bool -> String -> String soundexSimple :: String -> String soundexNARA :: String -> String -- | Array of soundex codes for single characters. The array maps uppercase -- letters (only) to a character representing a code in the range -- ['1'..'7'] or ?. Code '7' is returned as a coding convenience -- for AmericanMiracodeNARA/Knuth soundex. soundexCodes :: Array Char Char