-- Hoogle documentation, generated by Haddock -- See Hoogle, http://www.haskell.org/hoogle/ -- | Phonetic codes: Soundex and Phonix -- @package phonetic-code @version 0.1.1.1 -- | Phonix codes (Gadd 1990) augment slightly improved Soundex codes with -- a preprocessing step for cleaning up certain n-grams. Since the -- preprocessing step contains around 90 rules processed by a slow -- custom-written scanner, this implementation is not too fast. -- -- This code was based on a number of sources, including the CPAN Phonix -- code calculator `Text::Phonetic::Phonix.pm`. Because the paper -- describing the codes is not freely available and I'm lazy, I did not -- use it as a reference. Also because Phonix involves around 90 -- substitution rules, I transformed the Perl ones, which was easier than -- generating them from scratch. module Text.PhoneticCode.Phonix -- | Compute a "full" phonix code; i.e., do not drop any encodable -- characters from the result. The leading character of the code will be -- folded to uppercase. Non-alphabetics are not encoded. If no -- alphabetics are present, the phonix code will be "0". -- -- There appear to be many, many variants of phonix implemented on the -- web, and I'm too cheap and lazy to go find the original paper by Gadd -- (1990) that actually describes the original algorithm. Thus, I am -- taking some big guesses on intent here as I implement. Corrections, -- especially those involving getting me a copy of the article, are -- welcome. -- -- Dropping the "trailing sound" seems to be an integral part of Gadd's -- technique, but I'm not sure how it is supposed to be done. I am -- currently compressing runs of vowels, and then dropping the trailing -- digit or vowel from the code. -- -- Another area of confusion is whether to compress strings of the same -- code, as in Soundex, or merely strings of the same consonant. I have -- chosen the former. phonix :: String -> String -- | Array of phonix codes for single characters. The array maps uppercase -- letters (only) to a character representing a code in the range -- ['1'..'8'] or ?. phonixCodes :: Array Char Char -- | Substitution rules for Phonix canonicalization. "^" ("$") is used to -- anchor a pattern to the beginning (end) of the word. "c" ("v", ".") at -- the beginning or end of a pattern match a consonant (vowel, arbitrary -- character). A character matched in this fashion is automatically -- tacked onto the beginning (end) of the pattern. phonixRules :: [(String, String)] -- | List of pattern/substitution pairs built from the phonixRules. phonixRulesPatSubsts :: [(String, String)] -- | Apply each of the Phonix preprocessing rules in turn to the target -- word returning the resulting accumulated substitution. applyPhonixRules :: String -> String -- | Soundex is a phonetic coding algorithm. It transforms word into a -- similarity hash based on an approximation of its sounds. Thus, -- similar-sounding words tend to have the same hash. -- -- This implementation is based on a number of sources, including a -- description of soundex at http://wikipedia.org/wiki/Soundex and -- in Knuth's "The Art of Computer Programming" 2nd ed v1 pp394-395. A -- very helpful reference on the details and differences among soundex -- algorithms is "Soundex: The True Story", -- http://west-penwith.org.uk/misc/soundex.htm accessed 11 -- September 2008. -- -- This code was originally written for the "thimk" spelling suggestion -- application in Nickle (http://nickle.org) in July 2002 based on a -- description from -- http://www.geocities.com/Heartland/Hills/3916/soundex.html -- which is now http://www.searchforancestors.com/soundex.html The -- code was ported September 2008; the Soundex variants were also added -- at this time. module Text.PhoneticCode.Soundex -- | Compute a "full" soundex code; i.e., do not drop any encodable -- characters from the result. The leading character of the code will be -- folded to uppercase. Non-alphabetics are not encoded. If no -- alphabetics are present, the soundex code will be "0". -- -- The two commonly encountered forms of soundex are Simplified and -- another known as American, Miracode, NARA or Knuth. This code will -- calculate either---passing True gets NARA, and False gets Simplified. soundex :: Bool -> String -> String soundexSimple :: String -> String soundexNARA :: String -> String -- | Array of soundex codes for single characters. The array maps uppercase -- letters (only) to a character representing a code in the range -- ['1'..'7'] or ?. Code '7' is returned as a coding convenience -- for AmericanMiracodeNARA/Knuth soundex. soundexCodes :: Array Char Char