phonetic-code-0.1.1.1: Phonetic codes: Soundex and Phonix

Safe HaskellNone
LanguageHaskell2010

Text.PhoneticCode.Phonix

Description

Phonix codes (Gadd 1990) augment slightly improved Soundex codes with a preprocessing step for cleaning up certain n-grams. Since the preprocessing step contains around 90 rules processed by a slow custom-written scanner, this implementation is not too fast.

This code was based on a number of sources, including the CPAN Phonix code calculator `Text::Phonetic::Phonix.pm`. Because the paper describing the codes is not freely available and I'm lazy, I did not use it as a reference. Also because Phonix involves around 90 substitution rules, I transformed the Perl ones, which was easier than generating them from scratch.

Synopsis

Documentation

phonix :: String -> String Source

Compute a "full" phonix code; i.e., do not drop any encodable characters from the result. The leading character of the code will be folded to uppercase. Non-alphabetics are not encoded. If no alphabetics are present, the phonix code will be "0".

There appear to be many, many variants of phonix implemented on the web, and I'm too cheap and lazy to go find the original paper by Gadd (1990) that actually describes the original algorithm. Thus, I am taking some big guesses on intent here as I implement. Corrections, especially those involving getting me a copy of the article, are welcome.

Dropping the "trailing sound" seems to be an integral part of Gadd's technique, but I'm not sure how it is supposed to be done. I am currently compressing runs of vowels, and then dropping the trailing digit or vowel from the code.

Another area of confusion is whether to compress strings of the same code, as in Soundex, or merely strings of the same consonant. I have chosen the former.

phonixCodes :: Array Char Char Source

Array of phonix codes for single characters. The array maps uppercase letters (only) to a character representing a code in the range ['1'..'8'] or ?.

phonixRules :: [(String, String)] Source

Substitution rules for Phonix canonicalization. "^" ("$") is used to anchor a pattern to the beginning (end) of the word. "c" ("v", ".") at the beginning or end of a pattern match a consonant (vowel, arbitrary character). A character matched in this fashion is automatically tacked onto the beginning (end) of the pattern.

phonixRulesPatSubsts :: [(String, String)] Source

List of pattern/substitution pairs built from the phonixRules.

applyPhonixRules :: String -> String Source

Apply each of the Phonix preprocessing rules in turn to the target word returning the resulting accumulated substitution.