phonetic-code-0.1: Phonetic codes: Soundex and Phonix



Phonix codes (Gadd 1990) augment slightly improved Soundex codes with a preprocessing step for cleaning up certain n-grams. Since the preprocessing step contains more than 150 rules processed by a slow custom-written scanner, this implementation is not too fast.

This code was based on a number of sources, including the CPAN Phonix code calculator . Because the paper describing the codes is not freely available and I'm lazy, I did not use it as a reference. Also because Phonix involves over 150 substitution rules, I transformed the Perl ones, which was easier than generating them from scratch.



phonix :: String -> StringSource

Compute a full phonix code; i.e., do not drop any encodable characters from the result. The leading character of the code will be folded to uppercase. Non-alphabetics are not encoded. If no alphabetics are present, the phonix code will be 0.

There appear to be many, many variants of phonix implemented on the web, and I'm too cheap and lazy to go find the original paper by Gadd (1990) that actually describes the original algorithm. Thus, I am taking some big guesses on intent here as I implement. Corrections, especially those involving getting me a copy of the article, are welcome.

Dropping the trailing sound seems to be an integral part of Gadd's technique, but I'm not sure how it is supposed to be done. I am currently compressing runs of vowels, and then dropping the trailing digit or vowel from the code.

Another area of confusion is whether to compress strings of the same code, as in Soundex, or merely strings of the same consonant. I have chosen the former.

phonixCodes :: Array Char CharSource

Array of phonix codes for single characters. The array maps uppercase letters (only) to a character representing a code in the range ['1'..'8'] or ?.

phonixRules :: [(String, String)]Source

Substitution rules for Phonix canonicalization. ^ ($) is used to anchor a pattern to the beginning (end) of the word. c (v, .) at the beginning or end of a pattern match a consonant (vowel, arbitrary character). A character matched in this fashion is automatically tacked onto the beginning (end) of the pattern.