sphinxesc
A small module to prevent user-submitted search expressions from being
mis-parsed into invalid Sphinx Extended Query Expressions.
The module provides a function
module SphinxEscape where
escapeSphinxQueryString :: String -> String
that sanitizes the Sphinx query expression in a way that can be safely submitted to the Sphinx API.
Synopsis
Example from ghci:
ghci> :m SphinxEscape 
ghci> putStrLn $ escapeSphinxQueryString "@tag_list hello OR quick brown fox 7/11"
@tag_list hello | quick brown fox 7 11
ghci> 
ghci> putStrLn $ escapeSphinxQueryString "hello AND quick brown fox 7/11"
hello & quick brown fox 7 11
ghci> 
Explanation
escapeSphinxQueryString performs very simple escaping with the help of a
simplified abtract syntax tree. The abstract syntax tree it builds is:
data Expression = 
        TagFieldSearch String 
      | Literal String
      | Phrase String
      | AndOrExpr Conj Expression Expression 
  deriving Show
The escaping does not parse more advanced Sphinx query expressions such as
NEAR/n, quorum, etc., nor does it recognize arbitrary @field expressions.
The only special expressions recognized are & (AND), | (OR) and @tag_list WORDS.  Except for quoted phrases, non-alphanumeric characters that do not
form part of these specific expressions are simply turned into whitespace.
See the Testing section below for examples of conversions.
Obviously these rules are quite domain specific. The rules can be
made more configurable later.
Testing
The command line executable sphinxesc can be used to test the expression parser
and escaping of the input to the final sphinx search expression.
$ sphinxesc "test OR hello"
test | hello
# -p option shows the parsing result
$ sphinxesc -p "test OR hello"
AndOrExpr Or (Literal "test") (Literal "hello")
There is a suite of Bash-based regression tests in tests.txt, where the input
is on the left, followed by :: surrounded by any whitespace, followed by the
expected escaped output result. To run the tests, execute the script
./test.sh
NOTE This test output may be outdated. Please look at the tests.txt for
the current tests.
./test.sh
INPUT                         EXPECTED                      RESULT                        PASS      
7/11                          7 11                          7 11                          PASS      
hello 7/11                    hello 7 11                    hello 7 11                    PASS      
hello OR 7/11                 hello | 7 11                  hello | 7 11                  PASS      
hello or 7/11                 hello | 7 11                  hello | 7 11                  PASS      
hello | 7/11                  hello | 7 11                  hello | 7 11                  PASS      
hello AND 7/11                hello & 7 11                  hello & 7 11                  PASS      
@tag_list fox tango 7/11      @tag_list fox tango 7 11      @tag_list fox tango 7 11      PASS      
@(tag_list) fox tango 7/11    @tag_list fox tango 7 11      @tag_list fox tango 7 11      PASS      
@(tag_list) AND               @tag_list AND                 @tag_list AND                 PASS      
@other_field AND              other field AND               other field AND               PASS      
hello & @other_field AND      hello &  other field AND      hello &  other field AND      PASS      
hello &                       hello                         hello                         PASS      
& hello &                     hello                         hello                         PASS      
& & hello &                   hello                         hello                         PASS      
| | hello |                   hello                         hello                         PASS      
"hello" hello                 hello  hello                  hello  hello                  PASS      
hello" hello                  hello  hello                  hello  hello                  PASS      
hello' hello                  hello  hello                  hello  hello                  PASS      
hello' @tag_list fox          hello   @tag_list fox         hello   @tag_list fox         PASS      
hello' @tag_list fox &        hello   @tag_list fox         hello   @tag_list fox         PASS      
                                                                                          PASS      
(The last case is hard to see, but the input is a blank string "" and the output is a blank string "".)
Future directions
The escaping function can be made more configurable. The parser and AST data
structure can also be made more sophisticated, so that the AST can cover more
of the Sphinx Extended Query syntax.
Reference