Webrexp-1.1.2: Regexp-like engine to scrap web data

Safe HaskellNone

Text.Webrexp.Exprtypes

Contents

Description

Datatypes used to describe webrexps, and some helper functions.

Synopsis

Types

data WebRef Source

represent an element

Constructors

Wildcard

'*' Any subelement.

Elem String

... Search for a named element.

OfClass WebRef String

... . ... Check the value of the 'class' attribute

Attrib WebRef String

@... Check for the presence of an attribute

OfName WebRef String

#... Check the value of the 'id' attribute

Instances

data NodeRange Source

Ranges to be able to filter nodes by position.

Constructors

Index Int

...

Interval Int Int

min-max

data Op Source

Definitions of the operators available in the actions of the webrexp.

Constructors

OpAdd

+

OpSub

-

OpMul

*

OpDiv

div

OpLt

<

OpLe

<=

OpGt

>

OpGe

>=

OpEq

'=' in webrexp (== in Haskell)

OpNe

'!=' (/= in Haskell)

OpAnd

'&' (&& in Haksell)

OpOr

'|' (|| in Haskell)

OpMatch

'=~' regexp matching

OpContain

'~=' op contain, as the CSS3 operator.

OpBegin

'^=' op beginning, as the CSS3 operator.

OpEnd

'$=' op beginning, as the CSS3 operator.

OpSubstring

'^=' op beginning, as the CSS3 operator.

OpHyphenBegin

'|=' op beginning, as the CSS3 operator.

OpConcat

':' concatenate two strings

Instances

data ActionExpr Source

Represent an action Each production of the grammar more or less map to a data constructor of this type.

Constructors

ActionExprs [ActionExpr]

{ ... ; ... ; ... ; ... } A list of action to execute, each one must return a valid value to continue the evaluation

BinOp Op ActionExpr ActionExpr

Basic binary opertor application

ARef String

Find a value of a given attribute for the current element.

CstI Int

An integer constant.

CstS String

A string constant

NodeReplace ActionExpr

'$'... operator Used to put the action value back into the evaluation pipeline.

OutputAction

the . action. Dump the content of the current element.

DeepOutputAction

Translate a node and all it's children into text.

NodeNameOutputAction

Retrieve a node name

Call BuiltinFunc [ActionExpr]

funcName(..., ...)

data WebRexp Source

Type representation of web-regexp, main type.

Constructors

Branch [WebRexp]

( ... ; ... ; ... )

Unions [WebRexp]

( ... , ... , ... )

List [WebRexp]

... ... (each action followed, no rollback)

Star WebRexp

... *

Repeat RepeatCount WebRexp

... #{ }

Alternative WebRexp WebRexp

'|' Represent two alternative path, if the first fail, the second one is taken

Unique Int

'!' Possess an unique index to differentiate all the differents uniques. Negative value are considered invalid, all positive or null one are accepted.

Str String

"..." A string constant in the source expression.

Action ActionExpr

"{ ... }"

Range Int [NodeRange]

'[ ... ]' Filtering Range The Int is used as an index for a counter in the DFS evaluator.

Ref WebRef

every tag/class name

DirectChild WebRef

Find children who are the different descendent of the current nodes.

ConstrainedRef WebRef ActionExpr

This constructor is an optimisation, it combine an Ref followed by an action, where every action is a predicate. Help pruning quickly the evaluation tree in DFS evaluation.

DiggLink

'>>' operator in the language, used to follow hyper link

DumpLink

'->' operator in the language, used to follow hyper link and dump the resulting content on hard drive (if permited).

NextSibling

'+' operator in the language, used to select the next sibling node.

PreviousSibling

'~' operator in the language, used to select the previous sibling node.

Parent

'<' operator in the language. Select the parent node

Instances

Functions

Transformations

simplifyNodeRanges :: [NodeRange] -> [NodeRange]Source

This function is an helper function to simplify the handling the node range. After simplification, the ranges are sorted in ascending order and no node range overlap.

foldWebRexp :: (a -> WebRexp -> (a, WebRexp)) -> a -> WebRexp -> (a, WebRexp)Source

This function permit the rewriting of a wabrexp in a depth-first fashion while carying out an accumulator.

assignWebrexpIndices :: WebRexp -> (Int, Int, WebRexp)Source

Preparation function for webrexp, assign all indices used for evaluation as an automata.

prettyShowWebRef :: WebRef -> StringSource

Pretty printing for WebRef. It's should be reparsable by the WebRexp parser.

Predicates

isInNodeRange :: Int -> [NodeRange] -> BoolSource

Helper function to check if a given in dex is within all the ranges

isOperatorBoolean :: Op -> BoolSource

Tell if an action operator return a boolean operation. Useful to tell if an action is a predicate. See isActionPredicate

isActionPredicate :: ActionExpr -> BoolSource

Tell if an action is a predicate and is only used to filter nodes. Expression can be modified with this information to help prunning as soon as possible with the DFS evaluator.