Safe Haskell	Safe-Inferred

Language.Distfix

Contents

Data Structures
Classes
User Space

Description

We present an algorithm for de-sugaring distributed affixes (distfixes) in a rose-like data structure. Distfixes are also known as mixfixes, but I chose dist- because the parts of the affix are distributed in-order through the root, rather than mixed in (out-of-order connotation) with the root. Now then, let's actually describe a distfix in detail:

By rose-like data structure, we mean any type t such that when an element of t can be unwrapped into a [t], we can perform rewrites according to our distfix algorithm and rewrap the result. If a particular element cannot be unwrapped, then it will be left alone during rewriting. Of course, this library was meant to operate on Hexprs and Quasihexprs, but it could just as well work on a plain list or rose, as well as anything else you're willing to mangle into shape.

A distributed affix consists of a number of alternating keywords and slots. While keywords should match exactly one leaf node, slots can consume multiple nodes (leaves or branches) during a detection. If we denote slots by underscores and keywords by some reasonable programming language identifier (w/o underscores), then some representative distfix examples might be _+_, _?_:_, _!, if_then_else_, and while_do_end.

Using the algorithm requires categorizing the input distfixes in several dimensions: topology, associativity, priority, and precedence. Only precedence need by specified by the user (it is extrinsic to any distfix), the rest are either specified in or calculated from the distfix at hand. We discuss these properties below:

Slots in a distfix are always separated by keywords, but they may also be a leading and/or trailing keyword in a distfix. The presence or absence of certain keywords is the topology of a distfix, and this affects the possibilities of its associativity. There are four options:

Closed: preceded and followed by keywords (e.g. begin_end)
Half-open Left: only followed by a keyword (e.g. _!)
Half-open Right: only preceded by a keyword (e.g. if_then_else_)
Open: neither preceded nor followed by a keyword (e.g. _+_)

As usual, there are three associativities: left-, right-, and non-associative. Open distfixes can take any of these three. Closed distfixes have no associativity. Half-open left distfixes are always left associative, and half-open right are always right associative.

Operators are divided into precedence levels as normal, but there are no limits on the number of precedence levels available for use. In the distfix table, groups of distfixes of the same precedence are sorted in descending order.

When given a list of expressions (the contents of an unwrapped node) and a distfix, the distfix may be detected within the list. When multiple distfixes in a single precedence level are detected at once, an attempt is made to select exactly one of the detected distfixes using a priority scheme calculated from the properties of the distfixes in question. Provided that one distfix has a higher priority than all the other detected distfixes, the highest priority distfix binds least tightly (and is therefore selected first).

The rules for calculating priority are these:

If both distfixes have the same associativity (left- or right-, but not non-associative), the one with the "most significant" keyword "earliest" has priority: for left-associative, most significant means first and earliest means leftmost; for right-associative, most significant means last, earliest means rightmost. If its a still a tie, then the one with the most keywords has priority.
If both distfixes are closed, then they must be non-overlapping, or one must contain the other. It doesn't really matter which has higher priority if they don't overlap (as it happens, we've chosen leftmost for now). If one nests within the other, the outer has priority. If they overlap exactly, then the one with the most keywords has priority.
Other pairs of matches have no priority distinction.

Given that a particular distfix is detected and selected for rewriting, we rewrite the list of terms by extracting the distfix from its slots. Specifically, we take the detected elements and run them through the distfix's rewriter to produce some single element. We then place the rewritten element at the front of the node, followed by each (filled) slot in order and rewrapped in its own node. The re-written list is finally rewrapped and placed back in its original context.

Detections are made recursively. The details are unimportant except that this algorithm is applied at every branch in the structure as made available by unwrap and the recursion respects precedence and priority. Each branch is assumed to have been enclosed by parenthesis during parsing, and therefore unwrapping resets the precedence level. Note that rewriting only adds branches to the structure, never removes them, and so we can see distfixes as adding implicit parenthesis, which can be quite valuable as a conservative tool for increasing the signal-to-noise ratio in a programming language.

Now for some technical notes:

I'm not sure how detection and priority will work if the same keyword appears twice in the same distfix, so it's probably best to avoid that for now. Or work it out and tell me, whatever. Either someone will eventually need this, at which point we'll deal with it, or maybe I'll get bored, or maybe I just won't care enough relative to other problems.

The two-typeclass system might seem a bit strange, but this is so I can avoid making the user involve ghc's FlexibleInstances extension. So, give an instance for DistfixElement SomeType and DistfixElement a => DistfixStructure (Hexpr a), with nodeMatch simply unwrapping Leaf and delegating to match.

Synopsis

Data Structures

data Distfix a Source

A distfix consists of 1) a rewriter, the results of which precede the slots when extracting, 2) a topology and associativity, which is actually merged into a single datatype Shape because the choice of associativity is not independent of topology, and 3) a non-empty list of keywords, each implicitly separated by a slot.

In case a distfix has a closed topology, its list of keywords must actually be at least two elements long (one for the open keyword, and one for the close keyword).

For more detail on these components, see the module documentation.

Constructors

Distfix ([a] -> a) Shape [a]

Instances

Show a => Show (Distfix a)

data Shape Source

Information on both topology and associativity.

The two properties are merged into one datatype because choice of one limits choice of the other. The constructors should make the possibilities clear enough, but the module documentation might better present the reasoning involved.

Constructors

Closed
HalfOpenRight
HalfOpenLeft
OpenRight
OpenLeft
OpenNon

Instances

Eq Shape
Show Shape

type DistfixTable a = [[Distfix a]]Source

A list, in descending order of precedence (ascending of binding tightness) of groups of Distfixes.

How tightly distfixes within a group bind relative to one another is determined by priority (see the module description). Although ambiguous grammars are accepted, it might be best to avoid forcing the user to make lots of priority calculations just to determine if they need to insert disambiguating parenthesis.

Classes

class DistfixStructure f whereSource

These data structures can be de-structured in a rose-like fashion. See the module description for detail on the meaning of "rose-like".

There is one law:

Inverse: maybe node ((xs, rewrap) -> rewrap xs) (unwrap node) === node

In other words, if you can unwrap a node, then rewrapping will perform the inverse.

Methods

unwrap :: f -> Maybe ([f], [f] -> f)Source

Unpack a branch node into a list of that branch's children and a rewrapping function.

defaultRewrap :: [f] -> fSource

Workaround so I can give an instance of Show (DistfixError a).

nodeMatch :: f -> f -> Bool Source

Whereas match operates on elements of the structure, nodeMatch is really just boilerplate that extracts an element and calls match on it.

For example, we might write

    instance DistfixDetect a => DistfixStructure (Hexpr a)
        nodeMatch (Leaf x) (Leaf y) = match x y
        nodeMatch _ _ = False

Very probably, it would not make sense to allow a non-leaf node to match anything (by implication, disallowing non-leaf keywords).

class DistfixElement a whereSource

This class is used for matching instead of Eq so that certain components of the data might be ignored. For example, if a = (SourcePos, b) then the SourcePos should clearly be ignored during matching.

Methods

match :: a -> a -> Bool Source

Whether the two elements are equal with respect to matching a keyword.

User Space

runDistfix :: DistfixStructure a => DistfixTable a -> a -> Either (DistfixError a) aSource

Given a table of distfixes and some input structure, apply the distfix detection/extraction algorithm.

The algorithm may fail with a DistfixError. The module description explains successful results in more detail.

data DistfixError a Source

Report reasons for error in recognizing distfixes. There are two causes of error:

Ambiguity: When there is no single detection that has higher precedence or priority within a set of detections made in a node, this is an ambiguous parse. Note that ambiguous grammars are allowed in this scheme, but should this ambiguity manifest itself in an input, that input is not recognized. Really, this is pretty spiffy: distfixes admit specification fairly near to an arbitrary context-free grammar, but the algorithm will excise ambiguity only where it needs to, completely side-stepping the problem of whether a given grammar is ambiguous.
Leftovers: Once we've detected all the keywords possible in a node, we need to ensure there are no leftover keywords. If there were, this would probably indicate a user forgetting a keyword. For example, suppose [|_|] were a distfix then [| a ] would obtain a LeftoverErr.

There's some fuzziness between AmbiguousErr and LeftoverErr. To illustrate, suppose we have _<_ and _<=_ but not _<=_<_ as a distfix, then both a < b < c and a <= b < c will be errors. The first will result in leftovers, and the second in ambiguity. It would make sense if they were both AmbiguousErr, but doing so under the current structure would sacrifice some efficiency (and possibly complicate matters). Still, at least everything that should be an error is an error.

Constructors

AmbiguousErr [(Distfix a, [a], [a], [[a]], [a])]
LeftoverErr [a]

Instances

(Show a, DistfixStructure a) => Show (DistfixError a)