Safe Haskell | Safe-Inferred |
---|
We present an algorithm for de-sugaring distributed affixes (distfixes) in a rose-like data structure. Distfixes are also known as mixfixes, but I chose dist- because the parts of the affix are distributed in-order through the root, rather than mixed in (out-of-order connotation) with the root. Now then, let's actually describe a distfix in detail:
By rose-like data structure, we mean any type t
such that when an element of t
can be
unwrap
ped into a [t]
, we can perform rewrites according to our distfix algorithm
and rewrap
the result. If a particular element cannot be unwrap
ped, then it will be
left alone during rewriting. Of course, this library was meant to operate on Hexprs
and
Quasihexprs
, but it could just as well work on a plain list or rose, as well as anything else
you're willing to mangle into shape.
A distributed affix consists of a number of alternating keywords and slots. While keywords
should match exactly one leaf node, slots can consume multiple nodes (leaves or branches)
during a detection. If we denote slots by underscores and keywords by some reasonable
programming language identifier (w/o underscores), then some representative distfix examples
might be _+_
, _?_:_
, _!
, if_then_else_
, and while_do_end
.
Using the algorithm requires categorizing the input distfixes in several dimensions: topology, associativity, priority, and precedence. Only precedence need by specified by the user (it is extrinsic to any distfix), the rest are either specified in or calculated from the distfix at hand. We discuss these properties below:
Slots in a distfix are always separated by keywords, but they may also be a leading and/or trailing keyword in a distfix. The presence or absence of certain keywords is the topology of a distfix, and this affects the possibilities of its associativity. There are four options:
- Closed: preceded and followed by keywords (e.g.
begin_end
) - Half-open Left: only followed by a keyword (e.g.
_!
) - Half-open Right: only preceded by a keyword (e.g.
if_then_else_
) - Open: neither preceded nor followed by a keyword (e.g.
_+_
)
As usual, there are three associativities: left-, right-, and non-associative. Open distfixes can take any of these three. Closed distfixes have no associativity. Half-open left distfixes are always left associative, and half-open right are always right associative.
Operators are divided into precedence levels as normal, but there are no limits on the number of precedence levels available for use. In the distfix table, groups of distfixes of the same precedence are sorted in descending order.
When given a list of expressions (the contents of an unwrap
ped node) and a distfix, the
distfix may be detected within the list. When multiple distfixes in a single precedence
level are detected at once, an attempt is made to select exactly one of the detected
distfixes using a priority scheme calculated from the properties of the distfixes in
question. Provided that one distfix has a higher priority than all the other detected
distfixes, the highest priority distfix binds least tightly (and is therefore selected first).
The rules for calculating priority are these:
- If both distfixes have the same associativity (left- or right-, but not non-associative), the one with the "most significant" keyword "earliest" has priority: for left-associative, most significant means first and earliest means leftmost; for right-associative, most significant means last, earliest means rightmost. If its a still a tie, then the one with the most keywords has priority.
- If both distfixes are closed, then they must be non-overlapping, or one must contain the other. It doesn't really matter which has higher priority if they don't overlap (as it happens, we've chosen leftmost for now). If one nests within the other, the outer has priority. If they overlap exactly, then the one with the most keywords has priority.
- Other pairs of matches have no priority distinction.
Given that a particular distfix is detected and selected for rewriting, we rewrite the list of
terms by extracting the distfix from its slots. Specifically, we take the detected elements
and run them through the distfix's rewriter to produce some single element. We then place the
rewritten element at the front of the node, followed by each (filled) slot in order and
rewrap
ped in its own node. The re-written list is finally rewrap
ped and placed back in
its original context.
Detections are made recursively. The details are unimportant except that this algorithm is
applied at every branch in the structure as made available by unwrap
and the recursion
respects precedence and priority. Each branch is assumed to have been enclosed by parenthesis
during parsing, and therefore unwrap
ping resets the precedence level. Note that rewriting
only adds branches to the structure, never removes them, and so we can see distfixes as
adding implicit parenthesis, which can be quite valuable as a conservative tool for
increasing the signal-to-noise ratio in a programming language.
Now for some technical notes:
I'm not sure how detection and priority will work if the same keyword appears twice in the same distfix, so it's probably best to avoid that for now. Or work it out and tell me, whatever. Either someone will eventually need this, at which point we'll deal with it, or maybe I'll get bored, or maybe I just won't care enough relative to other problems.
The two-typeclass system might seem a bit strange, but this is so I can avoid making the user
involve ghc's FlexibleInstances
extension. So, give an instance for
and DistfixElement
SomeType
, with
DistfixElement
a => DistfixStructure
(Hexpr
a)nodeMatch
simply unwrapping Leaf
and delegating to match
.
- data Distfix a = Distfix ([a] -> a) Shape [a]
- data Shape
- = Closed
- | HalfOpenRight
- | HalfOpenLeft
- | OpenRight
- | OpenLeft
- | OpenNon
- type DistfixTable a = [[Distfix a]]
- class DistfixStructure f where
- unwrap :: f -> Maybe ([f], [f] -> f)
- defaultRewrap :: [f] -> f
- nodeMatch :: f -> f -> Bool
- class DistfixElement a where
- runDistfix :: DistfixStructure a => DistfixTable a -> a -> Either (DistfixError a) a
- data DistfixError a
- = AmbiguousErr [(Distfix a, [a], [a], [[a]], [a])]
- | LeftoverErr [a]
Data Structures
A distfix consists of
1) a rewriter, the results of which precede the slots when extracting,
2) a topology and associativity, which is actually merged into a single datatype Shape
because the choice of associativity is not independent of topology, and
3) a non-empty list of keywords, each implicitly separated by a slot.
In case a distfix has a closed topology, its list of keywords must actually be at least two elements long (one for the open keyword, and one for the close keyword).
For more detail on these components, see the module documentation.
Information on both topology and associativity.
The two properties are merged into one datatype because choice of one limits choice of the other. The constructors should make the possibilities clear enough, but the module documentation might better present the reasoning involved.
type DistfixTable a = [[Distfix a]]Source
A list, in descending order of precedence (ascending of binding tightness) of groups of Distfixes.
How tightly distfixes within a group bind relative to one another is determined by priority (see the module description). Although ambiguous grammars are accepted, it might be best to avoid forcing the user to make lots of priority calculations just to determine if they need to insert disambiguating parenthesis.
Classes
class DistfixStructure f whereSource
These data structures can be de-structured in a rose-like fashion. See the module description for detail on the meaning of "rose-like".
There is one law:
- Inverse
-
maybe node ((xs, rewrap) -> rewrap xs) (unwrap node) === node
In other words, if you can unwrap a node, then rewrapping will perform the inverse.
unwrap :: f -> Maybe ([f], [f] -> f)Source
Unpack a branch node into a list of that branch's children and a rewrapping function.
defaultRewrap :: [f] -> fSource
Workaround so I can give an instance of Show (DistfixError a).
nodeMatch :: f -> f -> BoolSource
Whereas match
operates on elements of the structure, nodeMatch
is really just
boilerplate that extracts an element and calls match
on it.
For example, we might write
instanceDistfixDetect
a =>DistfixStructure
(Hexpr
a) nodeMatch (Leaf
x) (Leaf
y) = match x y nodeMatch _ _ = False
Very probably, it would not make sense to allow a non-leaf node to match anything (by implication, disallowing non-leaf keywords).
class DistfixElement a whereSource
This class is used for matching instead of Eq
so that certain components of the data might be
ignored. For example, if a = (SourcePos, b)
then the SourcePos
should clearly be ignored
during matching.
User Space
runDistfix :: DistfixStructure a => DistfixTable a -> a -> Either (DistfixError a) aSource
Given a table of distfixes and some input structure, apply the distfix detection/extraction algorithm.
The algorithm may fail with a DistfixError
. The module description explains successful
results in more detail.
data DistfixError a Source
Report reasons for error in recognizing distfixes. There are two causes of error:
- Ambiguity
- When there is no single detection that has higher precedence or priority within a set of detections made in a node, this is an ambiguous parse. Note that ambiguous grammars are allowed in this scheme, but should this ambiguity manifest itself in an input, that input is not recognized. Really, this is pretty spiffy: distfixes admit specification fairly near to an arbitrary context-free grammar, but the algorithm will excise ambiguity only where it needs to, completely side-stepping the problem of whether a given grammar is ambiguous.
- Leftovers
- Once we've detected all the keywords possible in a node, we need to ensure there
are no leftover keywords. If there were, this would probably indicate a user forgetting a
keyword. For example, suppose
[|_|]
were a distfix then[| a ]
would obtain a LeftoverErr.
There's some fuzziness between AmbiguousErr
and LeftoverErr
. To illustrate, suppose we have
_<_
and _<=_
but not _<=_<_
as a distfix, then both a < b < c
and a <= b < c
will be
errors. The first will result in leftovers, and the second in ambiguity. It would make sense if
they were both AmbiguousErr
, but doing so under the current structure would sacrifice some
efficiency (and possibly complicate matters). Still, at least everything that should be an
error is an error.
AmbiguousErr [(Distfix a, [a], [a], [[a]], [a])] | |
LeftoverErr [a] |
(Show a, DistfixStructure a) => Show (DistfixError a) |