Compiling GF Aarne Ranta Proglog meeting, 1 November 2006 % to compile: txt2tags -thtml compiling-gf.txt ; htmls compiling-gf.html %!target:html %!postproc(html): #NEW #NEW ==The compilation task== GF is a grammar formalism, i.e. a special purpose programming language for writing grammars. Other grammar formalisms: - BNF, YACC, Happy (grammars for programming languages); - PATR, HPSG, LFG (grammars for natural languages). The grammar compiler prepares a GF grammar for two computational tasks: - linearization: take syntax trees to strings - parsing: take strings to syntax trees The grammar gives a declarative description of these functionalities, on a high abstraction level that improves grammar writing productivity. For efficiency, the grammar is compiled to lower-level formats. Type checking is another essential compilation phase. Its purpose is twofold, as usual: - checking the correctness of the grammar - type-annotating expressions for code generation #NEW ==Characteristics of GF language== Functional language with types, both built-in and user-defined. ``` Str : Type param Number = Sg | Pl param AdjForm = ASg Gender | APl Noun : Type = {s : Number => Str ; g : Gender} ``` Pattern matching. ``` svart_A = table { ASg _ => "svart" ; _ => "svarta" } ``` Higher-order functions. Dependent types. ``` flip : (a, b, c : Type) -> (a -> b -> c) -> b -> a -> c = \_,_,_,f,y,x -> f x y ; ``` #NEW ==The module system of GF== Main division: abstract syntax and concrete syntax ``` abstract Greeting = { cat Greet ; fun Hello : Greet ; } concrete GreetingEng of Greeting = { lincat Greet = {s : Str} ; lin Hello = {s = "hello"} ; } concrete GreetingIta of Greeting = { param Politeness = Familiar | Polite ; lincat Greet = {s : Politeness => Str} ; lin Hello = {s = table { Familiar => "ciao" ; Polite => "buongiorno" } ; } ``` Other features of the module system: - extension and opening - parametrized modules (cf. ML: signatures, structures, functors) #NEW ==GF vs. Haskell== Some things that (standard) Haskell hasn't: - records and record subtyping - regular expression patterns - dependent types - ML-style modules Some things that GF hasn't: - infinite (recursive) data types - recursive functions - classes, polymorphism #NEW ==GF vs. most linguistic grammar formalisms== GF separates abstract syntax from concrete syntax. GF has a module system with separate compilation. GF is generation-oriented (as opposed to parsing). GF has unidirectional matching (as opposed to unification). GF has a static type system (as opposed to a type-free universe). "I was - and I still am - firmly convinced that a program composed out of statically type-checked parts is more likely to faithfully express a well-thought-out design than a program relying on weakly-typed interfaces or dynamically-checked interfaces." (B. Stroustrup, 1994, p. 107) #NEW ==The computation model: abstract syntax== An abstract syntax defines a free algebra of trees (using dependent types, recursion, higher-order abstract syntax: GF includes a complete Logical Framework). ``` cat C (x_1 : A_1)...(x_n : A_n) a_1 : A_1 ... a_n : A_n{x_1 : A_1,...,x_n-1 : A_n-1} ---------------------------------------------------- (C a_1 ... a_n) : Type fun f : (x_1 : A_1) -> ... -> (x_n : A_n) -> A a_1 : A_1 ... a_n : A_n{x_1 : A_1,...,x_n-1 : A_n-1} ---------------------------------------------------- (f a_1 ... a_n) : A{x_1 : A_1,...,x_n : A_n} A : Type x : A |- B : Type x : A |- b : B f : (x : A) -> B a : A ---------------------------- ---------------------- ------------------------ (x : A) -> B : Type \x -> b : (x : A) -> B f a : B{x := A} ``` Notice that all syntax trees are in eta-long form. #NEW ==The computation model: concrete syntax== A concrete syntax defines a homomorphism (compositional mapping) from the abstract syntax to a system of concrete syntax objects. ``` cat C _ -------------------- lincat C = C* : Type fun f : (x_1 : A_1) -> ... -> (x_n : A_n) -> A ----------------------------------------------- lin f = f* : A_1* -> ... -> A_n* -> A* (f a_1 ... a_n)* = f* a_1* ... a_n* ``` The homomorphism can as such be used as linearization function. It is a functional program, but a restricted one, since it works in the end on finite data structures only. But a more efficient program is obtained via compilation to GFC = Canonical GF: the "machine code" of GF. The parsing problem of GFC can be reduced to that of MPCFG (Multiple Parallel Context Free Grammars), see P. Ljunglöf's thesis (2004). #NEW ==The core type system of concrete syntax: basic types== ``` param P P : PType PType : Type --------- --------- P : PType P : Type s : Str t : Str Str : type "foo" : Str [] : Str ---------------- s ++ t : Str ``` #NEW ==The core type system of concrete syntax: functions and tables== ``` A : Type x : A |- B : Type x : A |- b : B f : (x : A) -> B a : A ---------------------------- ---------------------- ------------------------ (x : A) -> B : Type \x -> b : (x : A) -> B f a : B{x := A} P : PType A : Type t : P => A p : p -------------------- ----------------- P => A : Type t ! p : A v_1,...,v_n : A ---------------------------------------------- P = {C_1,...,C_n} table {C_1 => v_1 ; ... ; C_n => v_n} : P => A ``` Pattern matching is treated as an abbreviation for tables. Notice that ``` case e of {...} == table {...} ! e ``` #NEW ==The core type system of concrete syntax: records== ``` A_1,...,A_n : Type ------------------------------------ n >= 0 {r_1 : A_1 ; ... ; r_n : A_n} : Type a_1 : A_1 ... a_n : A_n ------------------------------------------------------------ {r_1 = a_1 ; ... ; r_n = a_n} : {r_1 : A_1 ; ... ; r_n : A_n} r : {r_1 : A_1 ; ... ; r_n : A_n} ----------------------------------- i = 1,...,n r.r_1 : A_1 ``` Subtyping: if ``r : R`` then ``r : R ** {r : A}`` #NEW ==Computation rules== ``` (\x -> b) a = b{x := a} (table {C_1 => v_1 ; ... ; C_n => v_n} : P => A) ! C_i = v_i {r_1 = a_1 ; ... ; r_n = a_n}.r_i = a_i ``` #NEW ==Canonical GF== Concrete syntax type system: ``` A_1 : Type ... A_n : Type Str : Type Int : Type ------------------------- $i : A [A_1, ..., A_n] : Type a_1 : A_1 ... a_n : A_n t : [A_1, ..., A_n] --------------------------------- ------------------- i = 1,..,n [a_1, ..., a_n] : [A_1, ..., A_n] t ! i : A_i ``` Tuples represent both records and tables. There are no functions. Linearization: ``` lin f = f* (f a_1 ... a_n)* = f*{$1 = a_1*, ..., $n = a_n*} ``` #NEW ==The compilation task, again== 1. From a GF source grammar, derive a canonical GF grammar. 2. From the canonical GF grammar derive an MPCFG grammar The canonical GF grammar can be used for linearization, with linear time complexity (w.r.t. the size of the tree). The MPCFG grammar can be used for parsing, with (unbounded) polynomial time complexity (w.r.t. the size of the string). For these target formats, we have also built interpreters in different programming languages (C, C++, Haskell, Java, Prolog). Moreover, we generate supplementary formats such as grammars required by various speech recognition systems. #NEW ==An overview of compilation phases== Legend: - ellipse node: representation saved in a file - plain text node: internal representation - solid arrow or ellipse: essential phare or format - dashed arrow or ellipse: optional phase or format - arrow label: the module implementing the phase [gf-compiler.png] #NEW ==Using the compiler== Batch mode (cf. GHC). Interactive mode, building the grammar incrementally from different files, with the possibility of testing them (cf. GHCI). The interactive mode was first, built on the model of ALF-2 (L. Magnusson), and there was no file output of compiled grammars. #NEW ==Modules and separate compilation== The above diagram shows what happens to each module. (But not quite, since some of the back-end formats must be built for sets of modules: GFCC and the parser formats.) When the grammar compiler is called, it has a main module as its argument. It then builds recursively a dependency graph with all the other modules, and decides which ones must be recompiled. The behaviour is rather similar to GHC. Separate compilation is //extremely important// when developing big grammars, especially when using grammar libraries. Example: compiling the GF resource grammar library takes 5 minutes, whereas reading in the compiled image takes 10 seconds. #NEW ==Module dependencies and recompilation== (For later use, not for the Proglog talk) For each module M, there are 3 kinds of files: - M.gf, source file - M.gfc, compiled file ("object file") - M.gfr, type-checked and optimized source file (for resource modules only) The compiler reads gf files and writes gfc files (and gfr files if appropriate) The Main module is the one used as argument when calling GF. A module M (immediately) depends on the module K, if either - M is a concrete of K - M is an instance of K - M extends K - M opens K - M is a completion of K with something - M is a completion of some module with K instantiated with something A module M (transitively) depends on the module K, if either - M immediately depends on K - M depends on some L such that L immediately depends on K Immediate dependence is readable from the module header without parsing the whole module. The compiler reads recursively the headers of all modules that Main depends on. These modules are arranged in a dependency graph, which is checked to be acyclic. To decide whether a module M has to be compiled, do: + Get the time stamps t() of M.gf and M.gfc (if a file doesn't exist, its time is minus infinity). + If t(M.gf) > t(M.gfc), M must be compiled. + If M depends on K and K must be compiled, then M must be compiled. + If M depends on K and t(K.gf) > t(M.gfc), then M must be compiled. Decorate the dependency graph by information on whether the gf or the gfc (and gfr) format is to be read. Topologically sort the decorated graph, and read each file in the chosen format. The gfr file is generated for these module types only: - resource - instance When reading K.gfc, also K.gfr is read if some M depending on K has to be compiled. In other cases, it is enough to read K.gfc. In an interactive GF session, some modules may be in memory already. When read to the memory, each module M is given time stamp t(M.m). The additional rule now is: - If M.gfc is to be read, and t(M.m) > t(M.gfc), don't read M.gfc. #NEW ==Techniques used== The compiler is written in Haskell, with some C foreign function calls in the interactive version (readline, killing threads). BNFC is used for generating both the parsers and printers. This has helped to make the formats portable. "Almost compositional functions" (``composOp``) are used in many compiler passes, making them easier to write and understand. A ``grep`` on the sources reveals 40 uses (outside the definition of ``composOp`` itself). The key algorithmic ideas are - type-driven partial evaluation in GF-to-GFC generation - common subexpression elimination as back-end optimization - some ideas in GFC-to-MCFG encoding #NEW ==Type-driven partial evaluation== Each abstract syntax category in GF has a corresponding linearization type: ``` cat C lincat C = T ``` The general form of a GF rule pair is ``` fun f : C1 -> ... -> Cn -> C lin f = t ``` with the typing condition following the ``lincat`` definitions ``` t : T1 -> ... -> Tn -> T ``` The term ``t`` is in general built by using abstraction methods such as pattern matching, higher-order functions, local definitions, and library functions. The compilation technique proceeds as follows: - use eta-expansion on ``t`` to determine the canonical form of the term ``` \ $C1, ...., $Cn -> (t $C1 .... $Cn) ``` with unique variables ``$C1 .... $Cn`` for the arguments; repeat this inside the term for records and tables - evaluate the resulting term using the computation rules of GF - what remains is a canonical term with ``$C1 .... $Cn`` the only variables (the run-time input of the linearization function) #NEW ==Eta-expanding records and tables== For records that are valied via subtyping, eta expansion eliminates superfluous fields: ``` {r1 = t1 ; r2 = t2} : {r1 : T1} ----> {r1 = t1} ``` For tables, the effect is always expansion, since pattern matching can be used to represent tables compactly: ``` table {n => "fish"} : Number => Str ---> table { Sg => "fish" ; Pl => "fish" } ``` This can be helped by back-end optimizations (see below). #NEW ==Eliminating functions== "Everything is finite": parameter types, records, tables; finite number of string tokens per grammar. But "inifinite types" such as function types are useful when writing grammars, to enable abstractions. Since function types do not appear in linearization types, we want functions to be eliminated from linearization terms. This is similar to the **subformula property** in logic. Also the main problem is similar: function depending on a run-time variable, ``` (table {P => f ; Q = g} ! x) a ``` This is not a redex, but we can make it closer to one by moving the application inside the table, ``` table {P => f a ; Q = g a} ! x ``` This transformation is the same as Prawitz's (1965) elimination of maximal segments in natural deduction: ``` A B C -> D C C -> D C A B --------- --------- A v B C -> D C -> D A v B D D --------------------- ===> ------------------------- C -> D C D -------------------- D ``` #NEW ==Size effects of partial evaluation== Irrelevant table branches are thrown away, which can reduce the size. But, since tables are expanded and auxiliary functions are inlined, the size can grow exponentially. How can we keep the first property and eliminate the second? #NEW ==Parametrization of tables== Algorithm: for each branch in a table, consider replacing the argument by a variable: ``` table { table { P => t ; ---> x => t[P->x] ; Q => u x => u[Q->x] } } ``` If the resulting branches are all equal, you can replace the table by a lambda abstract ``` \\x => t[P->x] ``` If each created variable ``x`` is unique in the grammar, computation with the lambda abstract is efficient. #NEW ==Course-of-values tables== By maintaining a canonical order of parameters in a type, we can eliminate the left hand sides of branches. ``` table { table T [ P => t ; ---> t ; Q => u u } ] ``` The treatment is similar to ``Enum`` instances in Haskell. In the end, all parameter types can be translated to initial segments of integers. #NEW ==Common subexpression elimination== Algorithm: + Go through all terms and subterms in a module, creating a symbol table mapping terms to the number of occurrences. + For each subterm appearing at least twice, create a fresh constant defined as that subterm. + Go through all rules (incl. rules for the new constants), replacing largest possible subterms with such new constants. This algorithm, in a way, creates the strongest possible abstractions. In general, the new constants have open terms as definitions. But since all variables (and constants) are unique, they can be computed by simple replacement. #NEW ==Size effects of optimizations== Example: the German resource grammar ``LangGer`` || optimization | lines | characters | size % | blow-up | | none | 5394 | 3208435 | 100 | 25 | | all | 5394 | 750277 | 23 | 6 | | none_subs | 5772 | 1290866 | 40 | 10 | | all_subs | 5644 | 414119 | 13 | 3 | | gfcc | 3279 | 190004 | 6 | 1.5 | | gf source | 3976 | 121939 | 4 | 1 | Optimization "all" means parametrization + course-of-values. The source code size is an estimate, since it includes potentially irrelevant library modules, and comments. The GFCC format is not reusable in separate compilation. #NEW ==The shared prefix optimization== This is currently performed in GFCC only. The idea works for languages that have a rich morphology based on suffixes. Then we can replace a course of values with a pair of a prefix and a suffix set: ``` ["apa", "apan", "apor", "aporna"] ---> ("ap" + ["a", "an", "or", "orna"]) ``` The real gain comes via common subexpression elimination: ``` _34 = ["a", "an", "or", "orna"] apa = ("ap" + _34) blomma = ("blomm" + _34) flicka = ("flick" + _34) ``` Notice that it now matters a lot how grammars are written. For instance, if German verbs are treated as a one-dimensional table, ``` ["lieben", "liebe", "liebst", ...., "geliebt", "geliebter",...] ``` no shared prefix optimization is possible. A better form is separate tables for non-"ge" and "ge" forms: ``` [["lieben", "liebe", "liebst", ....], ["geliebt", "geliebter",...]] ``` #NEW ==Reuse of grammars as libraries== The idea of resource grammars: take care of all aspects of surface grammaticality (inflection, agreement, word order). Reuse in application grammar: via translations ``` cat C ---> oper C : Type = T lincat C = T fun f : A ---> oper f : A* = t lin f = t ``` The user only needs to know the type signatures (abstract syntax). However, this does not quite guarantee grammaticality, because different categories can have the same lincat: ``` lincat Conj = {s : Str} lincat Adv = {s : Str} ``` Thus someone may by accident use "and" as an adverb! #NEW ==Forcing the type checker to act as a grammar checker== We just have to make linearization types unique for each category. The technique is reminiscent of Haskell's ``newtype`` but uses records instead: we add **lock fields** e.g. ``` lincat Conj = {s : Str ; lock_Conj : {}} lincat Adv = {s : Str ; lock_Adv : {}} ``` Thanks to record subtyping, the translation is simple: ``` fun f : C1 -> ... -> Cn -> C lin f = t ---> oper f : C1* -> ... -> Cn* -> C* = \x1,...,xn -> (t x1 ... xn) ** {lock_C = {}} ``` #NEW ==Things to do== Better compression of gfc file format. Type checking of dependent-type pattern matching in abstract syntax. Compilation-related modules that need rewriting - ``ReadFiles``: clarify the logic of dependencies - ``Compile``: clarify the logic of what to do with each module - ``Compute``: make the evaluation more efficient - ``Parsing/*``, ``OldParsing/*``, ``Conversion/*``: reduce the number of parser formats and algorithms