From Resource Grammar to Wide Coverage Translation with GF

Aarne Ranta et al.
January-May 2014

Scope

Wide-coverage interlingual translator for Bulgarian, Chinese, Dutch, English, Finnish, French, German, Hindi, Italian, Spanish, Swedish.

How to use it

If you just want to try it before reading more, here are the main ways to get started:

1. Run on our server. http://www.grammaticalframework.org/demos/translation.html

2. Get an Android app. http://www.grammaticalframework.org/demos/app.html

3. Compile and run in the shell. Get the latest GF sources (with darcs or github) and then

4. To modify the sources, work on the files in

    GF/lib/src/translator/

It is these files that will be explained below.

GF and the RGL

GF, Grammatical Framework, was originally designed for the purpose of multilingual controlled language systems, which would enable high-quality translation on limited domains. The abstract syntax of GF defines the semantic structures relevant for the domain, and the concrete syntaxes map these structures to grammatically correct and idiomatic text in each target language. The reversibility of GF enables both generation and parsing, and thereby translation where the abstract syntax functions as an interlingua.

As a bottle-neck of GF applications, it was soon realized that the definition of concrete syntax requires a lot of manual work and linguistic skill, because of the complexities of natural language syntax and morphology. Some of the complexities can be ignored in a small system. For instance, in a mathematical system, it may be enough to use verbs in the present tense only. But very much the same linguistic problems must be solved again and again in new applications: French verb inflection is the same in mathematics as in a tourist phrasebook. To solve this problem, the GF Resource Grammar Library (RGL) was developed, to take care of "low-level" linguistic rules such as inflection, agreement, and word order. This enables the authors of application grammars to focus on the semantics (when designing the abstract syntax) and on selecting RGL functions that produce the idioms they want. The RGL grew into an international open-source project, where more than 50 persons have contributed to implementing it for 29 languages by the time of writing this.

Scaling up GF translation

The RGL was thus originally designed to be used just as its name says: as a library for application grammars. Only the latter were meant to be used as top-level grammars, i.e. for parsing, generation, and translation at run time. Little attention was therefore paid to the usability of RGL as a top-level grammar by itself. But when applications accumulated, ranging from technical text to spoken dialogue, the coverage of the RGL grew into a coverage that approximates a "complete grammar" of many of the languages. And recently, there has indeed been success in using the RGL as a wide-coverage translation grammar, mainly due to Krasimir Angelov's efforts to scale up the size of GF applications from language fragments to open-text processing. This success is a result of four lines of development:

Remaining problems

The result of all this work is a wide-coverage translation system, which can be used in the same way as Google Translate, Bing, Systran, and Apertium - to "translate anything", albeit with a varying quality. At the moment of writing, the performance is not yet generally on the level with the best of the competition, but shows some promising improvements in e.g. long-distance agreement and word order. To make these advantages into absolute improvements, we will need to fix problems that the other systems (or at least some of them) get right but where GF translation often fails:

Advantages of GF translation

Given that these issues get resolved, the strengths of the GF approach can be made more visible:

Wanted: more work, new ideas

The recipes for improvement are, as always, more work and new ideas. Each of the four weaknesses mentioned above can be relieved by more work - in particular, lexical coverage by more work on the lexicon, since automatic extraction methods cannot really be trusted. As for disambiguation, new ideas about probabilistic tree models are being discussed. As for speed, new ideas on parsing (in particular, the integration of disambiguation with parsing) would help, but also the complexity of grammatical structures plays a major role. As for idiomacy, more work is being done in introducing constructions (non-compositional syntax rules, generalizing the notion of multiword expressions, in particular, phrases in SMT), but also new ideas are being discussed on how to extract such constructions from e.g. phrase tables.

In the following, we will focus on describing the role of grammar in the GF translation system - in particular, how RGL can be modified to become usable as a top-level grammar for translating open text. As RGL was not meant to be used for parsing open text, but rather for the controlled language generation task, it has serious restrictions:

What speaks for using RGL

Despite these problems, the RGL has shown to be a possible starting point for large-scale translation. It has a couple of advantages speaking for this:

Of course, we are still left with the other option of addressing translation with an application grammar, something similar to the ResourceDemo with flatter and more semantic structures. But this would in turn require the replication of many rules, even though it would be to a large extent doable by using a functor, that is, by just one set of rules covering all languages.

The structure of the wide-coverage translation grammar

Thus the path chosen is a mixture of RGL and application grammar. In brief, the translation grammar consists of

The following picture shows the principal module structure of the translation grammar.

Here is a description of each of the modules:

Where and why the translation grammar differs from the RGL

A guiding principle is thus that the translation grammar preserves as much as possible of the RGL, so that duplicated work is avoided. But as the purposes of the two are different, not everything is possible. Two diverging principles have already been mentioned:

The old design principles of the RGL are thus kept in force, and this is made possible by separating parts of the translation grammar modules from the RGL.