Sunday, 24 February 2013

Meta-II: An Early Meta-Compiler

What is a meta-compiler? Before answering that question we need to take two steps back. First, what's a compiler? A compiler reads a program written in some source language and transforms it into a (more-or-less) equivalent version expressed in a target language. For example, a C compiler takes the usual textual representation of a C program and generates object code in some target language such as assembler code or virtual machine instructions. The resulting object code is equivalent (in some sense) to the source code, but it can be run directly on a physical machine or in an interpreter.

A compiler-compiler reads a "program" which specifies the grammar of a source language and some transformation rules. The object code it generates is a compiler for that source language. (So a compiler-compiler is also compiler, it's just that its source code is your grammar-plus-transformation-rules and its output is object code which directly executes those rules and transformations.) The idea is that writing a grammar and some transformation rules is a lot less work than writing a whole compiler. Instead you write a more declarative description of what you want to happen and the compiler-compiler generates the code that deals with the fiddly details of pattern-matching and code-generation.

So now we can answer the question: "What's a meta-compiler?" A meta-compiler is a compiler-compiler written in its own source language. So when the meta-compiler reads its own description, it generates exactly its own object code, a fixed-point if you want to think of it that way. It's a neat trick, and clearly an easier way to build a compiler-compiler — provided you already have a compiler-compiler to start with.

I first came across Meta-II when I read Experimenting with Programming Languages by Alessandro Warth. In this PhD thesis Warth explains some interesting implementation details of the meta-compiler OMeta. (This meta-compiler has been used to implement many different languages at the Viewpoints Research Institute and elsewhere. These languages and the systems written with them are key evidence to support Alan Kay's assertion that we can build systems with orders of magnitude less source code than is customary today.)

Now, OMeta started out as an experiment based on the much earlier Meta-II, so when I wanted to play with a meta-compiler, this earlier, simpler program seemed like a good place to start out. (I figured that by playing with this program I would come to understand what was wrong with it, what needed changing to make the more advanced OMeta. That's certainly been the case, but I'm not entirely sure it was worth the effort.)

Meta-II is described in Meta II: A Syntax-Oriented Compiler Writing Language by Val Schorre (1964), and although this paper is a nice read, I found that the best place to start playing is James Neighbors' Tutorial: Metacompilers Part 1 and the accompanying Metacompiler Workshop. This "workshop" is a web-page with an extended version of Meta-II implemented in Javascript, so you can put your grammar description in one window, and get the code for a compiler in another window, and then use that compiler in turn to compile a program. If you want to do what I did, you can work your way through Neighbors' tutorial and then port a version of Meta-II off the web-page and into a self-contained program. (You can find the results of my slightly-too-clever-for-its-own-good port to Python here on github.)

I had decided that I wanted a version very close to Schorre's original Meta-II, rather than the extended version constructed by Neighbors. This choice in turn forced me to implement a virtual machine in my Python runtime, rather than generating native Python control constructs. That's because the original Meta-II's output operations are tuned to generating assembler code line-by-line, with only two levels of indentation. Obviously that's not going to fit well with Python, where indentation indicates block structure. So instead my code reads the generated "assembler code" and effectively assembles, loads and links it into a form where it can be interpreted directly. Quite a lot of extra work to stay with the original version of Meta-II, and in the end not worth the effort, other than for the learning experience.

So, what did I learn? And what would I change about Meta-II so that it was more useful but still simple, without going all the way to the complexity of OMeta? (Which adds a full programming language to do the output transformations.)
  • It would be better to have a somewhat more expressive language in which to write the transformation rules, and that way avoid having to put so much work into a virtual machine to interpret the meta-compiler's output code. (Since Neighbors and Warth both came to the same conclusion, this does seem blindingly obvious in retrospect.) Probably the first step would be to follow Neigbors' example and introduce a mechanism for more general indentation of output, so it would then be possible to output nicely formatted representations of a program — for example S-expressions.

  • Both the original Meta-II and Neighbors' extension only have access, in their output operations, to the last token matched. OMeta is a lot more flexible here, since you can give names to the values of any matching rules and then use the values later in a different order. This facility would also remove the need for a special-purpose token-recognition sub-language which Neighbors introduces in his implementation.

  • OMeta uses backtracking when a rule doesn't match, but Meta-II has to make its decisions based only on the next token. In practice, you can still parse lots of languages, but it's an annoying restriction, best removed. Fortunately this is relatively straightforward. Neighbors explains how to do it, though he doesn't give an implementation. OMeta's implementation also improves backtracking performance using a caching technique: it's a "packrat parser". (See for example Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking by Bryan Ford.)

  • Lastly, Meta-II cannot handle left-recursive rule definitions, because they would not terminate. If you have a grammar that is naturally left-recursive, you therefore have to rewrite it without the left-recursion, which you can always do, but this isn't really very satisfactory. How confident can you be that the subsequent parse tree and your interpretation of it really is equivalent to the intent of the original grammar? It's a problem waiting to bite you later. So a further useful improvement would be to use Warth's neat technique that makes left-recursion work in a packrat parser.
You can only really understand things properly by making them yourself, so my next step will be to implement some of these ideas, and I'll report on how that went in some future post. In the mean time I would encourage you to look at the original Meta II paper and to have a play with Neighbors' Metacompiler Workshop. It's great fun!

1 comment:

  1. It's true META II has limitations. You should look at TREEMETA. There is an paper on CWIC in the ACM archives. TREEMETA solves the left recursion. I do not mean you can specify left recursion. TREEMETA builds a tree and the tree can be built as left handed or right handed. A loop as META II

    xp = term $(('+':ADD | '-':SUB) term !2); generates a left handed tree.

    xp = term (('+':ADD | '-':SUB) xp!2); generates a right handed tree.

    CWIC is TREEMETA on steroids. Schorre and Book compined languages. The meta language of TREEMETA combined with LISP 2 (The algol like LISP).

    There were many compiler writing system in the 1960's. As I understand the term metacompiler is that it is a metalanguage compiler.. Simply a contraction of the two making metacompiler. That is basically the original concept. Schoree the author of META II said of that META II is is a base from which better more powerful metalanguages can be created. The first META I compiler was a hand compiler metacompiler. We have an antagonistic discussion going on over at wiki metacompiler topic. I say it is a metalanguage compiler. The TREEMETA document says basically the same thing. An ACM document says nothing about compiling it's self. The Forth people believe forth to be a metacompiler because it is writen in forth and compiles it's self. The BOOK4 and BOOK5 meta compilers were written in LISP 2. I do not think self compiling is all that big a deal. Almost all general purpose language compilers are written in them selves.

    My adversary there is really pushing self compiling. I think it is demeaning to make self compiling the defining attribute. The metacompilers are an elegant soloution to compiler writing. They got so good the government classified the SDC metacompilers top-secret as sensitive technology. CWIC had backtracking only limited by memory available.