This is a followup to my earlier post on converting the Emacs C code into Common Lisp. This one is a bit more technical, diving into some specifics of the conversion process.
One important fact is that we do not need to convert an arbitrary C program to Common Lisp. This might or might not be efficiently possible — but we do not care. We only need to convert Emacs. This is simpler for two reasons. First, we can just ignore any C construct that Emacs does not use. If the translator barfs after some new update, we can fix it then. Second, Emacs itself is already written in a relatively Lispy style, being a Lisp implementation itself. We further exploit this by allowing the translator to know some details about Emacs. As a trivial example, all the
Smumble globals created by the
DEFUN marco need not be translated into Common Lisp as structure constants — they are an artifact of the implementation, and will show up directly in the generated
What to ignore
A good portion of Emacs is simply redundant in the CL world. There are a few types (cons, vector, integers, functions) that are shareable — in fact, sharing these is part of the goal of this effort. There are also a number of functions which are effectively identical. There are also entire redundant modules, like the garbage collector, or the bytecode interpreter.
The question is how to have the translator differentiate between what is useful and what is not, without breaking builds of future versions of Emacs.
I don’t currently think there is a high road to solving this problem. For modules like the GC, I plan to have ad hoc translator rules for the particular source files. For functions and data types, I’m adding new GCC attributes that I can use to mark the ignorable definitions.
There are two type-related issues that arise when translating the source.
First, how should Emacs-specific types be represented? Primarily these types are structures, like
struct buffer or
struct string (we cannot use the CL string type, because Emacs adds properties directly to the string, and Emacs has its own idiosyncratic character handling). My answer here is to just straightforwardly translate them to
The other question is when translating a C function, what do we do with the types of local variables? For the most part I am pretending that they don’t exist. This works fine except for local arrays and structures, but these are easily handled by initializing variables properly. My rationale is that while this is slower, it lets me get something working more quickly, and we can always update the translator to emit CL type declarations later on.
This simple approach doesn’t actually cover all the needed cases. For example, there is code in Emacs that takes the address of a local variable and passes it somewhere. This is easy to deal with; much of the remaining work is just digging through the code looking for special cases to clean up.
I’m similarly omitting type declarations from the generated structures. One possible nice side effect of this approach is that it will make it easier to lift Emacs’ file-size restrictions, because there will no longer be any code assuming that the size is a
Many low-level details of the Emacs implementation are hidden in macros. For example, Emacs stuffs some type information into the low-order bits of pointers. It uses macros to add or remove this information. For this build, I redefine these macros to do nothing. This makes the GCC Gimple representation much closer to the abstract meaning of the program, and thus simpler to translate.
There are also some macros that are useful to redefine so that we can more easily hook into them from the translator. For example, Emacs has a C macro
INTEGERP that is used to check whether its argument is an integer. Normally this macro uses bit twiddling to get its answer, but I redefine it like so:
extern Lisp_Object *INTEGERP (Lisp_Object)
The translator is not nearly complete, but it can already do a fair job at translating simple functions. For example, here is “
forward-point” from the Emacs C code:
DEFUN ("forward-point", Fforward_point, Sforward_point, 1, 1, 0,
doc: /* Return buffer position N characters after (before if N negative) point. */)
return make_number (PT + XINT (n));
Here is what the translator comes up with:
(defun Fforward_point (n)
(block nil (tagbody
; no gimple here
; no gimple here
(setf temp-var-0 (integerp n))
(if (== temp-var-0 nil)
(setf Qintegerp.316 Qintegerp)
(wrong_type_argument Qintegerp.316 n)
(setf current_buffer.317 current_buffer)
(setf temp-var-2 (buffer-pt current_buffer.317))
(setf temp-var-1 (+ temp-var-2 n))
(defun elisp:forward-point (arg0)
The output looks pretty weird, because the translator works after GCC’s CFG is built, and so the most straightforward translation is to use this mess with
tagbody. I doubt this matters much, but in any case the translator is readily hackable — it is still less than 400 lines of Python, including comments.
One thing to note is the translation of “
PT“. This is actually a macro that refers to the current buffer:
#define PT (current_buffer->pt + 0)
The translator properly turns this into a reference to “
Another detail is the handling of packages. My plan is to put the Emacs implementation into one package, and then any elisp into a second package called “
DEFUN in the C code will actually generate two functions: the internal one, and the elisp-visible one; hence the “
elisp:” in the translation.
There’s still a good amount of work to be done. The converter punts on various constructs; type translation is implemented but not actually wired up to anything; the translator should emit definitions for alien functions; and plenty more.