One component of the incremental compiler is re-using parsed header files. In fact, for C++ this is where I expect to gain performance improvements in the “normal compilation” case. A couple weeks ago I realized that I could easily try this idea, for C, by implementing parts of it and then testing it using --combine
. I took a break from removing globals from the C front end and wrote a trial implementation this week.
This code works by conceiving of a C compilation unit as a sequence of “hunks”, currently defined as “tokens between two file change events”. Then it computes the checksum of the tokens in each hunk. (In order to do this I had to change the C compiler to lex the whole compilation unit at once — this was easy and makes it better match the C++ compiler.)
While parsing a hunk for the first time, we note all its top-level declarations by hooking into bind()
. We save these for later.
Then in the top-level parsing loop, if we notice that the next hunk matches a hunk in our map, we simply copy in the bindings we saved, set the token pointer to the end of the hunk, and carry on.
There’s some extra bookkeeping here — we only want “top level” hunks, not ones where we saw a file change event in the middle of a function or structure or something; and we can only reuse a hunk if the declarations used by the code in the hunk were also reused. I haven’t yet implemented the latter case, but I’ll have to since I want to test this on GCC itself, and GCC uses constructs like this.
I was able to run it on a smaller program (zenity). We skipped parsing 1.4 million tokens in this program. Naturally, whether or not this is a performance win (I didn’t look at that yet, and honestly for C I am not expecting it) depends on the overhead of the bookkeeping. However, it should reduce memory use, making IMA a bit more practical. And, it gives a good idea of the kind of textual redundancy common in C programs.
Note that this is experimental. I don’t like --combine
all that much, since it requires users to change their Makefiles, and because it doesn’t work with C++. However, it provided a simple testbed.
6 Comments
Interesting. I really like the idea of being able to share compilation like this. The ability to skip parsing 1.4 million tokens is awfully cool.
Tom,
I briefly pondered this header issue too. I would like to be able to do something similar for static analysis.
CPP makes it next to impossible to analyze one file at a time. I wonder this “hunks between file change events” could be formalized more. Perhaps a preprocessed file could be transformed into a number of standalone smaller “files” that can be analyzed/compiled individually and would somewhat map to the original source code.
The first time a .i file is processed, the compiler could try to figure out which files form individual AST nodes and mark them appropriately so they can be processed separately for subsequent processing.
Basically I’m interested in fighting CPP-induced problems. Does this make any sense?
Hi Tom,
is this true? Are you seriously working on making GCC an incremental compiler?
Robert: I don’t want to say I’m seriously working on it. Let’s say that I’m investigating it… there are still unknowns, including how long Red Hat will let me hack on this 🙂
Taras: Initially I wanted to do a “full” analysis involving the preprocessor. But, this looks hard, and perhaps not necessary — I think cpp incrementalism can be handled in a pipelined way. Anyway, if you want to track back to the original source, then don’t use .i files, but instead use libcpp directly. That way you can get source location information for each token. I think with some modifications you could get something even better, like the “real location” of a piece of text in the source, even if it was originally an argument to a macro. I’m not sure if this answers your question … ?
Thanks, Libcpp could’ve saved me a lot of time. Too bad there is next to no info on it online so I didn’t come across it earlier 🙁
I think I misunderstood your blog. I thought by file change events you want the CPP #line stuff, but looks like you meant events in terms of actual changes in the file.
How do you plan to receive these “events”? Is it going to be by taking the last-compiled .i(or maybe by comparing ASTs) and diffing it against the newer one?
Nope, I did mean cpp #line stuff, more or less. If you use libcpp there is a callback you can set so that you are notified when a file boundary is crossed.
I chose file change events since these are simple and convenient, and because I thought that constructs would rarely cross file boundaries. This may not really be the best breakdown for C++ programs; there it might be nicer to use a finer granularity. That is a bit difficult, it might mean some “rough parsing” to get the correct bounds of top-level objects.
How the incrementalism is done will be a subject for a future post. I’m still looking into a few options here.