One component of the incremental compiler is re-using parsed header files. In fact, for C++ this is where I expect to gain performance improvements in the “normal compilation” case. A couple weeks ago I realized that I could easily try this idea, for C, by implementing parts of it and then testing it using --combine
. I took a break from removing globals from the C front end and wrote a trial implementation this week.
This code works by conceiving of a C compilation unit as a sequence of “hunks”, currently defined as “tokens between two file change events”. Then it computes the checksum of the tokens in each hunk. (In order to do this I had to change the C compiler to lex the whole compilation unit at once — this was easy and makes it better match the C++ compiler.)
While parsing a hunk for the first time, we note all its top-level declarations by hooking into bind()
. We save these for later.
Then in the top-level parsing loop, if we notice that the next hunk matches a hunk in our map, we simply copy in the bindings we saved, set the token pointer to the end of the hunk, and carry on.
There’s some extra bookkeeping here — we only want “top level” hunks, not ones where we saw a file change event in the middle of a function or structure or something; and we can only reuse a hunk if the declarations used by the code in the hunk were also reused. I haven’t yet implemented the latter case, but I’ll have to since I want to test this on GCC itself, and GCC uses constructs like this.
I was able to run it on a smaller program (zenity). We skipped parsing 1.4 million tokens in this program. Naturally, whether or not this is a performance win (I didn’t look at that yet, and honestly for C I am not expecting it) depends on the overhead of the bookkeeping. However, it should reduce memory use, making IMA a bit more practical. And, it gives a good idea of the kind of textual redundancy common in C programs.
Note that this is experimental. I don’t like --combine
all that much, since it requires users to change their Makefiles, and because it doesn’t work with C++. However, it provided a simple testbed.