Last week I implemented anti-dependencies in the C front end. This wasn’t difficult enough to write about, but it provides an excuse to describe dependencies and incrementalism in general.
The fundamental idea in speeding up the front end is that it should avoid parsing whenever possible. If we’ve already parsed a function definition in some earlier compilation, it is faster to reuse that definition than to try to parse it again. (At least, that is true if noticing that we can reuse it is faster than parsing. But it always is. Note also that this will be a much bigger win for C++ than for C.)
Reuse works by computing a checksum of a run of tokens (one of these runs is called a “hunk” in the implementation, a name that vaguely embarrasses me). If we’ve seen the checksum before, we reuse the parsed results, which we store in a table.
Of course, the situation is more complicated than that. A bit of code may depend on other bits of code, for instance:
typedef int myint;
extern myint variable;
Here we can’t reuse variable
in isolation — we can only reuse it if we’ve already reused the previous line. So, each hunk carries with it some dependency information. Before last week, this was just a list of all the checksums of hunks on which the current hunk depends.
Even this dependency tracking is not enough for proper operation. Suppose we already compiled this program:
struct x;
struct x { int field; };
… and then try to compile:
struct x;
struct x { int bug; };
struct x { int field; };
With a naive dependency scheme, this would compile without error, because we would erroneously reuse the second definition. After all, all its dependencies have been met.
Anti-dependencies are the fix for this. When considering a hunk for reuse, we check not only that its dependencies have been met, but we check that any declarations that we might reuse have the correct “prior value”. In fact, these two checks amount to the same thing; rather than record the checksums of prerequisite hunks, now we simply record declarations.
Given all this, how will incremental compilation work? My current plan is not to introduce any sort of special case. Instead, the parser will notify the backend when it parses a function. A function which comes from a reused hunk will simply not be compiled again.
If you think about it a bit, you’ll see that the prerequisites form a dependency model of a compilation unit. If the developer changes a declaration, anything using that declaration will be invalidated, and this invalidation will itself propagate outward. So, the amount of recompilation will be proportional to the “importance” of the change. Change a central typedef, and many things may need recompilation. Change a comment, and nothing will (though handling debug info properly may be a challenge).
A future refinement is the idea of a “semantic dependency”. The idea here is that, rather than check prerequisite declarations by identity, check them by ABI compatibility. There are a few tricks here, so this is a bit lower on my priority list for the moment.
You might have wondered how we compute the boundaries of a hunk, or how we can actually avoid recompiling functions. I’ll answer both of these in future posts.