How do we decide the boundaries of reuse in the incremental compiler?
The one crucial thing when looking at reuse is that the signature that you choose must be stable across compilations. And, it is better if the boundaries are cheap to compute and roughly line up with declaration boundaries. In the prototype implementation, I accomplished these by using file change events (conveniently noted by libcpp) as hunk boundaries. File change boundaries are roughly stable (moreso for system headers), are free (libcpp already computes them), and very frequently align with declaration boundaries. This last criterion is typically only broken by constructs where a file is included in the middle of a declaration — for instance, in gcc itself by including a
.def file in the middle of an enum, to populate it. I added a special case to the reuse code to account for these situations.
Aside from the convenience — which is very important for prototypes, but much less so for production — there was a historical reason that I chose file boundaries. When I first started thinking about writing an incremental compiler, I focused primarily on header files, and on having a single model which would encompass both the preprocessor and the compiler. After working through a number of cases, with the help of the prototype, I decided that this was “java thinking”; in C the primary unit of reuse is the declaration — not the file. And, it is much, much simpler to reason about C if you look at it after preprocessing.
In earlier posts I described the dependency computation that is needed for sharing declarations. One consequence of this is that the bigger the unit of reuse, the less often it will actually be reusable. At one extreme is gcc’s current PCH implementation, which requires a single header file to be included first in all compilations. With the file-boundary scheme, it turns out that even relatively ordinary programs fail to reuse in many situations. For instance, some headers in gcc have conditional declarations — so, change the order of includes, and an entire header’s worth of declarations must be re-parsed.
Note that failure to reuse is not a bug. Reuse is, after all, a performance and memory optimization. The worst consequence is that gcc will run as slowly, and will use as much memory, as it does now. Still, reuse is desirable. The more we reuse, and the smaller our units of reuse, the less work we have to do when recompiling.
So, this leads to the second implementation, which tries to guess declaration boundaries without parsing. This fast, if somewhat evil-looking, and it yields good results. This still isn’t perfect, though (it does not handle K&R-style definitions properly), so I will probably rewrite it again to do something closer to a quick parse.
I’m not sure whether this same approach makes sense for C++. I assume I will start here, because I understand it and know that it will work at least minimally. But, C++ brings its own problems. For instance, suppose you have an inline function defined in a class definition. In this case, if you change the body of the function, you’d like to be able to reuse the class definition, if possible — otherwise, many kinds of small changes will yield costly compilations. Tom Fitzsimmons suggested that perhaps the hunk computation can be made recursive in some way, meaning that perhaps it could operate inside of classes and namespaces, and not just at the top level. I’ll definitely be researching this once I start on the C++ front end.
The split between front and back end incremental processing means that each front end is free to implement its own approach to incrementality. I’m only looking at C and C++, but other people are free to hack on the other front ends. The actual changes to the middle-end interface are quite small, as it turns out.
What next? I think future posts will cover scalability, multi-processor support, and perhaps what to do about the C preprocessor. If there’s anything you want to know more about, post a comment and I will address it.