Archive for the ‘GCC’ Category

Compile Server Scalability

There are a few aspects to compile server scalability that are important to address.

First, and most obviously, memory use. Because we want to be able to send real programs through the compile server, and because we want it to remain live for relatively long periods of time, it is important that memory use be “acceptably bounded”. Naturally, the server process will grow with each additional compilation unit. At least in the straightforward implementation, there’s no way around that (but see below). However, it is important that the server not leak memory, and that recompilations generally not increase memory use. Also, ideally, all that work on decl sharing will keep memory use in check.

For the most part, this did not take any effort to achieve. GCC has a built-in garbage collector, and most nontrivial data structures are allocated using the GC. This is not a silver bullet, of course, but it has yielded good results with little effort in practice.

In the case of recompilation, we employ a simple heuristic — we store all parsed hunks keyed off the name of the requested object file (note: not the input file; it is common for a project to compile a given source file multiple times, but it is rare to see the same object file name more than once). When recompiling an object, we assume that there will be a lot of reuse against the object’s previous version, so we store those hunks temporarily, but then discard the old ones at the end of compilation. This way, we reuse, but we can also free hunks which are no longer in use.

Results from a few tests are very encouraging here. I compiled gdb with the compile server, then deleted the object files and re-compiled. Memory use (as reported by -fmem-report) stayed flat at around 51M — meaning that recompilation doesn’t grow the image, and the collection approach is working as desired.

I also built gdb using the compiler in “normal” mode, and looked at the -fmem-report totals. If you sum them up, which I naively expect gives a rough idea of how much memory --combine would use, you get 1.2G. Or, in other words, decl sharing appears to make a huge difference (I’m not completely confident in this particular number).

If memory use does become a problem for very large compiles, we could look at scaling another way: writing out hunks and reading them back in. Maybe we could use machinery from the LTO project to do this. This would only be useful if it is cheaper to read decls via LTO than it is to parse the source; if this is not cheaper then we could instead try to flush out (and force re-parsing of) objects which are rarely re-used. One special case of this is getting rid of non-inlineable function bodies — when we have incremental code-generation, we’ll never compile a function like that more than once anyway.

Another scalability question is how to exploit multiple processors, either multi-core machines, or compile farms. In an earlier post, I discussed making the compile server multi-threaded. However, that interacts poorly with our code generation approach (fork and do the work in the child), so I am probably not going to pursue it. Instead, for the multi-core case, it looks straightforward to simply run multiple servers — in other words, you would just invoke “gcc --server -j5“. Something similar can be done for compile farms.

An ideal result for this project would be for small changes to result in compilation times beneath your perceptual threshold. I doubt that is likely to happen, but the point is, the absolute turnaround time is important. (This is not really a question of scalability, but I felt like talking about it anyway.)

In the current code, though, we always run the preprocessor for any change. So, even once incremental code generation is implemented, the turnaround time will be bound by the time it takes to preprocess the source. This might turn out to be a problem.

In an earlier design (and in some other designs I have heard of), this is handled by making a model of compilation that includes preprocessing. That seems too complicated to me, though, and instead I think that it should be possible to also make an incremental preprocessor (say, one that uses inotify to decide what work must be re-done), and then use it without excessive cooperation from the parser.

Tools Happenings

There are some fun things going on in the tools arena right now.

Do you read Taras Glek’s blog? He’s working on GCC Dehydra, which lets you write GCC passes in javascript. I think his work is one of the most interesting developments in the GCC world today.

There are a few similar projects in the works right now. The plugin branch lets you write GCC plugins; the authors of this branch have a Python plugin, but as far as I’m aware this is not publicly available.

On a slightly more idiosyncratic note, Basile Starynkevitch made a branch for his MELT project. This is also a plugin system, but it uses a lisp dialect for which he’s written his own lisp-to-C translator. I’m kind of partial to this idea — I think it would be fun to write parts of GCC in lisp, at least if it compiled down to something reasonable.

I’m quite interested in pushing GCC more into source analysis and refactoring. Right now the front ends have some problems that make this difficult, but I think these are surmountable without too much trouble. Maybe when I finish this pesky incremental compiler…

With all this going on I wonder why GCC-XML is not in the tree, at least on a branch.

Vladimir Prus recently made available his prototype which integrates Python into gdb. This is promising work — we’ve needed something like this for years. Maybe we can finally print complex data structures in sensible ways.

Finally, Benjamin Kosnik has checked in a restructuring of the libstdc++ documentation. I browsed the new stuff a bit and I found it much simpler to navigate. I’m very happy about this; good, clear documentation is critical to the success of free software.

Hunk Boundaries

How do we decide the boundaries of reuse in the incremental compiler?

The one crucial thing when looking at reuse is that the signature that you choose must be stable across compilations. And, it is better if the boundaries are cheap to compute and roughly line up with declaration boundaries. In the prototype implementation, I accomplished these by using file change events (conveniently noted by libcpp) as hunk boundaries. File change boundaries are roughly stable (moreso for system headers), are free (libcpp already computes them), and very frequently align with declaration boundaries. This last criterion is typically only broken by constructs where a file is included in the middle of a declaration — for instance, in gcc itself by including a .def file in the middle of an enum, to populate it. I added a special case to the reuse code to account for these situations.

Aside from the convenience — which is very important for prototypes, but much less so for production — there was a historical reason that I chose file boundaries. When I first started thinking about writing an incremental compiler, I focused primarily on header files, and on having a single model which would encompass both the preprocessor and the compiler. After working through a number of cases, with the help of the prototype, I decided that this was “java thinking”; in C the primary unit of reuse is the declaration — not the file. And, it is much, much simpler to reason about C if you look at it after preprocessing.

In earlier posts I described the dependency computation that is needed for sharing declarations. One consequence of this is that the bigger the unit of reuse, the less often it will actually be reusable. At one extreme is gcc’s current PCH implementation, which requires a single header file to be included first in all compilations. With the file-boundary scheme, it turns out that even relatively ordinary programs fail to reuse in many situations. For instance, some headers in gcc have conditional declarations — so, change the order of includes, and an entire header’s worth of declarations must be re-parsed.

Note that failure to reuse is not a bug. Reuse is, after all, a performance and memory optimization. The worst consequence is that gcc will run as slowly, and will use as much memory, as it does now. Still, reuse is desirable. The more we reuse, and the smaller our units of reuse, the less work we have to do when recompiling.

So, this leads to the second implementation, which tries to guess declaration boundaries without parsing. This fast, if somewhat evil-looking, and it yields good results. This still isn’t perfect, though (it does not handle K&R-style definitions properly), so I will probably rewrite it again to do something closer to a quick parse.

I’m not sure whether this same approach makes sense for C++. I assume I will start here, because I understand it and know that it will work at least minimally. But, C++ brings its own problems. For instance, suppose you have an inline function defined in a class definition. In this case, if you change the body of the function, you’d like to be able to reuse the class definition, if possible — otherwise, many kinds of small changes will yield costly compilations. Tom Fitzsimmons suggested that perhaps the hunk computation can be made recursive in some way, meaning that perhaps it could operate inside of classes and namespaces, and not just at the top level. I’ll definitely be researching this once I start on the C++ front end.

The split between front and back end incremental processing means that each front end is free to implement its own approach to incrementality. I’m only looking at C and C++, but other people are free to hack on the other front ends. The actual changes to the middle-end interface are quite small, as it turns out.

What next? I think future posts will cover scalability, multi-processor support, and perhaps what to do about the C preprocessor. If there’s anything you want to know more about, post a comment and I will address it.

Incremental Code Generation

In an earlier post I said that I’d eventually explain how incremental code generation works.

First, to recap: I’ve already explained how incremental compilation works in the front end. Stopping here would still yield some speedups — I regularly see 20% compile-time improvements on ordinary programs; sometimes much larger; and for C++ I expect the effect will be bigger.

But, for many compilations, particularly when compiling C using optimization, it is the back end which is the major time sink. And since we already know from the front end which functions have changed (i.e., have been re-parsed) and which have not, it seems like it should be easy to also run the back-end incrementally.

There are a few plausible ways to implement this. Essentially, we can preserve the compiled form of the program at any point in the pipeline. By choosing a point after most of the work is done, we can minimize the time spent needlessly recompiling.

My current plan is to do the work at the very latest point possible, by writing out object files. The idea here is that each top-level definition in a translation unit — each global variable or function — will get its own tiny object file. Then, these files will be linked together to form the object file that the user actually asked for.

Of course, the current linker is slow and not at all suitable for this sort of thing. So, we’ll use a new, fast, incremental linker — which we’ll need anyway, in order to preserve incrementality through the entire compile cycle. (It isn’t much good to have a fast incremental compiler if you still end up waiting for the linker all the time.)

This approach has the nice property that the gcc work is relatively easy, for the most part. I don’t know yet whether this outweighs the work that must be done elsewhere, so I’m going to implement a prototype and find out. In fact, this is the next major milestone (though it will be a while, because I’m in a bug-fixing phase).

Another fun possibility here is that we can try to be smart about managing the cache of small object files. In particular, we may be able to preserve some benefits of incrementality even across restarts of the server. For C++, this could include template instantiations.

One drawback of the plan is that it may make dealing with debug info somewhat tricky. This ties in to how we compute hunks; if we don’t include source location in the hunk checksum, then we can end up with objects that lie to the debugger about where their corresponding source appears. We’d want to fix that up somehow, but that is not simple. I’m still thinking about this particular sub-problem.

Another oddity is handling inlining. In my planned implementation, gcc would fork and run code generation in a subprocess. (This lets me avoid excessive modifications to gcc.) But, if we inline a function then the compile server needs to know about this in order to properly re-compile when the inlined function is changed.

Future posts should include information about computing hunk boundaries (pretty short, but maybe I’ll touch on ideas for C++), multi-processor support, and maybe some speculation about scalability.

Anti-dependencies

Last week I implemented anti-dependencies in the C front end. This wasn’t difficult enough to write about, but it provides an excuse to describe dependencies and incrementalism in general.

The fundamental idea in speeding up the front end is that it should avoid parsing whenever possible. If we’ve already parsed a function definition in some earlier compilation, it is faster to reuse that definition than to try to parse it again. (At least, that is true if noticing that we can reuse it is faster than parsing. But it always is. Note also that this will be a much bigger win for C++ than for C.)

Reuse works by computing a checksum of a run of tokens (one of these runs is called a “hunk” in the implementation, a name that vaguely embarrasses me). If we’ve seen the checksum before, we reuse the parsed results, which we store in a table.

Of course, the situation is more complicated than that. A bit of code may depend on other bits of code, for instance:

typedef int myint;
extern myint variable;

Here we can’t reuse variable in isolation — we can only reuse it if we’ve already reused the previous line. So, each hunk carries with it some dependency information. Before last week, this was just a list of all the checksums of hunks on which the current hunk depends.

Even this dependency tracking is not enough for proper operation. Suppose we already compiled this program:

struct x;
struct x { int field; };

… and then try to compile:

struct x;
struct x { int bug; };
struct x { int field; };

With a naive dependency scheme, this would compile without error, because we would erroneously reuse the second definition. After all, all its dependencies have been met.

Anti-dependencies are the fix for this. When considering a hunk for reuse, we check not only that its dependencies have been met, but we check that any declarations that we might reuse have the correct “prior value”. In fact, these two checks amount to the same thing; rather than record the checksums of prerequisite hunks, now we simply record declarations.

Given all this, how will incremental compilation work? My current plan is not to introduce any sort of special case. Instead, the parser will notify the backend when it parses a function. A function which comes from a reused hunk will simply not be compiled again.

If you think about it a bit, you’ll see that the prerequisites form a dependency model of a compilation unit. If the developer changes a declaration, anything using that declaration will be invalidated, and this invalidation will itself propagate outward. So, the amount of recompilation will be proportional to the “importance” of the change. Change a central typedef, and many things may need recompilation. Change a comment, and nothing will (though handling debug info properly may be a challenge).

A future refinement is the idea of a “semantic dependency”. The idea here is that, rather than check prerequisite declarations by identity, check them by ABI compatibility. There are a few tricks here, so this is a bit lower on my priority list for the moment.

You might have wondered how we compute the boundaries of a hunk, or how we can actually avoid recompiling functions. I’ll answer both of these in future posts.

GCC 4.3 Hacking

I haven’t talked about the incremental compiler in a couple of weeks — first I was out of town, and then I was sick. And then yesterday, I put it off… I don’t want to be that way, but the truth is for the last couple of weeks I haven’t been working on this project much.

Instead, I’ve been doing a bit of work fixing bugs on the trunk, to help make it so that GCC 4.3 can be released in a timely way. I don’t really know much about most of GCC, though, so I’ve pretty much been working on only the bugs in parts I do know: more or less the C front end and the preprocessor.

Working on bugs this way is a humbling experience. Last week I think I fixed four bugs. Looking through the ChangeLog, I think Jakub fixed twenty. Hmm… I don’t even know how he can do that many bootstrap-and-check cycles in a week.

I also put a bit of work into making GCC emit better column number information — there’s some new location code that is more memory efficient (saves about 5%) and that enables column number output. Unfortunately the parsers and constant folders and debug output generators know nothing about columns; and fixing this is a big enough job that it won’t happen in 4.3.

The more I work on parts of GCC like this, the more I realize how awesome gcjx really was, if I may say so myself. GCC has several bad design decisions baked into things as basic as location handling. Sad. It will be a lot of work to clean this up… and when I look at GCC I think my eyes are bigger than my stomach. Or in other words, I need help.

I did do a little work on the incremental compiler, starting this week: I did a few cleanups to undo breakage I introduced earlier. My plan is to merge what I’ve got to the trunk during the next Stage 1. So, I’m cleaning up the little details I intentionally ignored during the experimentation phase.

My thinking behind the merge is that, first, the old compile server partly failed due to being on a branch too long, a mistake I don’t want to repeat; and second, even though this code does not do everything, it is reaching a good milestone, and at the very least it greatly speeds up --combine builds. I’ve heard that the Linux kernel uses combine; benchmarking this is on my to-do list.

Threaded GCC

This past week I spent playing around with making the C front end multi-threaded. What this means in particular is that the compile server makes a new thread for each incoming compilation request; the threads are largely independent but do share some results of parsing.

Parts of this work are, I think, reasonably clean. Making gengtype and the GC understand threads was pretty simple. The code to lock the global hunk map (this is where parsed bits of files are registered) was easy.

What is ugly is dealing with GCC’s many global variables. Cleaning this up is a multi-person, multi-year project — so I took the easy route and marked many variables (831 at last count) with __thread.

This works semi-well. There are some bugs, especially with higher -j levels, which means data races somewhere. Since this is not my most pressing project, I’m going to put this patch off to the side for a while.

Also I’ve been wondering whether the other GCC maintainers will really want to see this. For the most part the damage is isolated — the GC, the server, and the pieces of code that explicitly want to share things (for now this just means the C front end) need to know about threads. However, the thread-local variables are pretty ugly… if you’re a GCC maintainer, let me know what you think.

Meanwhile it is back to the grindstone of debugging C front end regressions that I’ve introduced. Also I’m looking at changing the debug output module so that I can defer writing any assembly output until fairly late in compilation. This is one step toward the incremental code generation plan.

Friday GCC Update

I haven’t been blogging much lately, at least partly because I’ve been having fun hacking on GCC. Mark pointed out, though, that I should blog regularly about my progress so people know what is happening.

I’ve got the basic compile server running. Recently I’ve been fixing bugs in it. I’m not seeing the mind-blowing 2x speedup I once was; I need to look into that a bit. This doesn’t matter, though, because the real benefits will come from the next step, which is incremental code generation.

This week, to get away from the bugs a bit, I made GCC’s garbage collector thread-aware. Following the implementation methodology known as “Biting Off More Than You Can Chew”, I also dove in to making the whole front end work multi-threaded. So far this has yielded an evil-looking patch that marks a couple hundred global variables with __thread.

My current plan is to try to merge what I’ve got during the next Stage 1. My thinking is, the current code provides benefits, even if it not every planned benefit. It would be better to merge as frequently as possible — not just for testing but also because smaller patches are simpler to understand, and an early merge would get other GCC maintainers used to the new stuff.

Actually… why don’t I just post my road-map here

  1. Fix bugs. Make it bootstrap again.
  2. Clean up the hacks I put in. Implement the things I skipped (anti-dependencies, maybe semantic dependencies, fix PCH).
  3. Test on bigger programs.
  4. Make sure --enable-intermodule bootstrap works. Measure performance improvement.
  5. Stage 1 merge.

At this point there are multiple options for what to do next. I’m not sure what order I will tackle them — as the thread experiment shows, I’m prone to following what looks fun, in the absence of other constraints.

  1. Incremental code generation. May require linker changes.
  2. C++ front end. (I’m thinking we should make the front ends into shared libraries that gcc will dlopen.)
  3. Finish multi-threading work (if I don’t do this next week).
  4. If there isn’t one yet, finish incremental linker.

After this comes other fun stuff, like adding a way to query the server for information about your program. I’m thinking http…

I’ve also gotten interested in other aspects of GCC’s C front end. For example, it is not ideal for some kinds of static analysis because it folds too early. It might be fun to fix this.

First Success

I just managed to get through an ordinary make of a real program (zenity, which for some reason is my standard test) using gcc as a server. Yay! It is, as predicted, about twice as fast as an ordinary make.

Now pile on the caveats.

Compiler Progress

I’ve been working steadily on the incremental compiler project these last couple of months, but not blogging about it. So, time for a status report.

I made a branch in the GCC repository, cleaned up my existing patches, and checked it all in. There’s a short web page describing the project on the GCC wiki (branch details there).

I’ve also written a rough draft of the support code for “decl smashing”, and also changed the parser to consider “per-declaration hunks”.

The idea behind per-declaration hunks is that, for code written by the user, we get better incremental behavior if we consider each hunk to hold a single declaration or definition, rather than the previous rule of “everything between two file change events”. The former is better because it means that a change is isolated to its containing declaration and uses of that declaration (and we can do even better here, perhaps, with a bit more analysis); the latter is better for system headers and the like because it has lower overhead.

Finally, I’ve started server-izing GCC. This part should be ready for real testing pretty soon.