Archive for the ‘GCC’ Category

GCC Summit 2007

The 2007 GCC Summit just ended; all that’s left is the wrap party. Lately I haven’t been too interested in writing up comprehensive trip reports (but if you liked that style, post a comment — I’d like to know), and instead I’m just going to write about the bits I found particularly interesting. Overall the summit was good; I think the proceedings are online if you want to read the papers.

I had a few personal goals for the summit and I think I’ve accomplished all of them. I wanted to meet up with the various GCC hackers I’ve met over the years and sort of get them used to the idea that I might be around in GCC-land a bit more than previously. This was the easy part — I talked to tons of people and met many more.

I also wanted to pitch my incremental compiler to anybody who would listen, and get feedback about the idea in general and about a few specific potential problem areas. Thanks here go to Ian Taylor and Alan Modra for some assembler and linker advice; and to Mark Mitchell and Nathan Sidwell for telling me some awful truths about C++.

Nobody really complained much about my plans, not even the evil part where GCC will generate a single object per top level definition, or the even eviler part where code generation is done by fork()ing the server and doing the work in the child, to avoid having to clean up a million global variables. Now I’m left hoping I’m not in a tragedy of manners, where everyone holds back their knowledge of impending doom.

Diego, appropriately, gave the opening remarks. He showed how GCC’s SpecINT performance has remained flat over the last five years, inducing widespread depression. SpecFP looks a bit better, though, and plus everyone can agree that GCC’s internals are much nicer than they used to be.

Most of the talks were about optimizations. This is fun stuff but not my current area of interest. There were a few talks touching on the idea of writing plugins for GCC — one paper discussing a custom version of Lisp, and another having impressive amounts of working code including Python bindings and a GIMPLE visualizer. I’m hoping one or both of these go into GCC eventually; I especially like the idea of writing custom code analysis passes in Lisp.

A couple talks were about debugging. I missed one of these but the other had Daniel Jacobowitz demonstrating gdb stepping backward in time. This is awesome stuff! … though at the moment there are several restrictions and not all the patches are available. Lots of people are concerned about GCC’s debuginfo, which I hear from many sources is rather sucky.

It’s a bit strange to embark on a project that seems so unrelated to the rest of the work in GCC. I keep doing mental management on myself so as not to obsessively worry about this. Anyway the GCC crowd is pretty nice and I’m not worrying about how people will react any more.

Several people asked me about gcj of course. I tried to be clear about the fact that the situation is murky: my own gcj hacking has dropped off precipitously, though I still review patches, and what to do about gcj really depends on the level of involvement of other parties in the future (it is hard to pick a time period here but I would say at least two major releases).

Distributed Compiler

Andrew MacLeod had an interesting idea yesterday — make the compile server interact nicely with distcc. I’ve been thinking about this a bit, and I don’t think there are any major problems with the idea. It would be pretty simple to run a compile server on multiple machines. Maybe it would make sense to ensure that a given file is always sent to the same server, but even this isn’t strictly necessary.

One possible problem is that this would not interact well with any indexing feature we add to the server. Also it would not allow cross-module inlining in all cases, or, worse, it might choose to inline or not depending on which jobs end up on which server — something would have to be done about that.

The point behind doing this is to try to scale to very large programs — larger than will fit in memory on your desktop.

Another idea in this space is to somehow serialize hunks to disk, GC “rarely used” hunks when memory is low, and then read hunks back in as needed. Whether this would actually be a win would depend, I think, on the speed of deserialization — it would have to be reliably faster than simply re-parsing (this is more likely if GCC adopt’s Ian’s proposed pointer-free “PC-rel-like” IR — but I don’t know how likely that is). I think I’ll wait for the LTO project to be a bit further along before investigating this any more; hopefully I’d be able to use LTO’s tree-reading and -writing capabilities to save and restore hunk contents.

Anyway, a distributed compile server is an interesting idea for the future. I’ll revisit it once the server is actually working.

Another FE Test

I added “prerequisite” support to my front end prototype recently — this is code that ensures that a hunk’s contents are only re-used if all the things that the hunk depends on are also re-used. This turned out to be much simpler to implement than I had anticipated.

With this I still can’t compile the C front end with my compiler, because GCC has many ordering dependencies in its headers (ugly stuff — you should avoid this) and so ends up requiring the decl smashing fix. The decl smashing fix looks like a tricky project, especially given the multi-threading constraint, so I haven’t done it yet.

Meanwhile I took a simpler approach to trying to get some results from this test case: I reordered all the #include statements in the C front end to ensure consistency. Yay hacks!

Even with the additional prerequisites bookkeeping, the results are quite nice, though a bit more modest than the zenity case:

Compiler Time (M:S) Memory at exit
Prototype 0:49 43M
Trunk 1:29 83M

PCH Again

The other day Roland asked me why PCH is not a bigger win. While discussing this I realized that I didn’t try to time the very best case for PCH. So, I looked at a couple more things.

First, I spent some time munging the output of -ftime-report and making graphs with gnuplot. I thought it might be interesting to visualize the problem areas. All this told me, though, is that in my earlier tests, the PCH build still spent an inordinate amount of time in the parser.

Back when I wrote gcjx (my test case here), I tried to make it PCH-friendly. I made a master file, "typedefs.hh", which declared most things and which included a number of other files. But, it turns out, it didn’t include absolutely everything — which the "all.cc" approach does. So, I made a .gch file that did include everything (not entirely trivial, as there are ordering dependencies between headers to consider) and re-ran my test. This did speed up the PCH case — but the all.cc case is still much faster. A look at the graph here shows the same thing, namely that the PCH build is spending a lot of time in the parser.

I didn’t look much deeper than this. I’ve learned that PCH is not excellent even in its best case, but I don’t know why. Given PCH’s many restrictions, I don’t think it is too interesting to pursue this further. The approach I’m investigating is faster and does not have as many limitations as PCH. Also it will not require changes to your Makefiles.

Conundrum

Today I’ve been thinking about a problem I’ve run across in my compiler front end work.

The problem is a technical issue having to do with how the C compiler is implemented. Currently if there are multiple declarations for an object, the compiler might modify one of the declarations in place. This causes problems when re-using declarations. What if the second declaration is visible in one compilation unit but not another? Remember that since we’re saving object pointers here, other declarations that refer to the multiply-declared object will point to the first declaration — the one that gets modified.

I haven’t been able to think of a nice way to handle this that also lets us do multi-threaded parsing. We could redefine the various DECL tree accessors to forward requests to a per-compiler (aka thread) “current declaration” — but this is fairly nasty and, due to the nature of trees, probably also error-prone

Any ideas? I mean ones that don’t involve redesigning GCC? The old compile server branch planned to solve this via an undo buffer, but that is not very nice for multi-threaded use (something I plan to do that the compile server did not).

Sometimes I think this project would be better if I went ahead with the more ambitious project of eliminating trees from the front end… but that would both be more work and, I suspect, harder to get approved. But, at least that way I could design in a fix, since I’d be touching every line in the front end anyway.

Speedup

I added the needed multi-hunk processing to my front end hack, to try to be able to compile parts of GCC. What this means is that if a construct (say, an enum) spans multiple file changes, the parser will now notice this and record it properly. If a subsequent parse sees the same starting hunk, it will check to see if subsequent hunks are the same as the previous chain, and if so, will map in all the bindings.

Unfortunately this was not the only problem — GCC also has an enum defined in two different header files, meaning that the token stream actually seen depends on the order of header inclusion. Gross stuff! Anyway, handling this was always on the agenda; but it means writing more code and so I am putting it off a little bit.

Meanwhile I compared my hacked compiler with a roughly equivalent unmodified version. I compiled zenity (a Gnome program; the part I compiled is about 5KLOC according to wc — but 475KLOC after preprocessing) with both, using the same command line arguments, including --combine.

The results: the modifications are a big win:

Compiler Time Memory (used at exit)
Hacked 11.98sec 17M
Trunk 23.87sec 62M

The hacked compiler with --combine is faster than running a make on the same program (though to be fair I didn’t try to measure any of the build overhead — but it would have to be more than 4 seconds of overhead, since the normal build took 16.55sec).

Sharing Parsed Headers

One component of the incremental compiler is re-using parsed header files. In fact, for C++ this is where I expect to gain performance improvements in the “normal compilation” case. A couple weeks ago I realized that I could easily try this idea, for C, by implementing parts of it and then testing it using --combine. I took a break from removing globals from the C front end and wrote a trial implementation this week.

This code works by conceiving of a C compilation unit as a sequence of “hunks”, currently defined as “tokens between two file change events”. Then it computes the checksum of the tokens in each hunk. (In order to do this I had to change the C compiler to lex the whole compilation unit at once — this was easy and makes it better match the C++ compiler.)

While parsing a hunk for the first time, we note all its top-level declarations by hooking into bind(). We save these for later.

Then in the top-level parsing loop, if we notice that the next hunk matches a hunk in our map, we simply copy in the bindings we saved, set the token pointer to the end of the hunk, and carry on.

There’s some extra bookkeeping here — we only want “top level” hunks, not ones where we saw a file change event in the middle of a function or structure or something; and we can only reuse a hunk if the declarations used by the code in the hunk were also reused. I haven’t yet implemented the latter case, but I’ll have to since I want to test this on GCC itself, and GCC uses constructs like this.

I was able to run it on a smaller program (zenity). We skipped parsing 1.4 million tokens in this program. Naturally, whether or not this is a performance win (I didn’t look at that yet, and honestly for C I am not expecting it) depends on the overhead of the bookkeeping. However, it should reduce memory use, making IMA a bit more practical. And, it gives a good idea of the kind of textual redundancy common in C programs.

Note that this is experimental. I don’t like --combine all that much, since it requires users to change their Makefiles, and because it doesn’t work with C++. However, it provided a simple testbed.

Front End Stuff

I’ve been digging through GCC’s C front end the last couple of weeks, trying to understand how it works and how feasible some parts of my project are.

The parser itself is nice. It is a hand-written recursive descent parser, which means that it is “just plain code” and thus simple to debug. It is written in a typical object-oriented style, where every function in the parser takes a “c_parser” object pointer as its first parameter.

Unfortunately the rest of the C front end is not so nice. There are many (around 200) global variables, and even though the parser itself is quite clean, it implicitly relies on these via functions elsewhere in the front end. This means that making the front end reentrant is going to be a fair amount of work. While much of this will be mechanical, there are some ugly things, like places where lang hooks (a sort of virtual function in the compiler) are designed in a way that implicitly requires a global variable.

I haven’t looked deeply into the C++ front end yet, but a quick look with nm shows that there are about 100 globals there as well.

Making the parser re-entrant is sort of a side plot for me: it will help with part of my project, but it isn’t on the critical path. Still, I’d like to see the C and C++ front ends eventually be written in “gcjx style”: thread-aware reusable libraries, written in an OO style, separated from the back ends.

So why work on this at all right now? I didn’t know the C front end at all, and I thought this would be a way to learn about it without getting bored just reading code. This plan is working pretty well. I’ve also spent a bit of time simply stepping through various parts of the compiler with gdb — another useful technique for learning a foreign code base.

Meanwhile I’ve been working on the design of the more crucial parts of the plan — namely, re-using parsed headers and providing incremental recompilation. My current hope is to have something ready for critique before the GCC Summit.

C++ noodling

A while back I spent a little time playing with g++, trying to understand compilation performance. For this experiment I tried to measure how much speedup a “model-based” compiler could expect to achieve.

I compiled my test program a few different ways. I timed compilation of a plain build (more or less a “make”) both with and without PCH. And, I timed compilation of the “all.cc” approach — I’ve heard many times over the years that C++ shops will cat their sources together into one big compilation unit, and that this reduces overall build times.

So far I’ve only done this experiment by compiling gcjx, a moderately sized, fairly ordinary, C++ program with which I am familiar. I plan to redo this with a couple other programs as well (send suggestions).

The results:

Approach Time
Plain build 13m50s
Plain + PCH 8m05s
all.cc 3m18s
all.cc + PCH 3m17s

This basically conforms with things I’ve found previously, but I was a little surprised that PCH is not a bigger win than it is, especially considering that I intentionally wrote gcjx with one big header file that includes most things, precisely to make better use of this feature. (I’ve heard from other folks that in some situations PCH is a net lose and they disable it. Any experiences out there?)

Anyway, I consider this promising data for a compile server approach, since what I’m thinking about essentially models the “all.cc” approach in the compiler.