Archive for June, 2007

Conundrum

Today I’ve been thinking about a problem I’ve run across in my compiler front end work.

The problem is a technical issue having to do with how the C compiler is implemented. Currently if there are multiple declarations for an object, the compiler might modify one of the declarations in place. This causes problems when re-using declarations. What if the second declaration is visible in one compilation unit but not another? Remember that since we’re saving object pointers here, other declarations that refer to the multiply-declared object will point to the first declaration — the one that gets modified.

I haven’t been able to think of a nice way to handle this that also lets us do multi-threaded parsing. We could redefine the various DECL tree accessors to forward requests to a per-compiler (aka thread) “current declaration” — but this is fairly nasty and, due to the nature of trees, probably also error-prone

Any ideas? I mean ones that don’t involve redesigning GCC? The old compile server branch planned to solve this via an undo buffer, but that is not very nice for multi-threaded use (something I plan to do that the compile server did not).

Sometimes I think this project would be better if I went ahead with the more ambitious project of eliminating trees from the front end… but that would both be more work and, I suspect, harder to get approved. But, at least that way I could design in a fix, since I’d be touching every line in the front end anyway.

Speedup

I added the needed multi-hunk processing to my front end hack, to try to be able to compile parts of GCC. What this means is that if a construct (say, an enum) spans multiple file changes, the parser will now notice this and record it properly. If a subsequent parse sees the same starting hunk, it will check to see if subsequent hunks are the same as the previous chain, and if so, will map in all the bindings.

Unfortunately this was not the only problem — GCC also has an enum defined in two different header files, meaning that the token stream actually seen depends on the order of header inclusion. Gross stuff! Anyway, handling this was always on the agenda; but it means writing more code and so I am putting it off a little bit.

Meanwhile I compared my hacked compiler with a roughly equivalent unmodified version. I compiled zenity (a Gnome program; the part I compiled is about 5KLOC according to wc — but 475KLOC after preprocessing) with both, using the same command line arguments, including --combine.

The results: the modifications are a big win:

Compiler Time Memory (used at exit)
Hacked 11.98sec 17M
Trunk 23.87sec 62M

The hacked compiler with --combine is faster than running a make on the same program (though to be fair I didn’t try to measure any of the build overhead — but it would have to be more than 4 seconds of overhead, since the normal build took 16.55sec).

Sharing Parsed Headers

One component of the incremental compiler is re-using parsed header files. In fact, for C++ this is where I expect to gain performance improvements in the “normal compilation” case. A couple weeks ago I realized that I could easily try this idea, for C, by implementing parts of it and then testing it using --combine. I took a break from removing globals from the C front end and wrote a trial implementation this week.

This code works by conceiving of a C compilation unit as a sequence of “hunks”, currently defined as “tokens between two file change events”. Then it computes the checksum of the tokens in each hunk. (In order to do this I had to change the C compiler to lex the whole compilation unit at once — this was easy and makes it better match the C++ compiler.)

While parsing a hunk for the first time, we note all its top-level declarations by hooking into bind(). We save these for later.

Then in the top-level parsing loop, if we notice that the next hunk matches a hunk in our map, we simply copy in the bindings we saved, set the token pointer to the end of the hunk, and carry on.

There’s some extra bookkeeping here — we only want “top level” hunks, not ones where we saw a file change event in the middle of a function or structure or something; and we can only reuse a hunk if the declarations used by the code in the hunk were also reused. I haven’t yet implemented the latter case, but I’ll have to since I want to test this on GCC itself, and GCC uses constructs like this.

I was able to run it on a smaller program (zenity). We skipped parsing 1.4 million tokens in this program. Naturally, whether or not this is a performance win (I didn’t look at that yet, and honestly for C I am not expecting it) depends on the overhead of the bookkeeping. However, it should reduce memory use, making IMA a bit more practical. And, it gives a good idea of the kind of textual redundancy common in C programs.

Note that this is experimental. I don’t like --combine all that much, since it requires users to change their Makefiles, and because it doesn’t work with C++. However, it provided a simple testbed.

idutils

I love hearing the odd details of how other people work. Usually I’m hoping to learn something that will help me be more productive, and, if not, to hear about something delightfully awful, convoluted, and obscure.

Talking to Roland at JavaOne yielded a nice mix of both. One interesting thing is that he uses full-text search on his source and basically (AIUI) doesn’t run grep from inside Emacs. I, on the other hand, was a long-time igrep user, switching to rgrep (new in Emacs 22) recently.

I decided to give Roland’s approach a try.

First I took a look at namazu, since I knew it had some sort of Emacs integration, and because it is packaged in Fedora. Unfortunately it doesn’t seem to understand how to index source code.

So then I tried GNU idutils. I had heard of this years ago, and even tried it once or twice, but somehow it didn’t end up in my repertoire. Anyway, it turns out that this is a pretty nice tool. It isn’t packaged for Fedora, but building it only took a couple minutes. The Emacs support code needed some tweaks (email me if you want this), but works ok enough out of the box.

The result is pretty nice. For small projects it may not matter much, but searches over the code I’m indexing (GCC) are much, much faster. This is especially nice since the work I’m doing on GCC at the moment requires repeated searches for all the users of various global variables.

Idutils could use a little love — right now there’s no way to incrementally update the index, which is a pain. If it affects me enough I suppose I’ll set up a cron job or something using inotify. This is just gloss, though. The important thing is that it has removed a pointless wait from my process.

Front End Stuff

I’ve been digging through GCC’s C front end the last couple of weeks, trying to understand how it works and how feasible some parts of my project are.

The parser itself is nice. It is a hand-written recursive descent parser, which means that it is “just plain code” and thus simple to debug. It is written in a typical object-oriented style, where every function in the parser takes a “c_parser” object pointer as its first parameter.

Unfortunately the rest of the C front end is not so nice. There are many (around 200) global variables, and even though the parser itself is quite clean, it implicitly relies on these via functions elsewhere in the front end. This means that making the front end reentrant is going to be a fair amount of work. While much of this will be mechanical, there are some ugly things, like places where lang hooks (a sort of virtual function in the compiler) are designed in a way that implicitly requires a global variable.

I haven’t looked deeply into the C++ front end yet, but a quick look with nm shows that there are about 100 globals there as well.

Making the parser re-entrant is sort of a side plot for me: it will help with part of my project, but it isn’t on the critical path. Still, I’d like to see the C and C++ front ends eventually be written in “gcjx style”: thread-aware reusable libraries, written in an OO style, separated from the back ends.

So why work on this at all right now? I didn’t know the C front end at all, and I thought this would be a way to learn about it without getting bored just reading code. This plan is working pretty well. I’ve also spent a bit of time simply stepping through various parts of the compiler with gdb — another useful technique for learning a foreign code base.

Meanwhile I’ve been working on the design of the more crucial parts of the plan — namely, re-using parsed headers and providing incremental recompilation. My current hope is to have something ready for critique before the GCC Summit.

Ocean’s 13

My brother called this “Ocean’s 11.5”.

I was disappointed by 12 — having the plot be a pointless, artificial contest was, in my opinion, a big cop-out by the writers. Also it was a bit incoherent at points, with a bit too much dialog left implied.

13 takes a stab at avoiding these problems. The plot is about revenge, old-fashioned but still fun. The dialog and situations aren’t quite as hard to follow (though I did spend much of the movie wondering what a “kooker” was, only to find out, via a review, that it was “cougar”, which luckily the review explained).

In the end 13 was a bit formulaic, but I could watch the cast of the movie all day long, and it was enjoyable enough to recommend. I own 11 and 12 and this one is comfortable enough that I’ll probably buy it someday and watch it when I want an easy distraction.

Word Swirl

Here’s Word Swirl, a dictionary-based game you can play in your web browser. You get points for every word you can find, and if you find a word that uses all the letters you will be able to go to the next round. You can use the space bar to rearrange your letters.

My sister says this doesn’t work in IE — sorry about that. I don’t have Windows here and so I don’t know a way to debug it. If you know what is wrong, let me know. I suppose real web developers must run Windows, or they must have 2-3 machines on their desktop at any given time.

It may take a little while to download, since it downloads its entire dictionary at once. Word Swirl is GPL.

Knocked Up

Steve said he thought this movie was hilarious and he laughed all the way through it. I don’t really understand how that is possible. It was funny in parts, occasionally too crude for my taste, but then also occasionally touching.