Today I’ve been thinking about a problem I’ve run across in my compiler front end work.
The problem is a technical issue having to do with how the C compiler is implemented. Currently if there are multiple declarations for an object, the compiler might modify one of the declarations in place. This causes problems when re-using declarations. What if the second declaration is visible in one compilation unit but not another? Remember that since we’re saving object pointers here, other declarations that refer to the multiply-declared object will point to the first declaration — the one that gets modified.
I haven’t been able to think of a nice way to handle this that also lets us do multi-threaded parsing. We could redefine the various DECL tree accessors to forward requests to a per-compiler (aka thread) “current declaration” — but this is fairly nasty and, due to the nature of trees, probably also error-prone
Any ideas? I mean ones that don’t involve redesigning GCC? The old compile server branch planned to solve this via an undo buffer, but that is not very nice for multi-threaded use (something I plan to do that the compile server did not).
Sometimes I think this project would be better if I went ahead with the more ambitious project of eliminating trees from the front end… but that would both be more work and, I suspect, harder to get approved. But, at least that way I could design in a fix, since I’d be touching every line in the front end anyway.
5 Comments
How about every time the compiler wants to modify one of these objects, take a copy of it, and use the copy for the thread that needs to modify it, leaving all other instances sharing it using the unmodified original.
If it’s possible to detect how many times it’s already being shared you could optimize away the copy in the case where it’s not, yet, but you’d have to add a flag to indicate that if it’s to be shared in future it’ll need to be recreated fresh…
Yeah, some form of copy-on-write seems to be the best way. Perhaps saving the old version in some form of backlink and somehow tagging each version so you know what compilation unit the modifications came from.
The problem with using copy-on-write is kind of obscure. I keep forgetting it :(. Anyway, I think the problem is that GCC compares trees like this using ==, so object identity is important. This is why smashing decls together is done, and this is why all the uses have to refer to a single decl. I need to investigate this a bit more — maybe this is the weak point where a change can be made.
Alternatively I thought of a complicated solution involving recording all the hunks that modify a decl, and then only allowing hunk reuse if all the modifying hunks are used by the compilation unit. This is kind of nasty though, and harder to reason about; for instance one of these hunks may have unsatisfied dependencies, so you may need a relatively hairy dependency check, or you may need to do all the dependency checks before doing any parsing.
Maybe another idea is to change the game a little bit — warn if two decls are smashed, and use the combined decl. Underlying this approach is the hope that smashing decls is uncommon. Some emperical data here would be useful.
That plan won’t work either. In the full incremental case, there would never be a nice way to unsmash a decl — but we definitely need this.
Instead I’m going to have a per-thread hash table mapping unmodified decls to their smashed variants. Then I’ll audit the C front end to add hash lookups everywhere they are needed. Not fun but it will give the correct answer in all the scenarios.