Faster Faster GDB Startup

A while ago, I wrote about my work to speed up GDB’s DWARF reader. I thought I’d write again with a few updates.

Sharding

Back then, I wrote: “maybe GDB could trade memory for performance and shard the resulting index and do separate canonicalizations in each worker thread”.

I did end up doing this. Recall that the canonicalization step goes through all the discovered DWARF entries of interest — basically, all the objects in the program that both have a name and are not in a function scope (except in some languages, there is always an exception with DWARF) — and ensures the names are in a normal form. For Ada, this step includes synthesizing the package hierarchy (something that should probably be done for Go as well, except nobody really works on the Go support in GDB).

As an aside, sometime in the last few years we realized that this canonicalization has to be done for C as well, because in C there are multiple spellings of types like “short”. This is also implemented.

Because GDB already reads DWARF CUs in chunks in separate threads, the sharding idea is that we can speed up canonicalization a bit by doing this separately. Previously, GDB combined all the results before processing. Sharding means that lookups are a little more complicated; but it turns out not to be too hard, because the number of shards is typically low (for reasons I haven’t yet investigated, the reader doesn’t scale past 8 threads or so).

Background Reading

The other major change I made is to do all the DWARF reading in the background. This is a trick to make gdb feel faster to users. The basic idea here is that in many cases, gdb does not immediately need the DWARF from the various files. So, if we push the reading into worker threads, maybe it will be completely read in by the time gdb does need it.

This also somewhat benefits the situation where several shared libraries are loaded at once into the inferior. In this case, gdb already defers breakpoint re-setting until all the DWARF has been read — and with this change, all that work will be done in parallel.

Making this work wasn’t entirely straightforward. The main issue here is that gdb determines the initial language and location for “list” (et al) based on the debug info. The patches arrange to set these things lazily as well. I also had to add some rudimentary thread-safety to BFD.

Now, this can be defeated in a few ways. If you have a .gdbinit that sets a breakpoint, then that will cause the familiar pause, because setting a breakpoint will wait for the workers to complete. Or, if you debug a large executable and type very quickly, you may have to wait for the parsing to finish.

However, when it does work, it feels like gdb starts instantly.

DWARF Abbrevs Use Too Much Space

I was curious about DWARF abbrev table efficiency the other day, so I instrumented gdb to record some simple stats about abbrevs: how many are seen, how many duplicates are seen, and how many bytes are used.

Running gdb on itself, I discovered that abbrevs are largely redundant. In particular, removing redundant abbrevs will remove 95% of abbrevs (10238 unique of 230714 total). Similarly the size of the abbrev tables reduces similarly (230714 bytes needed for the de-duplicated abbrevs, compared to 3848152 as seen in the executable).

Something to think about when you consider the effort DWARF puts in to save a single byte in .debug_info, say by using a 1-byte form rather than a uleb.

I was considering having gdb intern abbrevs and pre-reading all abbrevs so that later steps wouldn’t have to re-read these; but interning turns out to be too slow and so re-reading on demand seems like the way to go.

Faster GDB Startup

After literally years of false starts and failed attempts, last week I finally checked in a series of patches that speed up GDB’s DWARF reader. The speedup for ordinary C++ code is dramatic — I regularly see a 7x performance improvement. For example, on this machine, startup on gdb itself drops from 2.2 seconds to 0.3 seconds. This seems representative, and I’ve seen even better increases on my work machine, which has more cores. Startup on Ada programs is perhaps the worst case for the current code, due to some oddities in Ada debuginfo, but even there it’s a respectable improvement.

GDB Startup

GDB, essentially, had two DWARF readers. They actually shared a surprisingly small amount of code (which was an occasional source of bugs). For example, while abbrev lookup and name generation (more on that later) was shared, the actual DIE data structures were not.

The first DWARF reader created “partial symbols”, which held a name and some associated, easy-to-compute data, like the kind of symbol (variable, function, struct tag, etc). The second DWARF reader (which is still there now) is called when more information was needed about a particular symbol — say, its type. This reader reads all the DIEs in a DWARF compilation unit and expands them into gdb’s symbol table, block, and type data structures.

Both of these scans were slow, but for the time being I’ve only rewritten the first scan, as it was the one that was first encountered and most obviously painful. (I’ve got a plan to fix up the CU expansion as well, but that’s a lengthy project of its own.)

What Was Slow

The partial symbol reader had several slow points. None of them seemed obviously slow if you looked with a profiler, but each one performed unnecessary work, and they combined in an unfortunate way.

The partial DIE cache. GDB did a scan and saved certain DIEs in a cache. There were some helpful comments that I believe were true at one point that explained why this was useful. However, I instrumented GDB and found that less than 10% of the cached DIEs were ever re-used. Computing and allocating them was largely a waste, just to support a few lookups. And, nearly every DIE that was ever looked up was done so on behalf of a single call — so the cache was nearly useless.
Name canonicalization. DWARF says that C++ names should follow the system demangler. The idea here is to provide some kind of normal form without having to really specify it — this matters because there are multiple valid ways to spell certain C++ names. Unfortunately, GCC has never followed this part of DWARF. And, because GDB wants to normalize user input, so that any spelling will work, the partial reader normalized C++ names coming from the DWARF as well. This area has a whole horrible history (for example, the demangler is crash-prone so GDB installs a SEGV handler when invoking it), but the short form here is that the partial symtab reader first constructed a fully-qualified name, and only then normalized it. This meant that any class or namespace prefix (and there are a lot of them) was re-normalized over and over while constructing names.
The bcache. The partial symbol reader made heavy use of a data structure in GDB called a bcache. This is like a string interner, but it works on arbitrary memory chunks. The bcache was used to intern both the names coming from canonicalization, as well as the partial symbols themselves. This in itself isn’t a problem, except that it requires a lock if you want to use it from multiple threads.

The New Reader

The new reader fixes all the above problems, and implements some other optimizations besides.

There is no more partial DIE cache. Instead, GDB simply scans the DWARF and immediately processes what it finds. While working on this, I realized that whether a given DIE is interesting or not is, largely, a static property of its abbrev. For example, if a DIE does not have a name and does not refer back to another DIE (either via “specification” or “origin” — DWARF is weird), then it can simply be skipped without trying to understand it at all. So, in the new reader, this property is computed once per abbrev and then simply consulted in the scanner, avoiding a lot of repeated checks.

The entire scanner is based on the idea of not trying to form the fully qualified name of a symbol. Now, while the rest of GDB wants the fully-qualified name, there’s no need to store it. Instead, the conversion is handled by the name-lookup code, which splits the searched-for name into components. The scanner creates an index data structures that’s similar to what is described by DWARF 5 (modulo bugs in the standard).

As part of this non-qualifying approach, only the “local” name is stored in each entry. Name canonicalization must still be done for C++ (and a more complicated process for Ada), but this is done on much shorter strings. A form of string interning is still used, but it takes advantage of the fact that the original string comes from the DWARF string table, and so simple pointer comparisons can be done (normally the linker combines identical strings, and if not, this just wastes a little memory). Furthermore, the interning is all done in a worker thread, so in most cases the GDB prompt will return before the work is fully complete — this makes an illusion of speed, and a nicer experience as a user.

Speaking of threads, GDB also now scans all DWARF compilation units in parallel. Specifically, GDB has a parameter that sets the number of worker threads, and it uses a parallel for-each to split the list of compilation units into N groups, and each thread works on a group. I experimented a bit and found that setting N to the number of CPUs on the system works well, at least on the machines I have available.

There’s probably still some room to speed things up some more. Maybe there are some micro-optimizations to be done. Maybe GCC could canonicalize C++ names, and we could eliminate an entire step; or maybe GDB could trade memory for performance and shard the resulting index and do separate canonicalizations in each worker thread.

There’s still an unfortunate amount of hair in there to deal with all the peculiarities of DWARF. DWARF is nicely flexible, but sometimes much too flexible, and actively difficult to read. Also, each version of DWARF yields new modes, which complicate the design. In addition to ordinary DWARF, GDB also deals with split DWARF (two or maybe three kinds), dwz-compressed DWARF (which is standard but has very many inter-CU references, where ordinary compiler-generated DWARF has none), the multi-file dwz extension, and the old debug_types section. Each of these needed special code in the new reader.

Future Work

Full CU expansion is still slow. You don’t see this (much) during GDB startup, but if you’ve ever done a ‘next’ or ‘print’ and then waited interminably — congratulations, you’ve found a bad CU expansion case. Normally these occur when GDB encounters some truly enormous CU… in my experience, most CUs are small, but there are some bogglingly huge outliers.

This is probably the next thing to fix.

The current code still shares less code with the second DWARF reader than you may think. For example, the full symbol reader constructs fully-qualified names according to its own, different algorithm.

My current plan here is to reuse the existing index to construct a sort of skeleton symbol table. Then, we’d further change GDB to fill in the bodies of individual symbols on demand — eliminating the need to ever do a full expansion. (Perhaps this could be extended to types as well, but internally in GDB that may be trickier.) As part of this, the fully-qualified names would be constructed from the index itself, which is also much cheaper than re-computing and re-canonicalizing them.

Summary

GDB is a lot faster to start now. This was done through a combination of removing useless work, smarter data structures, and exploiting the wide availability of multi-core machines.

Warning and Sanitizer Retrospective

One of my hobbies in GDB is cleaning things up. A lot of this is modernizing and C++-ifying the code, but I’ve also enabled a number of warnings and other forms of code checking in the last year or two. I thought it might be interesting to look at the impact, on GDB, of these things.

So, I went through my old warning and sanitizer patch series (some of which are still in progress) to see how many bugs were caught.

This list is sorted by least effective first, with caveats.

-fsanitize=undefined; Score: 0 or 10

You can use -fsanitize=undefined when compiling to have GCC detect undefined behavior in your code. This series hasn’t landed yet (it is pending some documentation updates).

We have a caveat already! It’s not completely fair to put UBsan at the top of the list — the point of this is that it detects situations where the compiler might do something bad. As far as I know, none of the undefined behavior that was fixed in this series caused any visible problem (so from this point of view the score is zero); however, who knows what future compilers might do (and from this point of view it found 10 bugs). So maybe UBSan should be last on the list.

Most of the bugs found were due to integer overflow, for example decoding ULEB128 in a signed type. There were also a couple cases of passing NULL to memcpy with a length of 0, which is undefined but should probably just be changed in the standard.

-Wsuggest-override; Score: 0

This warning will fire if you have a method that could have been marked override, but was not. This did not catch any gdb bugs. It does still have value, like everything on this list, because it may prevent a future bug.

-Wduplicated-cond; Score: 1

This warning detects duplicated conditions in an if-else chain. Normally, I suppose, these would arise from typos or copy/paste in similar conditions. The one bug this caught in GDB was of that form — two identical conditions in an instruction decoder.

GCC has a related -Wduplicated-branches warning, which warns when the arms of an if have identical code; but it turns out that there are some macro expansions in one of GDB’s supporting libraries where this triggers, but where the code is in fact ok.

-Wunused-variable; Score: 2

When I added this warning to the build, I thought the impact would be removing some dead code, and perhaps a bit of fiddling with #ifs. However, it caught a couple of real bugs: cases where a variable was unused, but should have been used.

-D_GLIBCXX_DEBUG; Score: 2

libstdc++ has a debug mode that enables extra checking in various parts of the C++ library. For example, enabling this will check the irreflexivity rule for operator<. While the patch to enable this still hasn’t gone in — I think, actually, it is still pending some failure investigation on some builds — enabling the flag locally has caught a couple of bugs. The fixes for these went in.

-Wimplicit-fallthrough; Score: 3

C made a bad choice in allowing switch cases to fall through by default. This warning rectifies this old error by requiring you to explicitly mark fall-through cases.

Apparently I tried this twice; the first time didn’t detect any bugs, but the second time — and I don’t recall what, if anything, changed — this warning found three bugs: a missing break in the process recording code, and two in MI.

-Wshadow=local; Score: 3

Shadowing is when a variable in some inner scope has the same name as a variable in an outer scope. Often this is harmless, but sometimes it is confusing, and sometimes actively bad.

For a long time, enabling a warning in this area was controversial in GDB, because GCC didn’t offer enough control over exactly when to warn, the canonical example being that GCC would warn about a local variable named “index“, which shadowed a deprecated C library function.

However, now GCC can warn about shadowing within a single function; so I wrote a series (still not checked in) to add -Wshadow=local.

This found three bugs. One of the bugs was found by happenstance: it was in the vicinity of an otherwise innocuous shadowing problem. The other two bugs were cases where the shadowing variable caused incorrect behavior, and removing the inner declaration was enough to fix the problem.

-fsanitize=address; Score: 6

The address sanitizer checks various typical memory-related errors: buffer overflows, use-after-free, and the like. This series has not yet landed (I haven’t even written the final fix yet), but meanwhile it has found 6 bugs in GDB.

Conclusion

I’m generally a fan of turning on warnings, provided that they rarely have false positives.

There’s been a one-time cost for most warnings — a lot of grunge work to fix up all the obvious spots. Once that is done, though, the cost seems small: GDB enables warnings by default when built from git (not when built from a release), and most regular developers use GCC, so build failures are caught quickly.

The main surprise for me is how few bugs were caught. I suppose this is partly because the analysis done for new warnings is pretty shallow. In cases like the address sanitizer, more bugs were found; but at the same time there have already been passes done over GDB using Valgrind and memcheck, so perhaps the number of such bugs was already on the low side.

Emacs JIT Calling Convention

I’ve been working a bit more on my Emacs JIT, in particular on improving function calling. This has been a fun project so I thought I’d talk about it a bit.

Background

Under the hood, the Emacs Lisp implementation has a few different ways to call functions. Calls to or from Lisp are dispatched depending on what is being called:

For an interpreted function, the arguments are bound and then the interpreter is called;
For a byte-compiled function using dynamic binding, the arguments are bound and then the bytecode interpreter is called;
For a byte-compiled function using lexical binding, an array of arguments is passed to the bytecode interpreter;
For a function implemented in C (called a “subr” internally), up to 8 arguments are supported directly — as in, C functions of the form f(arg,arg,...); for more than that, an array of arguments is passed and the function itself must decide which slot means what. That is, there are exactly 10 forms of subr (actually there are 11 but the one missing from this description is used for special forms, which we don’t need to think about here).

Oh, let’s just show the definition so you can read for yourself:

union {
Lisp_Object (*a0) (void);
Lisp_Object (*a1) (Lisp_Object);
Lisp_Object (*a2) (Lisp_Object, Lisp_Object);
Lisp_Object (*a3) (Lisp_Object, Lisp_Object, Lisp_Object);
Lisp_Object (*a4) (Lisp_Object, Lisp_Object, Lisp_Object, Lisp_Object);
Lisp_Object (*a5) (Lisp_Object, Lisp_Object, Lisp_Object, Lisp_Object, Lisp_Object);
Lisp_Object (*a6) (Lisp_Object, Lisp_Object, Lisp_Object, Lisp_Object, Lisp_Object, Lisp_Object);
Lisp_Object (*a7) (Lisp_Object, Lisp_Object, Lisp_Object, Lisp_Object, Lisp_Object, Lisp_Object, Lisp_Object);
Lisp_Object (*a8) (Lisp_Object, Lisp_Object, Lisp_Object, Lisp_Object, Lisp_Object, Lisp_Object, Lisp_Object, Lisp_Object);
Lisp_Object (*aUNEVALLED) (Lisp_Object args);
Lisp_Object (*aMANY) (ptrdiff_t, Lisp_Object *);
} function;

Initial Approach

Initially the JIT worked like a lexically-bound bytecode function: an array of arguments was passed to the JIT-compiled function. The JIT compiler emitted a bunch of code to decode the arguments.

For Lisp functions taking a fixed number of arguments, this wasn’t too bad — just moving values from fixed slots in the array to fixed values in the IR.

Handling optional arguments was a bit uglier, involving a series of checks and branches, so that un-bound arguments could correctly be set to nil. These were done something like:

if nargs < 1 goto nope1
arg0 = array[0]
if nargs < 2 goto nope2
arg1 = array[1]
goto first_bytecode
nope1: arg0 = nil
nope2: arg1 = nil
first_bytecode: ...

&rest arguments were even a bit worse, requiring a call to create a list. (This, I think, can’t be avoided without a much smarter compiler, one that would notice when reifying the list could be avoided.)

Note that calling also has to use the fully generic approach: we make a temporary array of arguments, then call a C function (Ffuncall) that does the dispatching to the callee. This is also a source of inefficiency.

Today

Recently, I changed the JIT from this approach to use the equivalent of the subr calling convention. Now, any function with 8 or fewer (non-&rest) arguments is simply an ordinary function of N arguments, and we let the already-existing C code deal with optional arguments.

Although this often makes the generated assembly simpler, it won’t actually perform any better — the same work is still being done, just somewhere else. However, this approach does use a bit less memory (most JIT-compiled functions are shorter); and it opens the door to an even bigger improvement.

The Future

What I’m implementing now is an approach to removing most of the overhead from JIT-compiled function calls.

Now, ideally what I’d like is to have every call site work “like C”: move the arguments to exactly where the callee expects them to be, and then call. However, while looking at this I found some problems that make it tricky:

We still need to be able to call Lisp functions from C, so we’re limited to, at best, subr-style calling conventions;
While &rest arguments are straightforward (in our simple compiler, somebody just has to make the list); &optional arguments don’t have a good C-like analog. The callee could push extra arguments, but…
In Lisp, a function can be redefined at any time, and it is fine to change the function’s signature.

Consider this example:

(defun callee (x &optional y) (list x y))
(defun caller (callee 23))
(defun callee (x) (list x))

Now, if we compiled caller with a direct call, it would turn out like (callee 23 nil). But then, after the redefinition, we’d have to recompile caller. Note this can go the other way as well — we could redefine callee to have more optional arguments, or even more fixed arguments (meaning that the call should now throw an exception).

Recompiling isn’t such a big deal, right? The compiler is set up very naively: it just compiles every function that is invoked, and in this mode “recompilation” is equivalent to “just abandon the compiled code”.

Except… what do you do if caller is being run when callee is redefined? Whoops!

Actually, of course, this is a known issue in JIT compilation, and one possible solution is “on-stack replacement” (“OSR”) — recompiling a function while it is running.

This to me seemed like a lot of bookkeeping, though: keeping a list of which functions to compile when some function was redefined, and figuring out a decent way to implement OSR.

The Plan

Instead I came up a with a simpler approach, involving — you guessed it — indirection.

On the callee side, I am going to keep the subr calling convention that is in place today. This isn’t ideal in all cases, but it is reasonable for a lot of code. Instead, all the changes will take place at spots where the JIT emits a call.

I am planning to have three kinds of function calls in the JIT:

Indirect. If we see some code where we can’t determine the callee, we’ll emit a call via Ffuncall like we do today.
Fully direct. There are some functions that are implemented in C, and that I think are unreasonable to redefine. For these, we’ll just call the C function directly. Another fully-direct case is where the code dispatches to a byte-code vector coming from the function’s constant pool — here, there’s no possibility to redefine the function, so we can simply always call the JIT-compiled form.
Semi-direct. This will be the convention used when JIT-compiled code calls via a symbol.

The core idea of a semi-direct call is to have multiple possible implementations of a function:

One “true” implementation. If the function has 8 or fewer arguments (of any kind), it will simply have that many arguments. The JIT will simply pretend that an optional argument is fixed. If it has more than 8 arguments, following the subr convention it will just accept an array of arguments.
If the function has optional or rest arguments, there will be trampoline implementations with fewer arguments, that simply supply the required number of additional arguments and then call the true implementation.
Remember how there are exactly 10 relevant kinds of subr? Any forms not covered by the above can simply throw an exception.

A vector of function pointers will be attached to each symbol, and so the JIT-compiled code can simply load the function pointer from the appropriate slot (a single load — the nice thing about a JIT is we can simply hard-code the correct address).

Then, when a function is redefined, we simply define any of the trampolines that are required as well. We won’t even need to define all of them — only the ones that some actually-existing call site has needed.

Of course, no project like this is complete without a rathole, which is why instead of doing this I’m actually working on writing a compiler pre-pass so that the compiler itself can have the appropriate information about the callee at the point of call. This sub-project turns out to feel a lot like writing a Java bytecode verifier…

Further Future

Currently the JIT is only used for lexically-bound bytecode functions. That’s a reasonable restriction, I think — so one thing we should do is make sure that more of the Emacs core is using lexical binding. Currently, only about 1/3 of the Lisp files in Emacs enable this feature; but many more easily could.

Once my current project is done, the JIT will have a decent calling convention by default. Since we’ll have information about callees at points of call, I think it will be a good time to look into inlining. This will require tackling recompilation (and perhaps OSR) and having some sort of tiered optimization approach. There is still a lot for me to learn here — when does it make sense to inline? And what metrics should I use to decide when some code is hot enough to optimize? So, good times head even once the current project is done; and BTW if you have a reading list for any of this I would love to hear about it.

Once this is done, well, I have more ideas for even deeper JIT improvements. Those will have to wait for another post.

JIT Compilation for Emacs

There have been a few efforts at writing an Emacs JIT — the original one, Burton Samograd’s, and also Nick LLoyd’s. So, what else to do except write my own?

Like the latter two, I based mine on GNU libjit. I did look at a few other JIT libraries: LLVM, gcc-jit, GNU Lightning, MyJit. libjit seemed like a nice middle ground between a JIT with heavy runtime costs (LLVM, GCC) and one that is too lightweight (Lightning).

All of these Emacs JITs work by compiling bytecode to native code. Now, I don’t actually think that is the best choice — it’s just the easiest — but my other project to do a more complete job in this area isn’t really ready to be released. So bytecode it is.

Emacs implements a somewhat weird stack-based bytecode. Many ordinary things are there, but seemingly obvious stack operations like “swap” do not exist; and there are bytecodes for very specialized Emacs operations like forward-char or point-max.

Samograd describes his implementation as “compiling down the spine”. What he means by this is that the body of each opcode is implemented by some C function, and the JIT compiler emits, essentially, a series of subroutine calls. This used to be called “jsr threading” in the olden days, though maybe it has some newer names by now.

Of course, we can do better than this, and Lloyd’s JIT does. His emits instructions for the bodies of most bytecodes, deferring only a few to helper functions. This is a better approach because many of these operations are only one or two instructions.

However, his approach takes a wrong turn by deferring stack operations to the compiled code. For example, in this JIT, the Bdiscard opcode, which simply drops some items from the stack, is implemented as:

 CASE (Bdiscard):
 {
   JIT_NEED_STACK;
   JIT_INC (ctxt.stack, -sizeof (Lisp_Object));
   JIT_NEXT;
   NEXT;
 }

It turns out, though, that this isn’t needed — at least, for the bytecode generated by the Emacs byte-compiler, the stack depth at any given PC is a constant. This means that the stack adjustments can be done at compile time, not runtime, leading to a performance boost. So, the above opcode doesn’t need to emit code at all.

(And, if you’re worried about hand-crafted bytecode, it’s easy to write a little bytecode verifier to avoid JIT-compiling invalid things. Though of course you shouldn’t worry, since you can already crash Emacs with bad bytecode.)

So, naturally, my implementation does not do this extra work. And, it inlines more operations besides.

Caveat

I’ve only enabled the JIT for bytecode that uses lexical binding. There isn’t any problem enabling it everywhere, I just figured it probably isn’t that useful, and so I didn’t bother.

Results

The results are pretty good. First of all, I have it set up to automatically JIT compile every function, and this doesn’t seem any slower than ordinary Emacs, and it doesn’t crash.

Using the “silly-loop” example from the Emacs Lisp manual, with lexical binding enabled, I get these results:

Mode	Time
Interpreted	4.48
Byte compiled	0.91
JIT	0.26

This is essentially the best case for this JIT, though.

Future Directions

I have a few ideas for how to improve the performance of the generated code. One way to look at this is to look at Emacs’ own C code, to see what advantages it has over JIT-compiled code. There are really three: cheaper function calls, inlining, and unboxing.

Calling a function in Emacs Lisp is quite expensive. A call from the JIT requires marshalling the arguments into an array, then calling Ffuncall; which then might dispatch to a C function (a “subr”), the bytecode interpreter, or the ordinary interpreter. In some cases this may require allocation.

This overhead applies to nearly every call — but the C implementation of Emacs is free to call various primitive functions directly, without using Ffuncall to indirect through some Lisp symbol.

Now, these direct calls aren’t without a cost: they prevent the modification of some functions from Lisp. Sometimes this is a pain (it might be handy to hack on load from Lisp), but in many cases it is unimportant.

So, one idea for the JIT is to keep a list of such functions and then emit direct calls rather than indirect ones.

Even better than this would be to improve the calling convention so that all calls are less expensive. However, because a function can be redefined with different arguments, it is tricky to see how to do this efficiently.

In the Emacs C code, many things are inlined that still aren’t inlined in the JIT — just look through lisp.h for all the inline functions (and/or macros, lisp.h is “unusual”). Many of these things could be done in the JIT, though in some cases it might be more work than it is worth.

Even better, but also even more difficult, would be inlining from one bytecode function into another. High-performance JITs do this when they notice a hot spot in the code.

Finally, unboxing. In the Emacs C code, it’s relatively normal to type-check Lisp objects and then work solely in terms of their C analogues after that point. This is more efficient because it hoists the tag manipulations. Some work like this could be done automatically, by writing optimization passes for libjit that work on libjit’s internal representation of functions.

Getting the Code

The code is on the libjit branch in my Emacs repository on github. You’ll have to build your own libjit, too, and if you want to avoid hacking on the Emacs Makefile, you will need my fork of libjit that adds pkg-config files.

FOSDEM, Rust, and Debugging

I’ve recently switched groups at Mozilla to start working full-time on improving Rust debugging. To kick this off and to meet people from the various projects I’m working on — the Rust compiler, lldb, llvm, gdb, and (eventually) the DWARF standard — I will speak about this work at FOSDEM. If you’re going and want to meet up, drop me a line.

Shaggy Dogs and SpiderMonkey Unwinders

A year or so ago I was asked to debug a crash in the Firefox devtools. Crashes are easy! I fired up gdb and reproduced the crash… which turned out to be in some code JITted by SpiderMonkey. I was immediately lost; even a simple bt did not work. Someone more familiar with the JIT — hi Shu — had to dig out the answer :-(.

I did take the opportunity to get some information from him about how he found the result, though. He pointed me to the code responsible for laying out JIT stack frames. It turned out that gdb could not unwind through JIT frames, but it could be done by hand — so I resolved then to eventually fix this.

Phase One

I knew from my gdb hacking that gdb has a JIT unwinding API. Actually — and isn’t this the way most programs end up working? — it has two.

The first JIT API requires some extra work on the part of the JIT: it constructs an object file, typically ELF and DWARF, in memory, then calls a hook. GDB sets a breakpoint on this hook and, when hit, it reads the data from the inferior. This lets the JIT provide basically any kind of information — but it’s pretty heavy.

So, I focused my attention on the second API. In this mode, the JIT author would provide a shared library that used some callbacks to inform gdb of the details of what was going on. The set of callbacks was much more limited, but could at least describe how to unwind the registers. So, I figured that this is what I would do.

But… I didn’t really want to write this in C. That would be a real pain! C is fiddly and hard to deal with, and it would mean constant rebuilding of the shared library while debugging, and SpiderMonkey already had a reasonable number of gdb-python scripts — surely this could be done in Python.

So I took the quixotic approach, namely writing a shared library that used the second gdb JIT API but only to expose this API to Python.

Of course, this turned out to be Rube Goldbergian. Various parts of the gdb Python API could not be called from the JIT shared library, because those bits depended on other state in gdb, which wasn’t set properly when the JIT library was being called. So, I had gdb calling into my shared library, which called my Python code, which then invoked a new gdb command (written in Python and supplied by my package) — that existed solely for the purpose of setting this internal state properly — and that in turn invoked the code I wanted to run, say to fetch memory or a register or something.

Computer Science!

Well, that took a while. But it sort of worked! And maybe I could just keep it in github and not put it in Mozilla Central and avoid learning about the Firefox build system and copying in some gdb header file and license review and whatnot.

So I started writing the actual Python code… OMG. And see below since you will totally want to know about this. But meanwhile…

… while I was hacking away on this crazy idea, someone implemented the much more sane idea of just exposing gdb’s unwinder API to gdb’s Python layer.

Hmm… why didn’t I do that? Well, I left gdb under a bit of a cloud, and didn’t really want to be that involved at the time. Plus, you know, gdb is a high quality project; which means that if you write a giant patch to expose the unwinding API, you have to be prepared for 17 rounds of patch review (this really happened once), plus writing documentation and tests. Sometimes it’s just easier to channel one’s inner Rube.

Phase Two

The integrated Python API was a great development. Now I could delete my shared library and my insane trampoline hacks, and focus on my insane unwinding code.

A lot of this work was straightforward, in the sense that the general outline was clear and just the details remained. The details amount to things like understanding the SpiderMonkey frame descriptor (which partly describes the previous frame and partly the new frame; there’s one comment explaining this that somehow eluded me for quite a while); duplicating the SpiderMonkey JIT unwinding code in Python; and of course carefully reading the SpiderMonkey code that JITs the “entry frame” code to understand how registers are spilled.

Naturally, while doing this it turned out that I was maybe the first person to use these gdb APIs in anger. I found some gdb crashes, oops! The docs would have been impenetrable, except I already knew the underlying C APIs on which they were based… whew! The Python API was unexpectedly picky in other areas, too.

But then there was also some funny business, one part in gdb, and one part in SpiderMonkey.

GDB is probably more complicated than you realize. In this case, the complexity is that, in gdb, each stack frame can have its own architecture. This seemingly weird functionality is actually used; I think it was invented for the SPU, but some other chips have multiple modes as well. But what this means is that the question “what architecture is this program?” is not well-defined, and anyway gdb’s Python layer doesn’t provide you a way to find whatever approximation it is that would make sense in your specific case. However, when writing the SpiderMonkey unwinder, it kind of actually is well-defined and we’d like to know the answer to know which unwinder to choose.

For this problem I settled on the probably terrible idea of checking whether a given register is available. That is, if you see “$rip“, you can guess it’s x86-64.

The other problem here is that gdb thinks that, since you wrote an unwinder, it should get the first stab at unwinding. That’s very polite! But for SpiderMonkey, deciding “hey, is this PC in some code the JIT emitted?” is actually a real pain, or at least outside the random bits of it I learned in order to make all this work.

Aha! I know, there’s probably a Python API to say “is this address associated with some shared library?” I remembered reading and/or reviewing a patch… but no, gdb.solib_name is close but doesn’t do the right thing for addresses in the main executable. WAT.

I tried several tricks without success, and in the end I went with parsing /proc/maps to get the mappings to decide whether a given frame should be handled by this unwinder or by gdb. Horrible. And fails with remote debugging.

Luckily, nobody does remote debugging.

Remote Debugging

Oh, wait, people do remote debugging at Mozilla all the time. They don’t call it “remote debugging” though — they call it “using RR“, which while it runs locally, appears to be remote to gdb; and, importantly, during replay mode fakes the PID, and does other deep magic, though not deep enough to extend to making a fake map file that could be read via gdb’s remote get command.

By the way, you should be using RR. It’s the best advance in debugging since, well, gdb. It’s a process record-and-replay program, but unlike gdb’s built-in reverse debugging, it handles threads properly and has decent performance.

Oh Well

Oh well. It just won’t work remotely. Or at least not until fellow Mozillian (this always seems like it should be “Mozillan” to me, but it’s not, there really is that extra “i”) and all-star Nicolas Pierron wrote some additional Python to read some SpiderMonkey tables to make the decision in a more principled way. Now it will all work!

Though looking now I wonder if I dreamed this, because the code isn’t checked in. I know he had a patch but my memory is a bit fuzzy — maybe in the end it didn’t work, because RR didn’t implement the qGetTLSAddr packet, which gdb uses to read thread-local storage. Did I mention the thread-locals?

The Real Start of the Story

So, way back at the beginning, during my initial foray into this code, I found that a crucial bit of information — the appropriately-named TlsPerThreadData — was stashed away in a thread-local variable. Information stored here is needed by the unwinder in order to unwind from a C++ frame into a JIT frame.

Only, Firefox didn’t use “real” thread-local variables, the things that so many glibc and gcc hackers put so much effort into micro-optimizing. No, it just used a template class that wrapped pthread_setspecific and friends in a relatively ergonomic way.

Naturally, for an unwinder this is a disaster. Why? Unwinding is basically the dissection of the stack; but in order to compute the value of one of these thread-local-storage objects, the unwinder would have to make some function calls in the inferior (in fact this prevents it from working on OSX). But these would affect the stack, and also potentially let other inferior code (in other threads — remember, gdb is complicated and you can exert various unusual kinds of control like this) run as well.

So I neglected to mention the very first step: changing Firefox to use __thread. (Ok, I didn’t really neglect to mention it, I was just being lazy and anyway it’s a shaggy dog story.)

Do Not Use libthread_db

RR did not implement qGetTLSAddr, which we needed, because lots of people at Mozilla use RR. So I set out to implement that. This meant a foray into the dangerous world of libthread_db.

For reasons I do not know, and suspect that I do not want to know, glibc has historically followed many Solaris conventions. One such Solaris innovation was libthread_db — a library that debuggers use to find certain information from libc, information like the address of a thread-local variable

On the surface this seems like a great idea: don’t bake the implementation details of the C library into the debugger. Instead, let the debugger use a debugging library that comes with the C library. And, if you designed it that way, it would be a good idea.

Sadly, though, libthread_db was not designed that way. Oh no.

For example, libthread_db has a callback interface. The calling program — gdb or rr — must provide some functions that libthread_db can call, to do some simple things like “read some memory”; or some very complicated things like “find the address of a symbol given its name”. Normal C programmers might implement these callbacks using a structure containing function pointers. But not libthread_db! Instead it uses fixed symbol names that must be provided by the calling application. Not all of these are required for it to work (you get to figure out which, yay!), but some definitely are. And, you have to dlopen a libthread_db that matches the libc of the inferior that you’re debugging (or link against it, but that’s also obviously bad).

Wait, you say. Doesn’t that mess up cross-debugging? Why yes! Yes it does! Which is why qGetTLSAddr has to be in the gdb remote serial protocol to start with.

Hey, maybe the Linux vendors should fix this. They are — see Gary Benson’s Infinity project — but unfortunately that’s still in development and I wanted RR to work sooner.

Ok, so whew. I wrote qGetTLSAddr support for RR. This was a small patch in the end, but an unusual pain in an already painful series. Hopefully this won’t spill out into other programs.

glibc

Hahaha, you are so funny. Of course it spills out: remember how you have to define a bunch of functions with specific names in your program in order to use libthread_db? Well, how do you know you got the types correct?

Yeah, you include <proc_service.h> (a name deliberately chosen to confuse, I suppose, why not, it doesn’t bear any obvious relationship to the library). Only, that was never installed by glibc. Instead, gdb just copied it into the source tree.

So naturally I went and fixed this in glibc. And, even more naturally, this broke the gdb build, which was autoconf’d to check for a file that never existed in the past. LOL.

Thank You Cthulhu

At this point I figured it was only a matter of time until I had to patch the kernel. Thankfully this hasn’t been necessary yet.

It Says What

In gdb the actual unwinding and the display of frames are separate concerns.

And let me digress here to say that gdb’s unwinder design is excellent. I believe it was redone by Andrew Cagney (this was well before my active time in gdb, so apologies if you’re reading this and you did it and I’ve misattributed it). Like much of gdb, many of the details are bizarre and take one back to the byte-counting days of 1987; but the high level design is very solid and has endured with, I think, just one significant change (to support inline functions) in the intervening 15 or so years. I’ve long thought that this is a remarkable accomplishment in the programming world.

So, yes. It’s not enough to just unwind. Simply having an unwinder yields backtraces with lines like:

#5 0xfeefee ???

Better than nothing! But not yet great.

The second part of the SpiderMonkey unwinder is, therefore, a gdb “frame filter”. This is an object that takes raw frames and decorates them with information like a function name, or a file name, or arguments.

Work to add this information is ongoing — I landed one patch just yesterday, and another one, to add more information about interpreted frames, is still in the works. And there are two more bugs filed… maybe this project, like this blog post, will never conclude. It will just scroll endlessly.

But now, with all the code in place, bt can show something like:

#6 0x00007ffff7ff20f3 in <<JitFrame_BaselineJS "f1">> (this=JSVAL_VOID, arg1=$jsval(4700))

This is the call f1(4700).

Let’s Just Have One More

Of course we still couldn’t enable this unwinder by default. You have to enable it by hand.

And by the way, in the first release of gdb’s Python unwinder feature, enabling or disabling an unwinder didn’t flush the frame cache, so it wouldn’t actually take effect until some invisible-to-the-user state change took place. I fixed this bug, but here Pedro Alves also taught me the secret gdb command flushregs, which in fact just flushes the frame cache. (I’m going to go out on a limb and guess that this command predates the already ancient maint prefix command, hence its weird name.)

Anyway, you have to enable it by hand because the unwinder itself doesn’t work properly if the outermost frame is in JIT code. The JIT, in the interest of performance, doesn’t maintain a frame pointer. This means that in the outermost frame, there’s no reliable way to find the object that describes this frame and links to the previous frame.

Now, normally in this case gdb would either resort to debug info (not available here), or in extremis its encyclopedic suite of prologue analyzers (yes, gdb can analyze common function prologues for all architectures developed in the last 25 years to figure out stuff) — but naturally JIT compilers go their own way here as well.

Humans, like Shu back at the start of this story, can do this by dumping parts of the stack and guessing which bytes represent the frame header.

But, I’ve been reluctant and a bit afraid to hack a heuristic into the unwinder.

To sum up — in case you missed it — this means that all the code written during this entire saga would still not have helped with my original bug.

The End

The Deletion of gcj

I originally posted this on G+ but I thought maybe I should expand it a little and archive it here.

The patch to delete gcj went in recently.

When I was put on the gcj project at Cygnus, I remember thinking that Java was just a fad and that this was just a temporary thing for me. I wasn’t that interested in it. Then I ended up working on it for 10 years.

In some ways it was the high point of my career.

Socially it was fantastic, especially once we merged with the Classpath community — I’ve always considered Mark Wielaard’s leadership in that community as the thing that made it so great. I worked with and met many great people while working on gcj and Classpath, but I especially wanted to mention Andrew Haley, who is the single best debugger I’ve ever met, and who stayed in the Java world, now working on OpenJDK.

We also did some cool technical things in gcj. The binary compatibility ABI was great, and the split verifier was very fun to come up with. Per Bothner’s early vision for gcj drove us for quite a while, long after he left Cygnus and stopped working on it.

On the downside, gcj was never quite up to spec with Java. I’ve met Java developers even as recently as last year who harbor a grudge against gcj.

I don’t apologize for that, though. We were trying something difficult: to make a free Java with a relatively small team.

When OpenJDK came out, the Sun folks at FOSDEM were very nice to say that gcj had influenced the opening of the JDK. Now, I never truly believed this — I’m doubtful that Sun ever felt any heat from our ragtag operation — but it was very gracious of them to say so.

Since the gcj days I’ve been searching for basically the same combination that kept me hacking on gcj all those years: cool technology, great social environment, and a worthwhile mission.

This turned out to be harder than I expected. I’m still searching. I never thought it was possible to go back, though, and with this deletion, this is clearer than ever.

There’s a joy in deleting code (though in this case I didn’t get to do the deletion… grrr); but mainly this weekend I’m feeling sad about the final close of this chapter of my life.

GDB Preattach

In firefox development, it’s normal to do most development tasks via the mach command. Build? Use mach. Update UUIDs? Use mach. Run tests? Use mach. Debug tests? Yes, mach mochitest --debugger gdb.

Now, normally I run gdb inside emacs, of course. But this is hard to do when I’m also using mach to set up the environment and invoke gdb.

This is really an Emacs bug. GUD, the Emacs interface to all kinds of debuggers, is written as its own mode, but there’s no really great reason for this. It would be way cooler to have an adaptive shell mode, where running the debugger in the shell would magically change the shell-ish buffer into a gud-ish buffer. And somebody — probably you! — should work on this.

But anyway this is hard and I am lazy. Well, sort of lazy and when I’m not lazy, also unfocused, since I came up with three other approaches to the basic problem. Trying stuff out and all. And these are even the principled ways, not crazy stuff like screenify.

Oh right, the basic problem. The basic problem with running gdb from mach is that then you’re just stuck in the terminal. And unless you dig the TUI, which I don’t, terminal gdb is not that great to use.

One of the ideas, in fact the one this post is about, since this post isn’t about the one that I couldn’t get to work, or the one that is also pretty cool but that I’m not ready to talk about, was: hey, can’t I just attach gdb to the test firefox? Well, no, of course not, the test program runs too fast (sometimes) and racing to attach is no fun. What would be great is to be able to pre-attach — tell gdb to attach to the next instance of a given program.

This requires kernel support. Once upon a time there were some gdb and kernel patches (search for “global breakpoints”) to do this, but they were never merged. Though hmm! I can do some fun kernel stuff with SystemTap…

Specifically what I did was write a small SystemTap script to look for a specific exec, then deliver a SIGSTOP to the process. Then the script prints the PID of the process. On the gdb side, there’s a new command written in Python that invokes the SystemTap script, reads the PID, and invokes attach. It’s a bit hacky and a bit weird to use (the SIGSTOP appears in gdb to have been delivered multiple times or something like that). But it works!

It would be better to have this functionality directly in the kernel. Somebody — probably you! — should write this. But meanwhile my hack is available, along with a few other gdb scxripts, in my gdb helpers github repository.

Older Entries