Today I want to write something about register allocation, however. Register allocation is probably not my favorite problem, on account of being both messy and thankless. It is a messy problem because - aside from being NP-hard to solve optimally - hardware instruction sets and software ABI's introduce all sorts of annoying constraints. And it is a thankless problem because the case in which a good register allocator is useful - for instance, when there's lots of intermediate values used over a long stretch of code - are fairly rare. Much more common are the cases in which either there are trivially sufficient registers, or ABI constraints force a spill to memory anyway (e.g. when calling a function, almost all registers can be overwritten).
So, on account of this being not my favorite problem, and also because I promised to implement optimizations in the register allocator, I've been researching if there is a way to do better. And what better place to look than one of the fastest dynamic language implementations arround, LuaJIT? So that's what I did, and this post is about what I learned from that.
Truth be told, LuaJIT is not at all a learners' codebase (and I don't think it's author would claim this). It uses a rather terse style of C and lots and lots of preprocessor macros. I had somewhat gotten used to the style from hacking dynasm though, so that wasn't so bad. What was more surprising is that some of the steps in code generation that are distinct and separate in the MoarVM JIT - instruction selection, register allocation and emitting bytecode - were all blended together in LuaJIT. Over multiple backend architectures, too. And what's more - all these steps were done in reverse order - from the end of the program (trace) to the beginning. Now that's interesting...
I have no intention of combining all phases of code generation like LuaJIT has. But processing the IR in reverse seems to have some interesting properties. To understand why that is, I'll first have to explain how linear scan allocation currently works in MoarVM, and is most commonly described:
- First, the live ranges of program values are computed. Like the name indicates, these represent the range of the program code in which a value is both defined and may be used. Note that for the purpose of register allocation, the notion of a value shifts somewhat. In the expression DAG IR, a value is the result of a single computation. But for the purposes of register allocation, a value includes all its copies, as well as values computed from different conditional branches. This is necessary because when we actually start allocating registers, we need to know when a value is no longer in use (so we can reuse the register) and how long a value will remain in use -
- Because a value may be computed from distinct conditional branches, it is necessary to compute the holes in the live ranges. Holes exists because if a value is defined in both sides of a conditional branch, the range will cover both the earlier (in code order) branch and the later branch - but from the start of the later branch to its definition that value doesn't actually exist. We need this information to prevent the register allocator from trying to spill-and-load a nonexistent value, for instance.
- Only then can we allocate and assign the actual registers to instructions. Because we might have to spill values to memory, and because values now can have multiple definitions, this is a somewhat subtle problem. Also, we'll have to resolve all architecture specific register requirements in this step.
In the MoarVM register allocator, there's a fourth step and a fifth step. The fourth step exists to ensure that instructions conform to x86 two-operand form (Rather than return the result of an instruction in a third register, x86 reuses one of the input registers as the output register. E.g. all operators are of the form a = op(a, b) rather than a = op(b, c). This saves on instruction encoding space). The fifth step inserts instructions that are introduced by the third step; this is done so that each instruction has a fixed address in the stream while the stream is being processed.
Altogether this is quite a bit of complexity and work, even for what is arguably the simplest correct global register allocation algorithm. So when I started thinking of the reverse linear scan algorithm employed by LuaJIT, the advantages became clear:
- In LuaJIT, the IR maintains its SSA form - there is only a single definition of a value. This means that when iterating in reverse order, computing the live range becomes trivial. When we first encounter a use of a value, then by definition that is the last use. And when we encounter a definition, that is the only and single definition, and we can release the register. So there's no need to compute the live range in advance of allocation.
- Furthermore, rather than merging the values of multiple branches into the same live range, each value on either side becomes an individual live range. As a result, the live range of a value never has a hole, further simplifying code.
- LuaJIT uses register hints to indicate which registers could best be picked for a specific value. This is often determined by how a value is used (e.g., the divisor in a div instruction must be in the rcx register). If the preferred register can't be allocated, the register allocator inserts code to move it to the right place where needed. Having hints can be expected to greatly reduce the need for such code.
There are downsides as well, of course. Not knowing exactly how long a value will be live while processing it may cause the algorithm to make worse choices in which values to spill. But I don't think that's really a great concern, since figuring out the best possible value is practically impossible anyway, and the most commonly cited heuristic - evict the value that is live furthest in the future, because this will release a register over a longer range of code, reducing the chance that we'll need to evict again - is still available. (After all, we do always know the last use, even if we don't necessarily know the first definition).
Altogether, I'm quite excited about this algorithm; I think it will be a real simplification over the current implementation. Whether that will work out remains to be seen of course. I'll let you know!