donderdag 9 februari 2017

Register Allocator Update

Hi everybody, I thought some yof you might be interested in an update regarding the JIT register allocator, which is after all the last missing piece for the new 'expression' JIT backend. Well, the last complicated piece, at least. Because register allocation is such a broad topic, I don't expect to cover all topics relevant to design decisions here, and reserve a future post for that purpose.

I think I may have mentioned earlier that I've chosen to implement linear scan register allocation, an algorithm first described in 1999. Linear scan is relatively popular for JIT compilers because it achieves reasonably good allocation results while being considerably simpler and faster than the alternatives, most notably via graph coloring (unfortunately no open access link available). Because optimal register allocation is NP-complete, all realistic algorithms are heuristic, and linear scan applies a simple heuristic to good effect. I'm afraid fully explaining the nature of that heuristic and the tradeoffs involves is beyond the scope of this post, so you'll have to remind me to do it at a later point.

Commit ab077741 made the new allocator the default after I had ironed out sufficient bugs to be feature-equivalent to the old allocator (which still exists, although I plan to remove it soon).
Commit 0e66a23d introduced support for 'PHI' node merging, which is really important and exciting to me, so I'll have to explain what it means. The expression JIT represents code in a form in which all values are immutable, called single static assignment form, or SSA form shortly. This helps simplify compilation because there is a clear correspondence between operations and the values they compute. In general in compilers, the easier it is to assert something about code, the more interesting things you can do with it, and the better code you can compile. However, in real code, variables are often assigned more than one value. A PHI node is basically an 'escape hatch' to let you express things like:

int x, y;
if (some_condition()) {
    x = 5;
} else {
    x = 10;
y = x - 3;

In this case, despite our best intentions, x can have two different values. In SSA form, this is resolved as follows:

int x1, x2, x3, y;
if (some_condition()) {
    x1 = 5;
} else {
    x2 = 10;
x3 = PHI(x1,x2);
y = x3 - 3;

The meaning of the PHI node is that it 'joins together' the values of x1 and x2 (somewhat like a junction in perl6), and represents the value of whichever 'version' of x was ultimately defined. Resolving PHI nodes means ensuring that, as far as the register allocator is concerned, x1, x2, and x3 should preferably be allocated to the same register (or memory location), and if that's not possible, it should copy x1 and x2 to x3 for correctness. To find the set of values that are 'connected' via PHI nodes, I apply a union-find data structure, which is a very useful data structure in general. Much to my amazement, that code worked the first time I tried it.

Then I had to fix a very interesting bug in commit 36f1fe94 which involves ordering between 'synthetic' and 'natural' tiles. (Tiles are the output of the tiling process about which I've written at some length, they represent individual instructions). Within the register allocator, I've chosen to identify tiles / instructions by their index in the code list, and to store tiles in a contiguous array. There are many advantages to this strategy but they are also beyond the scope of this post. One particular advantage though is that the indexes into this array make their relative order immediately apparent. This is relevant to linear scan because it relies on relative order to determine when to allocate a register and when a value is no longer necessary.

However, because of using this index, it's not so easy to squeeze in new tiles to that array - which is exactly what a register allocator does, when it decides to 'spill' a value to memory and load it when needed. (Because inserts are delayed and merged into the array a single step, the cost of insertion is constant). Without proper ordering, a value loaded from memory could overwrite another value that is still in use. The fix for that is, I think, surprisingly simple and elegant. In order to 'make space' for the synthetic tiles, before comparison all indexes are multiplied by a factor of 2, and synthetic tiles are further offset by -1 or +1, depending on whether they should be handled before or after the 'natural' tile they are inserted for. E.g. synthetic tiles that load a value should be processed before the tile that uses the value they load.

Another issue soon appeared, this time having to do with x86 being, altogether, quaint and antiquated and annoying, and specifically with the use of one operand register as source and result value. To put it simply, where you and I and the expression JIT structure might say:

a = b + c

x86 says:

a = a + b

Resolving the difference is tricky, especially for linear scan, since linear scan processes the values in the program rather than the instructions that generate them. It is therefore not suited to deal with instruction-level constraints such as these. If a, b, and c in my example above are not the same (not aliases), then this can be achieved by a copy:

a = b
a = a + c

If a and b are aliases, the first copy isn't necessary. However, if a and c are aliases, then a copy may or may not be necessary, depending on whether the operation (in this case '+') is commutative, i.e. it holds for '+' but not for '-'. Commit 349b360 attempts to fix that for 'direct' binary operations, but a fix for indirect operations is still work in progress. Unfortunately, it meant I had to reserve a register for temporary use to resolve this, meaning there are fewer available for the register allocator to use. Fortunately, that did simplify handling of a few irregular instructions, e.g. signed cast of 32 bit integers to 64 bit integers.

So that brings us to today and my future plans. The next thing to implement will be support for function calls by the register allocator, which involves shuffling values to the right registers and correct positions on the stack, and also in spilling all values that are still required after the function call since the function may overwrite them. This requires a bit of refactoring of the logic that spills variables, since currently it is only used when there are not enough registers available. I also need to change the linear scan main loop, because it processes values in order of first definition, and as such, instructions that don't create any values are skipped, even if they need special handling like function calls. I'm thinking of solving that with a special 'interesting tiles' queue that is processed alongside the main values working queue.

That was it for today. I hope to write soon with more progress.

Geen opmerkingen:

Een reactie plaatsen