So what news from JIT compiler land? Well, since last I've blogged I had to design and implement the register allocator and tie everything together into the final running compiler. That has been achieved, but there are still many bugs and limitations that prevent it from actually being effective. I will talk about these at length. But first, let me explain the register allocator.
The basic problem of register allocation is that a compiler can assign more values than the CPU has registers. Thus, only some values can reside in registers and others must be stored to memory (spilled). Computing the best set of values to keep in registers is in general an impossible problem. To solve it I use the 'linear scan' allocation heuristic. This heuristic is as simple as can be - determine for each value allocated it's last use, and when it's time to spill a value, spill the one which will be live furthest in the future. These slides also do a good job of explaining it.
Aside from being simple to implement, it's also effective. Values that will expire soon are likely to be used soon as well, so their register will free up soon enough. On the other hand, values that are bound to live a long time will likely be spilled before they expire anyway, so it costs less to spill them now. Another benefit is that this can be evaluated online, i.e. as the JIT tree is being processed, so that it doesn't require a separate step. (One abstraction it did require, though, was order-numbering the tiles, which are currently attached to the JIT tree. This makes the tiles in essence a linear list, and it is my plan to convert them to a linear array in memory as well. That would reduce the number of tree traversals by one as well).
I will not maintain the illusion here that a register allocator is a trivial component, even with a conceptually simple algorithm as just outlined. One of the tricky bits is ensuring that values defined in a conditional branch do not accidentally 'escape' out of their branches, since they will be unavailable if the branch was not taken. In the same vein after a call all 'volatile' registers are invalidated, so that also requires some special treatment.
After the register allocator was finished, all that remained was ironing out the bugs. And they were (and are) many. The most annoying of these were the DynASM machine code generation bug. The DynASM runtime is a tight state machine that uses a mix of data and instructions to generate machine code. The first bug was relatively simple - a missing REX byte marking caused by the preprocessor looking at only one of the two operands. The second bug was positively evil. See, x86 uses so called ModR/M bytes to specify the register and memory access mode used for an instruction. It's not important you know what it stands for, but what is important is that each byte - 8 bits - is divided into a mode of 2 bits and 2 register numbers of 3 bits each. (There are 8 legacy registers in x86, so that fits). Except when the register number is 4 (rsp or the stack pointer register). Then the whole meaning of the byte changes to a SIB byte, which is quite different entirely - it refers to two registers at once. The upshot is that this SIB byte now must be appended to the ModR/M byte and filled with correct data; and that this SIB byte is then interpreted as if it were a ModR/M byte anyway.. I've patched DynASM to do this, but it is really quite sensitive and brittle and I quite expect to have to fix this in the future in another way.
That brings me to today. Unfortunately for my JIT aspirations, my educational obligations have caught up with me again. In other words: my study has started and leaves me with much less time to work on the JIT. So for clarity, I have completed in the last few months the following:
- The expression tree intermediate format and template preprocessor
- Documentation on the use of these
- The tiler and tiler table generator
- A register allocator and assorted support structures in the new JIT
- Dynamic register support for x64 DynASM
- Periodic reports on progress by means of this blog
Here is what I intended (and indeed promised) I would do, and haven't finished yet:
- A test suite - I haven't needed it during development as regular NQP compilation already involves plenty JIT compiler activity. The benefit of a real test suite is of course that specific segments can be tested (like numeric registers)
- An representation-based extension API. I haven't gotten around to it because the JIT has not been stable enough so far. It will be based on the use of templates specialized to a certain in-memory representation.
- Documentation on the use and development of architecture-specific tiles.
- Full floating point operation support (the simplest way is probably to use a wholy separate allocator for numeric registers).
- Automatic within-template ordering to maintain consistent branch semantics (rather than relying on let ordering to build values shared between branches).
- A framework for optimizing transformation. I think the key to this is probably a replacement table allowing different algorithms to insert replacements for given nodes.
- As I said above, a truly linear representation of the selected tiles, in order to allow reordering for the purposes of instruction scheduling.