Wrapping Up

Hi everybody, it's been a while since I've blogged, and I think you deserve an update. Last week, of course, was YAPC::EU, which was awesome. Granada is a very nice place, and the weather was excellent. Tapas lunch was very nice, as was the gala diner (also with tapas). There were many interesting people and presentations (many more than I could actually see). It was also very interesting to present (slides) about the JIT, which I think went well. One of the comments I heard was that it was a quite high-level talk, so if anyone should ask, I can and will give a talk describing the grueling details in the future. Oh, and I just found that Larry Wall's keynote has just been uploaded to youtube. Go ahead and watch, this page isn't going anywhere.

So what news from JIT compiler land? Well, since last I've blogged I had to design and implement the register allocator and tie everything together into the final running compiler. That has been achieved, but there are still many bugs and limitations that prevent it from actually being effective. I will talk about these at length. But first, let me explain the register allocator.

The basic problem of register allocation is that a compiler can assign more values than the CPU has registers. Thus, only some values can reside in registers and others must be stored to memory (spilled). Computing the best set of values to keep in registers is in general an impossible problem. To solve it I use the 'linear scan' allocation heuristic. This heuristic is as simple as can be - determine for each value allocated it's last use, and when it's time to spill a value, spill the one which will be live furthest in the future. These slides also do a good job of explaining it.

Aside from being simple to implement, it's also effective. Values that will expire soon are likely to be used soon as well, so their register will free up soon enough. On the other hand, values that are bound to live a long time will likely be spilled before they expire anyway, so it costs less to spill them now. Another benefit is that this can be evaluated online, i.e. as the JIT tree is being processed, so that it doesn't require a separate step. (One abstraction it did require, though, was order-numbering the tiles, which are currently attached to the JIT tree. This makes the tiles in essence a linear list, and it is my plan to convert them to a linear array in memory as well. That would reduce the number of tree traversals by one as well).

I will not maintain the illusion here that a register allocator is a trivial component, even with a conceptually simple algorithm as just outlined. One of the tricky bits is ensuring that values defined in a conditional branch do not accidentally 'escape' out of their branches, since they will be unavailable if the branch was not taken. In the same vein after a call all 'volatile' registers are invalidated, so that also requires some special treatment.

After the register allocator was finished, all that remained was ironing out the bugs. And they were (and are) many. The most annoying of these were the DynASM machine code generation bug. The DynASM runtime is a tight state machine that uses a mix of data and instructions to generate machine code. The first bug was relatively simple - a missing REX byte marking caused by the preprocessor looking at only one of the two operands. The second bug was positively evil. See, x86 uses so called ModR/M bytes to specify the register and memory access mode used for an instruction. It's not important you know what it stands for, but what is important is that each byte - 8 bits - is divided into a mode of 2 bits and 2 register numbers of 3 bits each. (There are 8 legacy registers in x86, so that fits). Except when the register number is 4 (rsp or the stack pointer register). Then the whole meaning of the byte changes to a SIB byte, which is quite different entirely - it refers to two registers at once. The upshot is that this SIB byte now must be appended to the ModR/M byte and filled with correct data; and that this SIB byte is then interpreted as if it were a ModR/M byte anyway.. I've patched DynASM to do this, but it is really quite sensitive and brittle and I quite expect to have to fix this in the future in another way.

That brings me to today. Unfortunately for my JIT aspirations, my educational obligations have caught up with me again. In other words: my study has started and leaves me with much less time to work on the JIT. So for clarity, I have completed in the last few months the following:

The expression tree intermediate format and template preprocessor

Documentation on the use of these

The tiler and tiler table generator
A register allocator and assorted support structures in the new JIT
Dynamic register support for x64 DynASM
Periodic reports on progress by means of this blog

Timo Paulssen has developed a C-source to expression tree format preprocessor, which should in the best case help quickly convert most of the old JIT compiler segments to the new, when this is more stable.

Here is what I intended (and indeed promised) I would do, and haven't finished yet:

A test suite - I haven't needed it during development as regular NQP compilation already involves plenty JIT compiler activity. The benefit of a real test suite is of course that specific segments can be tested (like numeric registers)
An representation-based extension API. I haven't gotten around to it because the JIT has not been stable enough so far. It will be based on the use of templates specialized to a certain in-memory representation.
Documentation on the use and development of architecture-specific tiles.

And there are things which I haven't outlined explicitly but which I still want to do:

Full floating point operation support (the simplest way is probably to use a wholy separate allocator for numeric registers).
Automatic within-template ordering to maintain consistent branch semantics (rather than relying on let ordering to build values shared between branches).
A framework for optimizing transformation. I think the key to this is probably a replacement table allowing different algorithms to insert replacements for given nodes.
As I said above, a truly linear representation of the selected tiles, in order to allow reordering for the purposes of instruction scheduling.

And there are many more. But as I said, I cannot continue with them as I have in the past few months. I will continue to work to stabilise the JIT and I hope to succeed in that before Christmas, only at a slower pace than before. I realise it is somewhat disappointing not to be able to fully 'land' the new JIT yet, but I'm confident we'll get there in the end. See you next time!

brrt to the future

Wrapping Up

Reacties

Een reactie posten

Populaire posts van deze blog

Reverse Linear Scan Allocation is probably a good idea

Something about IR optimization

A short post about types and polymorphism