FOSDEM and the future
Hi all, I realise I haven't written one of these posts for a long time. Since November, in fact. So you could be forgiven for believing I had stopped working on the MoarVM JIT. Fortunately, that is not entirely true. I have, in fact, been very busy with a project that has nothing to do with perl6, namely SciGRID, for which I've developed GridKit. GridKit is a toolkit for extracting a power network model from OpenStreetMap, which happens to contain a large number of individual power lines and stations. Such a network can then be used to compute the flow of electric power throughout Europe, which can then be used to optimize the integration of renewable energy sources, among other things. So that was and is an exciting project and I have a lot of things to write on that yet. It is not, however, the topic of my post today.
Let's talk about the expression JIT, where it stands, how it got there, and where it needs to go. Last time I wrote, I had just finished in reducing the expression JIT surface area to the point where it could just compile correct code. And that helped in making a major change, which I called tile linearisation. Tile linearisation is just one example of a major idea that I missed last summer, so it may be worthwhile to expand a bit on it.
As I've epxlained at some length before, the expression JIT initially creates trees of low-level operations out of high-level operations, which are then matched (tiled) to machine-level operations. The low-level operations can each be expressed by a machine-level operation, but some machine-level instructions match multiple low-level operations. The efficient and optimal matching of low-level to machine-level operations is the tiling step of the compiler, and it is where most of my efforts have been.
Initially, I had 'tagged' these tiles to the tree that had been created, relying on tree traversal to get the tiles to emit assembly code. This turned out to be a poor idea because it introduces implicit order based on the tree traversal order. This is first of all finicky - it forces the order of numbering tiles to be the same in the register allocator and the tile selection algorithm and again for the code emitter. In practice that means that the last two of these were implemented in a single online step. But more importantly and more troubling, it makes it more complex to determine exactly the extent of live ranges and of basic blocks.
The notion of basic blocks is also one that I missed. Expression trees are typically compiled for single basic blocks at a time. The definition of a basic block is a sequence of instructions that is executed without interruption. This allows for some nice simplifications, because it means that a value placed in a register at one instruction will still be there in the next. (In contrast, if it were possible to 'jump in between' the instructions, this is not so easy to ensure). However, these basic blocks are defined at the level of MoarVM instructions. Like most high-level language interpreters, MoarVM instructions are polymorphic and can check and dispatch based on operands. In other words, a single MoarVM instruction can form multiple basic blocks. For correct register allocation, it is vital that the register allocator knows about these basic blocks. But this is obscured, to say the least, by the expression 'tree' structure, which really forms a Directed Acyclic Graph, owing to the use of values by multiple consumers.
The point of tile linearisation is provide an authoritative, explicit order for tiles - and the code sequences that they represent - so that they can be clearly and obviously placed in basic blocks. This then allows the register allocator to be extended to deal with cross-basic block compilation. (In the distant future, we might even implement some form of instruction scheduling). As a side effect, it also means that the register allocation step should be moved out of the code emitter. I've asked around and got some nice papers about that, and it seems like the implementation of one of these algorithms - I'm still biased towards linear scan - is within the range of reasonable, as soon as I have the details figured out. Part of the plan is to extract value descriptors from the tree (much like the tile state) and treat them as immutable, introducing copies as necessary (for instance for live range splitting). The current register allocator can survive as the register selector, because it has some interesting properties in that aspect.
Aside from that I've implemented a few other improvements, like:
Anyway, that's all for today, and I hope next time I will be able to bring you some good news. See you!
Let's talk about the expression JIT, where it stands, how it got there, and where it needs to go. Last time I wrote, I had just finished in reducing the expression JIT surface area to the point where it could just compile correct code. And that helped in making a major change, which I called tile linearisation. Tile linearisation is just one example of a major idea that I missed last summer, so it may be worthwhile to expand a bit on it.
As I've epxlained at some length before, the expression JIT initially creates trees of low-level operations out of high-level operations, which are then matched (tiled) to machine-level operations. The low-level operations can each be expressed by a machine-level operation, but some machine-level instructions match multiple low-level operations. The efficient and optimal matching of low-level to machine-level operations is the tiling step of the compiler, and it is where most of my efforts have been.
Initially, I had 'tagged' these tiles to the tree that had been created, relying on tree traversal to get the tiles to emit assembly code. This turned out to be a poor idea because it introduces implicit order based on the tree traversal order. This is first of all finicky - it forces the order of numbering tiles to be the same in the register allocator and the tile selection algorithm and again for the code emitter. In practice that means that the last two of these were implemented in a single online step. But more importantly and more troubling, it makes it more complex to determine exactly the extent of live ranges and of basic blocks.
The notion of basic blocks is also one that I missed. Expression trees are typically compiled for single basic blocks at a time. The definition of a basic block is a sequence of instructions that is executed without interruption. This allows for some nice simplifications, because it means that a value placed in a register at one instruction will still be there in the next. (In contrast, if it were possible to 'jump in between' the instructions, this is not so easy to ensure). However, these basic blocks are defined at the level of MoarVM instructions. Like most high-level language interpreters, MoarVM instructions are polymorphic and can check and dispatch based on operands. In other words, a single MoarVM instruction can form multiple basic blocks. For correct register allocation, it is vital that the register allocator knows about these basic blocks. But this is obscured, to say the least, by the expression 'tree' structure, which really forms a Directed Acyclic Graph, owing to the use of values by multiple consumers.
The point of tile linearisation is provide an authoritative, explicit order for tiles - and the code sequences that they represent - so that they can be clearly and obviously placed in basic blocks. This then allows the register allocator to be extended to deal with cross-basic block compilation. (In the distant future, we might even implement some form of instruction scheduling). As a side effect, it also means that the register allocation step should be moved out of the code emitter. I've asked around and got some nice papers about that, and it seems like the implementation of one of these algorithms - I'm still biased towards linear scan - is within the range of reasonable, as soon as I have the details figured out. Part of the plan is to extract value descriptors from the tree (much like the tile state) and treat them as immutable, introducing copies as necessary (for instance for live range splitting). The current register allocator can survive as the register selector, because it has some interesting properties in that aspect.
Aside from that I've implemented a few other improvements, like:
- Refactored the tiler table generator, so that it could be extended to include tile arguments. This considerably simplifies the implementation of tiles. An interesting possibility, I think, is to make the tiler select tile candidates rather than a specific tile, which might allow choosing an optimal tile based on operation arguments rather than only on tree structure. Furthermore, the tiler table generator is now cleanish perl, which should probably help with maintenance.
- Factor tiler state out of the tree. I had initially implemented nearly all tree operations by means of 'tagging' the tree in an 'Info' structure. (Structures named 'Info' are like classes named Manager, and sign of a code problem). However, this means that the tile information is 'dragged along' with the tree during its lifetime, which is not really necessary, because the tiler state is temporary.
- Fixed a number of small issues, some of them centered around operand sizes, libuv versions, and build systems.
- Presented on FOSDEM about how amd64 assembly language works and can be used in perl 6.
- Implemented a JIT expression frame bisect tool which allows us to pinpoint precisely where the compilation of a perl6 (or nqp) frame breaks.
- We start a basic block with a label
- We append to that a 'dynamic control label' sequence, which updates the JIT 'reentry' label to point to the start of the basic block. This allows various operations in MoarVM which need to inspect the current program position - lexical variables in an inlined frame, exception handlers - to know where we are in the program.
- An instruction is annotated with tags that indicate that it is, for example, the first instruction of an exception handler, or a deoptimisation point.
- Because it is the first instruction of an exception handler, it must be labeled, so a label is inserted prior to the instruction. And, a dynamic control label sequence is also inserted prior to the instruction.
- Because the instruction was the first one of its basic block, it acquires the same label as the basic block. However, between the two same labels, a dynamic control sequence is inserted, which means that the same labels are inserted twice, not meaning the same thing.
- Presumably, although I haven't checked it, the last or the first label wins. But all repeated dynamic control labels are redundant. In this case, up to 4 different control labels are stacked on top of each other.
Anyway, that's all for today, and I hope next time I will be able to bring you some good news. See you!
Reacties
Een reactie posten