dinsdag 28 maart 2017

Function Call Milestone

Hi everybody. It's high time for another update, and this time I have good news. The 'expression' JIT compiler can now compile native ('C') function calls (although it's not able to use the results). This is a major milestone because function calls are hard! (At least from the perspective of a compiler, and especially from the perspective of the register allocator). Also because native function calls are really very important in MoarVM. Most of its 'primitive' operations (like hash table access, string equality, big integer arithmetic) are implemented by invoking native functions, and so to compile almost any program the JIT has to compile many function calls.

What makes function calls 'hard' is that they must implement the 'calling convention' of the relevant 'application binary interface' (ABI). In short, the ABI specifies the locations of function call parameters.  A small number of parameters (on Windows, the first 4, for POSIX platforms, the first 6) are placed in registers, and if there are more parameters they are usually placed on the stack. Aside from the calling convention, the ABI also specifies the expected alignment of the stack pointer (per 16 bytes) and the registers a functions may overwrite (clobber in ABI-speak) and which registers must have their original values after the function returns. The last type of registers are called 'callee-saved'. Note that at least a few registers must be callee-saved, especially those related to call stack management, because if the callee function would overwrite those it would be impossible to return control back to the caller. By the way, manipulating exactly those registers is how the setjmp and longjmp 'functions' work.

So the compiler is tasked with generating code that ensures the correct values are placed in the correct registers. That sounds easy enough, but what if the these registers are taken by other values, and what if those other values might be required for another parameter? Indeed, what if the value in the %rdx register needs to be in the %rsi register, and the value of the %rsi register is required in the %rdx register? How to determine the correct ordering for shuffling the operands?

One simple way to deal with this would be to eject all values from registers onto the stack, and then to load the values from registers if they are necessary. However, that would be very inefficient, especially if most function calls have no more than 6 (or 4) parameters and most of these parameters are computed for the function call only. So I thought that solution wouldn't do.

Another way to solve this would be if the register allocator could ensure that values are placed in their correct registers directly,- especially for register parameters -  i.e. by 'precoloring'. (The name comes from register allocation algorithms that work by 'graph coloring', something I will try to explain in a later post). However, that isn't an option due to my choice of 'linear scan' as the register allocation algorithm. This is a 'greedy' algorithm, meaning that it decides the allocation for a live range as soon as it encounters them, and that it cannot revert that decision once it's been made. (If it could, it would be more like a dynamic programming algorithm). So to ensure that the allocation is valid I'd have to make sure that the information about register requirements is propagated backwards from the instructions to all values that might conflict with it... and that point we're no longer talking about linear scan, and I would be better of re-engineering a new algorithm. Not a very attractive option either!

Instead, I thought about it and it occurred to me that this problem seems a lot like unravelling a dependency graph, with a number of restrictions. That is to say, it can be solved by a topological sort. I map the registers to a graph structure as follows:

  • Each register forms a node
  • A transfer from a register to another register, or from a register to a stack location or a local memory location is an edge
  • Each node can have multiple outbound edges, but only one inbound edge (only one value can ever be required in that register)
I linked to the topological sort page for an explanation of the problem, but I think my implementation is really quite different from that presented there. They use a node visitation map and a stack, I use an edge queue and and outbound count. A register transfer (edge) can be enqueued if it is clear that the destination register is not currently used. Transfers from registers to stack locations (as function call parameters) or local memory (to save the value from being overwritten by the called function) are also enqueued directly. As soon as the outbound count of a node reaches zero, it is considered to be 'free' and the inbound edge (if any) is enqueued.

Unlike a 'proper' dependency graph, cycles can and do occur, as in the example where '%rdx' and '%rsi' would need to swap places. Fortunately, because of the single-inbound edge rule, such cycles are 'simple' - all outbound edges not belonging to the cycle can be resolved prior to the cycle-breaking, and all remaining edges are part of the cycle. Thus, the cycle can always be broken by freeing just a single node (i.e. by copy to a temporary register).

The only thing left to consider are the values that are used after the function call returns (survive the function call) and that are stored in registers that the called function can overwrite (which is all of them, since the register allocator never selects callee-saved registers). So to make sure they are available afterwards, we must spill them. But there are a few spill strategies to choose from (terminology made up by me):

  • full spill replaces all references to the value in register, with references to the value in memory, by inserting load and store operations around every use and definition of the value. Basically, it splits the single value live range in parts.
  • split-and-spill finds the code segment where the spill is not required and splits it off from the rest and only edits the second part. This is actually rather tricky since to do this correctly requires data-flow analysis, which isn't otherwise required by a 'naive' implementation of linear scan
  • spill-and-restore will find the shortest piece of code where the spill is necessary, store the value to memory, and load it directly afterwards, as if the spill never happened.
The current register allocator does a full spill when it's run out of registers, and it would make some sense to apply the same logic for function-call related spills. I've decided to use spill-and-restore, however, because a full spill complicates the sorting order (a value that used to be in a register is suddenly only in memory) and it can be wasteful, especially if the call only happens in an alternative branch. This is common for instance when assigning values to object fields, as that may sometimes require a write barrier (to ensure the GC tracks all references from 'old' to 'new' objects). So I'm guessing that it's going to be better to pay the cost of spilling and restoring only in those alternative branches, and that's why I chose to use spill-and-restore.

That was it for today. Although I think being able to call functions is a major milestone, this is not the very last thing to do. We currently cannot allocate any of the registers used for floating-point calculations, which is a relatively minor limitation since those aren't used very frequently. But I also need to do some more work to actually use function return values and apply generic register requirements of tiles. But I do think the day is coming near where we can start thinking about merging the new JIT with the MoarVM master branch, making it available to everybody. Until next time!

donderdag 9 februari 2017

Register Allocator Update

Hi everybody, I thought some yof you might be interested in an update regarding the JIT register allocator, which is after all the last missing piece for the new 'expression' JIT backend. Well, the last complicated piece, at least. Because register allocation is such a broad topic, I don't expect to cover all topics relevant to design decisions here, and reserve a future post for that purpose.

I think I may have mentioned earlier that I've chosen to implement linear scan register allocation, an algorithm first described in 1999. Linear scan is relatively popular for JIT compilers because it achieves reasonably good allocation results while being considerably simpler and faster than the alternatives, most notably via graph coloring (unfortunately no open access link available). Because optimal register allocation is NP-complete, all realistic algorithms are heuristic, and linear scan applies a simple heuristic to good effect. I'm afraid fully explaining the nature of that heuristic and the tradeoffs involves is beyond the scope of this post, so you'll have to remind me to do it at a later point.

Commit ab077741 made the new allocator the default after I had ironed out sufficient bugs to be feature-equivalent to the old allocator (which still exists, although I plan to remove it soon).
Commit 0e66a23d introduced support for 'PHI' node merging, which is really important and exciting to me, so I'll have to explain what it means. The expression JIT represents code in a form in which all values are immutable, called single static assignment form, or SSA form shortly. This helps simplify compilation because there is a clear correspondence between operations and the values they compute. In general in compilers, the easier it is to assert something about code, the more interesting things you can do with it, and the better code you can compile. However, in real code, variables are often assigned more than one value. A PHI node is basically an 'escape hatch' to let you express things like:

int x, y;
if (some_condition()) {
    x = 5;
} else {
    x = 10;
y = x - 3;

In this case, despite our best intentions, x can have two different values. In SSA form, this is resolved as follows:

int x1, x2, x3, y;
if (some_condition()) {
    x1 = 5;
} else {
    x2 = 10;
x3 = PHI(x1,x2);
y = x3 - 3;

The meaning of the PHI node is that it 'joins together' the values of x1 and x2 (somewhat like a junction in perl6), and represents the value of whichever 'version' of x was ultimately defined. Resolving PHI nodes means ensuring that, as far as the register allocator is concerned, x1, x2, and x3 should preferably be allocated to the same register (or memory location), and if that's not possible, it should copy x1 and x2 to x3 for correctness. To find the set of values that are 'connected' via PHI nodes, I apply a union-find data structure, which is a very useful data structure in general. Much to my amazement, that code worked the first time I tried it.

Then I had to fix a very interesting bug in commit 36f1fe94 which involves ordering between 'synthetic' and 'natural' tiles. (Tiles are the output of the tiling process about which I've written at some length, they represent individual instructions). Within the register allocator, I've chosen to identify tiles / instructions by their index in the code list, and to store tiles in a contiguous array. There are many advantages to this strategy but they are also beyond the scope of this post. One particular advantage though is that the indexes into this array make their relative order immediately apparent. This is relevant to linear scan because it relies on relative order to determine when to allocate a register and when a value is no longer necessary.

However, because of using this index, it's not so easy to squeeze in new tiles to that array - which is exactly what a register allocator does, when it decides to 'spill' a value to memory and load it when needed. (Because inserts are delayed and merged into the array a single step, the cost of insertion is constant). Without proper ordering, a value loaded from memory could overwrite another value that is still in use. The fix for that is, I think, surprisingly simple and elegant. In order to 'make space' for the synthetic tiles, before comparison all indexes are multiplied by a factor of 2, and synthetic tiles are further offset by -1 or +1, depending on whether they should be handled before or after the 'natural' tile they are inserted for. E.g. synthetic tiles that load a value should be processed before the tile that uses the value they load.

Another issue soon appeared, this time having to do with x86 being, altogether, quaint and antiquated and annoying, and specifically with the use of one operand register as source and result value. To put it simply, where you and I and the expression JIT structure might say:

a = b + c

x86 says:

a = a + b

Resolving the difference is tricky, especially for linear scan, since linear scan processes the values in the program rather than the instructions that generate them. It is therefore not suited to deal with instruction-level constraints such as these. If a, b, and c in my example above are not the same (not aliases), then this can be achieved by a copy:

a = b
a = a + c

If a and b are aliases, the first copy isn't necessary. However, if a and c are aliases, then a copy may or may not be necessary, depending on whether the operation (in this case '+') is commutative, i.e. it holds for '+' but not for '-'. Commit 349b360 attempts to fix that for 'direct' binary operations, but a fix for indirect operations is still work in progress. Unfortunately, it meant I had to reserve a register for temporary use to resolve this, meaning there are fewer available for the register allocator to use. Fortunately, that did simplify handling of a few irregular instructions, e.g. signed cast of 32 bit integers to 64 bit integers.

So that brings us to today and my future plans. The next thing to implement will be support for function calls by the register allocator, which involves shuffling values to the right registers and correct positions on the stack, and also in spilling all values that are still required after the function call since the function may overwrite them. This requires a bit of refactoring of the logic that spills variables, since currently it is only used when there are not enough registers available. I also need to change the linear scan main loop, because it processes values in order of first definition, and as such, instructions that don't create any values are skipped, even if they need special handling like function calls. I'm thinking of solving that with a special 'interesting tiles' queue that is processed alongside the main values working queue.

That was it for today. I hope to write soon with more progress.

zondag 6 november 2016

A guide through register allocation: Introduction

This is the first post in what I intend to be a series on the register allocator for the MoarVM JIT compiler. It may be a bit less polished than usual, because I also intend to write more of these posts than I have in the past few months.

The main reason to write a register allocator is that it is needed by the compiler. The original 'lego' MoarVM JIT didn't need one, because it used what is called a 'memory-to-memory' model, meaning that every operation is expected to move operands from and to memory. In this it follows closely the behavior of virtually every other interpreter existing and especially that of MoarVM. However, many of these memory operations are logically redundant (for example, when storing and immediately loading an intermediate value, or loading the same value twice). Such redundancies are inherent to a memory-to-memory code model. In theory some of that can be optimized away, but in practice that involves building an unreasonably complicated state machine.

The new 'expression' JIT compiler was designed with the explicit (well, explicit to me, at least) goals of enabling optimization and specialization of machine code. That meant that a register-to-register code model was preferable, as it makes all memory operations explicit, which in turn enables optimization to remove some of them. (Most redundant 'load' operations can already be eliminated, and I'm plotting a way to remove most redundant 'store' operations, too). However, that also means the compiler must ensure that values can fit into the limited register set of the CPU, and that they aren't accidentally overwritten (for example as a result of a subroutine call). The job of the register allocator is to translate virtual registers to physical registers in a given code segment. This may involve modifying the original code by inserting load, store and copy operations.

Register allocation is known as a hard problem in computer science, and I think there are two reasons for that. The first reason is that finding the optimal allocation for a code segment is (probably) NP-complete. (NP-complete basically means that you have to consider all possible solutions in order to find the one you are after. A common feature of NP-complete problems is that the effect of a local choice on the global solution cannot be fully predicted). However, for what I think are excellent reasons, I can sidestep most of that complexity using the 'linear scan' register allocation algorithm. The details of that algorithm are subject of a later post.

The other reason that register allocation is hard is that the output code must meet the demanding specifications of the target CPU. For instance, some instructions take input only from specific registers, and some implicitly overwrite other registers. Calling conventions can also present a significant source of complexity as values must be placed in the right registers (or on the right stack locations) where the called function may expect them. So the register allocator must somehow encode these specific demands and ensure they are not violated.

Now that I've introduced register allocation, why it is needed, and what the challenges are, the next posts can begin to describe the solutions that I'm implementing.

zaterdag 4 juni 2016

In praise of incremental change

Hi everybody, welcome back. I'd like to share with you the things that have been happening in the MoarVM JIT since I've last posted, which was in fact March. To be brief, the JIT has been adapted to deal with moving frames, and I've started to rewrite the register allocator towards a much better design, if I do say so myself.

First, moving frames. jnthn has already written quite a bit about them. The purpose of it is to make the regular case of subroutine invocation cheaper by making special cases - like closures and continuations - a bit more expensive. I have nothing to add to that except that this was a bit of a problem for the JIT compiler. The JIT compiler liked to keep the current frame in a non-volatile register. These are CPU registers which are supposed not to be overwritten when you call a function. This is useful because it speeds up access of frequently used values. However, if the current frame object is be moved, the frame pointer in this register becomes stale. Thus, it had to go, and now we load the frame pointer from the thread context object (i.e. in memory), which never moves.

Unfortunately, that was not sufficient. Because MoarVM is an interpreter, control flow (like returning from a routine) is implemented updating the pointer to the next instruction (and the next frame). JIT compiled code never deals with this instruction pointer. Hence, whenever this instruction pointer could have been updated - we call this invokish or throwish code - the JIT may need to return control to the interpreter, which can then figure out what to do next. Originally, this had been implemented by comparing the frame pointer of the JIT frame - stored in the non-volatile register - with the frame pointer as understood by the interpreter - i.e., in the thread context object. This check no longer worked, because a): we didn't have a permanent pointer to the current frame anymore, and b): the current frame pointer might change for two different reasons, namely control flow and object movement.

I figured out a solution to this issue when I realized that what we really needed is a way to identify (cheaply) in the JIT whether or not we have changed control flow, i.e. whether we have entered another routine or returned out of the current one. This might be achieved by comparing immutable locations, but lacking those, another method is to simply assign increasing numbers to constructed frames. Such a sequence number then identifies the current position in the control flow, and whenever it is changed the JIT knows to return control to the interpreter. This caused some issues at first when I hadn't correctly updated the code in all places where the interpreter changed the current instruction, but afterwards it worked correctly. Special thanks go to lizmat who allowed me to debug this on Mac OS X, where it was broken.

Afterwards, I've focused on improving the register allocator. The primary function of a register allocator is to ensure that the values used in a calculations are placed in (correct) registers prior to that calculation. This involves, among other things, assigning the correct registers (some operations only work on specific registers), spilling registers to memory in order to make place, loading spilled values from memory if necessary, and ensuring that values in volatile registers are spilled prior to a function call. This was rather difficult because in the old design because, as it was inlined into the compilation step, it  couldn't really look behind or ahead, which is a problem if you want to place something correctly for future use. Furthermore, it allowed only for a one-on-one correspondence between a value that was computed and its current location in a register. That is a problem whenever -a value is copied to a different register, or stored in multiple memory locations.

So I've been, very slowly and methodically, in very small steps, moving code and structures through the JIT compiler in order to arrive at a register allocator that can handle these things. The first thing I did was remove the register allocation step out of compilation, into its own step (commit and another commit). Then I extracted the value descriptor structure - which describes in which location a value is stored - out of the expression tree (commit). I stored the tile list in a vector, in order to allow reverse and random access (commit). Because the register allocator works in a single pass and only requires temporary structures, I've 'internalized' it to its own file (commit one and commit two). Finally, I replaced the per-expression value structure with value descriptor structures (commit).

This places me in a position to replace register allocator structures (such as the active stack with an expiry heap), implement loads and stores, record register requirements per tile, implement pre-coloring, and correct allocation over basic blocks. All these things were impossible, or at least very difficult, with the old design.

What I think is interesting here is that in each of these commits, the logic of the program didn't substantially change, and the JIT continued to just as well as it had before. Nevertheless, all of this is progress - I replaced a rather broken design assumption (online register allocation with a value state machine) with another (offline register allocation with value descriptors) - that allows me to implement the necessary mechanics in a straightforward manner. And that, I think, demonstrates the power of incremental changes.

zondag 13 maart 2016

FOSDEM and the future

Hi all, I realise I haven't written one of these posts for a long time. Since November, in fact. So you could be forgiven for believing I had stopped working on the MoarVM JIT. Fortunately, that is not entirely true. I have, in fact, been very busy with a project that has nothing to do with perl6, namely SciGRID, for which I've developed GridKit. GridKit is a toolkit for extracting a power network model from OpenStreetMap, which happens to contain a large number of individual power lines and stations. Such a network can then be used to compute the flow of electric power throughout Europe, which can then be used to optimize the integration of renewable energy sources, among other things. So that was and is an exciting project and I have a lot of things to write on that yet. It is not, however, the topic of my post today.

Let's talk about the expression JIT, where it stands, how it got there, and where it needs to go. Last time I wrote, I had just finished in reducing the expression JIT surface area to the point where it could just compile correct code. And that helped in making a major change, which I called tile linearisation. Tile linearisation is just one example of a major idea that I missed last summer, so it may be worthwhile to expand a bit on it.

As I've epxlained at some length before, the expression JIT initially creates trees of low-level operations out of high-level operations, which are then matched (tiled) to machine-level operations. The low-level operations can each be expressed by a machine-level operation, but some machine-level instructions match multiple low-level operations. The efficient and optimal matching of low-level to machine-level operations is the tiling step of the compiler, and it is where most of my efforts have been.

Initially, I had 'tagged' these tiles to the tree that had been created, relying on tree traversal to get the tiles to emit assembly code. This turned out to be a poor idea because it introduces implicit order based on the tree traversal order. This is first of all finicky - it forces the order of numbering tiles to be the same in the register allocator and the tile selection algorithm and again for the code emitter. In practice that means that the last two of these were implemented in a single online step. But more importantly and more troubling, it makes it more complex to determine exactly the extent of live ranges and of basic blocks.

The notion of basic blocks is also one that I missed. Expression trees are typically compiled for single basic blocks at a time. The definition of a basic block is a sequence of instructions that is executed without interruption. This allows for some nice simplifications, because it means that a value placed in a register at one instruction will still be there in the next. (In contrast, if it were possible to 'jump in between' the instructions, this is not so easy to ensure). However, these basic blocks are defined at the level of MoarVM instructions. Like most high-level language interpreters, MoarVM instructions are polymorphic and can check and dispatch based on operands. In other words, a single MoarVM instruction can form multiple basic blocks. For correct register allocation, it is  vital that the register allocator knows about these basic blocks. But this is obscured, to say the  least, by the expression 'tree' structure, which really forms a Directed Acyclic Graph, owing to the use of values by multiple consumers.

The point of tile linearisation is provide an authoritative, explicit order for tiles - and the code sequences that they represent - so that they can be clearly and obviously placed in basic blocks. This then allows the register allocator to be extended to deal with cross-basic block compilation. (In the distant future, we might even implement some form of instruction scheduling). As a side effect, it also means that the register allocation step should be moved out of the code emitter. I've asked around and got some nice papers about that, and it seems like the implementation of one of these algorithms - I'm still biased towards linear scan - is within the range of reasonable, as soon as I have the details figured out. Part of the plan is to extract value descriptors from the tree (much like the tile state) and treat them as immutable, introducing copies as necessary (for instance for live range splitting). The current register allocator can survive as the register selector, because it has some interesting properties in that aspect.

Aside from that I've implemented a few other improvements, like:
  • Refactored the tiler table generator, so that it could be extended to include tile arguments. This considerably simplifies the implementation of tiles. An interesting possibility, I think, is to make the tiler select tile candidates  rather  than a specific tile, which might allow choosing an optimal tile based on operation arguments rather than only on tree structure. Furthermore, the tiler table generator is now cleanish perl, which should probably help with maintenance.
  • Factor tiler state out of the tree. I had initially implemented nearly all tree operations by means of 'tagging' the tree in an 'Info' structure. (Structures named 'Info' are like classes named Manager, and sign of a code problem). However, this means that the tile information is 'dragged along' with the tree during its lifetime, which is not really necessary, because the tiler state is temporary.
  • Fixed a number of small issues, some of them centered around operand sizes, libuv versions, and build systems.
  • Presented on FOSDEM about how amd64 assembly language works and can be used in perl 6.
  • Implemented a JIT expression frame bisect tool which allows us to pinpoint precisely where the compilation of a perl6 (or nqp) frame breaks.
From that last bit, I've learned that the way the JIT is currently dealing with annotations is subtly broken, because the following thing can and does happen:
  • We start a basic block with a label
  • We append to that a 'dynamic control label' sequence, which updates the JIT 'reentry' label to point to the start of the basic block. This allows various operations in MoarVM which need to inspect the current program position - lexical variables in an inlined frame, exception handlers - to know where we are in the program.
  • An instruction is annotated with tags that indicate that it is, for example, the first instruction of an exception handler, or a deoptimisation point.
  • Because it is the first instruction of an exception handler, it must be labeled, so a label is inserted prior to the instruction. And, a dynamic control label sequence is also inserted prior to the instruction.
  • Because the instruction was the first one of its basic block, it acquires the same label as the  basic block. However, between the two same labels, a dynamic control sequence is inserted, which means that the same labels are inserted twice, not meaning the same thing.
  • Presumably, although I haven't checked it, the last or the first label wins. But all repeated dynamic control labels are redundant. In this case, up to 4 different control labels are stacked on top of each other.
I'm not yet sure how to deal with this. Jonathan implemented a fix last year that introduced a dynamic control label at the start of each basic block. Ultimately, that reinforces this 'stacking' behavior, although it already happened. Ideally, we would not need to store the current location for each basic block just for the few operations that need it. It might instead be possible to refer to the current region in some other way, which is what happens to some extent in exception handling already.

Anyway, that's all for today, and I hope next time I will be able to bring you some good news. See you!

zondag 29 november 2015

Moar JIT news

Hello there, I thought it high time to write to you again and update you on the world of JITs. Since last I wrote, PyPy 4.0 was released. Also in python-land, Pyston 0.4 was released, and finally Guile 2.1.1 was released and Andy Wingo wrote a nice piece about that, as is his custom. I present these links not only to give these projects the attention they deserve, but also because I think they are relevant to our own project.

In chronological order, the release of PyPy 4.0 marked the first 'production' release of that projects' autovectorizer, which was developed over the course of this years Google Summer of Code. I'd like to take this opportunity to publicly congratulate the PyPy team on this achievement. So called 'vector' or SIMD operations perform a computation on multiple values in a single step and are an essential component of high-performance numerical computations. Autovectorizing refers to the compiler capability to automatically use such operations without explicit work by the programmer. This is not of great importance for the average web application, but it is very significant for scientific and deep learning applictions.

More recently, the Pyston project released version 0.4. Pyston is another attempt at an efficient implementation of Python, funded by Dropbox. Pyston is, or I should rather say, started out based on llvm. Most of my readers know of LLVM; for those who don't, it is a project which has somewhat revolutionised compiler development in the last few years. Its strengths are its high-quality cross-platform code generation with a permissive license. LLVM is also the basis for such languages as rust and julia. Notable weaknesses are size, speed, and complexity. To make a long story short, many people have high expectations of LLVM for code generation, and not without reason.

There are a few things that called my attention in the release post linked above. The first thing is that the Pyston project introduced a 'baseline' JIT compiler that skips the LLVM compilation step, so that JIT compiled code is available faster. They claim that this provides hardly a slowdown compared to the LLVM backend. The second thing is that they have stopped working on implementing LLVM-based optimisation. The third thing is that to support more esoteric python feature, Pyston now resorts to calling the Python C API directly, becoming sort of a hybrid interpreter. I would not be entirely surprised if the end point for Pyston would be life as a CPython extension module, although Conways law will probably prohibit that.

Pyston is not the first, nor the only current JIT implementation based on LLVM. It might be important to say here that there are many projects which do obtain awesome results from using LLVM; julia being a prime example. (Julia is also an excellent counterexample to the recent elitist celebration of self-declared victory by static typing enthusiasts assertion that 'static types have won', being very dynamic indeed). But Julia was designed to use the LLVM JIT, which probably means that tradeoffs have been made to assure performance; and it is also new, so it doesn't have to run as much weird legacy code; the team is probably highly competent. I don't know why some mileages vary so much (JavascriptCore also uses LLVM successfully, albeit as it's fourth and last tier). But it seems clear that far from being a golden gun, using LLVM for dynamic language implementations is a subtle and complex affair.

Anybody willing to try building an LLVM-backed JIT compiler for MoarVM, NQP, or perl6 in general, will of course receive my full (moral) support, for whatever that may be worth.

The posts by Andy Wingo, about the future of the Guile Scheme interpreter, are also well worth reading. The second post is especially relevant as it discusses the future of the guile interpreter and ways to construct a high-performance implementation of a dynamic language; it generalizes well to other dynamic languages. To summarise, there are roughly two ways of implementing a high-performance high-level programming language, dynamic or not, and the approach of tracing JIT compilers is the correct one, but incredibly complex and expensive and - up until now - mostly suitable for big corporations with megabucks to spend.
 Of course, we aim to challenge this; but for perl6 in the immediate future correctness far outranks performance in priority (as it should).

That all said, I also have some news on the front of the MoarVM JIT. I've recently fixed a very longstanding and complex bug that presented itself during the compilation of returns with named arguments by rakudo. This ultimately fell out as a missing inlined frame in the JIT compiler, which ultimately was caused by MoarVM trying to look for a variable using the JIT-compiler 'current location', while the actual frame was running under the interpreter,and  - this is the largest mystery - it was not deoptimized. I still do not know why that actually happened, but a very simple check fixed the bug.

I also achieved the goal of running the NQP and rakudo test suites under the new JIT compiler, albeit in a limited way; to achieve this I had to remove the templates of many complex operations, i.e. ops that call a C function or that have internal branches. The reason is that computing the flow of values beyond calls and branches is more complex, and trying to do it inline with a bunch of other things - as the new JIT has tried to so far - is prone to bugs. This is true especially during tree traversal, since it may not be obvious that computations relying on values may live in another context as the computations that generate these values.

In order to compile these more complex trees correctly, and understandably, I aim to disentangle the final  phases of compilation, that is, the stages of instruction selection, register allocation, and bytecode generation. Next to that I want to make the tiler internals and interface much simpler and user-friendly, and solve the 'implied costs problem'. The benefit of having the NQP test suite working means I can demonstrate the effects of changes much more directly, and more importantly, demonstrate whether individual changes work or not. I hope to report some progress on these issues soon, hopefully before christmas.

If you want to check out the progress of this work, checkout the even-moar-jit branch of MoarVM. I try, but not always successfully, to keep it up-to-date with the rapid pace of the MoarVM master branch. The new JIT only runs if you set the environment variable MVM_JIT_EXPR_ENABLE to a non-empty value. If you run into problems, please don't hesitate to report on github or on the #moarvm or #perl6 channels on freenode. See you next time!

maandag 12 oktober 2015

Rumors of JITs' demise are greatly exaggerated.

Earlier this week my attention was brought to an article claiming that the dusk was setting for JIT compilation. Naturally, I disagree. I usually try to steer clear of internet arguments, but this time I think I may have something to contribute. Nota bene, this is not a perl- or perl6 related argument, so if that is strictly your interest this is probably not an interesting post for you.

The main premise of the argument is that people are shifting away from JIT compilation because the technique has failed to live up to its promises. Those promises include, in various forms, high level languages running 'as fast as C', or having more optimization possibilities than ahead-of-time (AOT) compilers do. Now my perspective may be a bit unusual in that I don't actually expect momentous gains from JIT compilation per se. As I've described in the talk I gave at this years' YAPC::EU, by itself JIT compilation removes only the decoding and dispatch steps of interpretation, and - depending on the VM architecture - these may be a larger or smaller proportion of your running time. However, my thesis is that interpretation is not why high-level languages are slow, or rather, that interpretation is only one of the many sources of indirection that make high-level languages slow.

First of all, what of the evidence that JITs are actually in demise? The author provides three recent trends as evidence, none of which I hold to be decisive. First, both Windows 10 and the newest versions of Android translate .NET and Dalvik applications respectively to native code at installation time, which is properly considered ahead of time compilation. Second, high-performance javascript applications are currently often created using tools like emscripten, which compiles to asm.js, and this is in many ways more similar object code than it is to a high-level language, implying that the difficult bit of compilation is already behind us. (I agree mostly with that assesment, but not with its conclusion). Finally, on iOS devices JIT compilation is unsupported (except for the JIT compiler in the webkit browser engine), allegedly because it is insecure.

As to the first piece, the author suggest that the main reason is that JIT compilers being unpredictable in their output, at least relative to optimizing ahead-of-time compilers. I think that is nonsense; JIT compilation patterns tend to be quite reliably the same on different runs of the same program, a property I rely on heavily during e.g. debugging. The output code is also pretty much invariant, with an exception being the actual values of embedded pointers. So in my experience, what you see (as a developer) is also what you get (as a user), provided you're using the same VM. I humbly suggest that the author believes JITs to be unreliable because his work is being compiled by many different VMs using many different strategies. But I see that no differently than any other form of platform diversity. Maybe the author also refers to the fact that often optimization effectiveness and the resultant performance of JIT compiled applications is sensitive to minor and innocuous changes in the application source code. But this is true of any high-level language that relies primarily on optimizing compilers, for C as much as for python or javascript. The main difference between C and python is that any line of C implies far fewer levels of indirection and abstraction than a similar line of python.

I think I have a much simpler explanation as to why both Google and Microsoft decided to implement ahead-of-time compilation for their client platforms. The word 'client' is key here; because I think we're mostly talking about laptops, smartphones and tablets. As it turns out, hardware designers and consumers alike have decided to spend the last few years worth of chip manufacturing improvements on smaller, prettier form factors (and hopefully longer battery life) rather than computing power. Furthermore, what Qualcomm, Samsung etc. have given us, Photoshop has taken away. The result is that current generation portable devices are more portable and more powerful (and cheaper) than ever but are still memory-constrained.

JIT compilation inevitably comes with a significant memory cost from the compiled code itself (which is generally considerably larger than the interpreted code was), even when neglecting the memory usage of the compiler. Using various clever strategies one can improve on that a bit, and well-considered VM design is very important as always. But altogether it probably doesn't make a lot of sense to spend precious memory for JIT-compiled routines in a mobile setting. This is even more true when the JIT compiler in question, like Dalviks', isn't really very good and the AOT compiler has a good chance of matching its output.

Now to the case of asm.js. As I said, i agree mostly that a significant amount of work has already been done by an ahead-of-time compiler before the browser ever sees the code. It would be a mistake to think that therefore the role of the JIT (or rather the whole system) can be neglected. First of all, JIT-compiled code, even asm.js code, is greatly constrained in comparison to native code, which brings some obvious security benefits. Second of all, it is ultimately the JIT compiler that allows this code to run cross-platform at high performance. I think it is mistaken to suggest that this role is trivial, and so I see asm.js as a success of rather than evidence against JIT compilation as a technique.

Next, the iOS restriction on JIT compilation. I think the idea that this would be for security reasons is only plausible if you accept the idea that application security is significantly threatened by dynamic generation of machine code. While I'm sure that the presence of a JIT compiler makes static analysis very difficult - not to say impossible - I don't believe that this is the primary attack vector of our times. The assertion that memory must be both writable and executable for a JIT compiler to work is only superficially true, since there is no requirement that the memory must be both at the same time, and so this doesn't imply much of a threat (So called W^X memory is becoming a standard feature of operating systems). Vtable pointers stored in the heap, and return addresses on a downward-growing stack, now those are attack vectors of note.

But more importantly, that is not how mobile users are being attacked. It is much more interesting, not to mention significantly easier, for attackers to acquire whole contact books, private location information, credentials and private conversations via phishing and other techniques than it is to corrupt a JIT compiler and possibly, hopefully, and generally unreliably gain remote execution. Most of these attack vectors are wide-open indeed and should be prevented by actual security techniques like access control rather than by outlawing entire branches of computing technology. Indeed, an observer not sympathetic to Apple could probably relate this no-JIT compilation rule with the Californian company's general attitude to competing platforms, but I will not go further down that path here.

Finally, I think the claim that JIT compilation can't live up to its promise can readily be disproven by a simple google search. The reason is simple; the JIT compiler, which runs at runtime, has much more information at its disposal than even the best of ahead-of-time compilers. So-called profile-guided optimization help to offset the difference, but it is not a common technique, moreover that is still only a small subset of information available to a JIT compiler. The fact that many systems do not match this level of performance (and MoarVM's JIT compiler certainly doesn't) is of course relevant but not, in my opinion, decisive.

In conclusion, I would agree with the author that there are many cases in which JIT compilation is not suitable and in AOT compilation is. However, I think the much stronger claim that the dusk is setting on JIT compilation is unwarranted, and that JIT compilers will remain a very important component of computing systems.