zondag 10 juni 2018

Controlled Stack Hacking for the MoarVM JIT Compiler

Hi readers! Today I have a story about a recently-merged set of patches that allows MoarVM to use the on-stack return pointer to reduce the overhead of exception handling and other VM features for JIT compiled code. Maybe you'll find it interesting.

As you might know, MoarVM Is a bytecode interpreter. In some situations, MoarVM internals need to know the current position in the execution of the program. For instance in exception handling, all exception thrown within a block are caught by the associated CATCH block or propagated if no such block exists. Such blocks are indicated as a range within the bytecode, and we find the associated CATCH block by comparing the current position with the known ranges.

This is relatively straightforward to implement for the interpreter, because the interpreter must maintain a 'current position' pointer simply to function. (MoarVM stores a pointer to this pointer in a thread context object so that it is available throughout the VM). For the JIT that is another matter, because the control flow is handled implicitly by the CPU. The instruction pointer register (called %rip on amd64) cannot be read directly. Moreover, as soon as you enter a function that might want to use the current address (like the functions responsible for exception handling), you've left the 'program' code and entered VM code.

So what we used to do instead is take the address of a position within the bytecode (as indicated by a label in the bytecode, a somewhat involved process) and store that in a per-frame field called the jit_entry_label. This field is necessary to support another MoarVM feature  - we use the interpreter as a trampoline (in the first or second sense of that definition). Because the interpreter is not recursive, JIT compiled code needs to return to the interpreter to execute a subroutine that was invoked (as opposed to calling an interpreter function, as perl5 does for exception handling). The primary purpose of this label is to continue where we left off after returning from another invoked program. But it can be used just as well for finding where we are in the execution of the program.

Only problem then is that we need to keep it up to date, which we did. On the entry of every basic block (uninterrupted sequence of code), we stored the current position in this field. This is quite common - every conditional statement, loop or other control flow change needs one, as well as every exception-handler scope change needed a little snippet storing the current position. This was annoying.

Furthermore, there are numerous MoarVM instructions that might change the control flow (or might not). For instance, the instruction responsible for converting an object to a boolean value might need to invoke the Bool method specific to that objects' class - or, if no such method exists, fallback to a default implementation. We call such instructions invokish. When compiling code that contains such invokish instructions, we installed 'control guards' to check if the VM had in fact invoked another routine, and if so, to return to the interpreter to execute that routine. This too added quite a bit of overhead.

I keep writing in the past tense because all of that is now gone, and that happened due to a simple realization. When we call a function (in C or assembly), we place the return address (the machine instruction after the call instruction) on the stack. We can read this value from the stack and use it wherever we want to know about the current position.

I initially had implemented that using a stack walker function similar to the one in the link, except that I implemented it in assembly instead. (When writing this post I learned of the GCC __builtin_return_address and MSVC _ReturnAddress intrinsic functions, which presumably do the same thing). Unfortunately, that strategy didn't really work - it relies on the frame base pointer (%rbp) being placed right 'on top' of the return address pointer on the stack. Even with special compiler flags intended to preserve that behaviour, this assumption turned out to be unreliable.

Fortunately I realized later that it was also unnecessary. Because the JIT compiler controls the layout of the compiled code frame, it also controls exactly where the return address will be stored when we compile a (C) function call. That means that we can simply take a pointer to this address and store that in the thread context structure. From that address, we can read exactly the current position in the compiled code, without having to explicitly store it so often. Furthermore, we can also write to this location, changing the address the function will return to. Effectively, this is a controlled 'on-stack goto', an idiom more often used for exploits than for good purposes - clearly this is an exception! We use this to force a return to the interpreter (with proper stack frame cleanup) for 'invokish' instructions that end up invoking. We can change control to go directly to an exception handler if it is in the same frame. This makes all the earlier control 'guard' fragments redundant, allowing us to remove them entirely. Thus, an invokish instruction that doesn't actually invoke now carries no extra cost.

How much does this save? It depends a lot on the exact program, but I estimate about 5% of compiled code size, and from a hopelessly optimal (and fairly unrealistic) benchmark which I lifted from this blog post, approximately 10% of runtime. In real code, the effect is definitely nowhere near what jnthn++ or samcv++ achieved lately, but it's still nice. Also nice is that the code is quite a bit simpler than it was before.

Anyway, that's all I have to tell today. Have fun hacking, and until next time!

dinsdag 27 februari 2018

Some things I want

Lately I've been fixing a few bugs here and there in the JIT compiler, as well as trying to advance the JIT expression optimizer. The story of those bugs is interesting but in this post I want to focus on something else, namely some support infrastructure that I think we should have that would make working on MoarVM and spesh/jit in particular much nicer.

There are a bunch of things related to runtime control of spesh and the JIT:
  • jit-bisect.pl is an awesome tool (if I say so myself) that deserves it's own blog post. It can figure out which frame is the first to be miscompiled and cause an execution failure. (Or to be specific, the least miscompile to cause a failure). When a JIT issue is brought to me, I reach for it directly. It should be more powerful:
    • Because it does not know in advance how many frames are compiled in a program, it uses an 'exponential probing' strategy to find the first breaking block, and then binary search between the last two probes. This means that we need to have log(n) tries to find the first failure during probing, and then another log(n) tries to find the exact frame. Usually, failures are much faster than successful runs, so the first phase takes a disproportionate amount of time. It would probably speed up bisects if we could start probing from a number higher than 1.
    • It would be nice if as part of (optional) output, jit-bisect.pl could create a 'replicator' script that would replicate the failure exactly as the bisect script found it. E.g. if it finds that frame 237 block 6 fails, then it should output a script that executes the program with the expression JIT active to exactly that point.
    • We assign sequence number 0 to the first frame to be JIT-compiled, but can't bisect that because we start with sequence number 1.
  • Both spesh and JIT logging should be much more specific. Currently, when MVM_SPESH_LOG is defined, we log all information on all frames we encounter onto the log file, resulting in enormous (easily >20MB) text files. Of those text files, only tiny bits tend to be affected in a way that interests you, and especially as the result of bisect process, only the last frame will be of any interest. So I propose a MVM_SPESH_LOG_FOI (frame of interest) flag that toggles when a frame should be logged or not. Could be an array as well (comma- or semicolon-separated-values). Same goes for the JIT.
  • We can limit the number of frames that passes through spesh using the flag MVM_SPESH_LIMIT, and the number of frames (and basic blocks) that pass through the expression JIT with MVM_JIT_EXPR_LAST_FRAME and MVM_JIT_EXPR_LAST_BB. We cannot do the same for the 'lego' JIT, and we probably should.
  • When MVM_SPESH_BLOCKING is not defined, spesh runs concurrently with the main program, the result of MVM_SPESH_LOG is not repeatable, and MVM_SPESH_LIMIT has as a result no well-defined meaning. Although the jit-bisect.pl program will set MVM_SPESH_LIMIT, it is easy enough to forget. Maybe we should either disable MVM_SPESH_LIMIT without MVM_SPESH_BLOCKING, or MVM_SPESH_LIMIT should imply MVM_SPESH_BLOCKING.
  • Generally we have too much ad-hoc environment setup code that - I feel - does not fit well into the idea of having MoarVM being embedded within another program. Not that embedding interpreters in other programs is a common thing to do anymore, but still. We should probably take that out of general instance setup and install it somewhere else. (MVM_spesh_setup_control_flags?)
Then there's more ambitious stuff, that still falls under 'housekeeping':
  • Currently the expression tree is backed by an array of integers, some of which identify the nodes and some of which are internal pointers (relative to the base of the array). There is an additional info array of structures that has to be explicitly maintained and that holds some information (result type and size, mostly). This is rather annoying when operating on the tree structure (as the optimizer does) because both arrays have to be kept synchronized. Also, these info values are only defined for nodes of the tree. We can reasonably fit all information stored in an info node in 32 bits (if we're packing) or 64 bits (if we're sloppy), even together with the node value. Doing this should save memory, be better for caches, and make tree manipulation easier.
  • I want to have explicit cleanup of the spesh worker thread (if requested by program arguments). Not having this makes ASAN and valgrind somewhat less useful, because they will falsely report in-progress memory as being leaked. To do this we'd need to wait, during shutdown, for the worker thread to be stopped, which means that we also need to have a signal to indicate that it has been stopped.
  • Sometimes, the JIT can get into a confused state and panic, bringing down the process. This is great for debugging and awful for users. I feel like we might be able to improve this - we can, during JIT graph construction, bail out without crashing (if we find constructs we can't JIT yet). We may be able to extend and generalize this. (If we find that we're in an illegal state entirely, that should probably still crash).
  • The legacy ('lego') JIT currently works with a two phase process, first building a JIT graph, then compiling it. This is redundant - without loss of functionality, the lego JIT could be reduced to a single phase (reducing the number of internal structures quite dramatically). This is also something that I'm not totally sure of whether we should do, because we may want having an explicit preparation phase eventually (e.g. when we start doing multiple basic blocks per expression tree).
And there's more ambitious stuff that would fall under optimizations and general functionality improvements:
  • MoarVM in some places assumes that we have 'easy' access to the current instruction pointer, e.g. to lookup dynamic variables or the scope of the current CATCH block. This is logical for an interpreter but problematic for the JIT. We currently need to maintain a pointer to the last entered basic block to approximate it, i.e. a memory store per block. This is expensive, and it's not often needed:
    • We already have to maintain a 'return pointer' for JIT frames that are high up the call stack, so in those cases the additional stores are redundant.
    • In case we are interested in the current position of the current frame, the current position is already on the stack (as the return address of the function call), and we can read it from stack-walking.
    • Alternatively, we can replace calls to functions that are interested in the current position to calls that receive that position explicitly. However, I kind of prefer using stack walking since we'll need the return pointer anyway. Also, it doesn't work for anything that might implicitly throw (MVM_exception_throw_adhoc)
  • The register allocator can be much better in a bunch of (simple) ways:
    • We never need to spill a constant value to memory, since we can always just insert a copy instead.
    • It is probably not necessary to insert individual restores before usages in the same basic block, especially for usages prior to the spilling point.
    • We sometimes spill values that we didn't really need to spill, since we're already storing them to memory as part of the basic block.
  • We have a kind-of hacky way to ensure that our 3-instruction code (c = op(a,b)) is converted to x86 2-instruction code (b = op(b,a)). This should be something that the register allocator can solve, or a phase after the register allocator, but it is currently being handled during code generation.
There is definitively more out there I want to do, but this should be enough to keep me busy for a while. And if anyone else wants to take a try at any of these, they'd be very welcome to :-).

donderdag 25 mei 2017

Function call return values

Hi there, it's been about a month since last I wrote on the progress of the even-moar-jit branch, so it is probably time for another update.

Already two months ago I wrote about adding support for function calls in the expression JIT compiler. This was a major milestone as calling C functions is essential for almost everything that is not pure numerical computing. Now we can also use the return values of function calls (picture that!) The main issue with this was something I've come to call the 'garbage restore' problem, by which I mean that the register allocator would attempt to 'restore' an earlier, possibly undefined, version of a value over a value that would result from a function call.

This has everything to do with the spill strategy used by the compiler. When a value has to be stored to memory (spilled) in order to avoid being overwritten and lost, there are a number of things that can be done. The default, safest strategy is to store a value to memory after every instruction that computes it and to load it from memory before every instruction that uses it. I'll call this a full spill. It is safe because it effectively makes the memory location the only 'true' storage location, with the register being merely temporary caches. It can also be somewhat inefficient, especially if the code path that forces the spill is conditional and rarely taken. In MoarVM, this happens (for instance) around memory barriers, which are only necessary when creating cross-generation object references.

That's why around function calls the JIT uses another strategy, which I will call a point spill. What I mean by that is that the (live) values which could be overwritten by the function call are spilled to memory just before the function call, and loaded back into their original registers directly after. This is mostly safe, since under normal control flow, the code beyond the function call point will be able to continue as if nothing had changed. (A variant which is not at all safe is to store the values to memory at the point, and load them from memory in all subsequent code, because it isn't guaranteed that the original spill-point-code is reached in practice, meaning that you overwrite a good value with garbage. The original register allocator for the new JIT suffered from this problem).

It is only safe, though, if the value that is to be spilled-and-restored is both valid (defined in a code path that always precedes the spill) and required (the value is actually used in code paths that follow the restore). This is not the case, for instance, when a value is the result of a conditional function call, as in the following piece of code:

1:  my $x = $y + $z;
2:  if ($y < 0) {
3:      $x = compute-it($x, $y, $z);
4:  }
5:  say "\$x = $x";

In this code, the value in $x is defined first by the addition operation and then, optionally, by the function call to compute-it. The last use of $x is in the string interpolation on line 5. Thus, according to the compiler, $x holds a 'live' value at the site of the function call on line 3, and so to avoid it from being overwritten, it must be spilled to memory and restored. But in fact, loading $x from memory after compute-it would directly overwrite the new value with the old one.

The problem here appears to be that when the JIT decides to 'save' the value of $x around the function call, it does not take into account that - in this code path - the last use of the old value of $x is in fact when it is placed on the parameter list to the compute-it call. From the perspective of the conditional branch, it is only the new value of $x which is used on line 5. Between the use on the parameter list and the assignment from the return value, the value of $x is not 'live' at all. This is called a 'live range hole'. It is then the goal to find these holes and to make sure a value is not treated as live when it is in fact not.

I used an algorithm from a paper by Wimmer and Franz (2010) to find the holes. However, this algorithm relies on having the control flow structure of the program available, which usually requires a separate analysis step. In my case that was fortunately not necessary since this control flow structure is in fact generated by an earlier step in the JIT compilation process, and all that was necessary is to record it. The algorithm itself is really simple and relies on the following ideas:

  • If a value is defined as the result of a computation, that same (version) of the value cannot be used in code that precedes it.
  • If a value is used in a computation, it must have been defined in some code that precedes it (otherwise the program is obviously incorrect).
  • If, at a branch in the code path, a value is used in at least one of the branches (and not defined in it), it must have been defined prior to the branch instruction.
I think it goes beyond the scope of this blog post to explain how it works in full, but it is really not very complicated and works very well. At any rate, it was sufficient to prevent the JIT from overwriting good values with bad ones, and allowed me to finally enable functions that return values, which is otherwise really simple.

When that was done, I obviously tried to use it and immediately ran into some bugs. To fix that, I've improved the jit-bisect.pl script, which wasn't very robust before. The jit-bisect.pl script uses two environment variables, MVM_JIT_EXPR_LAST_FRAME and MVM_JIT_EXPR_LAST_BB, to automatically find the code sequence where the expression compiler fails and compiles wrong code. (These variables tell the JIT compiler to stop running the expression compiler after a certain number of frames and basic blocks. If we know that the program fails with N blocks compiled, we can use binary search between 0 and N to find out which frame is broken). The jit-dump.pl script then provides disassembled bytecode dumps that can be compared and with that, it is usually relatively easy to find out where the JIT compiler bug is.

With that in hand I've spent my time mostly fixing existing bugs in the JIT compiler. I am now at a stage in which I feel like most of the core functionality is in place, and what is left is about creating extension points and fixing bugs. More on that, however, in my next post. See you then!

zondag 30 april 2017

Letting templates do what you mean

Hi everybody, today I'd like to promote a minor, but important improvement in the 'expression template compiler' for the new JIT backend. This is a tool designed to make it easy to develop expression templates, which are themselves a way to make it easy to generate the 'expression tree' intermediate representation used by the new JIT backend. This is important because MoarVM instructions operate on a perl-like level of abstraction - single instructions can perform operations such as 'convert object to string', 'find first matching character in string' or 'access the last element of an array'. Such operations require rather more instructions to represent as machine code.

This level of abstraction is rather convenient for the rakudo compiler, which doesn't have to consider low-level details when it processes your perl6 code. But it is not very convenient for the JIT compiler which does. The 'expression' intermediate representation is designed to be much closer to what hardware can support directly. Basic operations include loading from and storing to memory, memory address computation, integer arithmetic, (conditional) branching, and function calls. At some point in the future, floating point operations will also be added. But because of this difference in abstraction level, a single MoarVM instruction will often map to many expression tree nodes. So what is needed is an efficient way to convert between the two representations, and that is what expression templates are supposed to do.

Expression templates are very much like the expression tree structure itself, in that both are represented as arrays of integers. Some of the elements represent instructions, some are constants, and some are references (indexes into the same array), forming a directed acyclic graph (not a tree). The only difference is that the template is associated with a set of instructions that indicate how it should be linked into the tree. (Instruction operands, i.e. the data that each instruction operates on, are prepared and linked by the template application process as well).

Surprisingly, arrays of integers aren't a very user-friendly way to write instruction templates, and so the template compiler was born. It takes as input a text file with expression templates defined as symbolic expressions, best known from the LISP world, and outputs a header file that contains the templates, ready for use by the JIT compiler. Note that the word 'template' has become a bit overloaded, referring to the textual input of the template compiler as well as to the binary input to the JIT compiler. That's okay, I guess, since they're really two representations of the same thing. The following table shows how template text, binary, and expression tree relate to each other:

Text 'Binary' Tree

(template: unless_i
  (when
    (zr $0)
    (branch (label $1))
  ))

template: {
  MVM_JIT_ZR,
  0,
  MVM_JIT_LABEL,
  1,
  MVM_JIT_BRANCH,
  2,
  MVM_JIT_WHEN,
  0,
  4,
},
info: ".f.f.l.ll",
len: 9,
root: 6

I hope it isn't too hard to see how one maps to the other. The unless_i instruction executes a branch if its integer argument is zero, specified by a constant as its second argument. All symbols (like when, label and zr) have been replaced by uppercase prefixed constants (MVM_JIT_WHEN), and all nesting has been replaced by references (indexes) into the template array. The 'info' string specifies how the template is to be linked into the tree. Instruction operands are indicated by an 'f', and internal links by an 'l'. In the tree representation the operands have been linked into the tree by the JIT; they form the LOAD and CONST nodes and everything below them.

Anyway, my improvement concerns a more complex form of template, such as the following example, an instruction to load an object value from the instance field of an object:

(template: sp_p6oget_o
  (let: (($val (load (add (^p6obody $1) $2) ptr_sz)))
    (if (nz $val) $val (^vmnull))))

This template contains a let: expression, which declares the $val variable. This value can be used in the subsequent expression by its name. Without such declarations the result of a computation could only have one reference, its immediate syntactic parent. (Or in other words, without let:, every template can only construct a tree). That is very inconvenient in case a result should be checked for null-ness, as in this case. (vmnull is a macro for the global 'null object' in MoarVM. The null object represents NULL wherever an object is needed, but isn't actually NULL, as that would mean it couldn't be dereferenced; it saves the interpreter from checking if a pointer to an object is NULL everywhere it is accessed).

The let: construct has another purpose: it ensures the ordering of operations. Although most operations can be ordered in whatever way suits the compiler, some do not, most notably function calls. (Function calls may have numerous unpredictable side effects, after all). All statements declared in the 'let declaration body' are compiled to run before any statements in the 'expression body'. This enables the programmer to ensure that a value is not needlessly computed twice, and more importantly, it ensures that a value that is used in multiple branches of a conditional statement is defined in both of them. For instance:


(let (($foo (...)))
     (if (...)
         (load $foo)
         $foo))

This pseudo-snippet of template code would dereference $foo if some condition is met (e.g. $foo is not NULL) and returns $foo directly otherwise. Without let to order the computation of $foo prior to the blocks of if, the first (conditional) child of if would be the first reference to $foo. That would mean that the code to compute $foo is only compiled in the first conditional block, which would not be executed whenever the if condition was not true, meaning that $foo would be undefined in the alternative conditional block. This would mean chaos. So in fact let does order expressions. All is good.

Except... I haven't told you how this ordering works, which is where my change comes in. Prior to commit 7fb1b10 the let expression would insert a hint to the JIT compiler to add the declared expressions as tree roots. The 'tree roots' are where the compiler starts converting the expression tree (graph) to a linear sequence of byte code. Hence the declaring expressions are compiled prior to the dependent expressions. But this has, of course, one big disadvantage, which is that the set of roots is global for the tree. Every declaration, no matter how deep into the tree, was to be compiled prior to the head of the tree. As a result, the following template code would not at all do what you want:


(let ($foo (...))
     (if (nz $foo)
         (let (($bar (load $foo))) # dereference $foo !
              (... $bar))
        ...)


The declaration of $bar would cause $foo to be dereferenced prior to checking whether it is non-null, causing a runtime failure. Chaos is back. Well, that's what I've changed. Fortunately, we have another ordering mechanism at our disposal, namely DO lists. These are nodes with a variable number of children that are also promised to be compiled in order. After the patch linked above, the compiler now transforms let expressions into the equivalent DO expressions. Because DO expressions can be nested safely, $bar is not computed prior to the null-check of $foo, as the programmer intended. I had originally intended to implement analysis to automatically order the expressions with regard to the conditionals, but I found that this was more complicated to implement and more surprising to the programmer. I think that in this case, relying on the programmer is the right thing.

One thing that I found interesting is that this reduces the number of mechanisms in the compiler. The 'root-hint' was no longer useful, and subsequently removed. At the same time, all but the last child of a DO list must be void expressions, i.e. yield no value, because DO can only return the value of its last child. Since all expressions in a let declaration must yield some value - otherwise they would be useless - they required a new operation type: discard. Thus with a new node type (extension of data range) we can remove a class of behavior.

After I had implemented this, I've started working on adding basic block analysis. That is a subject for a later post, though. Until next time!

dinsdag 28 maart 2017

Function Call Milestone

Hi everybody. It's high time for another update, and this time I have good news. The 'expression' JIT compiler can now compile native ('C') function calls (although it's not able to use the results). This is a major milestone because function calls are hard! (At least from the perspective of a compiler, and especially from the perspective of the register allocator). Also because native function calls are really very important in MoarVM. Most of its 'primitive' operations (like hash table access, string equality, big integer arithmetic) are implemented by invoking native functions, and so to compile almost any program the JIT has to compile many function calls.

What makes function calls 'hard' is that they must implement the 'calling convention' of the relevant 'application binary interface' (ABI). In short, the ABI specifies the locations of function call parameters.  A small number of parameters (on Windows, the first 4, for POSIX platforms, the first 6) are placed in registers, and if there are more parameters they are usually placed on the stack. Aside from the calling convention, the ABI also specifies the expected alignment of the stack pointer (per 16 bytes) and the registers a functions may overwrite (clobber in ABI-speak) and which registers must have their original values after the function returns. The last type of registers are called 'callee-saved'. Note that at least a few registers must be callee-saved, especially those related to call stack management, because if the callee function would overwrite those it would be impossible to return control back to the caller. By the way, manipulating exactly those registers is how the setjmp and longjmp 'functions' work.

So the compiler is tasked with generating code that ensures the correct values are placed in the correct registers. That sounds easy enough, but what if the these registers are taken by other values, and what if those other values might be required for another parameter? Indeed, what if the value in the %rdx register needs to be in the %rsi register, and the value of the %rsi register is required in the %rdx register? How to determine the correct ordering for shuffling the operands?

One simple way to deal with this would be to eject all values from registers onto the stack, and then to load the values from registers if they are necessary. However, that would be very inefficient, especially if most function calls have no more than 6 (or 4) parameters and most of these parameters are computed for the function call only. So I thought that solution wouldn't do.

Another way to solve this would be if the register allocator could ensure that values are placed in their correct registers directly,- especially for register parameters -  i.e. by 'precoloring'. (The name comes from register allocation algorithms that work by 'graph coloring', something I will try to explain in a later post). However, that isn't an option due to my choice of 'linear scan' as the register allocation algorithm. This is a 'greedy' algorithm, meaning that it decides the allocation for a live range as soon as it encounters them, and that it cannot revert that decision once it's been made. (If it could, it would be more like a dynamic programming algorithm). So to ensure that the allocation is valid I'd have to make sure that the information about register requirements is propagated backwards from the instructions to all values that might conflict with it... and that point we're no longer talking about linear scan, and I would be better off re-engineering a new algorithm. Not a very attractive option either!

Instead, I thought about it and it occurred to me that this problem seems a lot like unravelling a dependency graph, with a number of restrictions. That is to say, it can be solved by a topological sort. I map the registers to a graph structure as follows:

  • Each register forms a node
  • A transfer from a register to another register, or from a register to a stack location or a local memory location is an edge
  • Each node can have multiple outbound edges, but only one inbound edge (only one value can ever be required in that register)
I linked to the topological sort page for an explanation of the problem, but I think my implementation is really quite different from that presented there. They use a node visitation map and a stack, I use an edge queue and and outbound count. A register transfer (edge) can be enqueued if it is clear that the destination register is not currently used. Transfers from registers to stack locations (as function call parameters) or local memory (to save the value from being overwritten by the called function) are also enqueued directly. As soon as the outbound count of a node reaches zero, it is considered to be 'free' and the inbound edge (if any) is enqueued.


Unlike a 'proper' dependency graph, cycles can and do occur, as in the example where '%rdx' and '%rsi' would need to swap places. Fortunately, because of the single-inbound edge rule, such cycles are 'simple' - all outbound edges not belonging to the cycle can be resolved prior to the cycle-breaking, and all remaining edges are part of the cycle. Thus, the cycle can always be broken by freeing just a single node (i.e. by copy to a temporary register).

The only thing left to consider are the values that are used after the function call returns (survive the function call) and that are stored in registers that the called function can overwrite (which is all of them, since the register allocator never selects callee-saved registers). So to make sure they are available afterwards, we must spill them. But there are a few spill strategies to choose from (terminology made up by me):

  • full spill replaces all references to the value in register, with references to the value in memory, by inserting load and store operations around every use and definition of the value. Basically, it splits the single value live range in parts.
  • split-and-spill finds the code segment where the spill is not required and splits it off from the rest and only edits the second part. This is actually rather tricky since to do this correctly requires data-flow analysis, which isn't otherwise required by a 'naive' implementation of linear scan
  • spill-and-restore will find the shortest piece of code where the spill is necessary, store the value to memory, and load it directly afterwards, as if the spill never happened.
The current register allocator does a full spill when it's run out of registers, and it would make some sense to apply the same logic for function-call related spills. I've decided to use spill-and-restore, however, because a full spill complicates the sorting order (a value that used to be in a register is suddenly only in memory) and it can be wasteful, especially if the call only happens in an alternative branch. This is common for instance when assigning values to object fields, as that may sometimes require a write barrier (to ensure the GC tracks all references from 'old' to 'new' objects). So I'm guessing that it's going to be better to pay the cost of spilling and restoring only in those alternative branches, and that's why I chose to use spill-and-restore.

That was it for today. Although I think being able to call functions is a major milestone, this is not the very last thing to do. We currently cannot allocate any of the registers used for floating-point calculations, which is a relatively minor limitation since those aren't used very frequently. But I also need to do some more work to actually use function return values and apply generic register requirements of tiles. But I do think the day is coming near where we can start thinking about merging the new JIT with the MoarVM master branch, making it available to everybody. Until next time!

donderdag 9 februari 2017

Register Allocator Update

Hi everybody, I thought some yof you might be interested in an update regarding the JIT register allocator, which is after all the last missing piece for the new 'expression' JIT backend. Well, the last complicated piece, at least. Because register allocation is such a broad topic, I don't expect to cover all topics relevant to design decisions here, and reserve a future post for that purpose.

I think I may have mentioned earlier that I've chosen to implement linear scan register allocation, an algorithm first described in 1999. Linear scan is relatively popular for JIT compilers because it achieves reasonably good allocation results while being considerably simpler and faster than the alternatives, most notably via graph coloring (unfortunately no open access link available). Because optimal register allocation is NP-complete, all realistic algorithms are heuristic, and linear scan applies a simple heuristic to good effect. I'm afraid fully explaining the nature of that heuristic and the tradeoffs involves is beyond the scope of this post, so you'll have to remind me to do it at a later point.

Commit ab077741 made the new allocator the default after I had ironed out sufficient bugs to be feature-equivalent to the old allocator (which still exists, although I plan to remove it soon).
Commit 0e66a23d introduced support for 'PHI' node merging, which is really important and exciting to me, so I'll have to explain what it means. The expression JIT represents code in a form in which all values are immutable, called single static assignment form, or SSA form shortly. This helps simplify compilation because there is a clear correspondence between operations and the values they compute. In general in compilers, the easier it is to assert something about code, the more interesting things you can do with it, and the better code you can compile. However, in real code, variables are often assigned more than one value. A PHI node is basically an 'escape hatch' to let you express things like:

int x, y;
if (some_condition()) {
    x = 5;
} else {
    x = 10;
}
y = x - 3;

In this case, despite our best intentions, x can have two different values. In SSA form, this is resolved as follows:

int x1, x2, x3, y;
if (some_condition()) {
    x1 = 5;
} else {
    x2 = 10;
}
x3 = PHI(x1,x2);
y = x3 - 3;

The meaning of the PHI node is that it 'joins together' the values of x1 and x2 (somewhat like a junction in perl6), and represents the value of whichever 'version' of x was ultimately defined. Resolving PHI nodes means ensuring that, as far as the register allocator is concerned, x1, x2, and x3 should preferably be allocated to the same register (or memory location), and if that's not possible, it should copy x1 and x2 to x3 for correctness. To find the set of values that are 'connected' via PHI nodes, I apply a union-find data structure, which is a very useful data structure in general. Much to my amazement, that code worked the first time I tried it.

Then I had to fix a very interesting bug in commit 36f1fe94 which involves ordering between 'synthetic' and 'natural' tiles. (Tiles are the output of the tiling process about which I've written at some length, they represent individual instructions). Within the register allocator, I've chosen to identify tiles / instructions by their index in the code list, and to store tiles in a contiguous array. There are many advantages to this strategy but they are also beyond the scope of this post. One particular advantage though is that the indexes into this array make their relative order immediately apparent. This is relevant to linear scan because it relies on relative order to determine when to allocate a register and when a value is no longer necessary.

However, because of using this index, it's not so easy to squeeze in new tiles to that array - which is exactly what a register allocator does, when it decides to 'spill' a value to memory and load it when needed. (Because inserts are delayed and merged into the array a single step, the cost of insertion is constant). Without proper ordering, a value loaded from memory could overwrite another value that is still in use. The fix for that is, I think, surprisingly simple and elegant. In order to 'make space' for the synthetic tiles, before comparison all indexes are multiplied by a factor of 2, and synthetic tiles are further offset by -1 or +1, depending on whether they should be handled before or after the 'natural' tile they are inserted for. E.g. synthetic tiles that load a value should be processed before the tile that uses the value they load.

Another issue soon appeared, this time having to do with x86 being, altogether, quaint and antiquated and annoying, and specifically with the use of one operand register as source and result value. To put it simply, where you and I and the expression JIT structure might say:

a = b + c

x86 says:

a = a + b

Resolving the difference is tricky, especially for linear scan, since linear scan processes the values in the program rather than the instructions that generate them. It is therefore not suited to deal with instruction-level constraints such as these. If a, b, and c in my example above are not the same (not aliases), then this can be achieved by a copy:

a = b
a = a + c

If a and b are aliases, the first copy isn't necessary. However, if a and c are aliases, then a copy may or may not be necessary, depending on whether the operation (in this case '+') is commutative, i.e. it holds for '+' but not for '-'. Commit 349b360 attempts to fix that for 'direct' binary operations, but a fix for indirect operations is still work in progress. Unfortunately, it meant I had to reserve a register for temporary use to resolve this, meaning there are fewer available for the register allocator to use. Fortunately, that did simplify handling of a few irregular instructions, e.g. signed cast of 32 bit integers to 64 bit integers.

So that brings us to today and my future plans. The next thing to implement will be support for function calls by the register allocator, which involves shuffling values to the right registers and correct positions on the stack, and also in spilling all values that are still required after the function call since the function may overwrite them. This requires a bit of refactoring of the logic that spills variables, since currently it is only used when there are not enough registers available. I also need to change the linear scan main loop, because it processes values in order of first definition, and as such, instructions that don't create any values are skipped, even if they need special handling like function calls. I'm thinking of solving that with a special 'interesting tiles' queue that is processed alongside the main values working queue.

That was it for today. I hope to write soon with more progress.

zondag 6 november 2016

A guide through register allocation: Introduction

This is the first post in what I intend to be a series on the register allocator for the MoarVM JIT compiler. It may be a bit less polished than usual, because I also intend to write more of these posts than I have in the past few months.

The main reason to write a register allocator is that it is needed by the compiler. The original 'lego' MoarVM JIT didn't need one, because it used what is called a 'memory-to-memory' model, meaning that every operation is expected to move operands from and to memory. In this it follows closely the behavior of virtually every other interpreter existing and especially that of MoarVM. However, many of these memory operations are logically redundant (for example, when storing and immediately loading an intermediate value, or loading the same value twice). Such redundancies are inherent to a memory-to-memory code model. In theory some of that can be optimized away, but in practice that involves building an unreasonably complicated state machine.

The new 'expression' JIT compiler was designed with the explicit (well, explicit to me, at least) goals of enabling optimization and specialization of machine code. That meant that a register-to-register code model was preferable, as it makes all memory operations explicit, which in turn enables optimization to remove some of them. (Most redundant 'load' operations can already be eliminated, and I'm plotting a way to remove most redundant 'store' operations, too). However, that also means the compiler must ensure that values can fit into the limited register set of the CPU, and that they aren't accidentally overwritten (for example as a result of a subroutine call). The job of the register allocator is to translate virtual registers to physical registers in a given code segment. This may involve modifying the original code by inserting load, store and copy operations.

Register allocation is known as a hard problem in computer science, and I think there are two reasons for that. The first reason is that finding the optimal allocation for a code segment is (probably) NP-complete. (NP-complete basically means that you have to consider all possible solutions in order to find the one you are after. A common feature of NP-complete problems is that the effect of a local choice on the global solution cannot be fully predicted). However, for what I think are excellent reasons, I can sidestep most of that complexity using the 'linear scan' register allocation algorithm. The details of that algorithm are subject of a later post.

The other reason that register allocation is hard is that the output code must meet the demanding specifications of the target CPU. For instance, some instructions take input only from specific registers, and some implicitly overwrite other registers. Calling conventions can also present a significant source of complexity as values must be placed in the right registers (or on the right stack locations) where the called function may expect them. So the register allocator must somehow encode these specific demands and ensure they are not violated.

Now that I've introduced register allocation, why it is needed, and what the challenges are, the next posts can begin to describe the solutions that I'm implementing.