brrt to the future

Retrospective of the MoarVM JIT

2023-06-10T15:33:00.000-07:00

Hi hackers! Today the MoarVM JIT project is nearly 9 years old. I was inspired by Jonathan's presentation reflecting on the development of MoarVM, to do the same for the MoarVM JIT, for which I have been responsible.

For those who are unfamiliar, what is commonly understood as 'JIT compilation' for virtual machines is performed by two components in MoarVM.

A framework for runtime type specialization ('spesh')
A native code generation backend for the specialized code (the 'JIT').

This post refers only to the native code generation backend component. It, too, is split into two mostly-independent systems:

A backend that emits code directly from MoarVM instructions from machine code templates (the 'lego' JIT compiler).
Another backend that transforms MoarVM instructions into an expression-based intermediate representation and compiles machine code based on that (the 'expression' compiler).

Things that worked well

Using DynASM for code generation. Even though we had to extend it to support register selection, using DynASM saved a lot of time compared to generating code 'from scratch'.
Textual opcode templates for the expression compiler (mapping from MoarVM instructions to the JIT intermediate representation)
Type analysis for opcode templates (preventing bugs)
Tiling for instruction selection
Using the on-stack return address as a current position marker (for the purpose of exception handling, lexical variable analysis etc.)

Things that didn't work so well

Testing. MoarVM doesn't have a separate test suite but relies on the NQP and Rakudo tests. For the JIT compiler, this is essentially 'testing in production'. In hindsight, specific unit and integration tests would have been beneficial.
Using an unordered intermediate representation for the 'expression' JIT. Against my expectation this prevented the implementation of common optimizations. And it made it nearly impossible to extend the 'expression' JIT to code segments longer than a basic block.
Failing to deprecate the 'legacy' JIT compiler.
Despite some attempts I never truly succeeded in sharing the owernship of the JIT.

What's kind of ugly

The 'expression' IR uses a completely different memory model (linear array with integer indexes) from the 'spesh' subsystem of which it is supposedly a part (which uses a more straightforward object graph coupled with arena allocation). The reason to do this was that the expression IR expands each MoarVM instruction into many IR instructions, and if every IR instruction nodes had to be pointer-based the memory costs would have been significant. But it's still kind of messy.
Some features in MoarVM (like exception handling) rely on being able to identify the section of code which is currently being executed... which is a reasonable choice for an interpreter but not great for compiled code.

How did we get here?

One one hand, as a result of my limited experience, time and resources, and on the other hand as a result of the design of MoarVM.

MoarVM was originally designed as a traditional interpreter for a high level language (much like the Perl interpreter). Meaning that it has a large number of different instructions and many instructions operate on high-level data structures like strings, arrays and maps (as opposed to pointers and machine words).

This is by no means a bad or outdated design. Frequently executed routines (string manipulation, hash table lookups etc.) are implemented using an efficient language (C) and driven by a language that is optimized for usability (Raku). This design is also used in modern machine learning frameworks. More importantly, this was a reasonable design because it is a good target for the Rakudo compiler.

For the JIT compiler, this means two things:

Mapping the large number of VM instructions to machine code (or IR) becomes a significant challenge.
The high level routines and data structures used by the interpreter are mostly opaque to the compiler.

The machine code generated by the JIT compiler then will mostly consists of consecutive function calls to VM routines, which is not the type of code where a compiler can really improve performance much.

In other words, suppose 50% of runtime is spent in interpretation overhead (instruction decoding and dispatch), and 50% is spent in VM routines, then removing interpretation overhead via JIT compilation will at best result in a twofold increase in performance. For many programs, the observed performance increase will be even less.

Mind that I'm specifically refering to the improvement due to machine code generation, and not to those due to type specialization, inlining etc. (the domain of 'spesh'). These latter features have resulted in much more significant performance improvements.

Was it worth it?

I think it was.

For me personally, it was a tremendously valuable learning experience which led directly to my current career, writing SQL compilers for Google Cloud.

For the Raku community, even if we never realized the performance improvements that I might have hoped at the start, I hope that the JIT project (as it exists) has been valuable, if for no other reason than identifying the challenges of JIT compilation for MoarVM. A future effort may be able to do better based on what we learned; and I hope my blog posts are a useful resource from that perspective.

What's next?

Assuming that time and resources were not an issue:

I'd start by adding tests to the JIT backend. This might take the shape of extracting (parts of) the JIT to a separate project, which MoarVM would link to.
I'd re-engineer the expression IR to be linear ordered instructions, at which point we should be able to construct IR for multiple basic blocks.
I'd deprecate the legacy JIT so we only have one backend left.

If any of this comes to pass, you'll find my report on it right here. Thanks for reasding and until then!

Why bother with Scripting?

2021-03-14T14:33:00.000-07:00

Many years back, Larry Wall shared his thesis on the nature of scripting. Since recently even Java gained 'script' support I thought it would be fitting to revisit the topic, and hopefully relevant to the perl and raku language community.

The weakness of Larry's treatment (which, to be fair to the author, I think is more intended to be enlightening than to be complete) is the contrast of scripting with programming. This contrast does not permit a clear separation because scripts are programs. That is to say, no matter how long or short, scripts are written commands for a machine to execute, and I think that's a pretty decent definition of a program in general.

A more useful contrast - and, I think, the intended one - is between scripts and other sorts of programs, because that allows us to compare scripting (writing scripts) with 'programming' (writing non-script programs). And to do that we need to know what other sorts of programs there are.

The short version of that answer is - systems and applications, and a bunch of other things that aren't really relevant to the working programmer, like (embedded) control algorithms, spreadsheets and database queries. (The definition I provided above is very broad, by design, because I don't want to get stuck on boundary questions). Most programmers write applications, some write systems, virtually all write scripts once in a while, though plenty of people who aren't professional programmers also write scripts.

I think the defining features of applications and systems are, respectively:

Applications present models to users (for manipulation)
Systems provide functionality to other programs

Consider for instance a mail client (like thunderbird) in comparison to a mailer daemon (like sendmail) - one provides an interface to read and write e-mails (the model) and the other provides functionality to send that e-mail to other servers.

Note that under this (again, broad) definition, libraries are also system software, which makes sense, considering that their users are developers (just as for, say, PostgreSQL) who care about things like performance, reliability, and correctness. Incidentally, libraries as well as 'typical' system software (such as database engines and operating system kernels) tend to be written in languages like C and C++ for much the same reasons.

What then, are the differences between scripts, applications, and systems? I think the following is a good list:

Scripts tend to be short, applications in particular can grow very large.
Scripts tend to be ad-hoc (written for a specific need), applications and systems tend to be designed for a range of use cases. (Very common example: build scripts)
Scripts tend to run only in a specific environment; in contrast, many applications are designed for a range of devices/clients; many systems have specific requirements but the intention is that they can be setup in multiple distinct environments.
Because scripts are ad-hoc, short, and environment-dependent, many of software engineering standard best practices don't really apply (and are in fact often disregarded).

Obviously these distinctions aren't really binary - 'short' versus 'long', 'ad-hoc' versus 'general purpose' - and can't be used to conclusively settle the question whether something is a script or an application. (If, indeed, that question ever comes up). More important is that for the 10 or so scripts I've written over the past year - some professionally, some not - all or most of these properties held, and I'd be surprised if the same isn't true for most readers.

And - finally coming at the point that I'm trying to make today - these features point to a specific niche of programs more than to a specific technology (or set of technologies). To be exact, scripts are (mostly) short, custom programs to automate ad-hoc tasks, tasks that are either to specific or too small to develop and distribute another program for.

This has further implications on the preferred features of a scripting language (taken to mean, a language designed to enable the development of scripts). In particular:

It should make programs concise. The economic rationalization is that the total expected lifetime value of a script, being ad-hoc and context-dependent, is not very great, so writing it should be cheap, which implies that the script should be short).
Related to this, the value provided by type systems is generally less than in larger (application) programs, and the value of extensive modelling features (class hierarchies) is similarly low, so many scripting languages have very weak type systems and data modelling features, if they have them at all.
Interoperation with the environment is on the other hand very important, so I/O features tend to be well-developed. (Contrast C, in which I/O is entirely an afterthought provided by a library).
It is acceptable to depend on a local environment in implicit ways, since that's what you are going to do anyway.
It is acceptable to warn on a condition that might've been a fatal error in another programming language.
In fact, I think that concerns of correctness are often different, meaning relaxed, compared to applications, again because scripters don't necessarily expect their scripts to run on every environment and with every possible input.

As an example of the last point - Python 3 requires users to be exact about the encoding of their input, causing all sorts of trouble for unsuspecting scripters when they accidentally try to read ISO-8551 data as UTF-8, or vice versa. Python 2 did not, and for most scripts - not applications - I actually think that is the right choice.

This niche doesn't always exist. In computing environments where everything of interest is adequately captured by an application, or which lacks the ability to effectively automate ad-hoc tasks (I'm thinking in particular of Windows before PowerShell), the practice of scripting tends to not develop. Similarily, in a modern 'cloud' environment, where system setup is controlled by a state machine hosted by another organization, scripting doesn't really have much of a future.

To put it another way, scripting only thrives in an environment that has a lot of 'scriptable' tasks; meaning tasks for which there isn't already a pre-made solution available, environments that have powerful facilities available for a script to access, and whose users are empowered to automate those tasks. Such qualities are common on Unix/Linux 'workstations' but rather less so on smartphones and (as noted before) cloud computing environments.

Truth be told I'm a little worried about that development. I could point to, and expound on, the development and popularity of languages like go and rust, which aren't exactly scripting languages, or the replacement of Javascript with TypeScript, to make the point further, but I don't think that's necessary. At the same time I could point to the development of data science as a discipline to demonstrate that scripting is alive and well (and indeed perhaps more economically relevant than before).

What should be the conclusion for perl 5/7 and raku? I'm not quite sure, mostly because I'm not quite sure whether the broader perl/raku community would prefer their sister languages to be scripting or application languages. (As implied above, I think the Python community chose that they wanted Python 3 to be an application language, and this was not without consequences to their users).

Raku adds a number of features common to application languages (I'm thinking of it's powerful type system in particular), continuing a trend that perl 5 arguably pioneered. This is indeed a very powerful strategy - a language can be introduced for scripts and some of those scripts are then extended into applications (or even systems), thereby ensuring its continued usage. But for it to work, a new perl family language must be introduced on its scripting merits, and there must be a plentiful supply of scriptable tasks to automate, some of which - or a combination of which - grow into an application.

For myself, I would like to see scripting have a bright future. Not just because scripting is the most accessible form of programming, but also because an environment that permits, even requires scripting, is one were not all interesting problems have been solved, one where it's users ask it to do tasks so diverse that there isn't an app for that, yet. One where the true potential of the wonderful devices that surround is can be explored.

In such a world there might well be a bright future for scripting.

Reverse Linear Scan Allocation is probably a good idea

2019-03-21T15:52:00.001-07:00

Hi hackers! Today First of all, I want to thank everybody who gave such useful feedback on my last post. For instance, I found out that the similarity between the expression JIT IR and the Testarossa Trees IR is quite remarkable, and that they have a fix for the problem that is quite different from what I had in mind.

Today I want to write something about register allocation, however. Register allocation is probably not my favorite problem, on account of being both messy and thankless. It is a messy problem because - aside from being NP-hard to solve optimally - hardware instruction sets and software ABI's introduce all sorts of annoying constraints. And it is a thankless problem because the case in which a good register allocator is useful - for instance, when there's lots of intermediate values used over a long stretch of code - are fairly rare. Much more common are the cases in which either there are trivially sufficient registers, or ABI constraints force a spill to memory anyway (e.g. when calling a function, almost all registers can be overwritten).

So, on account of this being not my favorite problem, and also because I promised to implement optimizations in the register allocator, I've been researching if there is a way to do better. And what better place to look than one of the fastest dynamic language implementations arround, LuaJIT? So that's what I did, and this post is about what I learned from that.

Truth be told, LuaJIT is not at all a learners' codebase (and I don't think it's author would claim this). It uses a rather terse style of C and lots and lots of preprocessor macros. I had somewhat gotten used to the style from hacking dynasm though, so that wasn't so bad. What was more surprising is that some of the steps in code generation that are distinct and separate in the MoarVM JIT - instruction selection, register allocation and emitting bytecode - were all blended together in LuaJIT. Over multiple backend architectures, too. And what's more - all these steps were done in reverse order - from the end of the program (trace) to the beginning. Now that's interesting...

I have no intention of combining all phases of code generation like LuaJIT has. But processing the IR in reverse seems to have some interesting properties. To understand why that is, I'll first have to explain how linear scan allocation currently works in MoarVM, and is most commonly described:

First, the live ranges of program values are computed. Like the name indicates, these represent the range of the program code in which a value is both defined and may be used. Note that for the purpose of register allocation, the notion of a value shifts somewhat. In the expression DAG IR, a value is the result of a single computation. But for the purposes of register allocation, a value includes all its copies, as well as values computed from different conditional branches. This is necessary because when we actually start allocating registers, we need to know when a value is no longer in use (so we can reuse the register) and how long a value will remain in use -
Because a value may be computed from distinct conditional branches, it is necessary to compute the holes in the live ranges. Holes exists because if a value is defined in both sides of a conditional branch, the range will cover both the earlier (in code order) branch and the later branch - but from the start of the later branch to its definition that value doesn't actually exist. We need this information to prevent the register allocator from trying to spill-and-load a nonexistent value, for instance.
Only then can we allocate and assign the actual registers to instructions. Because we might have to spill values to memory, and because values now can have multiple definitions, this is a somewhat subtle problem. Also, we'll have to resolve all architecture specific register requirements in this step.

In the MoarVM register allocator, there's a fourth step and a fifth step. The fourth step exists to ensure that instructions conform to x86 two-operand form (Rather than return the result of an instruction in a third register, x86 reuses one of the input registers as the output register. E.g. all operators are of the form a = op(a, b) rather than a = op(b, c). This saves on instruction encoding space). The fifth step inserts instructions that are introduced by the third step; this is done so that each instruction has a fixed address in the stream while the stream is being processed.

Altogether this is quite a bit of complexity and work, even for what is arguably the simplest correct global register allocation algorithm. So when I started thinking of the reverse linear scan algorithm employed by LuaJIT, the advantages became clear:

In LuaJIT, the IR maintains its SSA form - there is only a single definition of a value. This means that when iterating in reverse order, computing the live range becomes trivial. When we first encounter a use of a value, then by definition that is the last use. And when we encounter a definition, that is the only and single definition, and we can release the register. So there's no need to compute the live range in advance of allocation.
Furthermore, rather than merging the values of multiple branches into the same live range, each value on either side becomes an individual live range. As a result, the live range of a value never has a hole, further simplifying code.
LuaJIT uses register hints to indicate which registers could best be picked for a specific value. This is often determined by how a value is used (e.g., the divisor in a div instruction must be in the rcx register). If the preferred register can't be allocated, the register allocator inserts code to move it to the right place where needed. Having hints can be expected to greatly reduce the need for such code.

There are downsides as well, of course. Not knowing exactly how long a value will be live while processing it may cause the algorithm to make worse choices in which values to spill. But I don't think that's really a great concern, since figuring out the best possible value is practically impossible anyway, and the most commonly cited heuristic - evict the value that is live furthest in the future, because this will release a register over a longer range of code, reducing the chance that we'll need to evict again - is still available. (After all, we do always know the last use, even if we don't necessarily know the first definition).

Altogether, I'm quite excited about this algorithm; I think it will be a real simplification over the current implementation. Whether that will work out remains to be seen of course. I'll let you know!

Something about IR optimization

2019-03-17T06:23:00.002-07:00

Hi hackers! Today I want to write about optimizing IR in the MoarVM JIT, and also a little bit about IR design itself.

One of the (major) design goals for the expression JIT was to have the ability to optimize code over the boundaries of individual MoarVM instructions. To enable this, the expression JIT first expands each VM instruction into a graph of lower-level operators. Optimization then means pattern-matching those graphs and replacing them with more efficient expressions.

As a running example, consider the idx operator. This operator takes two inputs (base and element) and a constant parameter scale and computes base+element*scale. This represents one of the operands of an 'indexed load' instruction on x86, typically used to process arrays. Such instructions allow one instruction to be used for what would otherwise be two operations (computing an address and loading a value). However, if the element of the idx operator is a constant, we can replace it instead with the addr instruction, which just adds a constant to a pointer. This is an improvement over idx because we no longer need to load the value of element into a register. This saves both an instruction and valuable register space.

Unfortunately this optimization introduces a bug. (Or, depending on your point of view, brings an existing bug out into the open). The expression JIT code generation process selects instructions for subtrees (tile) of the graph in a bottom-up fashion. These instructions represent the value computed or work performed by that subgraph. (For instance, a tree like (load (addr ? 8) 8) becomes mov ?, qword [?+8]; the question marks are filled in during register allocation). Because an instruction is always represents a tree, and because the graph is an arbitrary directed acyclic graph, the code generator projects that graph as a tree by visiting each operator node only once. So each value is computed once, and that computed value is reused by all later references.

It is worth going into some detail into why the expression graph is not a tree. Aside from transformations that might be introduced by optimizations (e.g. common subexpression elimination), a template may introduce a value that has multiple references via the let: pseudo-operator. See for instance the following (simplified) template:

(let: (($foo (load (local))))
(add $foo (sub $foo (const 1))))

Both ADD and SUB refer to the same LOAD node

In this case, both references to $foo point directly to the same load operator. Thus, the graph is not a tree. Another case in which this occurs is during linking of templates into the graph. The output of an instruction is used, if possible, directly as the input for another instruction. (This is the primary way that the expression JIT can get rid of unnecessary memory operations). But there can be multiple instructions that use a value, in which case an operator can have multiple references. Finally, instruction operands are inserted by the compiler and these can have multiple references as well.

If each operator is visited only once during code generation, then this may introduce a problem when combined with another feature - conditional expressions. For instance, if two branches of a conditional expression both refer to the same value (represented by name $foo) then the code generator will only emit code to compute its value when it encounters the first reference. When the code generator encounters $foo for the second time in the other branch, no code will be emitted. This means that in the second branch, $foo will effectively have no defined value (because the code in the first branch is never executed), and wrong values or memory corruption is then the predictable result.

This bug has always existed for as long as the expression JIT has been under development, and in the past the solution has been not to write templates which have this problem. This is made a little easier by a feature the let: operator, in that it inserts a do operator which orders the values that are declared to be computed before the code that references them. So that this is in fact non-buggy:

(let: (($foo (load (local))) # code to compute $foo is emitted here
(if (...)
(add $foo (const 1)) # $foo is just a reference
(sub $foo (const 2)) # and here as well

The DO node is inserted for the LET operator. It ensures that the value of the LOAD node is computed before the reference in either branch

Alternatively, if a value $foo is used in the condition of the if operator, you can also be sure that it is available in both sides of the condition.

All these methods rely on the programmer being able to predict when a value will be first referenced and hence evaluated. An optimizer breaks this by design. This means that if I want the JIT optimizer to be successful, my options are:

Fix the optimizer so as to not remove references that are critical for the correctness of the program
Modify the input tree so that such references are either copied or moved forward
Fix the code generator to emit code for a value, if it determines that an earlier reference is not available from the current block.

In other words, I first need to decide where this bug really belongs - in the optimizer, the code generator, or even the IR structure itself. The weakness of the expression IR is that expressions don't really impose a particular order. (This is unlike the spesh IR, which is instruction-based, and in which every instruction has a 'previous' and 'next' pointer). Thus, there really isn't a 'first' reference to a value, before the code generator introduces the concept. This is property is in fact quite handy for optimization (for instance, we can evaluate operands in whatever order is best, rather than being fixed by the input order) - so I'd really like to preserve it. But it also means that the property we're interested in - a value is computed before it is used in, in all possible code flow paths - isn't really expressible by the IR. And there is no obvious local invariant that can be maintained to ensure that this bug does not happen, so any correctness check may have to check the entire graph, which is quite impractical.

I hope this post explains why this is such a tricky problem! I have some ideas for how to get out of this, but I'll reserve those for a later post, since this one has gotten quite long enough. Until next time!

A short post about types and polymorphism

2019-01-14T13:34:00.002-08:00

Hi all. I usually write somewhat long-winded posts, but today I'm going to try and make an exception. Today I want to talk about the expression template language used to map the high-level MoarVM instructions to low-level constructs that the JIT compiler can easily work with:

This 'language' was designed back in 2015 subject to three constraints:

It should make it easy to develop 'templates' for MoarVM instructions, so we can map the ~800 or so different instructions supported by the interpreter to something the JIT compiler can work with.
It should be simple to process and analyze; specifically, it should be suitable as input to the instruction selection process (the tiler).
It should be simple to implement, both from the frontend (meaning the perl program that compiles a template file to a C header) and the backend (meaning the C code that combines templates into the IR that is compiled).

Recently I've been working on adding support for floating point operations, and this means working on the type system of the expression language. Because floating point instructions operate on a distinct set of registers from integer instructions, a floating point operator is not interchangeable with an integer (or pointer) operator.

This type system is enforced in two ways. First, by the template compiler, which attempts to check if you've used all operands correctly. This operates during development, which is convenient. Second, by instruction selection, as there will simply not be any instructions available that have the wrong combinations of types. Unfortunately, that happens at runtime, and such errors so annoying to debug that it motivated the development of the first type checker.

However, this presents two problems. One of the advantages of the expression IR is that, by virtue of having a small number of operators, it is fairly easy to analyze. Having a distinct set of operators for each type would undo that. But more importantly, there are several MoarVM instructions that are generic, i.e. that operate on integer, floating point, and pointer values. (For example, the set, getlex and bindlex instructions are generic in this way). This makes it impossible to know whether its values will be integers, pointers, or floats.

This is no problem for the interpreter since it can treat values as bags-of-bits (i.e., it can simply copy the union MVMRegister type that holds all values of all supported types). But the expression JIT works differently - it assumes that it can place any value in a register, and that it can reorder and potentially skip storing them to memory. (This saves work when the value would soon be overwritten anyway). So we need to know what register class that is, and we need to have the correct operators to manipulate a value in the right register class.

To summarize, the problem is:

We need to know the type of each value, both to ensure we use the correct instructions and the right registers.
There are several cases in which we don't really know (for the template) what type each value has.

There are two ways around this, and I chose to use both. First, we know as a fact for each local or lexical value in a MoarVM frame (subroutine) what type it should have. So even a generic operator like set can be resolved to a specific type at runtime, at which point we can select the correct operators. Second, we can introduce generic operators of our own. This is possible so long as we can select the correct instruction for an operator based on the types of the operands.

For instance, the store operator takes two operands, an address and a value. Depending on the type of the value (reg or num), we can always select the correct instruction (mov or movsd). It is however not possible to select different instructions for the load operator based on the type required, because instruction selection works from the bottom up. So we need a special load_num operator, but a store_num operator is not necessary. And this is true for a lot more operators than I had initially thought. For instance, aside from the (naturally generic) do and if operators, all arithmetic operators and comparison operators are generic in this way.

I realize that, despite my best efforts, this has become a rather long-winded post anyway.....

Anyway. For the next week, I'll be taking a slight detour, and I aim to generalize the two-operand form conversion that is necessary on x86. I'll try to write a blog about it as well, and maybe it'll be short and to the point. See you later!

New years post

2019-01-06T13:15:00.001-08:00

Hi everybody! I recently read jnthns Perl 6 new years resolutions post, and I realized that this was an excellent example to emulate. So here I will attempt to share what I've been doing in 2018 and what I'll be doing in 2019.

In 2018, aside from the usual refactoring, bugfixing and the like:

I added support for the fork() system call in the MoarVM backend.
I removed the 'invokish' control flow mechanism and replaced it with controlled return address modification.
I requested a grant from the Perl Foundation aiming to complete the expression JIT compiler with floating point support, irregular registers and the like.

So 2019 starts with me trying to complete the goals specified in that grant request. I've already partially completed one goal (as explained in the interim report) - ensuring that register encoding works correctly for SSE registers in DynASM. Next up is actually ensuring support for SSE (and floating point) registers in the JIT, which is surprisingly tricky, because it means introducing a type system where there wasn't really one previously. I will have more to report on that in the near future.

After that, the first thing on my list is the support for irregular register requirements in the register allocator, which should open up the possibility of supporting various instructions.

I guess that's all for now. Speak you later!

A future for fork(2)

2018-10-03T22:18:00.000-07:00

Hi hackers. Today I want to write about a new functionality that I've been developing for MoarVM that has very little to do with the JIT compiler. But it is still about VM internals so I guess it will fit.

Many months ago, jnthn wrote a blog post on the relation between perl 5 and perl 6. And as a longtime and enthusiastic perl 5 user - most of the JIT's compile time support software is written in perl 5 for a reason - I wholeheartedly agree with the 'sister language' narrative. There is plenty of room for all sorts of perls yet, I hope. Yet one thing kept itching me:

Moreover, it’s very much the case that Perl 5 and Perl 6 make different trade-offs. To pick one concrete example, Perl 6 makes it easy to run code across multiple threads, and even uses multiple threads internally (for example, performing optimization and JIT compilation on a background thread). Which is great…except the only winning move in a game involving both threads and fork() is not to play. Sometimes one just can’t have their cake and eat it, and if you’re wanting a language that more directly gives you your POSIX, that’s probably always going to be a strength of Perl 5 over Perl 6.

(Emphasis mine). You see, I had never realized that MoarVM couldn't implement fork(), but it's true. In POSIX systems, a fork()'d child process inherits the full memory space, as-is, from its parent process. But it inherits only the forking thread. This means that any operations performed by any other thread, including operations that might need to be protected by a mutex (e.g. malloc()), will be interrupted and unfinished (in the child process). This can be a problem. Or, in the words of the linux manual page on the subject:

       *  The child process is created with a single thread—the one that
          called fork().  The entire virtual address space of the parent is
          replicated in the child, including the states of mutexes,
          condition variables, and other pthreads objects; the use of
          pthread_atfork(3) may be helpful for dealing with problems that
          this can cause.

       *  After a fork() in a multithreaded program, the child can safely
          call only async-signal-safe functions (see signal-safety(7)) until
          such time as it calls execve(2).

Note that the set of signal-safe functions is not that large, and excludes all memory management functions. As noted by jnthn, MoarVM is inherently multithreaded from startup, and will happily spawn as many threads as needed by the program. It also uses malloc() and friends rather a lot. So it would seem that perl 6 cannot implement POSIX fork() (in which both parent and child program continue from the same position in the program) as well as perl 5 can.

I was disappointed. As a longtime (and enthusiastic) perl 5 user, fork() is one of my favorite concurrency tools. Its best feature is that parent and child processes are isolated by the operating system, so each can modify its own internal state without concern for concurrency,. This can make it practical to introduce concurrency after development, rather than designing it in from the start. Either process can crash while the other continues. It is also possible (and common) to run a child process with reduced privileges relative to the parent process, which isn't possible with threads. And it is possible to prepare arbitrarily complex state for the child process, unlike spawn() and similar calls.

Fortunately all hope isn't necessarily lost. The restrictions listed above only apply if there are multiple threads active at the moment that fork() is executed. That means that if we can stop all threads (except for the one planning to fork) before actually calling fork(), then the child process can continue safely. That is well within the realm of possibility, at least as far as system threads are concerned.

The process itself isn't very exciting to talk about, actually - it involves sending stop signals to system threads, waiting for these threads to join, verifying that the forking thread is the really only active thread, and restarting threads after fork(). Because of locking, it is a bit subtle (tbere may be another user thread that is also trying to fork), but not actually very hard. When I finally merged the code, it turned out that an ancient (and forgotten) thread list modification function was corrupting the list by not being aware of generational garbage collection... oops. But that was simple enough to fix (thanks to timotimo++ for the hint).

And now the following oneliner should work on platforms that support fork():(using development versions of MoarVM and Rakudo):

perl6 -e 'use nqp; my $i = nqp::fork(); say $i;'

The main limitation of this work is that it won't work if there are any user threads active. (If there are any, nqp::fork() will throw an exception). The reason why is simple: while it is possible to adapt the system threads so that I can stop them on demand, user threads may be doing arbitrary work, hold arbitrary locks and may be blocked (possibly indefinitely) on a system call. So jnthn's comment at the start of this post still applies - threads and fork() don't work together.

In fact, many programs may be better off with threads. But I think that languages in the perl family should let the user make that decision, rather than the VM. So I hope that this will find some use somewhere. If not, it was certainly fun to figure out. Happy hacking!

PS: For the curious, I think there may in fact be a way to make fork() work under a multithreaded program, and it relies on the fact that MoarVM has a multithreaded garbage collector. Basically, stopping all threads before calling fork() is not so different from stopping the world during the final phase of garbage collection. And so - in theory - it would be possible to hijack the synchronization mechanism of the garbage collector to pause all threads. During interpretation, and in JIT compiled code, each thread periodically checks if garbage collection has started. If it has, it will synchronize with the thread that started GC in order to share the work. Threads that are currently blocked (on a system call, or on acquiring a lock) mark themselves as such, and when they are resumed always check the GC status. So it is in fact possible to force MoarVM into a single active thread even with multiple user threads active. However, that still leaves cleanup to deal with, after returning from fork() in the child process. Also, this will not work if a thread is blocked on NativeCall. Altogether I think abusing the GC in order to enable a fork() may be a bit over the edge of insanity :-)

Template Compiler Update

2018-09-04T10:30:00.000-07:00

Hi everybody. After samcv++ released MoarVM again, it was finally time to introduce some updates to the expression template compiler.

As you may or may not know, the expression JIT backend maps MoarVM opcodes to its internal representation via expression templates. At runtime (during the JIT compilation phase), these templates are combined to form an expression tree. The expression template language remains somewhat underdocumented, but quite a few brave developers have still ventured to add templates. With the new template compiler, a few things will change:

Template syntax and type checking is much stricter. This means that, for example, it is no longer possible to place a C-macro expression (which evaluates to a constant value) where an expression operand (i.e. code) would be expected.
Templates can contain references to MoarVM opcode operands (written as $0, $1, $2 etc.) At runtime, read operand references are substituted with the value of the operand, and write operands with the address of the register they refer to. This was a source of confusion, so to reduce this all write operands must now be written with a '\' prefix, like \$0.
At the same time, the '!' suffix on a template name (like slice!) indicate that the template code does not return the value of the output, but overwrite the register where it is stored. (It tells the expression compiler that the value in that output register must be loaded from memory the next time it is used). Many templates that do not yield an output (like unshift) had this suffix, but although that is harmless at runtime, it is nevertheless confusing, so the template compiler will now raise an error if it finds this.
Last but not least, macro application is now hygienic, meaning that a name declared in a macro will never conflict with a name declared in a template. So something like this should now work, even though it is nonsense:


(macro: ^foo (,bar)
  (let (($obj ...))
    (add $obj ,bar)))
(template: foobar
  (let (($obj ...))
    (^foo $obj))

Prior to these patches, this would expand to something like:


(template: foobar
  (let (($obj ...))
    (let (($obj ...)) # name conflict is here
      (add $obj $obj))) # which $obj?

Although this code would previously fail during compilation, having to deal with such conflicts in remote code segments is annoying. But now named declarations are resolved prior to expanding the inner macro, and this means that the $obj for macro ^foo is resolved to the let: within that macro, and the outer $obj in the template refers to the outer declaration, as if you'd written:


(template: foobar
  (let (($obj1 ...))
    (let (($obj2 ...))
      (add $obj2 $obj1)))

I believe that this makes writing templates safer and simpler, catching more errors at compile time and fewer at runtime. Happy hacking!

A Curious Benchmark

2018-08-25T08:57:00.001-07:00

Hi hackers! I recently saw a curious benchmark passed round on the #moarvm channel. (I understand that this originates from Ovid during his Future of Perl 5 and 6 presentation, but I haven't seen it myself, so if I'm wrong, don't hesitate to correct me). It's curious because it runs on perl5 and perl6, and because the difference is… significant:

# reciprocal.pl
my $x = 0;
$x += 1/$_ for 1..50_000_000;
print "$x\n";

Perl 5 (5.24.1), on my laptop (Intel i7-4700HQ CPU @ 2.40GHz, 16GB ram), takes approximately 2.7s to print the value 18.3047492382933. Perl 6, on the same script, takes just shy of 2m18s, or nearly 140s. This is 51 times as much. Perl 5 is not known as a particularly fast language, but in this case perl 6 is really very slow indeed.

Let's try and do a little better. As pointed out by timotimo, the benchmark above is a pessimal case for rationals numbers, which perl 6 uses by default. Perl 5 uses floating point throughout. So we can do better by explicitly using floating point calculations in perl6:

# reciprocal.pl6
my $x = 0e0;
$x += 1e0/$_.Num for 1..50_000_000;
say $x;

This takes approximately 30s on my machine, approximately 5 times faster, but still over 11 times slower than perl5. (NB: for all these numbers I'm posting here, I didn't run exhaustive tests and calculate the statistics, but I feel like this is reliable).

We can typically avoid some overhead in Perl 6 if we avoid using scalar containers by means of binding. We can avoid the dynamic lookup of $_ by replacing the for with a while loop. And we can skip the cast from Int to Num by using a Num iterator value. That gives us the following code:

# reciprocal-while.pl6
my $x := 0e0;
my $i := 0e0;
while (($i := $i + 1e0) < 5e7) {
    $x := $x + 1e0/$i;
}
say $x;

This reduces the run time to approximately 26.5s. So instead of well over 11 times slower than perl 5, perl 6 is now a little less than 10 times slower.
I tried using native types, but that increased the run time to up to 36s. Native type performance (except for native typed arrays) has so far not met expectations for perl 6. (I understand that this is due to excessive boxing and unboxing, unfortunately). So it seems that I failed to make a perl6 program that performs comparably to perl 5.

And yet…

With all due respect to the perl 5 (and perl 6) developers, I think MoarVM ought to be able to do better. MoarVM has support for native-typed values and operators, the perl interpreter does not. It has a dynamic specializer and JIT compiles to native code, the perl interpreter does not. I should expect MoarVM to do better on this benchmark. (Otherwise, what have I been doing with my life, the last few years?)

Let's try the secret weapon: NQP. NQP stands for Not Quite Perl and it is the language the rakudo compiler is built in. It acts like a 'bootstrap' language and compatibility layer between the various backends of rakudo perl 6, such as MoarVM, Java and Javascript, and in the near future Truffle. I like to think that it relates to MoarVM in the same way that C relates to contemporary CPUs - it is a sort of low level, high level language. Although NQP fully support perl6 classes, regular expressions and grammars, it has no support for ranges or C-style for loops, so it does tend to look a bit primitive:

# reciprocal-boxed.nqp
my $x := 0e0;
my $i := 1e0;
while ($i < 5e7) {
    $x := $x + 1e0/$i;
    $i := $i + 1e0;
}
nqp::say("$x");

This version uses boxed objects and takes (on my machine) approximately 12.5s. If you recall, perl 5 took 2.7s, so this is just a bit less than 5 times slower than perl 5. Still not very satisfying though.

Let's improve this version a little bit and add native types here. The only difference between this code and the code above it, is that we have explicitly opted to use num values for $x and $i.

# reciprocal-native.nqp
my num $x := 0e0;
my num $i := 1e0;
while ($i < 5e7) {
    $x := $x + 1e0/$i;
    $i := $i + 1e0;
}
nqp::say("$x");

This code, with JIT enabled, consistently runs in approximately 0.28s on my machine. That is not a typing error. It prints the correct result. I emphatically want you to try this at home: simply save it as reciprocal.nqp and run time nqp reciprocal.nqp. (With JIT disabled, it runs in 2.4s, which is (finally) a bit faster than perl 5 in the interpreter).

Just out of curiosity, I tried comparing this result with the following C code:

#include <stdio.h>

int main(int argc, char **argv) {
    double x = 0.0;
    double i;
    for (i = 1.0;  i < 5e7; i += 1.0)
        x += 1.0/i;
    printf("%f", x);
}

On my machine, this takes approximately 0.22s per run, which means that NQP on MoarVM, using native types, is a little more than 1.3x slower than compiled C, excluding the compilation time of the C program. The NQP program does include JIT compilation.

For the record, the C compiled code is much simpler and faster than that generated by the MoarVM JIT - 75 bytes for C 'main' vs 1139 bytes for the NQP 'mainline' (149 bytes for the hot loop). So there is much to improve left, for sure. (EDIT: I originally wrote this in a rather confusing way, but the C code was always the shorter version).

So what does that tell us? Is perl6 50 times slower than perl5, or is NQP just 30% slower than C? There is actually much more to learn from the relative performance of these programs:

Regardless of the mathematical merits of rational numbers, they do not win benchmarks. It is a good thing perl 6 can fall back to floating point.
The cost of boxing and unboxing in MoarVM appears to be fairly large. Compare the performance of the boxing-floating point calculations and the native-floating point calculations in NQP - the cost of boxing absolutely dwarves the actual cost of doing the computation. I think this deserves further investigation.
The overhead introduced by perl6's flexibility, even after specialization, accounts for a factor of two slowdown compared to NQP. If I'd hazard a guess, I would attribute that to (multi) subroutine call overhead and indirection from scalar containers. I would speculate that (deep) inlining or trace specialization would help significantly.
The overhead of specialization and JIT compilation is insignificant even in a short-running benchmark like this. This is pretty encouraging I think.
This benchmark, when reduced to native operations, appears to be dominated by the cost of floating point operations, and I expect floating point division to be the most expensive of these. So the value as a benchmark for the JIT is relatively limited - both C and NQP code appear to spend most of the time waiting for the FPU to complete.

After I wrote this (but before publishing), I reran the benchmarks with the new branch postrelease-opts. The results were rather different and encouraging (units in seconds):

test	`master`	`postrelease-opts`
reciprocal.pl	140	51
reciprocal.pl6	31	6.6
reciprocal-while.pl6	26	4.6
reciprocal-boxed.nqp	12	3.6
reciprocal-native.nqp	0.27	0.25

Congratulations to jnthn++ and timotimo++ for this fantastic result. As I understand it, we expect to merge this branch after the release of MoarVM 2018.08.

As always, benchmarks should be taken with a grain of salt - results do not always generalize cleanly to production code. But I hope this has given some insights into the performance characteristics of perl 6, and I think it highlights some areas of interest.

PS. The postrelease-opts has taken the place of the master branch as the mainline for development, as we aim to get master fit for a release. Personally, I'd rather have master open for development and a release branch for cutting a stable release. I expect most users get MoarVM from a specific revision via MOAR_REVISION anyway, so having master be the 'development' branch should cause little harm.

Perl 6 on MoarVM has had a JIT for a few years

2018-07-05T08:32:00.000-07:00

Dear readers, I recently came across the following comment on the internet (via Perl 6 weekly):

perl6 on MoarVM is starting to grow a JIT

Which is very interesting because:

Someone had at least heard of our efforts, so that's a win.
But they had left with the impression that it was still in a beginning phase.

Clearly, we have some PR work to do. Consider this my attempt.

When people are talking about a 'JIT' colloquially and especially in the context of dynamic languages, they usually refer to a system that has both a dynamic specialization functionality, as well as a machine-code emitting functionality. MoarVM has had support for both since 2014. Historically we've called the specializer 'spesh' and the machine code emitter 'jit'. Maybe for communicating with the rest of the world it is better to call both 'the JIT' frontend and backend, respectively.

Without further ado, let me list the history of the MoarVM JIT compiler:

The April 2014 release of MoarVM introduced the dynamic specialization framework spesh (the 'frontend').
In June 2014, I started working in earnest on the machine code emitter (the part that we call the JIT), and halfway through that month had compiled the first function.
In July 2014 we saw the introduction of inlining and on-stack-replacement in spesh.
Also July 2014 saw the introduction of the invokish mechanism by which control flow can be returned to the interpreter.
In August 2014 the JIT backend was completed and merged into MoarVM master.
In June 2015, I started work on a new JIT compiler backend, first by patching the assembler library we use (DynASM), then proceeded with a new intermediate representation and instruction selection.
After that progress slowed until in March 2017 I completed the register allocator. (I guess register allocation was harder than I thought it would be). A little later, the register allocator would support call argument placement.
The new 'expression' JIT backend was merged in August 2017. Since then, many contributors have stepped up to develop expression JIT templates (that are used to generate machine code for MoarVM opcodes).
In August 2017, Jonathan started working on reorganizing spesh, starting with moving specialization to a separate thread, central optimization planning, improving optimizations and installing argument guards (which makes the process of selecting the correct specialized variant more efficient).
Somewhere after this - I'm not exactly sure when - nine implemented inlinig the 'NativeCall' foreign function interface into JIT compiled code.
Most recently Jonathan has started work to write specializer plugins - ways for rakudo to inform MoarVM on how to cache the result of method lookups, which should help increase the effectiveness of optimization for Perl 6.
Arround the same time I reworked the way that the interpreter handles control flow changes for JIT compiled code (e.g. in exception handling).

We (mostly Jonathan and I) have also given presentations on the progress of the JIT compiler:

At YAPC::EU 2014 and FOSDEM 2015, Jonathan gave a presentation on the MoarVM dynamic specialization system.
At YAPC::EU 2015 (Granada) I also gave a presentation. Sadly I can no longer find the presentation online or offline.
At the Swiss Perl Workshop in 2017 Jonathan gave a presentation on deoptimization and how it is used for supporting speculative optimizations.
At The Perl Conference in Amsterdam (2017) I gave a presentation on how to implement JIT expression templates in MoarVM.

I'm sure there is more out there that I haven't yet linked to. And there's still a lot of exciting work going on and still more work to do. However, I hope that after this post there can be no doubt about the reality of a Perl6 implementation backed by a specializing JIT compiler ;-)

PS: You might read this and be reasonably surprised that Rakudo Perl 6 is not, after all this, very fast yet. I have a - not entirely serious - explanation for that:

All problems in computer science can be solved with a layer of indirection.
Many layers of indirection make programs slow.
Perl 6 solves many computer science problems for you ;-)

In the future, we'll continue to solve those problems, just faster.

PPS: Also, it is important to note that many of practical speed improvements that Rakudo users have come to enjoy did not come from VM improvements per se, but from better use of the VM by core library routines, for which many volunteers are responsible.

Controlled Stack Hacking for the MoarVM JIT Compiler

2018-06-10T09:29:00.003-07:00

Hi readers! Today I have a story about a recently-merged set of patches that allows MoarVM to use the on-stack return pointer to reduce the overhead of exception handling and other VM features for JIT compiled code. Maybe you'll find it interesting.

As you might know, MoarVM Is a bytecode interpreter. In some situations, MoarVM internals need to know the current position in the execution of the program. For instance in exception handling, all exception thrown within a block are caught by the associated CATCH block or propagated if no such block exists. Such blocks are indicated as a range within the bytecode, and we find the associated CATCH block by comparing the current position with the known ranges.

This is relatively straightforward to implement for the interpreter, because the interpreter must maintain a 'current position' pointer simply to function. (MoarVM stores a pointer to this pointer in a thread context object so that it is available throughout the VM). For the JIT that is another matter, because the control flow is handled implicitly by the CPU. The instruction pointer register (called %rip on amd64) cannot be read directly. Moreover, as soon as you enter a function that might want to use the current address (like the functions responsible for exception handling), you've left the 'program' code and entered VM code.

So what we used to do instead is take the address of a position within the bytecode (as indicated by a label in the bytecode, a somewhat involved process) and store that in a per-frame field called the jit_entry_label. This field is necessary to support another MoarVM feature - we use the interpreter as a trampoline (in the first or second sense of that definition). Because the interpreter is not recursive, JIT compiled code needs to return to the interpreter to execute a subroutine that was invoked (as opposed to calling an interpreter function, as perl5 does for exception handling). The primary purpose of this label is to continue where we left off after returning from another invoked program. But it can be used just as well for finding where we are in the execution of the program.

Only problem then is that we need to keep it up to date, which we did. On the entry of every basic block (uninterrupted sequence of code), we stored the current position in this field. This is quite common - every conditional statement, loop or other control flow change needs one, as well as every exception-handler scope change needed a little snippet storing the current position. This was annoying.

Furthermore, there are numerous MoarVM instructions that might change the control flow (or might not). For instance, the instruction responsible for converting an object to a boolean value might need to invoke the Bool method specific to that objects' class - or, if no such method exists, fallback to a default implementation. We call such instructions invokish. When compiling code that contains such invokish instructions, we installed 'control guards' to check if the VM had in fact invoked another routine, and if so, to return to the interpreter to execute that routine. This too added quite a bit of overhead.

I keep writing in the past tense because all of that is now gone, and that happened due to a simple realization. When we call a function (in C or assembly), we place the return address (the machine instruction after the call instruction) on the stack. We can read this value from the stack and use it wherever we want to know about the current position.

I initially had implemented that using a stack walker function similar to the one in the link, except that I implemented it in assembly instead. (When writing this post I learned of the GCC __builtin_return_address and MSVC _ReturnAddress intrinsic functions, which presumably do the same thing). Unfortunately, that strategy didn't really work - it relies on the frame base pointer (%rbp) being placed right 'on top' of the return address pointer on the stack. Even with special compiler flags intended to preserve that behaviour, this assumption turned out to be unreliable.

Fortunately I realized later that it was also unnecessary. Because the JIT compiler controls the layout of the compiled code frame, it also controls exactly where the return address will be stored when we compile a (C) function call. That means that we can simply take a pointer to this address and store that in the thread context structure. From that address, we can read exactly the current position in the compiled code, without having to explicitly store it so often. Furthermore, we can also write to this location, changing the address the function will return to. Effectively, this is a controlled 'on-stack goto', an idiom more often used for exploits than for good purposes - clearly this is an exception! We use this to force a return to the interpreter (with proper stack frame cleanup) for 'invokish' instructions that end up invoking. We can change control to go directly to an exception handler if it is in the same frame. This makes all the earlier control 'guard' fragments redundant, allowing us to remove them entirely. Thus, an invokish instruction that doesn't actually invoke now carries no extra cost.

How much does this save? It depends a lot on the exact program, but I estimate about 5% of compiled code size, and from a hopelessly optimal (and fairly unrealistic) benchmark which I lifted from this blog post, approximately 10% of runtime. In real code, the effect is definitely nowhere near what jnthn++ or samcv++ achieved lately, but it's still nice. Also nice is that the code is quite a bit simpler than it was before.

Anyway, that's all I have to tell today. Have fun hacking, and until next time!

Some things I want

2018-02-27T08:38:00.000-08:00

Lately I've been fixing a few bugs here and there in the JIT compiler, as well as trying to advance the JIT expression optimizer. The story of those bugs is interesting but in this post I want to focus on something else, namely some support infrastructure that I think we should have that would make working on MoarVM and spesh/jit in particular much nicer.

There are a bunch of things related to runtime control of spesh and the JIT:

jit-bisect.pl is an awesome tool (if I say so myself) that deserves it's own blog post. It can figure out which frame is the first to be miscompiled and cause an execution failure. (Or to be specific, the least miscompile to cause a failure). When a JIT issue is brought to me, I reach for it directly. It should be more powerful:

Because it does not know in advance how many frames are compiled in a program, it uses an 'exponential probing' strategy to find the first breaking block, and then binary search between the last two probes. This means that we need to have log(n) tries to find the first failure during probing, and then another log(n) tries to find the exact frame. Usually, failures are much faster than successful runs, so the first phase takes a disproportionate amount of time. It would probably speed up bisects if we could start probing from a number higher than 1.
It would be nice if as part of (optional) output, jit-bisect.pl could create a 'replicator' script that would replicate the failure exactly as the bisect script found it. E.g. if it finds that frame 237 block 6 fails, then it should output a script that executes the program with the expression JIT active to exactly that point.
We assign sequence number 0 to the first frame to be JIT-compiled, but can't bisect that because we start with sequence number 1.

Both spesh and JIT logging should be much more specific. Currently, when MVM_SPESH_LOG is defined, we log all information on all frames we encounter onto the log file, resulting in enormous (easily >20MB) text files. Of those text files, only tiny bits tend to be affected in a way that interests you, and especially as the result of bisect process, only the last frame will be of any interest. So I propose a MVM_SPESH_LOG_FOI (frame of interest) flag that toggles when a frame should be logged or not. Could be an array as well (comma- or semicolon-separated-values). Same goes for the JIT.
We can limit the number of frames that passes through spesh using the flag MVM_SPESH_LIMIT, and the number of frames (and basic blocks) that pass through the expression JIT with MVM_JIT_EXPR_LAST_FRAME and MVM_JIT_EXPR_LAST_BB. We cannot do the same for the 'lego' JIT, and we probably should.
When MVM_SPESH_BLOCKING is not defined, spesh runs concurrently with the main program, the result of MVM_SPESH_LOG is not repeatable, and MVM_SPESH_LIMIT has as a result no well-defined meaning. Although the jit-bisect.pl program will set MVM_SPESH_LIMIT, it is easy enough to forget. Maybe we should either disable MVM_SPESH_LIMIT without MVM_SPESH_BLOCKING, or MVM_SPESH_LIMIT should imply MVM_SPESH_BLOCKING.
Generally we have too much ad-hoc environment setup code that - I feel - does not fit well into the idea of having MoarVM being embedded within another program. Not that embedding interpreters in other programs is a common thing to do anymore, but still. We should probably take that out of general instance setup and install it somewhere else. (MVM_spesh_setup_control_flags?)

Then there's more ambitious stuff, that still falls under 'housekeeping':

Currently the expression tree is backed by an array of integers, some of which identify the nodes and some of which are internal pointers (relative to the base of the array). There is an additional info array of structures that has to be explicitly maintained and that holds some information (result type and size, mostly). This is rather annoying when operating on the tree structure (as the optimizer does) because both arrays have to be kept synchronized. Also, these info values are only defined for nodes of the tree. We can reasonably fit all information stored in an info node in 32 bits (if we're packing) or 64 bits (if we're sloppy), even together with the node value. Doing this should save memory, be better for caches, and make tree manipulation easier.
I want to have explicit cleanup of the spesh worker thread (if requested by program arguments). Not having this makes ASAN and valgrind somewhat less useful, because they will falsely report in-progress memory as being leaked. To do this we'd need to wait, during shutdown, for the worker thread to be stopped, which means that we also need to have a signal to indicate that it has been stopped.
Sometimes, the JIT can get into a confused state and panic, bringing down the process. This is great for debugging and awful for users. I feel like we might be able to improve this - we can, during JIT graph construction, bail out without crashing (if we find constructs we can't JIT yet). We may be able to extend and generalize this. (If we find that we're in an illegal state entirely, that should probably still crash).
The legacy ('lego') JIT currently works with a two phase process, first building a JIT graph, then compiling it. This is redundant - without loss of functionality, the lego JIT could be reduced to a single phase (reducing the number of internal structures quite dramatically). This is also something that I'm not totally sure of whether we should do, because we may want having an explicit preparation phase eventually (e.g. when we start doing multiple basic blocks per expression tree).

And there's more ambitious stuff that would fall under optimizations and general functionality improvements:

MoarVM in some places assumes that we have 'easy' access to the current instruction pointer, e.g. to lookup dynamic variables or the scope of the current CATCH block. This is logical for an interpreter but problematic for the JIT. We currently need to maintain a pointer to the last entered basic block to approximate it, i.e. a memory store per block. This is expensive, and it's not often needed:

We already have to maintain a 'return pointer' for JIT frames that are high up the call stack, so in those cases the additional stores are redundant.
In case we are interested in the current position of the current frame, the current position is already on the stack (as the return address of the function call), and we can read it from stack-walking.
Alternatively, we can replace calls to functions that are interested in the current position to calls that receive that position explicitly. However, I kind of prefer using stack walking since we'll need the return pointer anyway. Also, it doesn't work for anything that might implicitly throw (MVM_exception_throw_adhoc)

The register allocator can be much better in a bunch of (simple) ways:

We never need to spill a constant value to memory, since we can always just insert a copy instead.
It is probably not necessary to insert individual restores before usages in the same basic block, especially for usages prior to the spilling point.
We sometimes spill values that we didn't really need to spill, since we're already storing them to memory as part of the basic block.

We have a kind-of hacky way to ensure that our 3-instruction code (c = op(a,b)) is converted to x86 2-instruction code (b = op(b,a)). This should be something that the register allocator can solve, or a phase after the register allocator, but it is currently being handled during code generation.

There is definitively more out there I want to do, but this should be enough to keep me busy for a while. And if anyone else wants to take a try at any of these, they'd be very welcome to :-).

Function call return values

2017-05-25T12:02:00.001-07:00

Hi there, it's been about a month since last I wrote on the progress of the even-moar-jit branch, so it is probably time for another update.

Already two months ago I wrote about adding support for function calls in the expression JIT compiler. This was a major milestone as calling C functions is essential for almost everything that is not pure numerical computing. Now we can also use the return values of function calls (picture that!) The main issue with this was something I've come to call the 'garbage restore' problem, by which I mean that the register allocator would attempt to 'restore' an earlier, possibly undefined, version of a value over a value that would result from a function call.

This has everything to do with the spill strategy used by the compiler. When a value has to be stored to memory (spilled) in order to avoid being overwritten and lost, there are a number of things that can be done. The default, safest strategy is to store a value to memory after every instruction that computes it and to load it from memory before every instruction that uses it. I'll call this a full spill. It is safe because it effectively makes the memory location the only 'true' storage location, with the register being merely temporary caches. It can also be somewhat inefficient, especially if the code path that forces the spill is conditional and rarely taken. In MoarVM, this happens (for instance) around memory barriers, which are only necessary when creating cross-generation object references.

That's why around function calls the JIT uses another strategy, which I will call a point spill. What I mean by that is that the (live) values which could be overwritten by the function call are spilled to memory just before the function call, and loaded back into their original registers directly after. This is mostly safe, since under normal control flow, the code beyond the function call point will be able to continue as if nothing had changed. (A variant which is not at all safe is to store the values to memory at the point, and load them from memory in all subsequent code, because it isn't guaranteed that the original spill-point-code is reached in practice, meaning that you overwrite a good value with garbage. The original register allocator for the new JIT suffered from this problem).

It is only safe, though, if the value that is to be spilled-and-restored is both valid (defined in a code path that always precedes the spill) and required (the value is actually used in code paths that follow the restore). This is not the case, for instance, when a value is the result of a conditional function call, as in the following piece of code:

1: my $x = $y + $z;
2: if ($y < 0) {
3: $x = compute-it($x, $y, $z);
4: }
5: say "\$x = $x";

In this code, the value in $x is defined first by the addition operation and then, optionally, by the function call to compute-it. The last use of $x is in the string interpolation on line 5. Thus, according to the compiler, $x holds a 'live' value at the site of the function call on line 3, and so to avoid it from being overwritten, it must be spilled to memory and restored. But in fact, loading $x from memory after compute-it would directly overwrite the new value with the old one.

The problem here appears to be that when the JIT decides to 'save' the value of $x around the function call, it does not take into account that - in this code path - the last use of the old value of $x is in fact when it is placed on the parameter list to the compute-it call. From the perspective of the conditional branch, it is only the new value of $x which is used on line 5. Between the use on the parameter list and the assignment from the return value, the value of $x is not 'live' at all. This is called a 'live range hole'. It is then the goal to find these holes and to make sure a value is not treated as live when it is in fact not.

I used an algorithm from a paper by Wimmer and Franz (2010) to find the holes. However, this algorithm relies on having the control flow structure of the program available, which usually requires a separate analysis step. In my case that was fortunately not necessary since this control flow structure is in fact generated by an earlier step in the JIT compilation process, and all that was necessary is to record it. The algorithm itself is really simple and relies on the following ideas:

If a value is defined as the result of a computation, that same (version) of the value cannot be used in code that precedes it.
If a value is used in a computation, it must have been defined in some code that precedes it (otherwise the program is obviously incorrect).
If, at a branch in the code path, a value is used in at least one of the branches (and not defined in it), it must have been defined prior to the branch instruction.

I think it goes beyond the scope of this blog post to explain how it works in full, but it is really not very complicated and works very well. At any rate, it was sufficient to prevent the JIT from overwriting good values with bad ones, and allowed me to finally enable functions that return values, which is otherwise really simple.

When that was done, I obviously tried to use it and immediately ran into some bugs. To fix that, I've improved the jit-bisect.pl script, which wasn't very robust before. The jit-bisect.pl script uses two environment variables, MVM_JIT_EXPR_LAST_FRAME and MVM_JIT_EXPR_LAST_BB, to automatically find the code sequence where the expression compiler fails and compiles wrong code. (These variables tell the JIT compiler to stop running the expression compiler after a certain number of frames and basic blocks. If we know that the program fails with N blocks compiled, we can use binary search between 0 and N to find out which frame is broken). The jit-dump.pl script then provides disassembled bytecode dumps that can be compared and with that, it is usually relatively easy to find out where the JIT compiler bug is.

With that in hand I've spent my time mostly fixing existing bugs in the JIT compiler. I am now at a stage in which I feel like most of the core functionality is in place, and what is left is about creating extension points and fixing bugs. More on that, however, in my next post. See you then!

Letting templates do what you mean

2017-04-30T15:12:00.004-07:00

Hi everybody, today I'd like to promote a minor, but important improvement in the 'expression template compiler' for the new JIT backend. This is a tool designed to make it easy to develop expression templates, which are themselves a way to make it easy to generate the 'expression tree' intermediate representation used by the new JIT backend. This is important because MoarVM instructions operate on a perl-like level of abstraction - single instructions can perform operations such as 'convert object to string', 'find first matching character in string' or 'access the last element of an array'. Such operations require rather more instructions to represent as machine code.

This level of abstraction is rather convenient for the rakudo compiler, which doesn't have to consider low-level details when it processes your perl6 code. But it is not very convenient for the JIT compiler which does. The 'expression' intermediate representation is designed to be much closer to what hardware can support directly. Basic operations include loading from and storing to memory, memory address computation, integer arithmetic, (conditional) branching, and function calls. At some point in the future, floating point operations will also be added. But because of this difference in abstraction level, a single MoarVM instruction will often map to many expression tree nodes. So what is needed is an efficient way to convert between the two representations, and that is what expression templates are supposed to do.

Expression templates are very much like the expression tree structure itself, in that both are represented as arrays of integers. Some of the elements represent instructions, some are constants, and some are references (indexes into the same array), forming a directed acyclic graph (not a tree). The only difference is that the template is associated with a set of instructions that indicate how it should be linked into the tree. (Instruction operands, i.e. the data that each instruction operates on, are prepared and linked by the template application process as well).

Surprisingly, arrays of integers aren't a very user-friendly way to write instruction templates, and so the template compiler was born. It takes as input a text file with expression templates defined as symbolic expressions, best known from the LISP world, and outputs a header file that contains the templates, ready for use by the JIT compiler. Note that the word 'template' has become a bit overloaded, referring to the textual input of the template compiler as well as to the binary input to the JIT compiler. That's okay, I guess, since they're really two representations of the same thing. The following table shows how template text, binary, and expression tree relate to each other:

Text 'Binary' Tree

Text	'Binary'	Tree
`(template: unless_i (when (zr $0) (branch (label $1)) ))`	`template: { MVM_JIT_ZR, 0, MVM_JIT_LABEL, 1, MVM_JIT_BRANCH, 2,` `MVM_JIT_WHEN,` `0,` `4, }, info: ".f.f.l.ll", len: 9, root: 6`

(template: unless_i
  (when
    (zr $0)
    (branch (label $1))
  ))

template: {
  MVM_JIT_ZR,
  0,
  MVM_JIT_LABEL,
  1,
  MVM_JIT_BRANCH,
  2,

  MVM_JIT_WHEN,

0,

  4,
},
info: ".f.f.l.ll",
len: 9,
root: 6

I hope it isn't too hard to see how one maps to the other. The unless_i instruction executes a branch if its integer argument is zero, specified by a constant as its second argument. All symbols (like when, label and zr) have been replaced by uppercase prefixed constants (MVM_JIT_WHEN), and all nesting has been replaced by references (indexes) into the template array. The 'info' string specifies how the template is to be linked into the tree. Instruction operands are indicated by an 'f', and internal links by an 'l'. In the tree representation the operands have been linked into the tree by the JIT; they form the LOAD and CONST nodes and everything below them.

Anyway, my improvement concerns a more complex form of template, such as the following example, an instruction to load an object value from the instance field of an object:

(template: sp_p6oget_o
  (let: (($val (load (add (^p6obody $1) $2) ptr_sz)))
    (if (nz $val) $val (^vmnull))))

This template contains a let: expression, which declares the $val variable. This value can be used in the subsequent expression by its name. Without such declarations the result of a computation could only have one reference, its immediate syntactic parent. (Or in other words, without let:, every template can only construct a tree). That is very inconvenient in case a result should be checked for null-ness, as in this case. (vmnull is a macro for the global 'null object' in MoarVM. The null object represents NULL wherever an object is needed, but isn't actually NULL, as that would mean it couldn't be dereferenced; it saves the interpreter from checking if a pointer to an object is NULL everywhere it is accessed).

The let: construct has another purpose: it ensures the ordering of operations. Although most operations can be ordered in whatever way suits the compiler, some do not, most notably function calls. (Function calls may have numerous unpredictable side effects, after all). All statements declared in the 'let declaration body' are compiled to run before any statements in the 'expression body'. This enables the programmer to ensure that a value is not needlessly computed twice, and more importantly, it ensures that a value that is used in multiple branches of a conditional statement is defined in both of them. For instance:

(let (($foo (...)))
     (if (...)
         (load $foo)
         $foo))

This pseudo-snippet of template code would dereference $foo if some condition is met (e.g. $foo is not NULL) and returns $foo directly otherwise. Without let to order the computation of $foo prior to the blocks of if, the first (conditional) child of if would be the first reference to $foo. That would mean that the code to compute $foo is only compiled in the first conditional block, which would not be executed whenever the if condition was not true, meaning that $foo would be undefined in the alternative conditional block. This would mean chaos. So in fact let does order expressions. All is good.

Except... I haven't told you how this ordering works, which is where my change comes in. Prior to commit 7fb1b10 the let expression would insert a hint to the JIT compiler to add the declared expressions as tree roots. The 'tree roots' are where the compiler starts converting the expression tree (graph) to a linear sequence of byte code. Hence the declaring expressions are compiled prior to the dependent expressions. But this has, of course, one big disadvantage, which is that the set of roots is global for the tree. Every declaration, no matter how deep into the tree, was to be compiled prior to the head of the tree. As a result, the following template code would not at all do what you want:

(let ($foo (...))
     (if (nz $foo)
         (let (($bar (load $foo))) # dereference $foo !
              (... $bar))
        ...)

The declaration of $bar would cause $foo to be dereferenced prior to checking whether it is non-null, causing a runtime failure. Chaos is back. Well, that's what I've changed. Fortunately, we have another ordering mechanism at our disposal, namely DO lists. These are nodes with a variable number of children that are also promised to be compiled in order. After the patch linked above, the compiler now transforms let expressions into the equivalent DO expressions. Because DO expressions can be nested safely, $bar is not computed prior to the null-check of $foo, as the programmer intended. I had originally intended to implement analysis to automatically order the expressions with regard to the conditionals, but I found that this was more complicated to implement and more surprising to the programmer. I think that in this case, relying on the programmer is the right thing.

One thing that I found interesting is that this reduces the number of mechanisms in the compiler. The 'root-hint' was no longer useful, and subsequently removed. At the same time, all but the last child of a DO list must be void expressions, i.e. yield no value, because DO can only return the value of its last child. Since all expressions in a let declaration must yield some value - otherwise they would be useless - they required a new operation type: discard. Thus with a new node type (extension of data range) we can remove a class of behavior.

After I had implemented this, I've started working on adding basic block analysis. That is a subject for a later post, though. Until next time!

Function Call Milestone

2017-03-28T09:14:00.001-07:00

Hi everybody. It's high time for another update, and this time I have good news. The 'expression' JIT compiler can now compile native ('C') function calls (although it's not able to use the results). This is a major milestone because function calls are hard! (At least from the perspective of a compiler, and especially from the perspective of the register allocator). Also because native function calls are really very important in MoarVM. Most of its 'primitive' operations (like hash table access, string equality, big integer arithmetic) are implemented by invoking native functions, and so to compile almost any program the JIT has to compile many function calls.

What makes function calls 'hard' is that they must implement the 'calling convention' of the relevant 'application binary interface' (ABI). In short, the ABI specifies the locations of function call parameters. A small number of parameters (on Windows, the first 4, for POSIX platforms, the first 6) are placed in registers, and if there are more parameters they are usually placed on the stack. Aside from the calling convention, the ABI also specifies the expected alignment of the stack pointer (per 16 bytes) and the registers a functions may overwrite (clobber in ABI-speak) and which registers must have their original values after the function returns. The last type of registers are called 'callee-saved'. Note that at least a few registers must be callee-saved, especially those related to call stack management, because if the callee function would overwrite those it would be impossible to return control back to the caller. By the way, manipulating exactly those registers is how the setjmp and longjmp 'functions' work.

So the compiler is tasked with generating code that ensures the correct values are placed in the correct registers. That sounds easy enough, but what if the these registers are taken by other values, and what if those other values might be required for another parameter? Indeed, what if the value in the %rdx register needs to be in the %rsi register, and the value of the %rsi register is required in the %rdx register? How to determine the correct ordering for shuffling the operands?

One simple way to deal with this would be to eject all values from registers onto the stack, and then to load the values from registers if they are necessary. However, that would be very inefficient, especially if most function calls have no more than 6 (or 4) parameters and most of these parameters are computed for the function call only. So I thought that solution wouldn't do.

Another way to solve this would be if the register allocator could ensure that values are placed in their correct registers directly,- especially for register parameters - i.e. by 'precoloring'. (The name comes from register allocation algorithms that work by 'graph coloring', something I will try to explain in a later post). However, that isn't an option due to my choice of 'linear scan' as the register allocation algorithm. This is a 'greedy' algorithm, meaning that it decides the allocation for a live range as soon as it encounters them, and that it cannot revert that decision once it's been made. (If it could, it would be more like a dynamic programming algorithm). So to ensure that the allocation is valid I'd have to make sure that the information about register requirements is propagated backwards from the instructions to all values that might conflict with it... and that point we're no longer talking about linear scan, and I would be better off re-engineering a new algorithm. Not a very attractive option either!

Instead, I thought about it and it occurred to me that this problem seems a lot like unravelling a dependency graph, with a number of restrictions. That is to say, it can be solved by a topological sort. I map the registers to a graph structure as follows:

Each register forms a node
A transfer from a register to another register, or from a register to a stack location or a local memory location is an edge
Each node can have multiple outbound edges, but only one inbound edge (only one value can ever be required in that register)

I linked to the topological sort page for an explanation of the problem, but I think my implementation is really quite different from that presented there. They use a node visitation map and a stack, I use an edge queue and and outbound count. A register transfer (edge) can be enqueued if it is clear that the destination register is not currently used. Transfers from registers to stack locations (as function call parameters) or local memory (to save the value from being overwritten by the called function) are also enqueued directly. As soon as the outbound count of a node reaches zero, it is considered to be 'free' and the inbound edge (if any) is enqueued.

Unlike a 'proper' dependency graph, cycles can and do occur, as in the example where '%rdx' and '%rsi' would need to swap places. Fortunately, because of the single-inbound edge rule, such cycles are 'simple' - all outbound edges not belonging to the cycle can be resolved prior to the cycle-breaking, and all remaining edges are part of the cycle. Thus, the cycle can always be broken by freeing just a single node (i.e. by copy to a temporary register).

The only thing left to consider are the values that are used after the function call returns (survive the function call) and that are stored in registers that the called function can overwrite (which is all of them, since the register allocator never selects callee-saved registers). So to make sure they are available afterwards, we must spill them. But there are a few spill strategies to choose from (terminology made up by me):

full spill replaces all references to the value in register, with references to the value in memory, by inserting load and store operations around every use and definition of the value. Basically, it splits the single value live range in parts.
split-and-spill finds the code segment where the spill is not required and splits it off from the rest and only edits the second part. This is actually rather tricky since to do this correctly requires data-flow analysis, which isn't otherwise required by a 'naive' implementation of linear scan
spill-and-restore will find the shortest piece of code where the spill is necessary, store the value to memory, and load it directly afterwards, as if the spill never happened.

The current register allocator does a full spill when it's run out of registers, and it would make some sense to apply the same logic for function-call related spills. I've decided to use spill-and-restore, however, because a full spill complicates the sorting order (a value that used to be in a register is suddenly only in memory) and it can be wasteful, especially if the call only happens in an alternative branch. This is common for instance when assigning values to object fields, as that may sometimes require a write barrier (to ensure the GC tracks all references from 'old' to 'new' objects). So I'm guessing that it's going to be better to pay the cost of spilling and restoring only in those alternative branches, and that's why I chose to use spill-and-restore.

That was it for today. Although I think being able to call functions is a major milestone, this is not the very last thing to do. We currently cannot allocate any of the registers used for floating-point calculations, which is a relatively minor limitation since those aren't used very frequently. But I also need to do some more work to actually use function return values and apply generic register requirements of tiles. But I do think the day is coming near where we can start thinking about merging the new JIT with the MoarVM master branch, making it available to everybody. Until next time!

Register Allocator Update

2017-02-09T08:19:00.001-08:00

Hi everybody, I thought some yof you might be interested in an update regarding the JIT register allocator, which is after all the last missing piece for the new 'expression' JIT backend. Well, the last complicated piece, at least. Because register allocation is such a broad topic, I don't expect to cover all topics relevant to design decisions here, and reserve a future post for that purpose.

I think I may have mentioned earlier that I've chosen to implement linear scan register allocation, an algorithm first described in 1999. Linear scan is relatively popular for JIT compilers because it achieves reasonably good allocation results while being considerably simpler and faster than the alternatives, most notably via graph coloring (unfortunately no open access link available). Because optimal register allocation is NP-complete, all realistic algorithms are heuristic, and linear scan applies a simple heuristic to good effect. I'm afraid fully explaining the nature of that heuristic and the tradeoffs involves is beyond the scope of this post, so you'll have to remind me to do it at a later point.

Commit ab077741 made the new allocator the default after I had ironed out sufficient bugs to be feature-equivalent to the old allocator (which still exists, although I plan to remove it soon).
Commit 0e66a23d introduced support for 'PHI' node merging, which is really important and exciting to me, so I'll have to explain what it means. The expression JIT represents code in a form in which all values are immutable, called single static assignment form, or SSA form shortly. This helps simplify compilation because there is a clear correspondence between operations and the values they compute. In general in compilers, the easier it is to assert something about code, the more interesting things you can do with it, and the better code you can compile. However, in real code, variables are often assigned more than one value. A PHI node is basically an 'escape hatch' to let you express things like:

int x, y;
if (some_condition()) {
    x = 5;
} else {
    x = 10;
}
y = x - 3;

In this case, despite our best intentions, x can have two different values. In SSA form, this is resolved as follows:

int x1, x2, x3, y;
if (some_condition()) {
    x1 = 5;
} else {
    x2 = 10;
}
x3 = PHI(x1,x2);
y = x3 - 3;

The meaning of the PHI node is that it 'joins together' the values of x1 and x2 (somewhat like a junction in perl6), and represents the value of whichever 'version' of x was ultimately defined. Resolving PHI nodes means ensuring that, as far as the register allocator is concerned, x1, x2, and x3 should preferably be allocated to the same register (or memory location), and if that's not possible, it should copy x1 and x2 to x3 for correctness. To find the set of values that are 'connected' via PHI nodes, I apply a union-find data structure, which is a very useful data structure in general. Much to my amazement, that code worked the first time I tried it.

Then I had to fix a very interesting bug in commit 36f1fe94 which involves ordering between 'synthetic' and 'natural' tiles. (Tiles are the output of the tiling process about which I've written at some length, they represent individual instructions). Within the register allocator, I've chosen to identify tiles / instructions by their index in the code list, and to store tiles in a contiguous array. There are many advantages to this strategy but they are also beyond the scope of this post. One particular advantage though is that the indexes into this array make their relative order immediately apparent. This is relevant to linear scan because it relies on relative order to determine when to allocate a register and when a value is no longer necessary.

However, because of using this index, it's not so easy to squeeze in new tiles to that array - which is exactly what a register allocator does, when it decides to 'spill' a value to memory and load it when needed. (Because inserts are delayed and merged into the array a single step, the cost of insertion is constant). Without proper ordering, a value loaded from memory could overwrite another value that is still in use. The fix for that is, I think, surprisingly simple and elegant. In order to 'make space' for the synthetic tiles, before comparison all indexes are multiplied by a factor of 2, and synthetic tiles are further offset by -1 or +1, depending on whether they should be handled before or after the 'natural' tile they are inserted for. E.g. synthetic tiles that load a value should be processed before the tile that uses the value they load.

Another issue soon appeared, this time having to do with x86 being, altogether, quaint and antiquated and annoying, and specifically with the use of one operand register as source and result value. To put it simply, where you and I and the expression JIT structure might say:


a = b + c

x86 says:


a = a + b

Resolving the difference is tricky, especially for linear scan, since linear scan processes the values in the program rather than the instructions that generate them. It is therefore not suited to deal with instruction-level constraints such as these. If a, b, and c in my example above are not the same (not aliases), then this can be achieved by a copy:


a = b
a = a + c

If a and b are aliases, the first copy isn't necessary. However, if a and c are aliases, then a copy may or may not be necessary, depending on whether the operation (in this case '+') is commutative, i.e. it holds for '+' but not for '-'. Commit 349b360 attempts to fix that for 'direct' binary operations, but a fix for indirect operations is still work in progress. Unfortunately, it meant I had to reserve a register for temporary use to resolve this, meaning there are fewer available for the register allocator to use. Fortunately, that did simplify handling of a few irregular instructions, e.g. signed cast of 32 bit integers to 64 bit integers.

So that brings us to today and my future plans. The next thing to implement will be support for function calls by the register allocator, which involves shuffling values to the right registers and correct positions on the stack, and also in spilling all values that are still required after the function call since the function may overwrite them. This requires a bit of refactoring of the logic that spills variables, since currently it is only used when there are not enough registers available. I also need to change the linear scan main loop, because it processes values in order of first definition, and as such, instructions that don't create any values are skipped, even if they need special handling like function calls. I'm thinking of solving that with a special 'interesting tiles' queue that is processed alongside the main values working queue.

That was it for today. I hope to write soon with more progress.

A guide through register allocation: Introduction

2016-11-06T02:29:00.002-08:00

This is the first post in what I intend to be a series on the register allocator for the MoarVM JIT compiler. It may be a bit less polished than usual, because I also intend to write more of these posts than I have in the past few months.

The main reason to write a register allocator is that it is needed by the compiler. The original 'lego' MoarVM JIT didn't need one, because it used what is called a 'memory-to-memory' model, meaning that every operation is expected to move operands from and to memory. In this it follows closely the behavior of virtually every other interpreter existing and especially that of MoarVM. However, many of these memory operations are logically redundant (for example, when storing and immediately loading an intermediate value, or loading the same value twice). Such redundancies are inherent to a memory-to-memory code model. In theory some of that can be optimized away, but in practice that involves building an unreasonably complicated state machine.

The new 'expression' JIT compiler was designed with the explicit (well, explicit to me, at least) goals of enabling optimization and specialization of machine code. That meant that a register-to-register code model was preferable, as it makes all memory operations explicit, which in turn enables optimization to remove some of them. (Most redundant 'load' operations can already be eliminated, and I'm plotting a way to remove most redundant 'store' operations, too). However, that also means the compiler must ensure that values can fit into the limited register set of the CPU, and that they aren't accidentally overwritten (for example as a result of a subroutine call). The job of the register allocator is to translate virtual registers to physical registers in a given code segment. This may involve modifying the original code by inserting load, store and copy operations.

Register allocation is known as a hard problem in computer science, and I think there are two reasons for that. The first reason is that finding the optimal allocation for a code segment is (probably) NP-complete. (NP-complete basically means that you have to consider all possible solutions in order to find the one you are after. A common feature of NP-complete problems is that the effect of a local choice on the global solution cannot be fully predicted). However, for what I think are excellent reasons, I can sidestep most of that complexity using the 'linear scan' register allocation algorithm. The details of that algorithm are subject of a later post.

The other reason that register allocation is hard is that the output code must meet the demanding specifications of the target CPU. For instance, some instructions take input only from specific registers, and some implicitly overwrite other registers. Calling conventions can also present a significant source of complexity as values must be placed in the right registers (or on the right stack locations) where the called function may expect them. So the register allocator must somehow encode these specific demands and ensure they are not violated.

Now that I've introduced register allocation, why it is needed, and what the challenges are, the next posts can begin to describe the solutions that I'm implementing.

In praise of incremental change

2016-06-04T10:01:00.002-07:00

Hi everybody, welcome back. I'd like to share with you the things that have been happening in the MoarVM JIT since I've last posted, which was in fact March. To be brief, the JIT has been adapted to deal with moving frames, and I've started to rewrite the register allocator towards a much better design, if I do say so myself.

First, moving frames. jnthn has already written quite a bit about them. The purpose of it is to make the regular case of subroutine invocation cheaper by making special cases - like closures and continuations - a bit more expensive. I have nothing to add to that except that this was a bit of a problem for the JIT compiler. The JIT compiler liked to keep the current frame in a non-volatile register. These are CPU registers which are supposed not to be overwritten when you call a function. This is useful because it speeds up access of frequently used values. However, if the current frame object is be moved, the frame pointer in this register becomes stale. Thus, it had to go, and now we load the frame pointer from the thread context object (i.e. in memory), which never moves.

Unfortunately, that was not sufficient. Because MoarVM is an interpreter, control flow (like returning from a routine) is implemented updating the pointer to the next instruction (and the next frame). JIT compiled code never deals with this instruction pointer. Hence, whenever this instruction pointer could have been updated - we call this invokish or throwish code - the JIT may need to return control to the interpreter, which can then figure out what to do next. Originally, this had been implemented by comparing the frame pointer of the JIT frame - stored in the non-volatile register - with the frame pointer as understood by the interpreter - i.e., in the thread context object. This check no longer worked, because a): we didn't have a permanent pointer to the current frame anymore, and b): the current frame pointer might change for two different reasons, namely control flow and object movement.

I figured out a solution to this issue when I realized that what we really needed is a way to identify (cheaply) in the JIT whether or not we have changed control flow, i.e. whether we have entered another routine or returned out of the current one. This might be achieved by comparing immutable locations, but lacking those, another method is to simply assign increasing numbers to constructed frames. Such a sequence number then identifies the current position in the control flow, and whenever it is changed the JIT knows to return control to the interpreter. This caused some issues at first when I hadn't correctly updated the code in all places where the interpreter changed the current instruction, but afterwards it worked correctly. Special thanks go to lizmat who allowed me to debug this on Mac OS X, where it was broken.

Afterwards, I've focused on improving the register allocator. The primary function of a register allocator is to ensure that the values used in a calculations are placed in (correct) registers prior to that calculation. This involves, among other things, assigning the correct registers (some operations only work on specific registers), spilling registers to memory in order to make place, loading spilled values from memory if necessary, and ensuring that values in volatile registers are spilled prior to a function call. This was rather difficult because in the old design because, as it was inlined into the compilation step, it couldn't really look behind or ahead, which is a problem if you want to place something correctly for future use. Furthermore, it allowed only for a one-on-one correspondence between a value that was computed and its current location in a register. That is a problem whenever -a value is copied to a different register, or stored in multiple memory locations.

So I've been, very slowly and methodically, in very small steps, moving code and structures through the JIT compiler in order to arrive at a register allocator that can handle these things. The first thing I did was remove the register allocation step out of compilation, into its own step (commit and another commit). Then I extracted the value descriptor structure - which describes in which location a value is stored - out of the expression tree (commit). I stored the tile list in a vector, in order to allow reverse and random access (commit). Because the register allocator works in a single pass and only requires temporary structures, I've 'internalized' it to its own file (commit one and commit two). Finally, I replaced the per-expression value structure with value descriptor structures (commit).

This places me in a position to replace register allocator structures (such as the active stack with an expiry heap), implement loads and stores, record register requirements per tile, implement pre-coloring, and correct allocation over basic blocks. All these things were impossible, or at least very difficult, with the old design.

What I think is interesting here is that in each of these commits, the logic of the program didn't substantially change, and the JIT continued to just as well as it had before. Nevertheless, all of this is progress - I replaced a rather broken design assumption (online register allocation with a value state machine) with another (offline register allocation with value descriptors) - that allows me to implement the necessary mechanics in a straightforward manner. And that, I think, demonstrates the power of incremental changes.

FOSDEM and the future

2016-03-13T04:20:00.000-07:00

Hi all, I realise I haven't written one of these posts for a long time. Since November, in fact. So you could be forgiven for believing I had stopped working on the MoarVM JIT. Fortunately, that is not entirely true. I have, in fact, been very busy with a project that has nothing to do with perl6, namely SciGRID, for which I've developed GridKit. GridKit is a toolkit for extracting a power network model from OpenStreetMap, which happens to contain a large number of individual power lines and stations. Such a network can then be used to compute the flow of electric power throughout Europe, which can then be used to optimize the integration of renewable energy sources, among other things. So that was and is an exciting project and I have a lot of things to write on that yet. It is not, however, the topic of my post today.

Let's talk about the expression JIT, where it stands, how it got there, and where it needs to go. Last time I wrote, I had just finished in reducing the expression JIT surface area to the point where it could just compile correct code. And that helped in making a major change, which I called tile linearisation. Tile linearisation is just one example of a major idea that I missed last summer, so it may be worthwhile to expand a bit on it.

As I've epxlained at some length before, the expression JIT initially creates trees of low-level operations out of high-level operations, which are then matched (tiled) to machine-level operations. The low-level operations can each be expressed by a machine-level operation, but some machine-level instructions match multiple low-level operations. The efficient and optimal matching of low-level to machine-level operations is the tiling step of the compiler, and it is where most of my efforts have been.

Initially, I had 'tagged' these tiles to the tree that had been created, relying on tree traversal to get the tiles to emit assembly code. This turned out to be a poor idea because it introduces implicit order based on the tree traversal order. This is first of all finicky - it forces the order of numbering tiles to be the same in the register allocator and the tile selection algorithm and again for the code emitter. In practice that means that the last two of these were implemented in a single online step. But more importantly and more troubling, it makes it more complex to determine exactly the extent of live ranges and of basic blocks.

The notion of basic blocks is also one that I missed. Expression trees are typically compiled for single basic blocks at a time. The definition of a basic block is a sequence of instructions that is executed without interruption. This allows for some nice simplifications, because it means that a value placed in a register at one instruction will still be there in the next. (In contrast, if it were possible to 'jump in between' the instructions, this is not so easy to ensure). However, these basic blocks are defined at the level of MoarVM instructions. Like most high-level language interpreters, MoarVM instructions are polymorphic and can check and dispatch based on operands. In other words, a single MoarVM instruction can form multiple basic blocks. For correct register allocation, it is vital that the register allocator knows about these basic blocks. But this is obscured, to say the least, by the expression 'tree' structure, which really forms a Directed Acyclic Graph, owing to the use of values by multiple consumers.

The point of tile linearisation is provide an authoritative, explicit order for tiles - and the code sequences that they represent - so that they can be clearly and obviously placed in basic blocks. This then allows the register allocator to be extended to deal with cross-basic block compilation. (In the distant future, we might even implement some form of instruction scheduling). As a side effect, it also means that the register allocation step should be moved out of the code emitter. I've asked around and got some nice papers about that, and it seems like the implementation of one of these algorithms - I'm still biased towards linear scan - is within the range of reasonable, as soon as I have the details figured out. Part of the plan is to extract value descriptors from the tree (much like the tile state) and treat them as immutable, introducing copies as necessary (for instance for live range splitting). The current register allocator can survive as the register selector, because it has some interesting properties in that aspect.

Aside from that I've implemented a few other improvements, like:

Refactored the tiler table generator, so that it could be extended to include tile arguments. This considerably simplifies the implementation of tiles. An interesting possibility, I think, is to make the tiler select tile candidates rather than a specific tile, which might allow choosing an optimal tile based on operation arguments rather than only on tree structure. Furthermore, the tiler table generator is now cleanish perl, which should probably help with maintenance.
Factor tiler state out of the tree. I had initially implemented nearly all tree operations by means of 'tagging' the tree in an 'Info' structure. (Structures named 'Info' are like classes named Manager, and sign of a code problem). However, this means that the tile information is 'dragged along' with the tree during its lifetime, which is not really necessary, because the tiler state is temporary.
Fixed a number of small issues, some of them centered around operand sizes, libuv versions, and build systems.
Presented on FOSDEM about how amd64 assembly language works and can be used in perl 6.
Implemented a JIT expression frame bisect tool which allows us to pinpoint precisely where the compilation of a perl6 (or nqp) frame breaks.

From that last bit, I've learned that the way the JIT is currently dealing with annotations is subtly broken, because the following thing can and does happen:

We start a basic block with a label
We append to that a 'dynamic control label' sequence, which updates the JIT 'reentry' label to point to the start of the basic block. This allows various operations in MoarVM which need to inspect the current program position - lexical variables in an inlined frame, exception handlers - to know where we are in the program.
An instruction is annotated with tags that indicate that it is, for example, the first instruction of an exception handler, or a deoptimisation point.
Because it is the first instruction of an exception handler, it must be labeled, so a label is inserted prior to the instruction. And, a dynamic control label sequence is also inserted prior to the instruction.
Because the instruction was the first one of its basic block, it acquires the same label as the basic block. However, between the two same labels, a dynamic control sequence is inserted, which means that the same labels are inserted twice, not meaning the same thing.
Presumably, although I haven't checked it, the last or the first label wins. But all repeated dynamic control labels are redundant. In this case, up to 4 different control labels are stacked on top of each other.

I'm not yet sure how to deal with this. Jonathan implemented a fix last year that introduced a dynamic control label at the start of each basic block. Ultimately, that reinforces this 'stacking' behavior, although it already happened. Ideally, we would not need to store the current location for each basic block just for the few operations that need it. It might instead be possible to refer to the current region in some other way, which is what happens to some extent in exception handling already.

Anyway, that's all for today, and I hope next time I will be able to bring you some good news. See you!

Moar JIT news

2015-11-29T07:22:00.002-08:00

Hello there, I thought it high time to write to you again and update you on the world of JITs. Since last I wrote, PyPy 4.0 was released. Also in python-land, Pyston 0.4 was released, and finally Guile 2.1.1 was released and Andy Wingo wrote a nice piece about that, as is his custom. I present these links not only to give these projects the attention they deserve, but also because I think they are relevant to our own project.

In chronological order, the release of PyPy 4.0 marked the first 'production' release of that projects' autovectorizer, which was developed over the course of this years Google Summer of Code. I'd like to take this opportunity to publicly congratulate the PyPy team on this achievement. So called 'vector' or SIMD operations perform a computation on multiple values in a single step and are an essential component of high-performance numerical computations. Autovectorizing refers to the compiler capability to automatically use such operations without explicit work by the programmer. This is not of great importance for the average web application, but it is very significant for scientific and deep learning applictions.

More recently, the Pyston project released version 0.4. Pyston is another attempt at an efficient implementation of Python, funded by Dropbox. Pyston is, or I should rather say, started out based on llvm. Most of my readers know of LLVM; for those who don't, it is a project which has somewhat revolutionised compiler development in the last few years. Its strengths are its high-quality cross-platform code generation with a permissive license. LLVM is also the basis for such languages as rust and julia. Notable weaknesses are size, speed, and complexity. To make a long story short, many people have high expectations of LLVM for code generation, and not without reason.

There are a few things that called my attention in the release post linked above. The first thing is that the Pyston project introduced a 'baseline' JIT compiler that skips the LLVM compilation step, so that JIT compiled code is available faster. They claim that this provides hardly a slowdown compared to the LLVM backend. The second thing is that they have stopped working on implementing LLVM-based optimisation. The third thing is that to support more esoteric python feature, Pyston now resorts to calling the Python C API directly, becoming sort of a hybrid interpreter. I would not be entirely surprised if the end point for Pyston would be life as a CPython extension module, although Conways law will probably prohibit that.

Pyston is not the first, nor the only current JIT implementation based on LLVM. It might be important to say here that there are many projects which do obtain awesome results from using LLVM; julia being a prime example. (Julia is also an excellent counterexample to the recent ~~elitist celebration of self-declared victory by static typing enthusiasts~~ assertion that 'static types have won', being very dynamic indeed). But Julia was designed to use the LLVM JIT, which probably means that tradeoffs have been made to assure performance; and it is also new, so it doesn't have to run as much weird legacy code; the team is probably highly competent. I don't know why some mileages vary so much (JavascriptCore also uses LLVM successfully, albeit as it's fourth and last tier). But it seems clear that far from being a golden gun, using LLVM for dynamic language implementations is a subtle and complex affair.

Anybody willing to try building an LLVM-backed JIT compiler for MoarVM, NQP, or perl6 in general, will of course receive my full (moral) support, for whatever that may be worth.

The posts by Andy Wingo, about the future of the Guile Scheme interpreter, are also well worth reading. The second post is especially relevant as it discusses the future of the guile interpreter and ways to construct a high-performance implementation of a dynamic language; it generalizes well to other dynamic languages. To summarise, there are roughly two ways of implementing a high-performance high-level programming language, dynamic or not, and the approach of tracing JIT compilers is the correct one, but incredibly complex and expensive and - up until now - mostly suitable for big corporations with megabucks to spend.
Of course, we aim to challenge this; but for perl6 in the immediate future correctness far outranks performance in priority (as it should).

That all said, I also have some news on the front of the MoarVM JIT. I've recently fixed a very longstanding and complex bug that presented itself during the compilation of returns with named arguments by rakudo. This ultimately fell out as a missing inlined frame in the JIT compiler, which ultimately was caused by MoarVM trying to look for a variable using the JIT-compiler 'current location', while the actual frame was running under the interpreter,and - this is the largest mystery - it was not deoptimized. I still do not know why that actually happened, but a very simple check fixed the bug.

I also achieved the goal of running the NQP and rakudo test suites under the new JIT compiler, albeit in a limited way; to achieve this I had to remove the templates of many complex operations, i.e. ops that call a C function or that have internal branches. The reason is that computing the flow of values beyond calls and branches is more complex, and trying to do it inline with a bunch of other things - as the new JIT has tried to so far - is prone to bugs. This is true especially during tree traversal, since it may not be obvious that computations relying on values may live in another context as the computations that generate these values.

In order to compile these more complex trees correctly, and understandably, I aim to disentangle the final phases of compilation, that is, the stages of instruction selection, register allocation, and bytecode generation. Next to that I want to make the tiler internals and interface much simpler and user-friendly, and solve the 'implied costs problem'. The benefit of having the NQP test suite working means I can demonstrate the effects of changes much more directly, and more importantly, demonstrate whether individual changes work or not. I hope to report some progress on these issues soon, hopefully before christmas.

If you want to check out the progress of this work, checkout the even-moar-jit branch of MoarVM. I try, but not always successfully, to keep it up-to-date with the rapid pace of the MoarVM master branch. The new JIT only runs if you set the environment variable MVM_JIT_EXPR_ENABLE to a non-empty value. If you run into problems, please don't hesitate to report on github or on the #moarvm or #perl6 channels on freenode. See you next time!

Rumors of JITs' demise are greatly exaggerated.

2015-10-12T12:38:00.001-07:00

Earlier this week my attention was brought to an article claiming that the dusk was setting for JIT compilation. Naturally, I disagree. I usually try to steer clear of internet arguments, but this time I think I may have something to contribute. Nota bene, this is not a perl- or perl6 related argument, so if that is strictly your interest this is probably not an interesting post for you.

The main premise of the argument is that people are shifting away from JIT compilation because the technique has failed to live up to its promises. Those promises include, in various forms, high level languages running 'as fast as C', or having more optimization possibilities than ahead-of-time (AOT) compilers do. Now my perspective may be a bit unusual in that I don't actually expect momentous gains from JIT compilation per se. As I've described in the talk I gave at this years' YAPC::EU, by itself JIT compilation removes only the decoding and dispatch steps of interpretation, and - depending on the VM architecture - these may be a larger or smaller proportion of your running time. However, my thesis is that interpretation is not why high-level languages are slow, or rather, that interpretation is only one of the many sources of indirection that make high-level languages slow.

First of all, what of the evidence that JITs are actually in demise? The author provides three recent trends as evidence, none of which I hold to be decisive. First, both Windows 10 and the newest versions of Android translate .NET and Dalvik applications respectively to native code at installation time, which is properly considered ahead of time compilation. Second, high-performance javascript applications are currently often created using tools like emscripten, which compiles to asm.js, and this is in many ways more similar object code than it is to a high-level language, implying that the difficult bit of compilation is already behind us. (I agree mostly with that assesment, but not with its conclusion). Finally, on iOS devices JIT compilation is unsupported (except for the JIT compiler in the webkit browser engine), allegedly because it is insecure.

As to the first piece, the author suggest that the main reason is that JIT compilers being unpredictable in their output, at least relative to optimizing ahead-of-time compilers. I think that is nonsense; JIT compilation patterns tend to be quite reliably the same on different runs of the same program, a property I rely on heavily during e.g. debugging. The output code is also pretty much invariant, with an exception being the actual values of embedded pointers. So in my experience, what you see (as a developer) is also what you get (as a user), provided you're using the same VM. I humbly suggest that the author believes JITs to be unreliable because his work is being compiled by many different VMs using many different strategies. But I see that no differently than any other form of platform diversity. Maybe the author also refers to the fact that often optimization effectiveness and the resultant performance of JIT compiled applications is sensitive to minor and innocuous changes in the application source code. But this is true of any high-level language that relies primarily on optimizing compilers, for C as much as for python or javascript. The main difference between C and python is that any line of C implies far fewer levels of indirection and abstraction than a similar line of python.

I think I have a much simpler explanation as to why both Google and Microsoft decided to implement ahead-of-time compilation for their client platforms. The word 'client' is key here; because I think we're mostly talking about laptops, smartphones and tablets. As it turns out, hardware designers and consumers alike have decided to spend the last few years worth of chip manufacturing improvements on smaller, prettier form factors (and hopefully longer battery life) rather than computing power. Furthermore, what Qualcomm, Samsung etc. have given us, Photoshop has taken away. The result is that current generation portable devices are more portable and more powerful (and cheaper) than ever but are still memory-constrained.

JIT compilation inevitably comes with a significant memory cost from the compiled code itself (which is generally considerably larger than the interpreted code was), even when neglecting the memory usage of the compiler. Using various clever strategies one can improve on that a bit, and well-considered VM design is very important as always. But altogether it probably doesn't make a lot of sense to spend precious memory for JIT-compiled routines in a mobile setting. This is even more true when the JIT compiler in question, like Dalviks', isn't really very good and the AOT compiler has a good chance of matching its output.

Now to the case of asm.js. As I said, i agree mostly that a significant amount of work has already been done by an ahead-of-time compiler before the browser ever sees the code. It would be a mistake to think that therefore the role of the JIT (or rather the whole system) can be neglected. First of all, JIT-compiled code, even asm.js code, is greatly constrained in comparison to native code, which brings some obvious security benefits. Second of all, it is ultimately the JIT compiler that allows this code to run cross-platform at high performance. I think it is mistaken to suggest that this role is trivial, and so I see asm.js as a success of rather than evidence against JIT compilation as a technique.

Next, the iOS restriction on JIT compilation. I think the idea that this would be for security reasons is only plausible if you accept the idea that application security is significantly threatened by dynamic generation of machine code. While I'm sure that the presence of a JIT compiler makes static analysis very difficult - not to say impossible - I don't believe that this is the primary attack vector of our times. The assertion that memory must be both writable and executable for a JIT compiler to work is only superficially true, since there is no requirement that the memory must be both at the same time, and so this doesn't imply much of a threat (So called W^X memory is becoming a standard feature of operating systems). Vtable pointers stored in the heap, and return addresses on a downward-growing stack, now those are attack vectors of note.

But more importantly, that is not how mobile users are being attacked. It is much more interesting, not to mention significantly easier, for attackers to acquire whole contact books, private location information, credentials and private conversations via phishing and other techniques than it is to corrupt a JIT compiler and possibly, hopefully, and generally unreliably gain remote execution. Most of these attack vectors are wide-open indeed and should be prevented by actual security techniques like access control rather than by outlawing entire branches of computing technology. Indeed, an observer not sympathetic to Apple could probably relate this no-JIT compilation rule with the Californian company's general attitude to competing platforms, but I will not go further down that path here.

Finally, I think the claim that JIT compilation can't live up to its promise can readily be disproven by a simple google search. The reason is simple; the JIT compiler, which runs at runtime, has much more information at its disposal than even the best of ahead-of-time compilers. So-called profile-guided optimization help to offset the difference, but it is not a common technique, moreover that is still only a small subset of information available to a JIT compiler. The fact that many systems do not match this level of performance (and MoarVM's JIT compiler certainly doesn't) is of course relevant but not, in my opinion, decisive.

In conclusion, I would agree with the author that there are many cases in which JIT compilation is not suitable and in AOT compilation is. However, I think the much stronger claim that the dusk is setting on JIT compilation is unwarranted, and that JIT compilers will remain a very important component of computing systems.

Most Significant Bits

2015-09-20T07:12:00.001-07:00

This week I think I fixed irregular behavior in the x64 instruction encoding register selction of DynASM. I think it'll be a fun story to share, so I thought it'd be time to blog.

The astonishingly irregular thing about x64 instruction encoding is that it is mostly very regular. Ignoring for the moment instruction prefixes and constants, an x86 instruction consists of two bytes, one for the instruction proper, and one for it's two operands. Take for example the addition of two registers:

+-----------------+
| add  | eax, ecx |
+-----------------+
| 0x01 |     0xc8 |
+-----------------+

We take the instruction byte as given. It is the second byte that concerns me because it determines which operands to use and how. Like a good CISC architecture, the x86 supports a number of addressing modes, meaning that a register can be used as a value but also as a (part of a) pointer. One of the reasons C does pointer arithmetic so freely is that this reflects the nature of the CPU's which where current when C was designed. The (relevant) x86 addressing modes are shown in the following table. (There are more, but you shouldn't use them):

Addressing Mode	Byte flag	Meaning
Direct	`0xc0`	Both operands are used as direct values
Indirect	`0x00`	One of the operands is used as a memory reference, whether the source of destination operand depends on the instruction
Indirect with offset	`0x80` or `0x40`	One of the operands is used as a memory reference, offset by a constant which is encoded directly after the instruction.
Indexed	`0x04`	One operand consists of two registers base and index, which is multiplied by a scale, to provide a memory reference. Optionally also has an offset, in which case the code byte is `0x44` or `0x84`, depending on whether the offset fits in a single byte or not.
Instruction-relative	`0x05`	One of the operands is a reference to the current location in the code offset by a constant (and the other refers to a register as usual).

Readers which are more careful than can be reasonably expected will have noticed that in the first three addressing modes, the lowest nibble is zero, whereas it is nonzero for the lower two addressing modes. This is in fact the source of irregularities in instruction encoding. To appreciate this it helps to unpack the operand byte in octal rather than hexadecimal. Octal is much closer to how x86 thinks about the world. As demonstrated in this table, the lowest two pairs of 3 bits each encode the actual registers that should be used.

+------+------+-----+-----+----------+
| Byte | Mode | Op2 | Op1 | Meaning  |
+------+------+-----+-----+----------+
| 0xc0 |    3 |   0 |   0 | Direct   |
+------+------+-----+-----+----------+
| 0x80 |    2 |   0 |   0 | Indirect |
+------+------+-----+-----+----------+
| 0x04 |    0 |   0 |   4 | Indexed  |
+------+------+-----+-----+----------+

The upshot of this is that in case the operand mode isn't direct, and the first operand register is either 4 or 5, the meaning of the operand byte is completely different. x86 suddenly expects another operand byte (a so-called SIB byte) to specify which register shall be base and which shall be index.

Normally this isn't much of a problem since the registers refered to by number 4 and 5 are rsp and rbp respectively; meaning the stack top and stack bottom registers. Fun fact: the x86 stack grows downward, so rbp > rsp in basically all cases. Also fun fact: because of this, writing from a rsp relative reference upwards can overwrite the return pointer held somewhere below rbp, which is the basis of most buffer overflow attacks. You thought NULL was a billion dollar mistake? Consider how the engineers that decided the stack should grow downward must feel.

Anyway, considering that rbp and rsp take up such a pivotal role in your program, it's actually unlikely you'll encode them by mistake. So as long as you don't do that, it's safe to ignore this complexity and just 'add in' the correct bits into the operand byte. Thus, for instance, to refer to the 7th and first register respectively in direct mode, we generate:

0300 + (07 << 3) + (01) == 0371 == 0xf9

However, in the land of x64, things are not so happy. I have blogged earlier about how the x64 architecture gives you 8 extra registers on top of the 8 legacy x86 registers, and how the difference between those registers lies only in that x64 specifies a prefix byte called REX. REX byte encoding is not so difficult, the trick is to find it reliably. But because the lower three bits of the 4-bit register number are placed in the operand byte, register r12 and r13 look exactly like rsp and rbp to the CPU. Well, that's where the fun really starts, because it's all too easy to encode these 'accidentally'. They are after all perfectly regular registers.

For those not keeping score, we have two special cases to handle. First, whenever the first operand is either rsp or r12 and we're not using direct mode, an extra SIB byte needs to be encoded to specify that we are really talking about accessing rsp/r12 directly. This is done by encoding rsp as both the base and index, which the x86 understands because using rsp as an index is usually illegal. (The magic byte is thus 0044 or 0x24). Second, whenever the first operand is rbp or r13 and we're using indirect access without an offset, we need to encode indirect access with an offset instead, just with the offset at zero. This of course requires another byte. Somewhat byzantine, but manageable.

We are, unfortunately, not completely OK yet. It is my central hypothesis of this post that DynASM was not designed to handle register selection at runtime. Evidence for this hypothesis is that DynASM does 'weird' things like mix data and instructions and linking prior to encoding. Usually one encodes first and links afterwards, especially when during encoding you may need to make decisions that influence the final positions of certain segments. DynASM does it the other way around, which means that during linking we should be able to calculate exactly how much space we need for each instruction. Which is a pain, because DynASM mixes the data stream (which we need for inspection) with the instruction stream (which tells the runtime what to do with its input). It's possible to hack around this - basically by copying data into the instructions - but it's not elegant. Starting with this commit, I'm reasonably confident that this stuff works, a test case is provided here.

That almost concludes this weeks madness. The only thing left is to question the method. Why should x86 use this convoluted scheme? I could go on a detailed historical research, but I prefer to speculate it is caused by economy in memory. After all, in the regular case you need only 2 bytes, which is - conveniently - equal to 16 bit, the original register size of the 8086. And since that chip was designed in the 1970s, it makes sense instructions should be as space-efficient as possible. In contrast, ARM uses 32 bit instructions with 3 operands. So space economy seems a plausible cause to me.

See you next time!

Wrapping Up

2015-09-14T01:31:00.000-07:00

Hi everybody, it's been a while since I've blogged, and I think you deserve an update. Last week, of course, was YAPC::EU, which was awesome. Granada is a very nice place, and the weather was excellent. Tapas lunch was very nice, as was the gala diner (also with tapas). There were many interesting people and presentations (many more than I could actually see). It was also very interesting to present (slides) about the JIT, which I think went well. One of the comments I heard was that it was a quite high-level talk, so if anyone should ask, I can and will give a talk describing the grueling details in the future. Oh, and I just found that Larry Wall's keynote has just been uploaded to youtube. Go ahead and watch, this page isn't going anywhere.

So what news from JIT compiler land? Well, since last I've blogged I had to design and implement the register allocator and tie everything together into the final running compiler. That has been achieved, but there are still many bugs and limitations that prevent it from actually being effective. I will talk about these at length. But first, let me explain the register allocator.

The basic problem of register allocation is that a compiler can assign more values than the CPU has registers. Thus, only some values can reside in registers and others must be stored to memory (spilled). Computing the best set of values to keep in registers is in general an impossible problem. To solve it I use the 'linear scan' allocation heuristic. This heuristic is as simple as can be - determine for each value allocated it's last use, and when it's time to spill a value, spill the one which will be live furthest in the future. These slides also do a good job of explaining it.

Aside from being simple to implement, it's also effective. Values that will expire soon are likely to be used soon as well, so their register will free up soon enough. On the other hand, values that are bound to live a long time will likely be spilled before they expire anyway, so it costs less to spill them now. Another benefit is that this can be evaluated online, i.e. as the JIT tree is being processed, so that it doesn't require a separate step. (One abstraction it did require, though, was order-numbering the tiles, which are currently attached to the JIT tree. This makes the tiles in essence a linear list, and it is my plan to convert them to a linear array in memory as well. That would reduce the number of tree traversals by one as well).

I will not maintain the illusion here that a register allocator is a trivial component, even with a conceptually simple algorithm as just outlined. One of the tricky bits is ensuring that values defined in a conditional branch do not accidentally 'escape' out of their branches, since they will be unavailable if the branch was not taken. In the same vein after a call all 'volatile' registers are invalidated, so that also requires some special treatment.

After the register allocator was finished, all that remained was ironing out the bugs. And they were (and are) many. The most annoying of these were the DynASM machine code generation bug. The DynASM runtime is a tight state machine that uses a mix of data and instructions to generate machine code. The first bug was relatively simple - a missing REX byte marking caused by the preprocessor looking at only one of the two operands. The second bug was positively evil. See, x86 uses so called ModR/M bytes to specify the register and memory access mode used for an instruction. It's not important you know what it stands for, but what is important is that each byte - 8 bits - is divided into a mode of 2 bits and 2 register numbers of 3 bits each. (There are 8 legacy registers in x86, so that fits). Except when the register number is 4 (rsp or the stack pointer register). Then the whole meaning of the byte changes to a SIB byte, which is quite different entirely - it refers to two registers at once. The upshot is that this SIB byte now must be appended to the ModR/M byte and filled with correct data; and that this SIB byte is then interpreted as if it were a ModR/M byte anyway.. I've patched DynASM to do this, but it is really quite sensitive and brittle and I quite expect to have to fix this in the future in another way.

That brings me to today. Unfortunately for my JIT aspirations, my educational obligations have caught up with me again. In other words: my study has started and leaves me with much less time to work on the JIT. So for clarity, I have completed in the last few months the following:

The expression tree intermediate format and template preprocessor

Documentation on the use of these

The tiler and tiler table generator
A register allocator and assorted support structures in the new JIT
Dynamic register support for x64 DynASM
Periodic reports on progress by means of this blog

Timo Paulssen has developed a C-source to expression tree format preprocessor, which should in the best case help quickly convert most of the old JIT compiler segments to the new, when this is more stable.

Here is what I intended (and indeed promised) I would do, and haven't finished yet:

A test suite - I haven't needed it during development as regular NQP compilation already involves plenty JIT compiler activity. The benefit of a real test suite is of course that specific segments can be tested (like numeric registers)
An representation-based extension API. I haven't gotten around to it because the JIT has not been stable enough so far. It will be based on the use of templates specialized to a certain in-memory representation.
Documentation on the use and development of architecture-specific tiles.

And there are things which I haven't outlined explicitly but which I still want to do:

Full floating point operation support (the simplest way is probably to use a wholy separate allocator for numeric registers).
Automatic within-template ordering to maintain consistent branch semantics (rather than relying on let ordering to build values shared between branches).
A framework for optimizing transformation. I think the key to this is probably a replacement table allowing different algorithms to insert replacements for given nodes.
As I said above, a truly linear representation of the selected tiles, in order to allow reordering for the purposes of instruction scheduling.

And there are many more. But as I said, I cannot continue with them as I have in the past few months. I will continue to work to stabilise the JIT and I hope to succeed in that before Christmas, only at a slower pace than before. I realise it is somewhat disappointing not to be able to fully 'land' the new JIT yet, but I'm confident we'll get there in the end. See you next time!

Tiler Update

2015-08-16T13:02:00.003-07:00

In my last blog I talked a bit about a problem I had with generating all possible sets of rules from the tile 'grammar'. I've since solved that problem, and thought it'd be nice to share my results. (NB: I use the word 'grammar' loosely today and on other days. In compiler jargon a grammar is usually a thing that matches a linear stream of lexical tokens to a tree structure. In this case, it's really the other way around, because it's used to match a tree and output a linear sequence. However, these things are quite similar, so I hope you'll forgive me for using the term).

It may help if I state the problem as clearly as I can. The input is a list of rules mapping a head and up to two child nodes to a tile and a nonterminal symbol. A tile, to reiterate, is ultimately a piece of code that emits machine code. The nonterminal symbol (from now on: symbol) is what takes the place of a node after a tile has been selected for it. This abstraction is what allows the tiler to determine the rule(set) for a node by looking only at it's direct children. Note that tile rules that refer to nested structures are 'flattened' by replacing nested children by special one-off symbols.

Many rules can be matched together. Take for instance these tiles starting with an add node. For x64, we can compile a register-to-register addition, a constant-to-register addition, and a memory-to-register addition. (x86 can also add from register to memory, but we ignore that).

Rule 1: Adding register to register

Rule 2: A tile that adds a constant value to a register


Rule 3: A tile that adds a register to a value loaded from memory

Given that both a (load mem) and a (const) node can be tiled to reg symbols, we can always match (add reg reg) when matching an add node. But it's not always possible to match (add reg (const)) or (add reg (load mem)). And the last two certainly cannot be combined. So when the tiler inspects the add node, it can emit the following tree combinations of tiles: {1}, {1,2} and {1,3}. The problem I had was given the complete set of rules, how can we find all permissible combinations that the tiler can generate?

(As an aside, I don't think generating only the necessary rulesets is an optimization, but very much necessary for the feasibility of the tiler. The tiler tables otherwise become either prohibitively large, or incomplete).

Why can we combine 1 and 3? Because the (load mem) node can generate a reg symbol as well as the 'placeholder' symbol generated in rule 3 (let's call that the 3a symbol). Similarly, the (const) node can generate a reg as well as the 2a symbol. Indeed, rules can be combined whenever the symbols they refer to can be combined. And they are separate whenever the symbols they refer to can occur separately. So the key to finding all permissible combinations is finding all possible combinations of symbols. That is something of a simpler problem.

First of all, all rules that refer to the same combination-of-terminals can be combined themselves, generating new combinations of terminals. Second, some terminal combinations may not be generated at all (3a will never occur in isolation, for example). Finally, when we do have all possible sets of terminals, applying all rules to each set will continue to generate these combinations. We can find rules which combine in such a way easily using a trie. So the solution to finding all permissible combinations of rules is this:

Apply each rule to a trie of symbol combinations and store the rule number and symbol in each leaf.
Extract all combinations of symbols that resolve to a single trie leaf.
Compare the symbol combinations to those used in step 1. If there are any changes, repeat from step 1.
The trie leafs also now also contain all possible rule combination, which simply need to be given a number.

The idea for this algorithm I derived from the table generation algorithm for parser generators, to which it is also quite similar.

A further change I had to make was to split the finding of rulesets - a bottom-up procedure - to the selection of the right rules, which is a top-down procedure. It is top-down, because the rule selected at a node determines the rules that can be applied for it's child nodes. All that is needed is a table that stores the minimum-cost rule to generate a given symbol for a given tile set, and a root rule to start with. (NB: I'm not 100% sure that's true, but it's my working hypothesis, and it can't generate wrong code, only suboptimal code.)

So that was my update. See you next time!

Inching Closer

2015-08-13T01:33:00.002-07:00

Hi everybody, I realise I haven't blogged in a while now, and it'd be a good time to write you an update. I've hit a few problems, but am still making progress.

I've written the tiler algorithm for the JIT compiler. This caused a difficulty because the input graph is not a tree but a DAG, meaning that a node may have more than one 'parent' or 'consumer' nodes. Since consumers decide the tiling of their child nodes, this may mean that the tiles conflict. I resolve such tile conflicts simply by adding a new node to replace the old one.

I've also been busy adding the tile implementations. These are pieces of C code that emit assembly code corresponding to the part of the input tree they represent. This caused some challenges in that tiles sometimes refer to nodes deep in the tree. I solve this by adding traced paths to the nonterminals to the JIT tile table, allowing the JIT compiler to find the nodes refered to and make them easily available. For example:


(add reg (load mem))

The second 'value' argument to this tile is the mem node, not the load. These value arguments are important for managing state between tiles, like the register containing the computed value, or a computed memory address.

To the expression template compiler, I've added the facility to mark certain templates as 'destructive', meaning that they write the value they yield directly to memory. Thus, this value does not become directly available for the JIT tree. Many MoarVM instructions are actually implemented that way, which I noticed when timotimo kindly offered to help write an automated translation tool from the 'old' JIT graph to the new format.

I've changed the JIT-to-interpreter interface to be simpler and deal more sensibly with exits. It used to be the case that the interpreter had to return control out from the JIT-ted frame, so that it'd know when it had unwound the last stack and had to exit. Now the JIT-ted frame is responsible for unwinding the stack itself, much like interpreted code is. The unwound stack check was simple enough to replicate in the JIT interpreter driver instruction.

I've completed the changes to DynASM that allow us to address all x64 general-purpose-registers in all different sizes. This required the addition of a new facility that registered if a so-called REX byte - indicating the use of extended register - was required for addressing the registers supplied to DynASM. Because many fields on MoarVM internal structures are less than 64 bits in size, this was more important than I had realized. At the same time, I've also added value size information to the expression tree, so that these sizes are used correctly.

I've designed, but not yet implemented, a proper register allocator. The key abstraction, I think, is to distinguish between registers which are in use in contrast with registers that are allocated. Allocated registers may be spilt and reused, but used registers may not.

I've documented the syntax and structure of the expression tree format. I hope this will help other people to write JIT extensions and/or optimizations. Documentation on the tiler is work in progress.

My current problem is that the method I had designed last time to derive all possible tiler states is wrong, because it tried to apply a topological sort on a graph that had cycles. Breaking these cycles leads to not generating all permissible tile states, which leads to runtime failure for the tiler. So this has to be avoided, at the same time it is also very important not to generate more tiler states than necessary because otherwise the tile table (and the time spent constructing it) becomes prohibitively large. Despite helpful suggestions by the community, I'm still not sure how to solve this. (It's a similar problem to parser table construction, but... not the same).

All in all, I think I've come very close to really starting to apply the new code generator. I had expected to reach this state earlier, but what I had not realized is that a 'real' compiler requires many moving parts working correctly. Fortunately, I've tried to test new parts as they are being written, so I'm confident that most of them works as expected. Nevertheless, the next weeks are quite exciting, as I'm sure there will be many bugs to find.

Finally, the schedule for this years' YAPC::EU is online, and I'm due to present at 12:00 on Wednesday, in the Aula Magna. For those who'll be there I hope to have an interesting story to tell. See you next weeks!