woensdag 3 oktober 2018

A future for fork(2)

Hi hackers. Today I want to write about a new functionality that I've been developing for MoarVM that has very little to do with the JIT compiler. But it is still about VM internals so I guess it will fit.

Many months ago, jnthn wrote a blog post on the relation between perl 5 and perl 6. And as a longtime and enthusiastic perl 5 user - most of the JIT's compile time support software is written in perl 5 for a reason - I wholeheartedly agree with the 'sister language' narrative. There is plenty of room for all sorts of perls yet, I hope. Yet one thing kept itching me:
Moreover, it’s very much the case that Perl 5 and Perl 6 make different trade-offs. To pick one concrete example, Perl 6 makes it easy to run code across multiple threads, and even uses multiple threads internally (for example, performing optimization and JIT compilation on a background thread). Which is great…except the only winning move in a game involving both threads and fork() is not to play. Sometimes one just can’t have their cake and eat it, and if you’re wanting a language that more directly gives you your POSIX, that’s probably always going to be a strength of Perl 5 over Perl 6.
(Emphasis mine). You see, I had never realized that MoarVM couldn't implement fork(), but it's true. In POSIX systems, a fork()'d child process inherits the full memory space, as-is, from its parent process. But it inherits only the forking thread. This means that any operations performed by any other thread, including operations that might need to be protected by a mutex (e.g. malloc()), will be interrupted and unfinished (in the child process). This can be a problem. Or, in the words of the linux manual page on the subject:

       *  The child process is created with a single thread—the one that
          called fork().  The entire virtual address space of the parent is
          replicated in the child, including the states of mutexes,
          condition variables, and other pthreads objects; the use of
          pthread_atfork(3) may be helpful for dealing with problems that
          this can cause.

       *  After a fork() in a multithreaded program, the child can safely
          call only async-signal-safe functions (see signal-safety(7)) until
          such time as it calls execve(2).

Note that the set of signal-safe functions is not that large, and excludes all memory management functions. As noted by jnthn, MoarVM is inherently multithreaded from startup, and will happily spawn as many threads as needed by the program. It also uses malloc() and friends rather a lot. So it would seem that perl 6 cannot implement POSIX fork() (in which both parent and child program continue from the same position in the program) as well as perl 5 can.

I was disappointed. As a longtime (and enthusiastic) perl 5 user, fork() is one of my favorite concurrency tools. Its best feature is that parent and child processes are isolated by the operating system, so each can modify its own internal state without concern for concurrency,. This can make it practical to introduce concurrency after development, rather than designing it in from the start. Either process can crash while the other continues. It is also possible (and common) to run a child process with reduced privileges relative to the parent process, which isn't possible with threads. And it is possible to prepare arbitrarily complex state for the child process, unlike spawn() and similar calls.

Fortunately all hope isn't necessarily lost. The restrictions listed above only apply if there are multiple threads active at the moment that fork() is executed. That means that if we can stop all threads (except for the one planning to fork) before actually calling fork(), then the child process can continue safely. That is well within the realm of possibility, at least as far as system threads are concerned.

The process itself isn't very exciting to talk about, actually - it involves sending stop signals to system threads, waiting for these threads to join, verifying that the forking thread is the really only active thread, and restarting threads after fork(). Because of locking, it is a bit subtle (tbere may be another user thread that is also trying to fork), but not actually very hard. When I finally merged the code, it turned out that an ancient (and forgotten) thread list modification function was corrupting the list by not being aware of generational garbage collection... oops. But that was simple enough to fix (thanks to timotimo++ for the hint).

And now the following oneliner should work on platforms that support fork():(using development versions of MoarVM and Rakudo):

perl6 -e 'use nqp; my $i = nqp::fork(); say $i;'

The main limitation of this work is that it won't work if there are any user threads active. (If there are any, nqp::fork() will throw an exception). The reason why is simple: while it is possible to adapt the system threads so that I can stop them on demand, user threads may be doing arbitrary work, hold arbitrary locks and may be blocked (possibly indefinitely) on a system call. So jnthn's comment at the start of this post still applies - threads and fork() don't work together.

In fact, many programs may be better off with threads. But I think that languages in the perl family should let the user make that decision, rather than the VM. So I hope that this will find some use somewhere. If not, it was certainly fun to figure out. Happy hacking!


PS: For the curious, I think there may in fact be a way to make fork() work under a multithreaded program, and it relies on the fact that MoarVM has a multithreaded garbage collector. Basically, stopping all threads before calling fork() is not so different from stopping the world during the final phase of garbage collection. And so - in theory - it would be possible to hijack the synchronization mechanism of the garbage collector to pause all threads. During interpretation, and in JIT compiled code, each thread periodically checks if garbage collection has started. If it has, it will synchronize with the thread that started GC in order to share the work. Threads that are currently blocked (on a system call, or on acquiring a lock) mark themselves as such, and when they are resumed always check the GC status. So it is in fact possible to force MoarVM into a single active thread even with multiple user threads active. However, that still leaves cleanup to deal with, after returning from fork() in the child process. Also, this will not work if a thread is blocked on NativeCall. Altogether I think abusing the GC in order to enable a fork() may be a bit over the edge of insanity :-)

dinsdag 4 september 2018

Template Compiler Update

Hi everybody. After samcv++ released MoarVM again, it was finally time to introduce some updates to the expression template compiler.

As you may or may not know, the expression JIT backend maps MoarVM opcodes to its internal representation via expression templates. At runtime (during the JIT compilation phase), these templates are combined to form an expression tree. The expression template language remains somewhat underdocumented, but quite a few brave developers have still ventured to add templates. With the new template compiler, a few things will change:
  • Template syntax and type checking is much stricter. This means that, for example, it is no longer possible to place a C-macro expression (which evaluates to a constant value) where an expression operand (i.e. code) would be expected.
  • Templates can contain references to MoarVM opcode operands (written as $0, $1, $2 etc.) At runtime, read operand references are substituted with the value of the operand, and write operands with the address of the register they refer to. This was a source of confusion, so to reduce this all write operands must now be written with a '\' prefix, like \$0.
  • At the same time, the '!' suffix on a template name (like slice!) indicate that the template code does not return the value of the output, but overwrite the register where it is stored. (It tells the expression compiler that the value in that output register must be loaded from memory the next time it is used). Many templates that do not yield an output (like unshift)  had this suffix, but although that is harmless at runtime, it is nevertheless confusing, so the template compiler will now raise an error if it finds this.
  • Last but not least, macro application is now hygienic, meaning that a name declared in a macro will never conflict with a name declared in a template. So something like this should now work, even though it is nonsense:

(macro: ^foo (,bar)
  (let (($obj ...))
    (add $obj ,bar)))
(template: foobar
  (let (($obj ...))
    (^foo $obj))

Prior to these patches, this would expand to something like:

(template: foobar
  (let (($obj ...))
    (let (($obj ...)) # name conflict is here
      (add $obj $obj))) # which $obj?

Although this code would previously fail during compilation, having to deal with such conflicts in remote code segments is annoying. But now named declarations are resolved prior to expanding the inner macro, and this means that the $obj for macro ^foo is resolved to the let: within that macro, and the outer $obj in the template refers to the outer declaration, as if you'd written:

(template: foobar
  (let (($obj1 ...))
    (let (($obj2 ...))
      (add $obj2 $obj1)))

I believe that this makes writing templates safer and simpler, catching more errors at compile time and fewer at runtime. Happy hacking!

zaterdag 25 augustus 2018

A Curious Benchmark

Hi hackers! I recently saw a curious benchmark passed round on the #moarvm channel. (I understand that this originates from Ovid during his Future of Perl 5 and 6 presentation, but I haven't seen it myself, so if I'm wrong, don't hesitate to correct me). It's curious because it runs on perl5 and perl6, and because the difference is… significant:

# reciprocal.pl
my $x = 0;
$x += 1/$_ for 1..50_000_000;
print "$x\n";

Perl 5 (5.24.1), on my laptop (Intel i7-4700HQ CPU @ 2.40GHz, 16GB ram), takes approximately 2.7s to print the value 18.3047492382933. Perl 6, on the same script, takes just shy of 2m18s, or nearly 140s. This is 51 times as much. Perl 5 is not known as a particularly fast language, but in this case perl 6 is really very slow indeed.

Let's try and do a little better. As pointed out by timotimo, the benchmark above is a pessimal case for rationals numbers, which perl 6 uses by default. Perl 5 uses floating point throughout. So we can do better by explicitly using floating point calculations in perl6:

# reciprocal.pl6
my $x = 0e0;
$x += 1e0/$_.Num for 1..50_000_000;
say $x;

This takes approximately 30s on my machine, approximately 5 times faster, but still over 11 times slower than perl5. (NB: for all these numbers I'm posting here, I didn't run exhaustive tests and calculate the statistics, but I feel like this is reliable).

We can typically avoid some overhead in Perl 6 if we avoid using scalar containers by means of binding. We can avoid the dynamic lookup of $_ by replacing the for with a while loop. And we can skip the cast from Int to Num by using a Num iterator value. That gives us the following code:

# reciprocal-while.pl6
my $x := 0e0;
my $i := 0e0;
while (($i := $i + 1e0) < 5e7) {
    $x := $x + 1e0/$i;
}
say $x;

This reduces the run time to approximately 26.5s. So instead of well over 11 times slower than perl 5, perl 6 is now a little less than 10 times slower.
I tried using native types, but that increased the run time to up to 36s. Native type performance (except for native typed arrays) has so far not met expectations for perl 6. (I understand that this is due to excessive boxing and unboxing, unfortunately). So it seems that I failed to make a perl6 program that performs comparably to perl 5.

And yet…

With all due respect to the perl 5 (and perl 6) developers, I think MoarVM ought to be able to do better. MoarVM has support for native-typed values and operators, the perl interpreter does not. It has a dynamic specializer and JIT compiles to native code, the perl interpreter does not. I should expect MoarVM to do better on this benchmark. (Otherwise, what have I been doing with my life, the last few years?)

Let's try the secret weapon: NQP. NQP stands for Not Quite Perl and it is the language the rakudo compiler is built in. It acts like a 'bootstrap' language and compatibility layer between the various backends of rakudo perl 6, such as MoarVM, Java and Javascript, and in the near future Truffle. I like to think that it relates to MoarVM in the same way that C relates to contemporary CPUs - it is a sort of low level, high level language. Although NQP fully support perl6 classes, regular expressions and grammars, it has no support for ranges or C-style for loops, so it does tend to look a bit primitive:

# reciprocal-boxed.nqp
my $x := 0e0;
my $i := 1e0;
while ($i < 5e7) {
    $x := $x + 1e0/$i;
    $i := $i + 1e0;
}
nqp::say("$x");

This version uses boxed objects and takes (on my machine) approximately 12.5s. If you recall, perl 5 took 2.7s, so this is just a bit less than 5 times slower than perl 5. Still not very satisfying though.

Let's improve this version a little bit and add native types here. The only difference between this code and the code above it, is that we have explicitly opted to use num values for $x and $i.

# reciprocal-native.nqp
my num $x := 0e0;
my num $i := 1e0;
while ($i < 5e7) {
    $x := $x + 1e0/$i;
    $i := $i + 1e0;
}
nqp::say("$x");

This code, with JIT enabled, consistently runs in approximately 0.28s on my machine. That is not a typing error. It prints the correct result. I emphatically want you to try this at home: simply save it as reciprocal.nqp and run time nqp reciprocal.nqp. (With JIT disabled, it runs in 2.4s, which is (finally) a bit faster than perl 5 in the interpreter).

Just out of curiosity, I tried comparing this result with the following C code:
 
#include <stdio.h>

int main(int argc, char **argv) {
    double x = 0.0;
    double i;
    for (i = 1.0;  i < 5e7; i += 1.0)
        x += 1.0/i;
    printf("%f", x);
}

On my machine, this takes approximately 0.22s per run, which means that NQP on MoarVM, using native types, is a little more than 1.3x slower than compiled C, excluding the compilation time of the C program. The NQP program does include JIT compilation.

For the record, the C compiled code is much simpler and faster than that generated by the MoarVM JIT - 75 bytes for C 'main' vs 1139 bytes for the NQP 'mainline' (149 bytes for the hot loop). So there is much to improve left, for sure. (EDIT: I originally wrote this in a rather confusing way, but the C code was always the shorter version).

So what does that tell us? Is perl6 50 times slower than perl5, or is NQP just 30% slower than C? There is actually much more to learn from the relative performance of these programs:
  • Regardless of the mathematical merits of rational numbers, they do not win benchmarks. It is a good thing perl 6 can fall back to floating point.
  • The cost of boxing and unboxing in MoarVM appears to be fairly large. Compare the performance of the boxing-floating point calculations and the native-floating point calculations in NQP - the cost of boxing absolutely dwarves the actual cost of doing the computation. I think this deserves further investigation.
  • The overhead introduced by perl6's flexibility, even after specialization, accounts for a factor of two slowdown compared to NQP. If I'd hazard a guess, I would attribute that to (multi) subroutine call overhead and indirection from scalar containers. I would speculate that (deep) inlining or trace specialization would help significantly.
  • The overhead of specialization and JIT compilation is insignificant even in a short-running benchmark like this. This is pretty encouraging I think.
  • This benchmark, when reduced to native operations, appears to be dominated by the cost of floating point operations, and I expect floating point division to be the most expensive of these. So the value as a benchmark for the JIT is relatively limited - both C and NQP code appear to spend most of the time waiting for the FPU to complete.
After I wrote this (but before publishing), I reran the benchmarks with the new branch postrelease-opts. The results were rather different and encouraging (units in seconds):

test master postrelease-opts
reciprocal.pl 140 51
reciprocal.pl6 31 6.6
reciprocal-while.pl6 26 4.6
reciprocal-boxed.nqp 12 3.6
reciprocal-native.nqp 0.27 0.25

Congratulations to jnthn++ and timotimo++ for this fantastic result. As I understand it, we expect to merge this branch after the release of MoarVM 2018.08.

As always, benchmarks should be taken with a grain of salt - results do not always generalize cleanly to production code. But I hope this has given some insights into the performance characteristics of perl 6, and I think it highlights some areas of interest.

PS. The postrelease-opts has taken the place of the master branch as the mainline for development, as we aim to get master fit for a release. Personally, I'd rather have master open for development and a release branch for cutting a stable release. I expect most users get MoarVM from a specific revision via MOAR_REVISION anyway, so having master be the 'development' branch should cause little harm.

donderdag 5 juli 2018

Perl 6 on MoarVM has had a JIT for a few years

Dear readers, I recently came across the following comment on the internet (via Perl 6 weekly):

perl6 on MoarVM is starting to grow a JIT

 Which is very interesting because:
  1. Someone had at least heard of our efforts, so that's a win.
  2. But they had left with the impression that it was still in a beginning phase.
Clearly, we have some PR work to do. Consider this my attempt.

When people are talking about a 'JIT' colloquially and especially in the context of dynamic languages, they usually refer to a system that has both a dynamic specialization functionality, as well as a machine-code emitting functionality. MoarVM has had support for both since 2014. Historically we've called the specializer 'spesh' and the machine code emitter 'jit'. Maybe for communicating with the rest of the world it is better to call both 'the JIT' frontend and backend, respectively.

Without further ado, let me list the history of the MoarVM JIT compiler:
  1. The April 2014 release of MoarVM introduced the dynamic specialization framework spesh (the 'frontend').
  2. In June 2014, I started working in earnest on the machine code emitter (the part that we call the JIT), and halfway through that month had compiled the first function.
  3. In July 2014 we saw the introduction of inlining and on-stack-replacement in spesh.
  4. Also July 2014 saw the introduction of the invokish mechanism by which control flow can be returned to the interpreter.
  5. In August 2014 the JIT backend was completed and merged into MoarVM master.
  6. In June 2015, I started work on a new JIT compiler backend, first by patching the assembler library we use (DynASM), then proceeded with a new intermediate representation and instruction selection.
  7. After that progress slowed until in March 2017 I completed the register allocator. (I guess register allocation was harder than I thought it would be). A little later, the register allocator would support call argument placement
  8. The new 'expression' JIT backend was merged in August 2017. Since then, many contributors have stepped up to develop expression JIT templates (that are used to generate machine code for MoarVM opcodes).
  9. In August 2017, Jonathan started working on reorganizing spesh, starting with moving specialization to a separate thread, central optimization planning, improving optimizations and installing argument guards (which makes the process of selecting the correct specialized variant more efficient).
  10. Somewhere after this - I'm not exactly sure when - nine implemented inlinig the 'NativeCall' foreign function interface into JIT compiled code.
  11. Most recently Jonathan has started work to write specializer plugins - ways for rakudo to inform MoarVM on how to cache the result of method lookups, which should help increase the effectiveness of optimization for Perl 6.
  12. Arround the same time I reworked the way that the interpreter handles control flow changes for  JIT compiled code (e.g. in exception handling).
We (mostly Jonathan and I) have also given presentations on the progress of the JIT compiler:
  1. At YAPC::EU 2014 and FOSDEM 2015, Jonathan gave a presentation on the MoarVM dynamic specialization system.
  2. At YAPC::EU 2015 (Granada) I also gave a presentation. Sadly I can no longer find the presentation online or offline.
  3. At the Swiss Perl Workshop in 2017 Jonathan gave a presentation on deoptimization  and how it is used for supporting speculative optimizations.
  4. At The Perl Conference in Amsterdam (2017) I gave a presentation on how to implement JIT expression templates in MoarVM.
I'm sure there is more out there that I haven't yet linked to. And there's still a lot of exciting work going on and still more work to do. However, I hope that after this post there can be no doubt about the reality of a Perl6 implementation backed by a specializing JIT compiler ;-)


PS: You might read this and be reasonably surprised that Rakudo Perl 6 is not, after all this, very fast yet. I have a - not entirely serious - explanation for that:
  1. All problems in computer science can be solved with a layer of indirection.
  2. Many layers of indirection make programs slow.
  3. Perl 6 solves many computer science problems for you ;-) 
In the future, we'll continue to solve those problems, just faster.

PPS: Also, it is important to note that many of practical speed improvements that Rakudo users have come to enjoy did not come from VM improvements per se, but from better use of the VM by core library routines, for which many volunteers are responsible.

zondag 10 juni 2018

Controlled Stack Hacking for the MoarVM JIT Compiler

Hi readers! Today I have a story about a recently-merged set of patches that allows MoarVM to use the on-stack return pointer to reduce the overhead of exception handling and other VM features for JIT compiled code. Maybe you'll find it interesting.

As you might know, MoarVM Is a bytecode interpreter. In some situations, MoarVM internals need to know the current position in the execution of the program. For instance in exception handling, all exception thrown within a block are caught by the associated CATCH block or propagated if no such block exists. Such blocks are indicated as a range within the bytecode, and we find the associated CATCH block by comparing the current position with the known ranges.

This is relatively straightforward to implement for the interpreter, because the interpreter must maintain a 'current position' pointer simply to function. (MoarVM stores a pointer to this pointer in a thread context object so that it is available throughout the VM). For the JIT that is another matter, because the control flow is handled implicitly by the CPU. The instruction pointer register (called %rip on amd64) cannot be read directly. Moreover, as soon as you enter a function that might want to use the current address (like the functions responsible for exception handling), you've left the 'program' code and entered VM code.

So what we used to do instead is take the address of a position within the bytecode (as indicated by a label in the bytecode, a somewhat involved process) and store that in a per-frame field called the jit_entry_label. This field is necessary to support another MoarVM feature  - we use the interpreter as a trampoline (in the first or second sense of that definition). Because the interpreter is not recursive, JIT compiled code needs to return to the interpreter to execute a subroutine that was invoked (as opposed to calling an interpreter function, as perl5 does for exception handling). The primary purpose of this label is to continue where we left off after returning from another invoked program. But it can be used just as well for finding where we are in the execution of the program.

Only problem then is that we need to keep it up to date, which we did. On the entry of every basic block (uninterrupted sequence of code), we stored the current position in this field. This is quite common - every conditional statement, loop or other control flow change needs one, as well as every exception-handler scope change needed a little snippet storing the current position. This was annoying.

Furthermore, there are numerous MoarVM instructions that might change the control flow (or might not). For instance, the instruction responsible for converting an object to a boolean value might need to invoke the Bool method specific to that objects' class - or, if no such method exists, fallback to a default implementation. We call such instructions invokish. When compiling code that contains such invokish instructions, we installed 'control guards' to check if the VM had in fact invoked another routine, and if so, to return to the interpreter to execute that routine. This too added quite a bit of overhead.

I keep writing in the past tense because all of that is now gone, and that happened due to a simple realization. When we call a function (in C or assembly), we place the return address (the machine instruction after the call instruction) on the stack. We can read this value from the stack and use it wherever we want to know about the current position.

I initially had implemented that using a stack walker function similar to the one in the link, except that I implemented it in assembly instead. (When writing this post I learned of the GCC __builtin_return_address and MSVC _ReturnAddress intrinsic functions, which presumably do the same thing). Unfortunately, that strategy didn't really work - it relies on the frame base pointer (%rbp) being placed right 'on top' of the return address pointer on the stack. Even with special compiler flags intended to preserve that behaviour, this assumption turned out to be unreliable.

Fortunately I realized later that it was also unnecessary. Because the JIT compiler controls the layout of the compiled code frame, it also controls exactly where the return address will be stored when we compile a (C) function call. That means that we can simply take a pointer to this address and store that in the thread context structure. From that address, we can read exactly the current position in the compiled code, without having to explicitly store it so often. Furthermore, we can also write to this location, changing the address the function will return to. Effectively, this is a controlled 'on-stack goto', an idiom more often used for exploits than for good purposes - clearly this is an exception! We use this to force a return to the interpreter (with proper stack frame cleanup) for 'invokish' instructions that end up invoking. We can change control to go directly to an exception handler if it is in the same frame. This makes all the earlier control 'guard' fragments redundant, allowing us to remove them entirely. Thus, an invokish instruction that doesn't actually invoke now carries no extra cost.

How much does this save? It depends a lot on the exact program, but I estimate about 5% of compiled code size, and from a hopelessly optimal (and fairly unrealistic) benchmark which I lifted from this blog post, approximately 10% of runtime. In real code, the effect is definitely nowhere near what jnthn++ or samcv++ achieved lately, but it's still nice. Also nice is that the code is quite a bit simpler than it was before.

Anyway, that's all I have to tell today. Have fun hacking, and until next time!

dinsdag 27 februari 2018

Some things I want

Lately I've been fixing a few bugs here and there in the JIT compiler, as well as trying to advance the JIT expression optimizer. The story of those bugs is interesting but in this post I want to focus on something else, namely some support infrastructure that I think we should have that would make working on MoarVM and spesh/jit in particular much nicer.

There are a bunch of things related to runtime control of spesh and the JIT:
  • jit-bisect.pl is an awesome tool (if I say so myself) that deserves it's own blog post. It can figure out which frame is the first to be miscompiled and cause an execution failure. (Or to be specific, the least miscompile to cause a failure). When a JIT issue is brought to me, I reach for it directly. It should be more powerful:
    • Because it does not know in advance how many frames are compiled in a program, it uses an 'exponential probing' strategy to find the first breaking block, and then binary search between the last two probes. This means that we need to have log(n) tries to find the first failure during probing, and then another log(n) tries to find the exact frame. Usually, failures are much faster than successful runs, so the first phase takes a disproportionate amount of time. It would probably speed up bisects if we could start probing from a number higher than 1.
    • It would be nice if as part of (optional) output, jit-bisect.pl could create a 'replicator' script that would replicate the failure exactly as the bisect script found it. E.g. if it finds that frame 237 block 6 fails, then it should output a script that executes the program with the expression JIT active to exactly that point.
    • We assign sequence number 0 to the first frame to be JIT-compiled, but can't bisect that because we start with sequence number 1.
  • Both spesh and JIT logging should be much more specific. Currently, when MVM_SPESH_LOG is defined, we log all information on all frames we encounter onto the log file, resulting in enormous (easily >20MB) text files. Of those text files, only tiny bits tend to be affected in a way that interests you, and especially as the result of bisect process, only the last frame will be of any interest. So I propose a MVM_SPESH_LOG_FOI (frame of interest) flag that toggles when a frame should be logged or not. Could be an array as well (comma- or semicolon-separated-values). Same goes for the JIT.
  • We can limit the number of frames that passes through spesh using the flag MVM_SPESH_LIMIT, and the number of frames (and basic blocks) that pass through the expression JIT with MVM_JIT_EXPR_LAST_FRAME and MVM_JIT_EXPR_LAST_BB. We cannot do the same for the 'lego' JIT, and we probably should.
  • When MVM_SPESH_BLOCKING is not defined, spesh runs concurrently with the main program, the result of MVM_SPESH_LOG is not repeatable, and MVM_SPESH_LIMIT has as a result no well-defined meaning. Although the jit-bisect.pl program will set MVM_SPESH_LIMIT, it is easy enough to forget. Maybe we should either disable MVM_SPESH_LIMIT without MVM_SPESH_BLOCKING, or MVM_SPESH_LIMIT should imply MVM_SPESH_BLOCKING.
  • Generally we have too much ad-hoc environment setup code that - I feel - does not fit well into the idea of having MoarVM being embedded within another program. Not that embedding interpreters in other programs is a common thing to do anymore, but still. We should probably take that out of general instance setup and install it somewhere else. (MVM_spesh_setup_control_flags?)
Then there's more ambitious stuff, that still falls under 'housekeeping':
  • Currently the expression tree is backed by an array of integers, some of which identify the nodes and some of which are internal pointers (relative to the base of the array). There is an additional info array of structures that has to be explicitly maintained and that holds some information (result type and size, mostly). This is rather annoying when operating on the tree structure (as the optimizer does) because both arrays have to be kept synchronized. Also, these info values are only defined for nodes of the tree. We can reasonably fit all information stored in an info node in 32 bits (if we're packing) or 64 bits (if we're sloppy), even together with the node value. Doing this should save memory, be better for caches, and make tree manipulation easier.
  • I want to have explicit cleanup of the spesh worker thread (if requested by program arguments). Not having this makes ASAN and valgrind somewhat less useful, because they will falsely report in-progress memory as being leaked. To do this we'd need to wait, during shutdown, for the worker thread to be stopped, which means that we also need to have a signal to indicate that it has been stopped.
  • Sometimes, the JIT can get into a confused state and panic, bringing down the process. This is great for debugging and awful for users. I feel like we might be able to improve this - we can, during JIT graph construction, bail out without crashing (if we find constructs we can't JIT yet). We may be able to extend and generalize this. (If we find that we're in an illegal state entirely, that should probably still crash).
  • The legacy ('lego') JIT currently works with a two phase process, first building a JIT graph, then compiling it. This is redundant - without loss of functionality, the lego JIT could be reduced to a single phase (reducing the number of internal structures quite dramatically). This is also something that I'm not totally sure of whether we should do, because we may want having an explicit preparation phase eventually (e.g. when we start doing multiple basic blocks per expression tree).
And there's more ambitious stuff that would fall under optimizations and general functionality improvements:
  • MoarVM in some places assumes that we have 'easy' access to the current instruction pointer, e.g. to lookup dynamic variables or the scope of the current CATCH block. This is logical for an interpreter but problematic for the JIT. We currently need to maintain a pointer to the last entered basic block to approximate it, i.e. a memory store per block. This is expensive, and it's not often needed:
    • We already have to maintain a 'return pointer' for JIT frames that are high up the call stack, so in those cases the additional stores are redundant.
    • In case we are interested in the current position of the current frame, the current position is already on the stack (as the return address of the function call), and we can read it from stack-walking.
    • Alternatively, we can replace calls to functions that are interested in the current position to calls that receive that position explicitly. However, I kind of prefer using stack walking since we'll need the return pointer anyway. Also, it doesn't work for anything that might implicitly throw (MVM_exception_throw_adhoc)
  • The register allocator can be much better in a bunch of (simple) ways:
    • We never need to spill a constant value to memory, since we can always just insert a copy instead.
    • It is probably not necessary to insert individual restores before usages in the same basic block, especially for usages prior to the spilling point.
    • We sometimes spill values that we didn't really need to spill, since we're already storing them to memory as part of the basic block.
  • We have a kind-of hacky way to ensure that our 3-instruction code (c = op(a,b)) is converted to x86 2-instruction code (b = op(b,a)). This should be something that the register allocator can solve, or a phase after the register allocator, but it is currently being handled during code generation.
There is definitively more out there I want to do, but this should be enough to keep me busy for a while. And if anyone else wants to take a try at any of these, they'd be very welcome to :-).

donderdag 25 mei 2017

Function call return values

Hi there, it's been about a month since last I wrote on the progress of the even-moar-jit branch, so it is probably time for another update.

Already two months ago I wrote about adding support for function calls in the expression JIT compiler. This was a major milestone as calling C functions is essential for almost everything that is not pure numerical computing. Now we can also use the return values of function calls (picture that!) The main issue with this was something I've come to call the 'garbage restore' problem, by which I mean that the register allocator would attempt to 'restore' an earlier, possibly undefined, version of a value over a value that would result from a function call.

This has everything to do with the spill strategy used by the compiler. When a value has to be stored to memory (spilled) in order to avoid being overwritten and lost, there are a number of things that can be done. The default, safest strategy is to store a value to memory after every instruction that computes it and to load it from memory before every instruction that uses it. I'll call this a full spill. It is safe because it effectively makes the memory location the only 'true' storage location, with the register being merely temporary caches. It can also be somewhat inefficient, especially if the code path that forces the spill is conditional and rarely taken. In MoarVM, this happens (for instance) around memory barriers, which are only necessary when creating cross-generation object references.

That's why around function calls the JIT uses another strategy, which I will call a point spill. What I mean by that is that the (live) values which could be overwritten by the function call are spilled to memory just before the function call, and loaded back into their original registers directly after. This is mostly safe, since under normal control flow, the code beyond the function call point will be able to continue as if nothing had changed. (A variant which is not at all safe is to store the values to memory at the point, and load them from memory in all subsequent code, because it isn't guaranteed that the original spill-point-code is reached in practice, meaning that you overwrite a good value with garbage. The original register allocator for the new JIT suffered from this problem).

It is only safe, though, if the value that is to be spilled-and-restored is both valid (defined in a code path that always precedes the spill) and required (the value is actually used in code paths that follow the restore). This is not the case, for instance, when a value is the result of a conditional function call, as in the following piece of code:

1:  my $x = $y + $z;
2:  if ($y < 0) {
3:      $x = compute-it($x, $y, $z);
4:  }
5:  say "\$x = $x";

In this code, the value in $x is defined first by the addition operation and then, optionally, by the function call to compute-it. The last use of $x is in the string interpolation on line 5. Thus, according to the compiler, $x holds a 'live' value at the site of the function call on line 3, and so to avoid it from being overwritten, it must be spilled to memory and restored. But in fact, loading $x from memory after compute-it would directly overwrite the new value with the old one.

The problem here appears to be that when the JIT decides to 'save' the value of $x around the function call, it does not take into account that - in this code path - the last use of the old value of $x is in fact when it is placed on the parameter list to the compute-it call. From the perspective of the conditional branch, it is only the new value of $x which is used on line 5. Between the use on the parameter list and the assignment from the return value, the value of $x is not 'live' at all. This is called a 'live range hole'. It is then the goal to find these holes and to make sure a value is not treated as live when it is in fact not.

I used an algorithm from a paper by Wimmer and Franz (2010) to find the holes. However, this algorithm relies on having the control flow structure of the program available, which usually requires a separate analysis step. In my case that was fortunately not necessary since this control flow structure is in fact generated by an earlier step in the JIT compilation process, and all that was necessary is to record it. The algorithm itself is really simple and relies on the following ideas:

  • If a value is defined as the result of a computation, that same (version) of the value cannot be used in code that precedes it.
  • If a value is used in a computation, it must have been defined in some code that precedes it (otherwise the program is obviously incorrect).
  • If, at a branch in the code path, a value is used in at least one of the branches (and not defined in it), it must have been defined prior to the branch instruction.
I think it goes beyond the scope of this blog post to explain how it works in full, but it is really not very complicated and works very well. At any rate, it was sufficient to prevent the JIT from overwriting good values with bad ones, and allowed me to finally enable functions that return values, which is otherwise really simple.

When that was done, I obviously tried to use it and immediately ran into some bugs. To fix that, I've improved the jit-bisect.pl script, which wasn't very robust before. The jit-bisect.pl script uses two environment variables, MVM_JIT_EXPR_LAST_FRAME and MVM_JIT_EXPR_LAST_BB, to automatically find the code sequence where the expression compiler fails and compiles wrong code. (These variables tell the JIT compiler to stop running the expression compiler after a certain number of frames and basic blocks. If we know that the program fails with N blocks compiled, we can use binary search between 0 and N to find out which frame is broken). The jit-dump.pl script then provides disassembled bytecode dumps that can be compared and with that, it is usually relatively easy to find out where the JIT compiler bug is.

With that in hand I've spent my time mostly fixing existing bugs in the JIT compiler. I am now at a stage in which I feel like most of the core functionality is in place, and what is left is about creating extension points and fixing bugs. More on that, however, in my next post. See you then!