maandag 18 augustus 2014

Some Lessons

It's wrap-up time! This will be my final post in the GSoC 2014 program, because the GSoC 2014 progam ends 19:00 UTC today, so that's early enough that I won't be writing another post today. It's hard enough to write a post every week as it is. So what I'd like to do today is wrap-up what we've reached and what I've learned in the past three months.

Let's start with a summary of what has been reached. MoarVM has an optional JIT compiler based on spesh, which is the optimization framework I've mentioned often by now. The JIT compiled code cannot handle all instruction yet, but it can handle loops, invocations, perl6 container ops, some regular expressions, expressions and deoptimisation. Reported benefits are anywhere between nothing and 50% faster code, which is awesome. I assume very few people have a lot of truly computation-intensive code running on rakudo perl6 yet, but I suspect that is where it would help most.

The code-generator that we currently have is not awesome. It is in fact pretty naive. Values that are computed in a CPU register must be stored in the MoarVM register space before we can use it in another instruction. We make no attempt to detect loops and allocate registers, we don't do common sub-expression elimination, nor do we make any attempt to dynamically select different instructions.

Well, one aspect of it is awesome, and that is that by using DynASM, there is a direct correspondence between the code we write and the code that is executed. I cannot over-emphasize how user-friendly this really is. Writing a compiler as if I'm writing regular x64 assembly is a great benefit and allows for much faster development times.

Anyway, this discussion of the code generator brings me to another topic I'd like to discuss, and that's why we wrote our own code generator rather than borrowing the one from LuaJIT (or LLVM, or v8 for that matter). I think this is a very fair question and so it deserves some discussion. The first point I'd like to call out is the idea that LLVM (or any of the alternatives) are magic black boxes in which you enter intermediate-level code, wait, and out rolls totally optimal machine code that will run faster than C. I'm obviously totally biased but based on my limited experience I'd say that's not how this works.

First of all, code generation is hardly the most difficult part of doing a JIT compiler. By far most of the complexity in this project has come from integration with the existing MoarVM infrastructure, and that's where the most complex bugs have come from, too. (I've written at length about such bugs already). For example, MoarVM call frame mechanics involve a lot of bookkeeping that simply doesn't exist for C call frames. Likewise, the JIT compiler must make objects available and understandable for the garbage collector just as the interpreter does (or change the GC to understand how to read a JITted frame, not an easy task by itself). Exception throwing and handling form a similar challenge, one which I've talked about at length in another post.

Second, using another projects' code generator comes with specific costs of its' own. A great example of this is the FLTJIT that has recently been added to JavascriptCore (the javascript interpreter that is embedded in webkit). As described here, the JavascriptCore team essentially use LLVM as a plug-in code generation backend after converting their intermediate representation into LLVM IR. And to be fair to them, this seems cause quite a speedup. However this also comes at a cost. For one thing, LLVM optimization and compilation is sufficiently slow for it to be moved into a separate thread (so that it doesn't slow down execution of the unoptimised code). For another example, the garbage collector used by JavascriptCore is semi-conservative (i.e. the stack is scanned conservatively, whereas heap objects are scanned precisely), and to avoid the retention of lots of dead objects unused portions of the stack have to be zeroed explicitly. For a third example, they apparently had to go great lengths to deal with on-stack-replacement, something we handle more simply. In short, using LLVM is costly.

As for using the LuaJIT or v8, I'd argue that there is no truly language-independent compiler. Both v8 and LuaJIT heavily used properties of their respective languages to optimise code, properties which do not hold for perl6. And similarly these languages may have quirks or unusual properties which perl6 does not. An example of these is the truth-value of NaN values. For perl6 these are true because NaN is not equal to 0.0, and for javascript they are false.

I would also note that impressive as the results of the aforementioned projects may be, what they do isn't magic. Algorithms to do code generation are really quite well-known and researched, and there is no reason why we couldn't - in time - implement them. More important than code generation however seems to be optimizing memory access, as it is my suspicion that this is where most of the time is actually spent (Timotimo is actually looking into this as I'm writing this). Because none of these backends know anything about perl6, none of them can use any of the special properties of perl6 to optimize code, which we can (and do).

In conclusion, the choice of a code generation backend is a trade-off like any other. There is no magical silver bullet that will make your code run fast. You could argue against writing our own (even using the help of DynASM) and such an argument could be fair. I would argue as I have above that the trade-off was such that using DynASM was right, especially given the ease of use that it provides.

So what are the next steps for the MoarVM JIT? The first thing is to keep adding instructions while trying to remain bug-free. This isn't as easy as it looks because it seems like every third instruction we add enables the execution of poorly-tested branches that contain new bugs :-). After that, I'd still like to add full register selection for x64 by modifying DynASM. This won't be a very simple task for me since I know very little of x64 instruction encoding or of lua. With that, we can add proper instruction selection and low-level optimization, which should result in a speed-up. However, the most important work is probably to be done in spesh by transforming polymorphic code in simpler, monomorphic, efficient code. For instance, the instruction to test the truth-value of an iterator could be transformed into an iterator-specific instruction that would be much simpler, possibly eliminating a function call. Such transformations should also benefit users who do not use the JIT compiler yet. It seems to me spesh is the place where MoarVM can gain most.

maandag 11 augustus 2014

Names and Labels

I compensate for the infrequency of my blog posts with their length. Or the other way around. Anyway, I have some good news to report, so let's do that first. The JIT branch of MoarVM (moar-jit), which has been my work for the last few months, has been merged into the main branch just this morning, after we've found it not to crash and burn on building nqp, rakudo, and running spectests. This means that, with some fiddling, you too can now run JIT for MoarVM, NQP, and Perl 6. World domination for camelia!

Well, unless we find new bugs, of course. We've had plenty of these the last few weeks, and most of them all had the same cause, which can be summarized simply: semantic use of the interpreter process counter is not really compatible with JIT compilation. But maybe that requires some elaboration.

Simply put, during interpreting MoarVM sometimes needs to know exactly where in the program the interpreter is. This happens, for example, when using handlers, which is the mechanism MoarVM uses for try-catch constructs. In the following frame, for example, lines 1 and 4 would be the start and end of the handlers, and the block within CATCH would be invoked if the do-something-dangerous() would actually throw something. On the other hand, if another-dangerous-thing() were to throw something, the CATCH block should clearly not catch it.

0: sub a-handler-frame() {
1:     try {
2:         do-something-dangerous();
3:         CATCH { say($_); }
4:     } 
5:     another-dangerous-thing();
6: }

To determine what should be done in the event that either of these dangerous functions raises an error, MoarVM inspects the current process counter of the frame, and determines whether or not the try block applies. And this works very well in practice, so long as MoarVM is actually interpreting code. When the same code is compiled - as in, JIT compiled - the interpreter is merely used for entering into the JIT code, and never changes as we move through the frame. So the interpreter instruction pointer can no longer be used to tell where we are. As a result, exception handling didn't quite go smoothly.

A similar problem existed with dynamic variable caches. These are used to make lexical lookup of dynamic variables cheaper by caching them locally in the frame. It is not normally necessary to know where the interpreter is in the frame, except when we're dealing with inlined frames. Put shortly, before inlining the inlined frames may have had different ideas of the contents of the dynamic lexical cache. Because of this, MoarVM needs to know in which of the inlined frames we're actually working to figure out which cache to use. (NB: I'm not totally sure this explanation is 100% correct. Please correct me if not). So again when running the JIT MoarVM couldn't figure this out, and would use the wrong cache.  By the way, I should note that jnthn has been extremely helpful in finding the cause of this and several other bugs.

Because the third time is a charm (as I think the saying goes), another, very similar, version of the same bug appeared just a bit earlier with deoptimization. As I had never implemented any operation that caused a 'global deoptimization', I naively thought I wouldn't have to deal with it yet. After all,global deoptimization means that all frames in the call stack would have to be deoptimized. And you may have guessed it, but to do that correctly, you'll have to know precisely where you are in the deoptimising frame. This one was not only found, but also very helpfully fixed by jnthn.

All this implied that it became necessary for me to just solve this problem - where are we in the JIT code - once and for all. And in fact, there already existed parts of a solution to this problem. After all, the JIT already used a special label to store the place we should return too after we'd invoke another frame. So to determine where we are in the program, all we need to do is map those pointers back to the original structures that refer to them - that is to say, the inlined frames, the handlers, and the deoptimization structures. So it was done, just this week. I'd be lying if I said that this went without a hitch, because especially exception handling presented some challenges, but I think this morning I've ironed out the last issue. And because today is - according to the GSoC 2014 timeline - the 'soft pencils-down date' - in other words, the deadline - we felt it was time to merge moar-jit into master and let you enjoy my work.

And people have! This gist shows the relative speedup caused by spesh and JIT compilation in an admittedly overly simple example. As a counterexample, the compilation of CORE.setting - the most time-intensive part of building rakudo - seems to take slightly longer while using JIT than while. Still, tight and simple loops such as these do seem to occur, so I hope that in real-world programs the MoarVM JIT will give better performance. Quite possibly not as good as the JVM or other systems, certainly not as good as it could be, but better than it used to be.

There is still quite a lot to be done, of course. Many instructions are not readily compiled by the JIT. Fortunately, many of these can be compiled into function calls, because this is exactly what they are for the interpreter, too. Many people, including timotimo and jnthn, have already added instructions this way. Some instructions may have to be refactored a bit and I'm sure we'll encounter new bugs, but I do hope that my work can be a starting point.