vrijdag 3 juli 2015

Intermediate Progress on an Intermediate Representation

In which I write about my mad-scientists approach to building an expression tree and muse about the recursive nature of compilation.

Last week I reported some success in hacking register addressing for amd64 extended register into the DynASM assembler and preprocessor. This week I thought I could do better, and implement something very much like DynASM myself. Well, that wasn't part of the original plan at all, so I think that justifies an explanation.

Maybe you'll recall that the original aim for my proposal was to make the JIT compiler generate better code. In this case, we're lucky enough to know what better means: smaller and with less memory traffic. The old JIT has two principal limitations to prevent it from achieving this goal: it couldn't address registers and it had no way to decouple computations from memory access. The first of these problems involves the aforementioned DynASM hackery. The second of these problems is the topic for the rest of this post.

To generate good code, a compiler needs to know how values are used and how memory is used. For instance, it is useless to commit a temporary variable to memory, especially if it is used directly after, and it is especially wasteful to load a value from memory if it already exists in a register. It is my opinion that a tree structure that explicitly represents the memory access and value usage in a code segment (a basic block in compiler jargon) is the best way to discover opportunities for generating efficient code. I call this tree structure the 'expression tree'. This is especially relevant as the x86 architecture, being a CISC architecture, has many ways of encoding the same computation, so that finding the optimal way is not obvious.  In a way, the same thing that make it easy for a human to program x86 makes it more difficult for a compiler.

As a sidenote: programming x86, especially the amd64 dialect, really is easy, and I would suggest that learning it is an excellent investment of time. There are literally hundreds of guides, most of them quite reasonable (although few have been updated for amd64).

It follows that if one wants to generate code from an expression tree one must first acquire or build such a tree from some input. The input for the JIT compiler is a spesh graph, which is a graphical-and-linear representation of MoarVM bytecode. It is very suitable for analysis and manipulation, but it is not so well suited for low-level code generation (in my opinion), because all memory access is implicit, as are relations between values. (Actually, they are encoded using SSA form, but it takes explicit analysis to find the relations). To summarise, before we can compile an expression tree to machine code, we should first compile the MoarVM bytecode to an expression tree.

I think a good way to do that is to use templates for the expression tree that correspond to particular MoarVM instructions, which are then filled in with information from the specific instruction. Using a relatively simple algorithm, computed values from earlier instructions are then associated with their use in later instructions, forming a tree structure. (Actually, a DAG structure, because these computed values can be used by multiple computations). Whenever a value is first loaded from memory - or we know the register values to have been invalidated somehow - an explicit load node is inserted. Similarily an 'immediate' value node is inserted whenever an instruction has a constant value operand. This ensures that the use of a value is always linked to the most recent computation of it.

Another aside: the use of 'template filling' is made significantly easier by the use of a linear tree representation. Rather than use pointers, I use indices into the array to refer to child nodes. This has several advantages, realloc safety for one, and trivial linking of templates into the tree for another. I use 64 bit integers for holding each tree node, which is immensely wasteful for the tree nodes, but very handy for holding immediate values. Finally, generating the tree in this manner into a linear array implies that the array can be used directly for code generation - because code using an operand is always preceded by code defining it.

If you agree with me that template filling is a good method for generating the low-level IR - considering the most obvious alternative is coding the tree by hand - then maybe you'll also agree that a lookup table is the most obvious way to map MoarVM instructions to templates. And maybe you'll agree that hand-writing a linear tree representation can be a huge pain, because it requires you to exactly match nodes to indices. Moreover, because in C one cannot declare the template array inline to a struct declaration - although one can declare a string inline - these trees would either have to be stored in nearly a thousand separate variables, or in a single giant array. For the purpose of not polluting the namespace unnecessarily, the last solution is preferable.

I'm not sure I can expect my reader to follow me this deep into the rabbit hole. But my narrative isn't done yet. It was clear to me now that I had to use some form of preprocessor to generate the templates (as well as the lookup tables and some runtime usage instructions). (Of course, the language of this preprocessor had to be perl). The last question then was how to represent the template trees. Since these templates could have a tree structure themselves, using the linear array format would've been rather annoying. A lot of people today would probably choose JSON. (That would've been a fine choice, to be honest). Not me, I pick s-expressions. S-expressions are not only trivial to parse (current implementation costs 23 lines), they are also compact and represent trees without any ambiguity. Using just the tiniest bit of syntactic sugar, I've added macro facilities and let statements. This preprocessor is now complete, but I still need  to implement the template filling algorithm, define all the node types for the IR, and of course hook it into the current JIT. With so much still left to do, I'm hoping (but reasonably confident) that this detour of writing an expression template generator will eventually be worth the time. (For one thing, I expect it to make creating the extension API a bit easier).

Next week I plan to finish the IR tree generation and write a simple code generator for it. That code generator will not produce optimal code just yet, but it will demonstrate that the tree structure works, and it will serve to probe difficulties in implementing a more advanced tree-walking code generator. See you then!

dinsdag 23 juni 2015

Adventures in Instruction Encoding

In which I update you of my progress in patching DynASM to Do What I Mean.

Some of you may have noticed but I've officially started working on the MoarVM JIT compiler a little over a week ago. I've been spending that time catching up on reading, thinking about the proper way to represent the low-level IR, tiles, registers, and other bits and pieces. At the advice of Jonathan, I also confronted the problem of dynamic register addressing head-on, which was an interesting experience.

As I have mentioned in my earlier posts, we use DynASM for generating x86-64 machine code, something which it does very well. (Link goes to the unofficial rather than the official documentation, since the former is much more useful than the latter). The advantage of DynASM is that it allows you to write snippets of assembly code just as you would for a regular assembler, and at runtime these are then assembled into real machine code. As such, it hides the user from the gory details of instruction encoding. I think it's safe to say using DynASM made developing the JIT dramatically simpler.

However, DynASM as we used it has an important limitation. The x86-64 instruction set architecture specifies 16 general-purpose registers, but the dynamic addressing feature of DynASM (which allows you to specify at runtime which registers are the operands of a instruction) was limited to using only the 8 registers already present in x86. This is an effect of the way instructions are encoded in x86 - namely, using 3 bits (in octal). 3 bits are not enough to specify 16 registers, so how are the extra registers dealt with?

The answer is: using a special bit in a special prefix byte (REX byte). This byte signifies the use of 64 bit operands or the use of the extended registers. To make matters more difficult, x86 instructions can use up to three registers for 2 instructions, so it is kind of important to know which bit to set, and which not to set. Furthermore, at instruction encoding time you will need to know where that REX byte is, because any number of instruction parameters might have come between the byte that holds the register address and the REX byte. (You might notice I've been talking about bytes, a lot, by now. Instruction encoding is a byte business).

I finally implemented this by specifically marking REX bytes whenever they are followed by a dynamic register declaration, and then adding in the required bits at address encoding time. This required some coordination between the lua part and the C part of DynASM, but ultimately it worked out. Patches are here. In due course I plan to backport this branch to LuaJIT This patch is not entirely complete, though, because the REX byte is not always present when using dynamic registers, only if we use 64 bit operands. Thus, it needs to be conditionally added when using extended registers in the case of 32 bit operands. We don't really expect to use that, though, since MoarVM uses 64 bit almost exclusively, especially on a 64 bit platform.

The importance of this all is that it unblocks the development of a tiler, register selection, register allocation, and in general all the nice stuff of compiler development. Next weeks, I'll start by developing a prototype tiler, which I can then integrate into the MoarVM JIT. There are still plenty of issues to deal with before that is done, and so I'll just try to keep you up to date.

Finally, if you're going to YAPC::EU, note that I'll be presenting on the topic of JIT compilers (especially in the context of MoarVM and perl6). If this interests you, be sure to check it out.

zondag 7 juni 2015

Studying

Odds are that if you read my blog, you also read either perl6 weekly or the perl foundation blog, in which case you already know that my grant application has been accepted. Yay! I should really have blogged somewhat earlier about that, but I've been very busy the last few weeks. But for clarity, that means I start working on the MoarVM JIT compiler ('expression compiler') on the 14th of June, and I hope to have reached a first milestone 5 weeks later, and a number of 'inchstones' before that.

In the meantime, I have just merged a branch that timotimo and I worked on to JIT-compile a larger number of frames containing exceptions. That caused some problems because, as it turns out, there is more than one way to throw and catch exceptions in MoarVM. To be specific, catching an exception sometimes means invoking a handler routine and sometimes means jumping to a specific point within a frame. To make matters more confusing, sometimes that means we descend in the stack (call from our current frame) and sometimes that means we ascend, and sometimes we just jump around in our current frame. And to top it of, perl6 (as one of the few languages I know of) allows you to resume an exception, like so:


sub foo() {
    try {
        say "TRY";
        die "DIE";
        say "RESUMED";
        CATCH {
            say "CATCH";
            default {
                $_.resume;
            }
        }
    }
}

loop (my int $i = 0; $i < 500; $i++) {
    foo();
}
say "FINISHED";

There is no question to me that this is super-cool, of course, but it can be a bit tricky to implement. For the JIT, it meant storing a pointer to a label *just* after the current 'throwish' operation in the exception body. (A 'throwish' op is a VM-level operation that throws exception and thus causes flow control to jump around relatively unpredictably. This is also why (some) C++ programmers dislike exceptions, by the way). This way when an exception is resumed the JIT trampolining mechanisms will ensure that control is resumed where we left off. Unless of course we have jumped to a handler in the same frame, in which case we jump there directly. And because timotimo has implemented the auxiliary ops that come with exception handling we can now JIT quite a few more frames.

Anyway, there are two reasons this post has it's name. The first is of course that I'm still busy finishing this study year for a final week, which is stressful in itself. And the second reason is that I've started reading up on articles concerning JIT compilation and code generation. A short list of these include:
Finally, I've submitted a talk proposal to YAPC::EU 2015 to discuss all the interesting bits of JIT compilation. I hope it will be accepted because (I think) there is no shortage of interesting stuff to to talk about.

zondag 19 april 2015

Grant Application

Hi everybody, I just wanted to point you to my Grant Application at the Perl Foundation to develop a better JIT compiler. Feedback is highly appreciated.

dinsdag 24 maart 2015

Advancing the JIT compiler

It is customary to reintroduce oneself after a long blogging hiatus like the one I displayed here for some 6 months. So here I am, ready to talk JIT compilers again. Or compilers in general, if you wish it. Today I want to discuss possible ways to improve the MoarVM JIT compiler for great performance wins! Or so I hope.

It is no secret (at least not to me) that the MoarVM JIT compiler isn't as good as, say, any of {v8, luajit, HotSpot, Spidermonkey, PyPy}. There are many reasons why and it is helpful to consider the structure of the Rakudo-MoarVM 'stack'.
So to help you do that I've drawn this picture:
A picture may be worth more than a thousand words, but this is just a diagram, and I can summarize the important points as:
  1. MoarVM is the 'end of the pipe'. To your program many optimizations may be applied before your code ever reaches this layer. Many more optimizations cannot be proven safe due to perl6 semantics, however, which is why spesh is a speculative optimizer.
  2. The JIT only kicks in after spesh has applied several optimizations and in the current system only after spesh has generated optimized bytecode. (This is a flaw that we are looking to correct). This is an important point that is often missed, possibly because in many other systems the optimization and JIT compilation steps are not so clearly delineated.
  3. Within MoarVM, 6model plays a central role. 6model is the name for Rakudo's implementation of the perl6 (meta-) object system. Many operations such as array indexes and hash lookups are implemented as 6model operations. In general, these ultimately resolve to 6model 'representation' objects, which are implemented in C. Many steps in the execution of a perl6 program are really (virtually) dispatched to these 'REPR' objects.
So what benefit can an improved JIT compiler bring? The JIT has an advantage compared to all other components in that it has on one hand all information of the layers above and on the other hand it has the full flexibility of the underlying machine. Many tricks are easy and possible at the level of machine code which are awkward (and compiler-specific) to express at higher levels. Generating code at runtime is more-or-less a superpower, which is probably what LISP fans will tell you too.

The current JIT cannot really take advantage of this power because it is really very simple. All operations in a given piece of code are compiled independently and stitched together to form the entire subroutine. You can think of it as a 'cut-and-paste' compiler. Because all pieces are independent, each piece has load and store values directly to memory, causing unnecessary memory traffic. There is also no way to reorder operations or coalesce operations into a smaller amount of instructions. Indeed most forms of machine-level optimization are virtually impossible.

I propose to replace - step-by-step - our current 'cut and paste JIT' with a more advanced 'expression JIT'. The idea is very simple - in order to generate good code for a machine it is first necessary to reify the operations on that machine. Compilation is then the process of ordering these operations in a (directed acyclic) graph and selecting instructions to perform those operations. There is nothing new about that idea, 'real' compilers have been written that way for ages, and there is very little preventing us from doing the same. (Currently, DynASM doesn't support dynamic selection of registers other than those present in x86, which are rather limited. This is something that can be fixed, though).


To be sure, I think we should leave large parts of the current JIT infrastructure intact, with maybe an improvement here or there. And I think this 'expression JIT' can be introduced piecemeal, with the current 'cut-and-paste' code snippets serving as a template. The expression JIT will function as another node type for the current JIT, although probably most code will eventually be converted to it. When this is done it opens up a lot of possibilities:
  • Load and store elision, and in general 'optimal' code generation by register selection and allocation. We are aided in a way by the fact that x86-64 has become more 'RISC-y' over the years, which makes instruction selection simpler.
  • JIT compilation of REPR object methods. The 'expression tree' on which the JIT operates is architecture-independent, so it's feasible to have REPR objects generate a tree for specific operations, effectively 'inlining' such operations into the compilation frame. In real terms, this may translate a high-level array access into a simple pointer reference.
  • NativeCall calls may be more easily converted into JIT-level C calls, which is ultimately the same thing, just much more efficient.
  • Many optimizations that become much simpler even if they were possible before. For example 'small bigint' optimization which lets us execute small integer operations with big integer semantics, courtesy of cheap overflow checking at the machine level. Or possibly transforming tight operation loops into equivalent SIMD instructions, although that one is in fact rather more involved.
To conclude, I think the current MoarVM JIT compiler can be radically improved. I also think that this is nothing groundbreaking intellectually speaking and that there are few barriers to implementation. And finally, that with these improvements MoarVM will have a strong base for future development. I'd love to hear your thoughts.

maandag 18 augustus 2014

Some Lessons

It's wrap-up time! This will be my final post in the GSoC 2014 program, because the GSoC 2014 progam ends 19:00 UTC today, so that's early enough that I won't be writing another post today. It's hard enough to write a post every week as it is. So what I'd like to do today is wrap-up what we've reached and what I've learned in the past three months.

Let's start with a summary of what has been reached. MoarVM has an optional JIT compiler based on spesh, which is the optimization framework I've mentioned often by now. The JIT compiled code cannot handle all instruction yet, but it can handle loops, invocations, perl6 container ops, some regular expressions, expressions and deoptimisation. Reported benefits are anywhere between nothing and 50% faster code, which is awesome. I assume very few people have a lot of truly computation-intensive code running on rakudo perl6 yet, but I suspect that is where it would help most.

The code-generator that we currently have is not awesome. It is in fact pretty naive. Values that are computed in a CPU register must be stored in the MoarVM register space before we can use it in another instruction. We make no attempt to detect loops and allocate registers, we don't do common sub-expression elimination, nor do we make any attempt to dynamically select different instructions.

Well, one aspect of it is awesome, and that is that by using DynASM, there is a direct correspondence between the code we write and the code that is executed. I cannot over-emphasize how user-friendly this really is. Writing a compiler as if I'm writing regular x64 assembly is a great benefit and allows for much faster development times.

Anyway, this discussion of the code generator brings me to another topic I'd like to discuss, and that's why we wrote our own code generator rather than borrowing the one from LuaJIT (or LLVM, or v8 for that matter). I think this is a very fair question and so it deserves some discussion. The first point I'd like to call out is the idea that LLVM (or any of the alternatives) are magic black boxes in which you enter intermediate-level code, wait, and out rolls totally optimal machine code that will run faster than C. I'm obviously totally biased but based on my limited experience I'd say that's not how this works.

First of all, code generation is hardly the most difficult part of doing a JIT compiler. By far most of the complexity in this project has come from integration with the existing MoarVM infrastructure, and that's where the most complex bugs have come from, too. (I've written at length about such bugs already). For example, MoarVM call frame mechanics involve a lot of bookkeeping that simply doesn't exist for C call frames. Likewise, the JIT compiler must make objects available and understandable for the garbage collector just as the interpreter does (or change the GC to understand how to read a JITted frame, not an easy task by itself). Exception throwing and handling form a similar challenge, one which I've talked about at length in another post.

Second, using another projects' code generator comes with specific costs of its' own. A great example of this is the FLTJIT that has recently been added to JavascriptCore (the javascript interpreter that is embedded in webkit). As described here, the JavascriptCore team essentially use LLVM as a plug-in code generation backend after converting their intermediate representation into LLVM IR. And to be fair to them, this seems cause quite a speedup. However this also comes at a cost. For one thing, LLVM optimization and compilation is sufficiently slow for it to be moved into a separate thread (so that it doesn't slow down execution of the unoptimised code). For another example, the garbage collector used by JavascriptCore is semi-conservative (i.e. the stack is scanned conservatively, whereas heap objects are scanned precisely), and to avoid the retention of lots of dead objects unused portions of the stack have to be zeroed explicitly. For a third example, they apparently had to go great lengths to deal with on-stack-replacement, something we handle more simply. In short, using LLVM is costly.


As for using the LuaJIT or v8, I'd argue that there is no truly language-independent compiler. Both v8 and LuaJIT heavily used properties of their respective languages to optimise code, properties which do not hold for perl6. And similarly these languages may have quirks or unusual properties which perl6 does not. An example of these is the truth-value of NaN values. For perl6 these are true because NaN is not equal to 0.0, and for javascript they are false.

I would also note that impressive as the results of the aforementioned projects may be, what they do isn't magic. Algorithms to do code generation are really quite well-known and researched, and there is no reason why we couldn't - in time - implement them. More important than code generation however seems to be optimizing memory access, as it is my suspicion that this is where most of the time is actually spent (Timotimo is actually looking into this as I'm writing this). Because none of these backends know anything about perl6, none of them can use any of the special properties of perl6 to optimize code, which we can (and do).

In conclusion, the choice of a code generation backend is a trade-off like any other. There is no magical silver bullet that will make your code run fast. You could argue against writing our own (even using the help of DynASM) and such an argument could be fair. I would argue as I have above that the trade-off was such that using DynASM was right, especially given the ease of use that it provides.

So what are the next steps for the MoarVM JIT? The first thing is to keep adding instructions while trying to remain bug-free. This isn't as easy as it looks because it seems like every third instruction we add enables the execution of poorly-tested branches that contain new bugs :-). After that, I'd still like to add full register selection for x64 by modifying DynASM. This won't be a very simple task for me since I know very little of x64 instruction encoding or of lua. With that, we can add proper instruction selection and low-level optimization, which should result in a speed-up. However, the most important work is probably to be done in spesh by transforming polymorphic code in simpler, monomorphic, efficient code. For instance, the instruction to test the truth-value of an iterator could be transformed into an iterator-specific instruction that would be much simpler, possibly eliminating a function call. Such transformations should also benefit users who do not use the JIT compiler yet. It seems to me spesh is the place where MoarVM can gain most.

maandag 11 augustus 2014

Names and Labels

I compensate for the infrequency of my blog posts with their length. Or the other way around. Anyway, I have some good news to report, so let's do that first. The JIT branch of MoarVM (moar-jit), which has been my work for the last few months, has been merged into the main branch just this morning, after we've found it not to crash and burn on building nqp, rakudo, and running spectests. This means that, with some fiddling, you too can now run JIT for MoarVM, NQP, and Perl 6. World domination for camelia!

Well, unless we find new bugs, of course. We've had plenty of these the last few weeks, and most of them all had the same cause, which can be summarized simply: semantic use of the interpreter process counter is not really compatible with JIT compilation. But maybe that requires some elaboration.

Simply put, during interpreting MoarVM sometimes needs to know exactly where in the program the interpreter is. This happens, for example, when using handlers, which is the mechanism MoarVM uses for try-catch constructs. In the following frame, for example, lines 1 and 4 would be the start and end of the handlers, and the block within CATCH would be invoked if the do-something-dangerous() would actually throw something. On the other hand, if another-dangerous-thing() were to throw something, the CATCH block should clearly not catch it.

0: sub a-handler-frame() {
1:     try {
2:         do-something-dangerous();
3:         CATCH { say($_); }
4:     } 
5:     another-dangerous-thing();
6: }

To determine what should be done in the event that either of these dangerous functions raises an error, MoarVM inspects the current process counter of the frame, and determines whether or not the try block applies. And this works very well in practice, so long as MoarVM is actually interpreting code. When the same code is compiled - as in, JIT compiled - the interpreter is merely used for entering into the JIT code, and never changes as we move through the frame. So the interpreter instruction pointer can no longer be used to tell where we are. As a result, exception handling didn't quite go smoothly.

A similar problem existed with dynamic variable caches. These are used to make lexical lookup of dynamic variables cheaper by caching them locally in the frame. It is not normally necessary to know where the interpreter is in the frame, except when we're dealing with inlined frames. Put shortly, before inlining the inlined frames may have had different ideas of the contents of the dynamic lexical cache. Because of this, MoarVM needs to know in which of the inlined frames we're actually working to figure out which cache to use. (NB: I'm not totally sure this explanation is 100% correct. Please correct me if not). So again when running the JIT MoarVM couldn't figure this out, and would use the wrong cache.  By the way, I should note that jnthn has been extremely helpful in finding the cause of this and several other bugs.

Because the third time is a charm (as I think the saying goes), another, very similar, version of the same bug appeared just a bit earlier with deoptimization. As I had never implemented any operation that caused a 'global deoptimization', I naively thought I wouldn't have to deal with it yet. After all,global deoptimization means that all frames in the call stack would have to be deoptimized. And you may have guessed it, but to do that correctly, you'll have to know precisely where you are in the deoptimising frame. This one was not only found, but also very helpfully fixed by jnthn.

All this implied that it became necessary for me to just solve this problem - where are we in the JIT code - once and for all. And in fact, there already existed parts of a solution to this problem. After all, the JIT already used a special label to store the place we should return too after we'd invoke another frame. So to determine where we are in the program, all we need to do is map those pointers back to the original structures that refer to them - that is to say, the inlined frames, the handlers, and the deoptimization structures. So it was done, just this week. I'd be lying if I said that this went without a hitch, because especially exception handling presented some challenges, but I think this morning I've ironed out the last issue. And because today is - according to the GSoC 2014 timeline - the 'soft pencils-down date' - in other words, the deadline - we felt it was time to merge moar-jit into master and let you enjoy my work.

And people have! This gist shows the relative speedup caused by spesh and JIT compilation in an admittedly overly simple example. As a counterexample, the compilation of CORE.setting - the most time-intensive part of building rakudo - seems to take slightly longer while using JIT than while. Still, tight and simple loops such as these do seem to occur, so I hope that in real-world programs the MoarVM JIT will give better performance. Quite possibly not as good as the JVM or other systems, certainly not as good as it could be, but better than it used to be.

There is still quite a lot to be done, of course. Many instructions are not readily compiled by the JIT. Fortunately, many of these can be compiled into function calls, because this is exactly what they are for the interpreter, too. Many people, including timotimo and jnthn, have already added instructions this way. Some instructions may have to be refactored a bit and I'm sure we'll encounter new bugs, but I do hope that my work can be a starting point.