woensdag 29 juli 2015

Tiles and Compiler Compilers

This week I made progress on the central 'tiling' algorithm used for code generation. I think it makes an interesting story and theory, so I will try to explain it in this post. Unfortunately, I don't think I can explain it very well without resorting to computer-science jargon, but I hope you can forgive me.

What is tiling?
Tiling refers to the process of breaking an input tree into many pieces. The tree refers to the data structures representing the code that the JIT should compile. As I wrote a few weeks back, these trees are ultimately generated from templates, mapped to the MoarVM opcodes. Let's suppose we have a tree for the expression: result = LOCAL[8] + 12 and that it'd look like the code below:

(add 
  (load
    (addr
      (local)
      8)
    int_sz)
  (const
     8
     int_sz))

I realize most people probably are not familiar with the syntax used here. These are called s-expressions, and there is fortunately little to know: '(' opens a list, ')' closes a list, and words represent symbols that are reproduced verbatim (as are the numbers). If we act like the first word in the list reperesents the name of a function and the rest of the list the arguments, it hopefully becomes clear how the text represents the expression (and you can add LISP to your linkedin profile at the same time).

For those who prefer a visual layout (which is most people, I guess), the following graph represents the exact same code:

The code fragment under consideration.

Tiling is necessary because there are many ways to implement this expression in machine code. For instance, every node can be mapped to a single instruction, as in the naive code below. Compare this to the improved code, which is only two instructions long:
TreeNaive codeImproved Code
(add 
  (load
    (addr
      (local)
      8)
    int_sz)
  (const
     12
     int_sz))
mov r0, 12
lea r1, [rbx+8]
mov r2, [r1]
mov r3, r0
add r3, r2
mov r0, [rbx+8]
add r0, 12


As you can hopefully see from the color, in the improved code each instruction refers to multiple parts of the tree. As a result, the improved code is much shorter, and probably faster to execute.
How do we do it?
There are two basic abstractions in tiling.  The first of these is the tile grammar. A very simple grammar is shown below. Each of these tile rules maps a tree fragment to an action, at terminal and a cost. A tile rule that matches a tree fragment conceptually replaces the fragment with it's terminal; the resulting terminal can be used to match other tiles.

1:  (tile: local    (local) reg 1)
2:  (tile: addr_reg (addr reg) mem 1)
3:  (tile: addr_mem (addr reg) reg 2)
4:  (tile: const    (const) reg 2)
5:  (tile: load_reg (load reg) reg 5)
6:  (tile: load_mem (load mem) reg 5)
7:  (tile: add_reg  (add reg reg) reg 2)
8:  (tile: add_cnst (add reg (const)) reg 3)
9:  (tile: add_ldrg (add reg (load reg)) reg 6)
10: (tile: add_ldmm (add reg (load mem)) reg 6)

The second abstraction is the rule set. A given tree node can potentially be matched by any number of rules. For instance, (const) may be matched by rule 4 or it may be matched as the 'subrule' of rule 8. (We denote the subrule of rule 8 as rule 8a). In fact, there is no way the matcher can distinguish between these two options when it evaluates the (const) node. The matcher can only distinguish between the two options when evaluating the parent node of const. (Parents often know more than their children). Hence, mapping a node often yields multiple rules.

This is something of a challenge for matching the parent node. In case of the (load) node in the graph above, do we match to rule 6 (load mem) or rule 5 (load reg)? The (addr) node can map to either a reg or a mem terminal, so it does not reduce the ambiguity. The answer is that rather than trying to cut through the ambiguity we should embrace it. That is to say, we represent the combination of rules as a single ruleset, and the ruleset represent all possible matching rules.

For example, in the grammar above, a (const) node by itself always matches to rule 4 or rule 8a. So the (const) matches a single ruleset consisting of { 4, 8a }. Similarily, an (addr) always takes a reg terminal and maps to both rules { 2, 3 }. In constrast, the (load) node can match rule 5 - if it's child matches to a reg terminal - or rule 6 if it's child matches to a mem terminal. (It can also match to rule 9a and 10a, but ignore that for simplicity). Since all nodes that generate a mem terminal (i.e. the (addr)) can also generate a reg terminal, rule 6 is always combined with rule 5, but the inverse is not the case. (It's perfectly possible to (load) the result of an (add) operation, for example). Thus, (load) maps to two distinct rulesets: { 5 } and { 5, 6 }.

Table generation
It is pretty easy to determine whether a rule will match a node and a combination of rulesets: just try if any of those rulesets can generate the required terminals. Checking this for all rules available will then give you a new combination of rules, which is also represented with a ruleset.  Better yet, knowing the costs associated with each rule, one can determine the optimal rule to compute a node to the terminal required. For instance, in the tree above:
  1. (local) can only be matched by rule 1 (ruleset {1}).
  2. (addr {1}) can be matched by rule 2 and 3 equally. (ruleset {2,3})
  3. (load {2,3}) can be matched by rule 5, 6, 9a and 10a, because the ruleset {2,3} from (addr) generates both mem and reg terminals. (ruleset {5,6,9a,10a}).
  4. (const) can be matched by rule 4 and 8a (ruleset: {4, 8a}).
  5. (add) can be matched by rule 7 and 8, because ruleset {5,6,9a,10a} can generate a reg, and ruleset {4, 8a} can generate the (const) placeholder expected by rule 8. Hence (add {5,6,9a, 10a} {4, 8a}) yields ruleset {7,8}.
Now to determine the optimum code:
  1. (add) can best be represented by rule 8, because this has lower cost than the combination of rule 4 and rule 7.
  2. (load) can best be represented by rule 6, because this has lower cost than rule 3 and 5 combined.
  3. (const) requires no specific representation being embedded in rule 8.
  4. The same goes for (addr) and (local).
In fact, you can compute this information prior to the actual compilation process, and stick it in a table - simply by applying all rules to all combinations of rulesets. Doing this transforms the ambiguous, nondeterministic matching process into a deterministic table lookup. In CS jargon, it transforms the NFA represented by the tile grammar into a DFA. As in all such transformations, this takes up significant amounts of space.

Let's keep it simple
So much space, in fact, that we're not home and dry just yet. A table mapping the combination of a node and two children - indexed by ruleset - must be at least O(nchild × n2ruleset) large. If we naively generate all combinations of rules that start with a given head node we generate 2n rulesets per type of head. Some heads are potentially involved with over 10 rules (consider (load), which is allowed in nearly all x86 instructions), giving - naively - 1024 rulesets. Most of these rulesets are impossible to generate. For example, in our miniature grammar, a ruleset containing {8,9} clearly cannot be generated. It is therefore in our keen interest to generate the minimum amount of rulesets.

But that is pretty complicated: it either requires rather sensitive analysis of the grammar, which isn't algorithmicly cheap by itself; or we can simply read all the rulesets that can be generated from the grammar, by constructing the table that generates them. Clearly that is a chicken-and-egg problem: to find the rulesets that can be generated by a grammar, we have to make a table based on just those rulesets. Fortunately, chicken-and-egg problems can usually be solved by using some form of topological sorting. To put it in other words, we don't need to have all rulesets available to find the combination of rules the grammar can produce, just some that generate all the terminals needed by a given node. In our grammar above, we can just start by generating all rules for (const) and (local), noting that they generate one ruleset each. After that is done, we can generate all rules that rely only on reg, which is the (addr) rule (generating mem). We continue process this until all rulesets have been generated. This dramatically reduces the size of the table, which is still pretty large. Without this procedure however, the time taken to build the table tends to explode on relatively small grammars.

Ultimately the tiler table must be available for the JIT compiler, which is written in C. The tile table generator is written in perl5 (just like the expression template compiler), because, manipulexity and whipuptitude, and it runs everywhere, you know? In fact, perl5 is already a requirement for building MoarVM, which means I wouldn't introduce new build-time dependencies. (Nobody likes installing a build-time dependency, least of all me). Perl5 natively supports hash tables; C does not. So I chose to represent the table as a sorted array of key + value and use binary search to find the right items. There are certainly more efficient representations, but this is very simple and still guarantees quite adequate lookup times. This is important in ensuring the JIT compiler won't become a bottleneck itself.

So that is the story of how I wrote the tiler table generator (and incidentally, a story how perl saved the day). With these in place, I can implement the final tiler quite trivially (I already did, again, in perl). I conclude with noting that while the Aho algorithm guarantees optimal tiling (within the limitations of the grammar), it is not an optimization method by itself. For truly good code - say, like GCC or LLVM can produce - much more is needed: instruction scheduling, register allocation, and true optimisation techniques. Until my next report, I'll be working on these topics. See you then!

maandag 20 juli 2015

Of Values


In the interest of the common interest in my little project, I think it's time for me to blog again. Today marks the end of the fifth week of my project period, since my project was planned to take 10 weeks, that means I'm now halfway. So what have I achieved, and what have I learned?

Well, I promised to deliver the following things:
  1. An implementation of the code generation algorithm described below, including instruction selection tables for the x64 architecture
  2. A runtime representation of machine operations ('expressions') that will form the input to the code generator and is suitable for targeting to different architectures
  3. A patched version of DynASM capable of addressing the extended registers of x64
  4. Conversion of large parts of the current JIT to the new algorithm
  5. An extension API for REPR ops to insert (inline) expressions into the JIT in place of some operations
  6. A set of automated tests known to trigger JIT compilation to either the NQP or Rakudo Perl 6 test suite.
  7. Reports and documentation explaining the JIT and it's API
I have delivered 2, and 3, and the main reason I haven't done 4 yet is that 1 is not completely done. (The consequence of converting everything to the expression tree format would be that testing the soundness of compilation algorithms would become much more difficult). I have delivered tooling (i.e. a preprocessor) to elegantly and efficiently transform MoarVM bytecode segments to the expression tree format.

I think my almost-weekly blog reports do something for 7, but real documentation is still lacking. In the case of 6 (the test suite), it turns out that practically any NQP program - including bootstrapping NQP itself - already exercises the JIT quite a bit, including the new expression tree pseudocompiler. Thus, during development it has not yet been necessary to develop an explicit test suite, but I expect it will become more useful when the core compiler has stabilized. So in short, although I am not quite near the finish line, I think I am well underway to delivering a usable and useful compiler.

What have I learned that I think will help me go forward?
  1. A lot of things that look like simple expressions in MoarVM are quite complex underneath. Some things include conditional evaluation, some include evaluation lists. Many things have common subexpressions. Many other things are really statements.
  2. A MoarVM basic block is a much larger unit than a machine basic block, and the former may include many of the latter. A basic block in the expression tree is also quite conceptually difficult, given that
  3. Lazy evaluation is not compatible with single evaluation in case of conditional evaluation.
  4. The issues of evaluation order, value management, register management and instruction selection are distinct but closely related. Each greatly influences the order. For instance,
    1. A register can hold multiple values (values can be aliased to the same register).
    2. Values may be classified as intermediaries (single use not representing a final variable), temporaries (multiple uses not representing a final variable) and locals (that do represent a final variable).
    3. Value uniqueness falls directly out of the expression tree format.
    4. Instruction selection influences the amount of registers required and should precede register management.
    5. Register selection benefits from a consistent way to get a 'free register', and either a heap or a stack are decent ways to provide this; more importantly, it benefits from a way to efficiently subset the register set.
  5. It's nearly impossible to compile decent code using only a single pass traversal, because you don't know where values will end up, and to which of the 3 classes above it belongs.
  6. The expression tree is really a Directed Acyclic Graph, and traversal and compilation can be significantly more complex for a DAG than they are for a tree.
Accordingly, I've spent most of my last week learning these things, in various degrees of hard and easy ways to learn them. This is why, as far as features are concerned, I don't have so much news to report this week. I hope next week I'll have more exciting news to report. See you then!

maandag 13 juli 2015

A Progress Report

Another week, another moment for reporting JIT compiler progress. I don't really have an interesting story to tell, so I'll keep it to a list of goals achieved.

I implemented a macro facility in the expression tree template builder, and changed the parser to use s-expressions throughout, making it much simpler.  I've used it to implement some opcode templates, learning much about what is required of the expression tree.

I've introduced a macro-based dynamic array implementation and refactored the JIT graph builder and expression tree builder to use it. This is necessary to allow the expression tree builder to use JIT graph labels. (For the record, the graph is the structure representing the whole routine or frame, and the expression tree represents a small part of interconnected expressions or statements. Expression tree is a misnomer for the data type I'm manipulating, because it is a DAG rather than a tree, and it holds statements rather than expressions. But the name is there, and I haven't really got a replacement ready).

I've implemented a 'generic' tree-walking mechanism on the expression tree, and designed some algorithms for it to help compiling, such as live-range calculations and common subexpression elimination (CSE). CSE is not just a useful optimization, but as a result of it, all sorts of useful information can be calculated, informing register allocation and/or spill decisions. Another useful optimization, and not a very difficult one, is constant folding.

I've added and changed and removed a bunch of expression tree node types and macro's. There are some interesting language-design details there; for instance that all and any can stand in for boolean or and and when these are used for binary computations, as is the case for machine code.

I've started writing a 'pseudocompiler', that is to say, a routine that logs the machine code statements that would be produced by the expression tree compiler to the JIT log, allowing me to inspect the logs to find bugs rather than deep down in GDB. Predictably, there were many bugs, most of which I think I've now fixed.

I've implemented the worlds most naive register allocator, based on a ring of usable registers and spilling to stack. This was more complex than I had assumed, so doing so was another learning experience. I noticed that without use information, there is no way to insert spills

I've also encountered some difficulties. The notion of a basic block - an uninterrupted sequence of operations - differs between the JIT compiler and spesh, because many MoarVM-level instructions are implemented as function calls. Function calls imply spills (because registers are not persisted between calls); but because the call may be conditional, there is potentially a path with and without spills; implying the load will be garbage. Or in other words, spills should precede conditionals, because conditionals break up the basic block. I think the SSA information from spesh could be useful here, but I have so far not figured out how to combine this information with the expression tree.

Some things (pure operations without side effects) can potentially be recalculated rather than spilled. Address calculations, which can be done inline (for the most part) in x64 instructions, are a prime example of this. (The current pseudocompiler computes these values into real registers, because the current pseudocompiler is dumb).

That is most of it, I guess. See you next week, when it's milestone time.

vrijdag 3 juli 2015

Intermediate Progress on an Intermediate Representation

In which I write about my mad-scientists approach to building an expression tree and muse about the recursive nature of compilation.

Last week I reported some success in hacking register addressing for amd64 extended register into the DynASM assembler and preprocessor. This week I thought I could do better, and implement something very much like DynASM myself. Well, that wasn't part of the original plan at all, so I think that justifies an explanation.

Maybe you'll recall that the original aim for my proposal was to make the JIT compiler generate better code. In this case, we're lucky enough to know what better means: smaller and with less memory traffic. The old JIT has two principal limitations to prevent it from achieving this goal: it couldn't address registers and it had no way to decouple computations from memory access. The first of these problems involves the aforementioned DynASM hackery. The second of these problems is the topic for the rest of this post.

To generate good code, a compiler needs to know how values are used and how memory is used. For instance, it is useless to commit a temporary variable to memory, especially if it is used directly after, and it is especially wasteful to load a value from memory if it already exists in a register. It is my opinion that a tree structure that explicitly represents the memory access and value usage in a code segment (a basic block in compiler jargon) is the best way to discover opportunities for generating efficient code. I call this tree structure the 'expression tree'. This is especially relevant as the x86 architecture, being a CISC architecture, has many ways of encoding the same computation, so that finding the optimal way is not obvious.  In a way, the same thing that make it easy for a human to program x86 makes it more difficult for a compiler.

As a sidenote: programming x86, especially the amd64 dialect, really is easy, and I would suggest that learning it is an excellent investment of time. There are literally hundreds of guides, most of them quite reasonable (although few have been updated for amd64).

It follows that if one wants to generate code from an expression tree one must first acquire or build such a tree from some input. The input for the JIT compiler is a spesh graph, which is a graphical-and-linear representation of MoarVM bytecode. It is very suitable for analysis and manipulation, but it is not so well suited for low-level code generation (in my opinion), because all memory access is implicit, as are relations between values. (Actually, they are encoded using SSA form, but it takes explicit analysis to find the relations). To summarise, before we can compile an expression tree to machine code, we should first compile the MoarVM bytecode to an expression tree.

I think a good way to do that is to use templates for the expression tree that correspond to particular MoarVM instructions, which are then filled in with information from the specific instruction. Using a relatively simple algorithm, computed values from earlier instructions are then associated with their use in later instructions, forming a tree structure. (Actually, a DAG structure, because these computed values can be used by multiple computations). Whenever a value is first loaded from memory - or we know the register values to have been invalidated somehow - an explicit load node is inserted. Similarily an 'immediate' value node is inserted whenever an instruction has a constant value operand. This ensures that the use of a value is always linked to the most recent computation of it.

Another aside: the use of 'template filling' is made significantly easier by the use of a linear tree representation. Rather than use pointers, I use indices into the array to refer to child nodes. This has several advantages, realloc safety for one, and trivial linking of templates into the tree for another. I use 64 bit integers for holding each tree node, which is immensely wasteful for the tree nodes, but very handy for holding immediate values. Finally, generating the tree in this manner into a linear array implies that the array can be used directly for code generation - because code using an operand is always preceded by code defining it.

If you agree with me that template filling is a good method for generating the low-level IR - considering the most obvious alternative is coding the tree by hand - then maybe you'll also agree that a lookup table is the most obvious way to map MoarVM instructions to templates. And maybe you'll agree that hand-writing a linear tree representation can be a huge pain, because it requires you to exactly match nodes to indices. Moreover, because in C one cannot declare the template array inline to a struct declaration - although one can declare a string inline - these trees would either have to be stored in nearly a thousand separate variables, or in a single giant array. For the purpose of not polluting the namespace unnecessarily, the last solution is preferable.

I'm not sure I can expect my reader to follow me this deep into the rabbit hole. But my narrative isn't done yet. It was clear to me now that I had to use some form of preprocessor to generate the templates (as well as the lookup tables and some runtime usage instructions). (Of course, the language of this preprocessor had to be perl). The last question then was how to represent the template trees. Since these templates could have a tree structure themselves, using the linear array format would've been rather annoying. A lot of people today would probably choose JSON. (That would've been a fine choice, to be honest). Not me, I pick s-expressions. S-expressions are not only trivial to parse (current implementation costs 23 lines), they are also compact and represent trees without any ambiguity. Using just the tiniest bit of syntactic sugar, I've added macro facilities and let statements. This preprocessor is now complete, but I still need  to implement the template filling algorithm, define all the node types for the IR, and of course hook it into the current JIT. With so much still left to do, I'm hoping (but reasonably confident) that this detour of writing an expression template generator will eventually be worth the time. (For one thing, I expect it to make creating the extension API a bit easier).

Next week I plan to finish the IR tree generation and write a simple code generator for it. That code generator will not produce optimal code just yet, but it will demonstrate that the tree structure works, and it will serve to probe difficulties in implementing a more advanced tree-walking code generator. See you then!

dinsdag 23 juni 2015

Adventures in Instruction Encoding

In which I update you of my progress in patching DynASM to Do What I Mean.

Some of you may have noticed but I've officially started working on the MoarVM JIT compiler a little over a week ago. I've been spending that time catching up on reading, thinking about the proper way to represent the low-level IR, tiles, registers, and other bits and pieces. At the advice of Jonathan, I also confronted the problem of dynamic register addressing head-on, which was an interesting experience.

As I have mentioned in my earlier posts, we use DynASM for generating x86-64 machine code, something which it does very well. (Link goes to the unofficial rather than the official documentation, since the former is much more useful than the latter). The advantage of DynASM is that it allows you to write snippets of assembly code just as you would for a regular assembler, and at runtime these are then assembled into real machine code. As such, it hides the user from the gory details of instruction encoding. I think it's safe to say using DynASM made developing the JIT dramatically simpler.

However, DynASM as we used it has an important limitation. The x86-64 instruction set architecture specifies 16 general-purpose registers, but the dynamic addressing feature of DynASM (which allows you to specify at runtime which registers are the operands of a instruction) was limited to using only the 8 registers already present in x86. This is an effect of the way instructions are encoded in x86 - namely, using 3 bits (in octal). 3 bits are not enough to specify 16 registers, so how are the extra registers dealt with?

The answer is: using a special bit in a special prefix byte (REX byte). This byte signifies the use of 64 bit operands or the use of the extended registers. To make matters more difficult, x86 instructions can use up to three registers for 2 instructions, so it is kind of important to know which bit to set, and which not to set. Furthermore, at instruction encoding time you will need to know where that REX byte is, because any number of instruction parameters might have come between the byte that holds the register address and the REX byte. (You might notice I've been talking about bytes, a lot, by now. Instruction encoding is a byte business).

I finally implemented this by specifically marking REX bytes whenever they are followed by a dynamic register declaration, and then adding in the required bits at address encoding time. This required some coordination between the lua part and the C part of DynASM, but ultimately it worked out. Patches are here. In due course I plan to backport this branch to LuaJIT This patch is not entirely complete, though, because the REX byte is not always present when using dynamic registers, only if we use 64 bit operands. Thus, it needs to be conditionally added when using extended registers in the case of 32 bit operands. We don't really expect to use that, though, since MoarVM uses 64 bit almost exclusively, especially on a 64 bit platform.

The importance of this all is that it unblocks the development of a tiler, register selection, register allocation, and in general all the nice stuff of compiler development. Next weeks, I'll start by developing a prototype tiler, which I can then integrate into the MoarVM JIT. There are still plenty of issues to deal with before that is done, and so I'll just try to keep you up to date.

Finally, if you're going to YAPC::EU, note that I'll be presenting on the topic of JIT compilers (especially in the context of MoarVM and perl6). If this interests you, be sure to check it out.

zondag 7 juni 2015

Studying

Odds are that if you read my blog, you also read either perl6 weekly or the perl foundation blog, in which case you already know that my grant application has been accepted. Yay! I should really have blogged somewhat earlier about that, but I've been very busy the last few weeks. But for clarity, that means I start working on the MoarVM JIT compiler ('expression compiler') on the 14th of June, and I hope to have reached a first milestone 5 weeks later, and a number of 'inchstones' before that.

In the meantime, I have just merged a branch that timotimo and I worked on to JIT-compile a larger number of frames containing exceptions. That caused some problems because, as it turns out, there is more than one way to throw and catch exceptions in MoarVM. To be specific, catching an exception sometimes means invoking a handler routine and sometimes means jumping to a specific point within a frame. To make matters more confusing, sometimes that means we descend in the stack (call from our current frame) and sometimes that means we ascend, and sometimes we just jump around in our current frame. And to top it of, perl6 (as one of the few languages I know of) allows you to resume an exception, like so:


sub foo() {
    try {
        say "TRY";
        die "DIE";
        say "RESUMED";
        CATCH {
            say "CATCH";
            default {
                $_.resume;
            }
        }
    }
}

loop (my int $i = 0; $i < 500; $i++) {
    foo();
}
say "FINISHED";

There is no question to me that this is super-cool, of course, but it can be a bit tricky to implement. For the JIT, it meant storing a pointer to a label *just* after the current 'throwish' operation in the exception body. (A 'throwish' op is a VM-level operation that throws exception and thus causes flow control to jump around relatively unpredictably. This is also why (some) C++ programmers dislike exceptions, by the way). This way when an exception is resumed the JIT trampolining mechanisms will ensure that control is resumed where we left off. Unless of course we have jumped to a handler in the same frame, in which case we jump there directly. And because timotimo has implemented the auxiliary ops that come with exception handling we can now JIT quite a few more frames.

Anyway, there are two reasons this post has it's name. The first is of course that I'm still busy finishing this study year for a final week, which is stressful in itself. And the second reason is that I've started reading up on articles concerning JIT compilation and code generation. A short list of these include:
Finally, I've submitted a talk proposal to YAPC::EU 2015 to discuss all the interesting bits of JIT compilation. I hope it will be accepted because (I think) there is no shortage of interesting stuff to to talk about.

zondag 19 april 2015

Grant Application

Hi everybody, I just wanted to point you to my Grant Application at the Perl Foundation to develop a better JIT compiler. Feedback is highly appreciated.