What is tiling?
Tiling refers to the process of breaking an input tree into many pieces. The tree refers to the data structures representing the code that the JIT should compile. As I wrote a few weeks back, these trees are ultimately generated from templates, mapped to the MoarVM opcodes. Let's suppose we have a tree for the expression:
result = LOCAL[8] + 12
and that it'd look like the code below:
(add
(load
(addr
(local)
8)
int_sz)
(const
8
int_sz))
I realize most people probably are not familiar with the syntax used here. These are called sexpressions, and there is fortunately little to know: '(' opens a list, ')' closes a list, and words represent symbols that are reproduced verbatim (as are the numbers). If we act like the first word in the list reperesents the name of a function and the rest of the list the arguments, it hopefully becomes clear how the text represents the expression (and you can add LISP to your linkedin profile at the same time).
For those who prefer a visual layout (which is most people, I guess), the following graph represents the exact same code:
The code fragment under consideration. 
Tiling is necessary because there are many ways to implement this expression in machine code. For instance, every node can be mapped to a single instruction, as in the naive code below. Compare this to the improved code, which is only two instructions long:
Tree  Naive code  Improved Code 




As you can hopefully see from the color, in the improved code each instruction refers to multiple parts of the tree. As a result, the improved code is much shorter, and probably faster to execute.
How do we do it?
There are two basic abstractions in tiling. The first of these is the tile grammar. A very simple grammar is shown below. Each of these tile rules maps a tree fragment to an action, at terminal and a cost. A tile rule that matches a tree fragment conceptually replaces the fragment with it's terminal; the resulting terminal can be used to match other tiles.
1: (tile: local (local) reg 1)
2: (tile: addr_reg (addr reg) mem 1)
3: (tile: addr_mem (addr reg) reg 2)
4: (tile: const (const) reg 2)
5: (tile: load_reg (load reg) reg 5)
6: (tile: load_mem (load mem) reg 5)
7: (tile: add_reg (add reg reg) reg 2)
8: (tile: add_cnst (add reg (const)) reg 3)
9: (tile: add_ldrg (add reg (load reg)) reg 6)
10: (tile: add_ldmm (add reg (load mem)) reg 6)
The second abstraction is the rule set. A given tree node can potentially be matched by any number of rules. For instance,
(const)
may be matched by rule 4 or it may be matched as the 'subrule' of rule 8. (We denote the subrule of rule 8 as rule 8^{a}). In fact, there is no way the matcher can distinguish between these two options when it evaluates the (const)
node. The matcher can only distinguish between the two options when evaluating the parent node of const. (Parents often know more than their children). Hence, mapping a node often yields multiple rules.This is something of a challenge for matching the parent node. In case of the
(load)
node in the graph above, do we match to rule 6 (load mem)
or rule 5 (load reg)
? The (addr)
node can map to either a reg
or a mem
terminal, so it does not reduce the ambiguity. The answer is that rather than trying to cut through the ambiguity we should embrace it. That is to say, we represent the combination of rules as a single ruleset, and the ruleset represent all possible matching rules.For example, in the grammar above, a
(const)
node by itself always matches to rule 4 or rule 8^{a}. So the (const)
matches a single ruleset consisting of { 4, 8^{a} }. Similarily, an (addr)
always takes a reg terminal and maps to both rules { 2, 3 }. In constrast, the (load)
node can match rule 5  if it's child matches to a reg terminal  or rule 6 if it's child matches to a mem terminal. (It can also match to rule 9^{a} and 10^{a}, but ignore that for simplicity). Since all nodes that generate a mem terminal (i.e. the (addr)
) can also generate a reg terminal, rule 6 is always combined with rule 5, but the inverse is not the case. (It's perfectly possible to (load)
the result of an (add)
operation, for example). Thus, (load)
maps to two distinct rulesets: { 5 } and { 5, 6 }.Table generation
It is pretty easy to determine whether a rule will match a node and a combination of rulesets: just try if any of those rulesets can generate the required terminals. Checking this for all rules available will then give you a new combination of rules, which is also represented with a ruleset. Better yet, knowing the costs associated with each rule, one can determine the optimal rule to compute a node to the terminal required. For instance, in the tree above:
(local)
can only be matched by rule 1 (ruleset {1}).(addr {1})
can be matched by rule 2 and 3 equally. (ruleset {2,3})(load {2,3})
can be matched by rule 5, 6, 9^{a} and 10^{a}, because the ruleset {2,3} from(addr)
generates both mem and reg terminals. (ruleset {5,6,9^{a},10^{a}}).(const)
can be matched by rule 4 and 8^{a} (ruleset: {4, 8^{a}}).(add)
can be matched by rule 7 and 8, because ruleset {5,6,9^{a},10^{a}} can generate a reg, and ruleset {4, 8^{a}} can generate the(const)
placeholder expected by rule 8. Hence(add {5,6,9^{a}, 10^{a}} {4, 8^{a}})
yields ruleset {7,8}.
(add)
can best be represented by rule 8, because this has lower cost than the combination of rule 4 and rule 7.(load)
can best be represented by rule 6, because this has lower cost than rule 3 and 5 combined.(const)
requires no specific representation being embedded in rule 8. The same goes for
(addr)
and(local)
.
Let's keep it simple
So much space, in fact, that we're not home and dry just yet. A table mapping the combination of a node and two children  indexed by ruleset  must be at least O(n_{child} × n^{2}_{ruleset}) large. If we naively generate all combinations of rules that start with a given head node we generate 2^{n} rulesets per type of head. Some heads are potentially involved with over 10 rules (consider
(load)
, which is allowed in nearly all x86 instructions), giving  naively  1024 rulesets. Most of these rulesets are impossible to generate. For example, in our miniature grammar, a ruleset containing {8,9} clearly cannot be generated. It is therefore in our keen interest to generate the minimum amount of rulesets.But that is pretty complicated: it either requires rather sensitive analysis of the grammar, which isn't algorithmicly cheap by itself; or we can simply read all the rulesets that can be generated from the grammar, by constructing the table that generates them. Clearly that is a chickenandegg problem: to find the rulesets that can be generated by a grammar, we have to make a table based on just those rulesets. Fortunately, chickenandegg problems can usually be solved by using some form of topological sorting. To put it in other words, we don't need to have all rulesets available to find the combination of rules the grammar can produce, just some that generate all the terminals needed by a given node. In our grammar above, we can just start by generating all rules for
(const)
and (local)
, noting that they generate one ruleset each. After that is done, we can generate all rules that rely only on reg, which is the (addr)
rule (generating mem). We continue process this until all rulesets have been generated. This dramatically reduces the size of the table, which is still pretty large. Without this procedure however, the time taken to build the table tends to explode on relatively small grammars.Ultimately the tiler table must be available for the JIT compiler, which is written in C. The tile table generator is written in perl5 (just like the expression template compiler), because, manipulexity and whipuptitude, and it runs everywhere, you know? In fact, perl5 is already a requirement for building MoarVM, which means I wouldn't introduce new buildtime dependencies. (Nobody likes installing a buildtime dependency, least of all me). Perl5 natively supports hash tables; C does not. So I chose to represent the table as a sorted array of key + value and use binary search to find the right items. There are certainly more efficient representations, but this is very simple and still guarantees quite adequate lookup times. This is important in ensuring the JIT compiler won't become a bottleneck itself.
So that is the story of how I wrote the tiler table generator (and incidentally, a story how perl saved the day). With these in place, I can implement the final tiler quite trivially (I already did, again, in perl). I conclude with noting that while the Aho algorithm guarantees optimal tiling (within the limitations of the grammar), it is not an optimization method by itself. For truly good code  say, like GCC or LLVM can produce  much more is needed: instruction scheduling, register allocation, and true optimisation techniques. Until my next report, I'll be working on these topics. See you then!