zondag 29 november 2015

Moar JIT news

Hello there, I thought it high time to write to you again and update you on the world of JITs. Since last I wrote, PyPy 4.0 was released. Also in python-land, Pyston 0.4 was released, and finally Guile 2.1.1 was released and Andy Wingo wrote a nice piece about that, as is his custom. I present these links not only to give these projects the attention they deserve, but also because I think they are relevant to our own project.

In chronological order, the release of PyPy 4.0 marked the first 'production' release of that projects' autovectorizer, which was developed over the course of this years Google Summer of Code. I'd like to take this opportunity to publicly congratulate the PyPy team on this achievement. So called 'vector' or SIMD operations perform a computation on multiple values in a single step and are an essential component of high-performance numerical computations. Autovectorizing refers to the compiler capability to automatically use such operations without explicit work by the programmer. This is not of great importance for the average web application, but it is very significant for scientific and deep learning applictions.

More recently, the Pyston project released version 0.4. Pyston is another attempt at an efficient implementation of Python, funded by Dropbox. Pyston is, or I should rather say, started out based on llvm. Most of my readers know of LLVM; for those who don't, it is a project which has somewhat revolutionised compiler development in the last few years. Its strengths are its high-quality cross-platform code generation with a permissive license. LLVM is also the basis for such languages as rust and julia. Notable weaknesses are size, speed, and complexity. To make a long story short, many people have high expectations of LLVM for code generation, and not without reason.

There are a few things that called my attention in the release post linked above. The first thing is that the Pyston project introduced a 'baseline' JIT compiler that skips the LLVM compilation step, so that JIT compiled code is available faster. They claim that this provides hardly a slowdown compared to the LLVM backend. The second thing is that they have stopped working on implementing LLVM-based optimisation. The third thing is that to support more esoteric python feature, Pyston now resorts to calling the Python C API directly, becoming sort of a hybrid interpreter. I would not be entirely surprised if the end point for Pyston would be life as a CPython extension module, although Conways law will probably prohibit that.

Pyston is not the first, nor the only current JIT implementation based on LLVM. It might be important to say here that there are many projects which do obtain awesome results from using LLVM; julia being a prime example. (Julia is also an excellent counterexample to the recent elitist celebration of self-declared victory by static typing enthusiasts assertion that 'static types have won', being very dynamic indeed). But Julia was designed to use the LLVM JIT, which probably means that tradeoffs have been made to assure performance; and it is also new, so it doesn't have to run as much weird legacy code; the team is probably highly competent. I don't know why some mileages vary so much (JavascriptCore also uses LLVM successfully, albeit as it's fourth and last tier). But it seems clear that far from being a golden gun, using LLVM for dynamic language implementations is a subtle and complex affair.

Anybody willing to try building an LLVM-backed JIT compiler for MoarVM, NQP, or perl6 in general, will of course receive my full (moral) support, for whatever that may be worth.

The posts by Andy Wingo, about the future of the Guile Scheme interpreter, are also well worth reading. The second post is especially relevant as it discusses the future of the guile interpreter and ways to construct a high-performance implementation of a dynamic language; it generalizes well to other dynamic languages. To summarise, there are roughly two ways of implementing a high-performance high-level programming language, dynamic or not, and the approach of tracing JIT compilers is the correct one, but incredibly complex and expensive and - up until now - mostly suitable for big corporations with megabucks to spend.
 Of course, we aim to challenge this; but for perl6 in the immediate future correctness far outranks performance in priority (as it should).

That all said, I also have some news on the front of the MoarVM JIT. I've recently fixed a very longstanding and complex bug that presented itself during the compilation of returns with named arguments by rakudo. This ultimately fell out as a missing inlined frame in the JIT compiler, which ultimately was caused by MoarVM trying to look for a variable using the JIT-compiler 'current location', while the actual frame was running under the interpreter,and  - this is the largest mystery - it was not deoptimized. I still do not know why that actually happened, but a very simple check fixed the bug.

I also achieved the goal of running the NQP and rakudo test suites under the new JIT compiler, albeit in a limited way; to achieve this I had to remove the templates of many complex operations, i.e. ops that call a C function or that have internal branches. The reason is that computing the flow of values beyond calls and branches is more complex, and trying to do it inline with a bunch of other things - as the new JIT has tried to so far - is prone to bugs. This is true especially during tree traversal, since it may not be obvious that computations relying on values may live in another context as the computations that generate these values.

In order to compile these more complex trees correctly, and understandably, I aim to disentangle the final  phases of compilation, that is, the stages of instruction selection, register allocation, and bytecode generation. Next to that I want to make the tiler internals and interface much simpler and user-friendly, and solve the 'implied costs problem'. The benefit of having the NQP test suite working means I can demonstrate the effects of changes much more directly, and more importantly, demonstrate whether individual changes work or not. I hope to report some progress on these issues soon, hopefully before christmas.

If you want to check out the progress of this work, checkout the even-moar-jit branch of MoarVM. I try, but not always successfully, to keep it up-to-date with the rapid pace of the MoarVM master branch. The new JIT only runs if you set the environment variable MVM_JIT_EXPR_ENABLE to a non-empty value. If you run into problems, please don't hesitate to report on github or on the #moarvm or #perl6 channels on freenode. See you next time!

maandag 12 oktober 2015

Rumors of JITs' demise are greatly exaggerated.

Earlier this week my attention was brought to an article claiming that the dusk was setting for JIT compilation. Naturally, I disagree. I usually try to steer clear of internet arguments, but this time I think I may have something to contribute. Nota bene, this is not a perl- or perl6 related argument, so if that is strictly your interest this is probably not an interesting post for you.

The main premise of the argument is that people are shifting away from JIT compilation because the technique has failed to live up to its promises. Those promises include, in various forms, high level languages running 'as fast as C', or having more optimization possibilities than ahead-of-time (AOT) compilers do. Now my perspective may be a bit unusual in that I don't actually expect momentous gains from JIT compilation per se. As I've described in the talk I gave at this years' YAPC::EU, by itself JIT compilation removes only the decoding and dispatch steps of interpretation, and - depending on the VM architecture - these may be a larger or smaller proportion of your running time. However, my thesis is that interpretation is not why high-level languages are slow, or rather, that interpretation is only one of the many sources of indirection that make high-level languages slow.

First of all, what of the evidence that JITs are actually in demise? The author provides three recent trends as evidence, none of which I hold to be decisive. First, both Windows 10 and the newest versions of Android translate .NET and Dalvik applications respectively to native code at installation time, which is properly considered ahead of time compilation. Second, high-performance javascript applications are currently often created using tools like emscripten, which compiles to asm.js, and this is in many ways more similar object code than it is to a high-level language, implying that the difficult bit of compilation is already behind us. (I agree mostly with that assesment, but not with its conclusion). Finally, on iOS devices JIT compilation is unsupported (except for the JIT compiler in the webkit browser engine), allegedly because it is insecure.

As to the first piece, the author suggest that the main reason is that JIT compilers being unpredictable in their output, at least relative to optimizing ahead-of-time compilers. I think that is nonsense; JIT compilation patterns tend to be quite reliably the same on different runs of the same program, a property I rely on heavily during e.g. debugging. The output code is also pretty much invariant, with an exception being the actual values of embedded pointers. So in my experience, what you see (as a developer) is also what you get (as a user), provided you're using the same VM. I humbly suggest that the author believes JITs to be unreliable because his work is being compiled by many different VMs using many different strategies. But I see that no differently than any other form of platform diversity. Maybe the author also refers to the fact that often optimization effectiveness and the resultant performance of JIT compiled applications is sensitive to minor and innocuous changes in the application source code. But this is true of any high-level language that relies primarily on optimizing compilers, for C as much as for python or javascript. The main difference between C and python is that any line of C implies far fewer levels of indirection and abstraction than a similar line of python.

I think I have a much simpler explanation as to why both Google and Microsoft decided to implement ahead-of-time compilation for their client platforms. The word 'client' is key here; because I think we're mostly talking about laptops, smartphones and tablets. As it turns out, hardware designers and consumers alike have decided to spend the last few years worth of chip manufacturing improvements on smaller, prettier form factors (and hopefully longer battery life) rather than computing power. Furthermore, what Qualcomm, Samsung etc. have given us, Photoshop has taken away. The result is that current generation portable devices are more portable and more powerful (and cheaper) than ever but are still memory-constrained.

JIT compilation inevitably comes with a significant memory cost from the compiled code itself (which is generally considerably larger than the interpreted code was), even when neglecting the memory usage of the compiler. Using various clever strategies one can improve on that a bit, and well-considered VM design is very important as always. But altogether it probably doesn't make a lot of sense to spend precious memory for JIT-compiled routines in a mobile setting. This is even more true when the JIT compiler in question, like Dalviks', isn't really very good and the AOT compiler has a good chance of matching its output.

Now to the case of asm.js. As I said, i agree mostly that a significant amount of work has already been done by an ahead-of-time compiler before the browser ever sees the code. It would be a mistake to think that therefore the role of the JIT (or rather the whole system) can be neglected. First of all, JIT-compiled code, even asm.js code, is greatly constrained in comparison to native code, which brings some obvious security benefits. Second of all, it is ultimately the JIT compiler that allows this code to run cross-platform at high performance. I think it is mistaken to suggest that this role is trivial, and so I see asm.js as a success of rather than evidence against JIT compilation as a technique.

Next, the iOS restriction on JIT compilation. I think the idea that this would be for security reasons is only plausible if you accept the idea that application security is significantly threatened by dynamic generation of machine code. While I'm sure that the presence of a JIT compiler makes static analysis very difficult - not to say impossible - I don't believe that this is the primary attack vector of our times. The assertion that memory must be both writable and executable for a JIT compiler to work is only superficially true, since there is no requirement that the memory must be both at the same time, and so this doesn't imply much of a threat (So called W^X memory is becoming a standard feature of operating systems). Vtable pointers stored in the heap, and return addresses on a downward-growing stack, now those are attack vectors of note.

But more importantly, that is not how mobile users are being attacked. It is much more interesting, not to mention significantly easier, for attackers to acquire whole contact books, private location information, credentials and private conversations via phishing and other techniques than it is to corrupt a JIT compiler and possibly, hopefully, and generally unreliably gain remote execution. Most of these attack vectors are wide-open indeed and should be prevented by actual security techniques like access control rather than by outlawing entire branches of computing technology. Indeed, an observer not sympathetic to Apple could probably relate this no-JIT compilation rule with the Californian company's general attitude to competing platforms, but I will not go further down that path here.

Finally, I think the claim that JIT compilation can't live up to its promise can readily be disproven by a simple google search. The reason is simple; the JIT compiler, which runs at runtime, has much more information at its disposal than even the best of ahead-of-time compilers. So-called profile-guided optimization help to offset the difference, but it is not a common technique, moreover that is still only a small subset of information available to a JIT compiler. The fact that many systems do not match this level of performance (and MoarVM's JIT compiler certainly doesn't) is of course relevant but not, in my opinion, decisive.

In conclusion, I would agree with the author that there are many cases in which JIT compilation is not suitable and in AOT compilation is. However, I think the much stronger claim that the dusk is setting on JIT compilation is unwarranted, and that JIT compilers will remain a very important component of computing systems.

zondag 20 september 2015

Most Significant Bits

This week I think I fixed irregular behavior in the x64 instruction encoding register selction of DynASM. I think it'll be a fun story to share, so I thought it'd be time to blog.

The astonishingly irregular thing about x64 instruction encoding is that it is mostly very regular. Ignoring for the moment instruction prefixes and constants, an x86 instruction consists of two bytes, one for the instruction proper, and one for it's two operands. Take for example the addition of two registers:

| add  | eax, ecx |
| 0x01 |     0xc8 |

We take the instruction byte as given. It is the second byte that concerns me because it determines which operands to use and how. Like a good CISC architecture, the x86 supports a number of addressing modes, meaning that a register can be used as a value but also as a (part of a) pointer. One of the reasons C does pointer arithmetic so freely is that this reflects the nature of the CPU's which where current when C was designed. The (relevant) x86 addressing modes are shown in the following table. (There are more, but you shouldn't use them):

Addressing ModeByte flagMeaning
Direct0xc0Both operands are used as direct values
Indirect0x00One of the operands is used as a memory reference, whether the source of destination operand depends on the instruction
Indirect with offset0x80 or 0x40One of the operands is used as a memory reference, offset by a constant which is encoded directly after the instruction.
Indexed0x04One operand consists of two registers base and index, which is multiplied by a scale, to provide a memory reference. Optionally also has an offset, in which case the code byte is 0x44 or 0x84, depending on whether the offset fits in a single byte or not.
Instruction-relative0x05One of the operands is a reference to the current location in the code offset by a constant (and the other refers to a register as usual).

Readers which are more careful than can be reasonably expected will have noticed that in the first three addressing modes, the lowest nibble is zero, whereas it is nonzero for the lower two addressing modes. This is in fact the source of irregularities in instruction encoding. To appreciate this it helps to unpack the operand byte in octal rather than hexadecimal. Octal is much closer to how x86 thinks about the world. As demonstrated in this table, the lowest two pairs of 3 bits each encode the actual registers that should be used.

| Byte | Mode | Op2 | Op1 | Meaning  |
| 0xc0 |    3 |   0 |   0 | Direct   |
| 0x80 |    2 |   0 |   0 | Indirect |
| 0x04 |    0 |   0 |   4 | Indexed  |

The upshot of this is that in case the operand mode isn't direct, and the first operand register is either 4 or 5, the meaning of the operand byte is completely different. x86 suddenly expects another operand byte (a so-called SIB byte) to specify which register shall be base and which shall be index.

Normally this isn't much of a problem since the registers refered to by number 4 and 5 are rsp and rbp respectively; meaning the stack top and stack bottom registers. Fun fact: the x86 stack grows downward, so rbp > rsp in basically all cases. Also fun fact: because of this, writing from a rsp relative reference upwards can overwrite the return pointer held somewhere below rbp, which is the basis of most buffer overflow attacks. You thought NULL was a billion dollar mistake? Consider how the engineers that decided the stack should grow downward must feel.

Anyway, considering that rbp and rsp take up such a pivotal role in your program, it's actually unlikely you'll encode them by mistake. So as long as you don't do that, it's safe to ignore this complexity and just 'add in' the correct bits into the operand byte. Thus, for instance, to refer to the 7th and first register respectively in direct mode, we generate:

0300 + (07 << 3) + (01) == 0371 == 0xf9

However, in the land of x64, things are not so happy. I have blogged earlier about how the x64 architecture gives you 8 extra registers on top of the 8 legacy x86 registers, and how the difference between those registers lies only in that x64 specifies a prefix byte called REX. REX byte encoding is not so difficult, the trick is to find it reliably.  But because the lower three bits of the 4-bit register number are placed in the operand byte, register r12 and r13 look exactly like rsp and rbp to the CPU. Well, that's where the fun really starts, because it's all too easy to encode these 'accidentally'. They are after all perfectly regular registers.

For those not keeping score, we have two special cases to handle. First, whenever the first operand is either rsp or r12 and we're not using direct mode, an extra SIB byte needs to be encoded to specify that we are really talking about accessing rsp/r12 directly. This is done by encoding rsp as both the base and index, which the x86 understands because using rsp as an index is usually illegal. (The magic byte is thus 0044 or 0x24). Second, whenever the first operand is rbp or r13 and we're using indirect access without an offset, we need to encode indirect access with an offset instead, just with the offset at zero. This of course requires another byte. Somewhat byzantine, but manageable.

We are, unfortunately, not completely OK yet. It is my central hypothesis of this post that DynASM was not designed to handle register selection at runtime. Evidence for this hypothesis is that DynASM does 'weird' things like mix data and instructions and linking prior to encoding. Usually one encodes first and links afterwards, especially when during encoding you may need to make decisions that influence the final positions of certain segments. DynASM does it the other way around, which means that during linking we should be able to calculate exactly how much space we need for each instruction. Which is a pain, because DynASM mixes the data stream (which we need for inspection) with the instruction stream (which tells the runtime what to do with its input). It's possible to hack around this - basically by copying data into the instructions - but it's not elegant. Starting with this commit, I'm reasonably confident that this stuff works, a test case is provided here.

That almost concludes this weeks madness. The only thing left is to question the method. Why should x86 use this convoluted scheme? I could go on a detailed historical research, but I prefer to speculate it is caused by economy in memory. After all, in the regular case you need only 2 bytes, which is - conveniently - equal to 16 bit, the original register size of the 8086. And since that chip was designed in the 1970s, it makes sense instructions should be as space-efficient as possible. In contrast, ARM uses 32 bit instructions with 3 operands. So space economy seems a plausible cause to me.

See you next time!

maandag 14 september 2015

Wrapping Up

Hi everybody, it's been a while since I've blogged, and I think you deserve an update. Last week, of course, was YAPC::EU, which was awesome. Granada is a very nice place, and the weather was excellent. Tapas lunch was very nice, as was the gala diner (also with tapas). There were many interesting people and presentations (many more than I could actually see). It was also very interesting to present (slides) about the JIT, which I think went well. One of the comments I heard was that it was a quite high-level talk, so if anyone should ask, I can and will give a talk describing the grueling details in the future. Oh, and I just found that Larry Wall's keynote has just been uploaded to youtube. Go ahead and watch, this page isn't going anywhere.

So what news from JIT compiler land? Well, since last I've blogged I had to design and implement the register allocator and tie everything together into the final running compiler. That has been achieved, but there are still many bugs and limitations that prevent it from actually being effective. I will talk about these at length. But first, let me explain the register allocator.

The basic problem of register allocation is that a compiler can assign more values than the CPU has registers. Thus, only some values can reside in registers and others must be stored to memory (spilled). Computing the best set of values to keep in registers is in general an impossible problem. To solve it I use the 'linear scan' allocation heuristic. This heuristic is as simple as can be - determine for each value allocated it's last use, and when it's time to spill a value, spill the one which will be live furthest in the future. These slides also do a good job of explaining it.

Aside from being simple to implement, it's also effective. Values that will expire soon are likely to be used soon as well, so their register will free up soon enough. On the other hand, values that are bound to live a long time will likely be spilled before they expire anyway, so it costs less to spill them now. Another benefit is that this can be evaluated online, i.e. as the JIT tree is being processed, so that it doesn't require a separate step. (One abstraction it did require, though, was order-numbering the tiles, which are currently attached to the JIT tree. This makes the tiles in essence a linear list, and it is my plan to convert them to a linear array in memory as well. That would reduce the number of tree traversals by one as well).

I will not maintain the illusion here that a register allocator is a trivial component, even with a conceptually simple algorithm as just outlined. One of the tricky bits is ensuring that values defined in a conditional branch do not accidentally 'escape' out of their branches, since they will be unavailable if the branch was not taken. In the same vein after a call all 'volatile' registers are invalidated, so that also requires some special treatment.

After the register allocator was finished, all that remained was ironing out the bugs. And they were (and are) many. The most annoying of these were the DynASM machine code generation bug. The DynASM runtime is a tight state machine that uses a mix of data and instructions to generate machine code. The first bug was relatively simple - a missing REX byte marking caused by the preprocessor looking at only one of the two operands. The second bug was positively evil. See, x86 uses so called ModR/M bytes to specify the register and memory access mode used for an instruction. It's not important you know what it stands for, but what is important is that each byte - 8 bits - is divided into a mode of 2 bits and 2 register numbers of 3 bits each. (There are 8 legacy registers in x86, so that fits). Except when the register number is 4 (rsp or the stack pointer register). Then the whole meaning of the byte changes to a SIB byte, which is quite different entirely - it refers to two registers at once. The upshot is that this SIB byte now must be appended to the ModR/M byte and filled with correct data; and that this SIB byte is then interpreted as if it were a ModR/M byte anyway.. I've patched DynASM to do this, but it is really quite sensitive and brittle and I quite expect to have to fix this in the future in another way.

That brings me to today. Unfortunately for my JIT aspirations, my educational obligations have caught up with me again. In other words: my study has started and leaves me with much less time to work on the JIT. So for clarity, I have completed in the last few months the following:
  • The expression tree intermediate format and template preprocessor
    • Documentation on the use of these
  • The tiler and tiler table generator
  • A register allocator and assorted support structures in the new JIT
  • Dynamic register support for x64 DynASM 
  • Periodic reports on progress by means of this blog
Timo Paulssen has developed a C-source to expression tree format preprocessor, which should in the best case help quickly convert most of the old JIT compiler segments to the new, when this is more stable.

Here is what I intended (and indeed promised) I would do, and haven't finished yet:
  • A test suite - I haven't needed it during development as regular NQP compilation already involves plenty JIT compiler activity. The benefit of a real test suite is of course that specific segments can be tested (like numeric registers)
  • An representation-based extension API. I haven't gotten around to it because the JIT has not been stable enough so far. It will be based on the use of templates specialized to a certain in-memory representation.
  • Documentation on the use and development of architecture-specific tiles.
And there are things which I haven't outlined explicitly but which I still want to do:
  • Full floating point operation support (the simplest way is probably to use a wholy separate allocator for numeric registers).
  • Automatic within-template ordering to maintain consistent branch semantics (rather than relying on let ordering to build values shared between branches).
  • A framework for optimizing transformation. I think the key to this is probably a replacement table allowing different algorithms to insert replacements for given nodes.
  • As I said above, a truly linear representation of the selected tiles, in order to allow reordering for the purposes of instruction scheduling.
And there are many more. But as I said, I cannot continue with them as I have in the past few months. I will continue to work to stabilise the JIT and I hope to succeed in that before Christmas, only at a slower pace than before. I realise it is somewhat disappointing not to be able to fully 'land' the new JIT yet, but I'm confident we'll get there in the end. See you next time!

zondag 16 augustus 2015

Tiler Update

In my last blog I talked a bit about a problem I had with generating all possible sets of rules from the tile 'grammar'. I've since solved that problem, and thought it'd be nice to share my results. (NB: I use the word 'grammar' loosely today and on other days. In compiler jargon a grammar is usually a thing that matches a linear stream of lexical tokens to a tree structure. In this case, it's really the other way around, because it's used to match a tree and output a linear sequence. However, these things are quite similar, so I hope you'll forgive me for using the term).

It may help if I state the problem as clearly as I can. The input is a list of rules mapping a head and up to two child nodes to a tile and a nonterminal symbol. A tile, to reiterate, is ultimately a piece of code that emits machine code. The nonterminal symbol (from now on: symbol) is what takes the place of a node after a tile has been selected for it. This abstraction is what allows the tiler to determine the rule(set) for a node by looking only at it's direct children.  Note that tile rules that refer to nested structures are 'flattened' by replacing nested children by special one-off symbols.

Many rules can be matched together. Take for instance these tiles starting with an add node. For x64, we can compile a register-to-register addition, a constant-to-register addition, and a memory-to-register addition. (x86 can also add from register to memory, but we ignore that).

Rule 1: Adding register to register

Rule 2: A tile that adds a constant value to a register

Rule 3: A tile that adds a register to a value loaded from memory

Given that both a (load mem) and a (const) node can be tiled to reg symbols, we can always match (add reg reg) when matching an add node. But it's not always possible to match (add reg (const)) or (add reg (load mem)). And the last two certainly cannot be combined. So when the tiler inspects the add node, it can emit the following tree combinations of tiles: {1}, {1,2} and {1,3}. The problem I had was given the complete set of rules, how can we find all permissible combinations that the tiler can generate?

(As an aside, I don't think generating only the necessary rulesets is an optimization, but very much necessary for the feasibility of the tiler. The tiler tables otherwise become either prohibitively large, or incomplete).

Why can we combine 1 and 3? Because the (load mem) node can generate a reg symbol as well as the 'placeholder' symbol generated in rule 3 (let's call that the 3a symbol). Similarly, the (const) node can generate a reg as well as the 2a symbol. Indeed, rules can be combined whenever the symbols they refer to can be combined. And they are separate whenever the symbols they refer to can occur separately. So the key to finding all permissible combinations is finding all possible combinations of symbols. That is something of a simpler problem.

First of all, all rules that refer to the same combination-of-terminals can be combined themselves, generating new combinations of terminals. Second, some terminal combinations may not be generated at all (3a will never occur in isolation, for example). Finally, when we do have all possible sets of terminals, applying all rules to each set will continue to generate these combinations. We can find rules which combine in such a way easily using a trie. So the solution to finding all permissible combinations of rules is this:
  1. Apply each rule to a trie of symbol combinations and store the rule number and symbol in each leaf.
  2. Extract all combinations of symbols that resolve to a single trie leaf.
  3. Compare the symbol combinations to those used in step 1. If there are any changes, repeat from step 1.
  4. The trie leafs also now also contain all possible rule combination, which simply need to be given a number.
The idea for this algorithm I derived from the table generation algorithm for parser generators, to which it is also quite similar.

A further change I had to make was to split the finding of rulesets - a bottom-up procedure - to the selection of the right rules, which is a top-down procedure. It is top-down, because the rule selected at a node determines the rules that can be applied for it's child nodes. All that is needed is a table that stores the minimum-cost rule to generate a given symbol for a given tile set, and a root rule to start with. (NB: I'm not 100% sure that's true, but it's my working hypothesis, and it can't generate wrong code, only suboptimal code.)

So that was my update. See you next time!

donderdag 13 augustus 2015

Inching Closer

Hi everybody, I realise I haven't blogged in a while now, and it'd be a good time to write you an update. I've hit a few problems, but am still making progress.

I've written the tiler algorithm for the JIT compiler. This caused a difficulty because the input graph is not a tree but a DAG, meaning that a node may have more than one 'parent' or 'consumer' nodes. Since consumers decide the tiling of their child nodes, this may mean that the tiles conflict. I resolve such tile conflicts simply by adding a new node to replace the old one.

I've also been busy adding the tile implementations. These are pieces of C code that emit assembly code corresponding to the part of the input tree they represent. This caused some challenges in that tiles sometimes refer to nodes deep in the tree. I solve this by adding traced paths to the nonterminals to the JIT tile table, allowing the JIT compiler to find the nodes refered to and make them easily available. For example:

(add reg (load mem))

The second 'value' argument to this tile is the mem node, not the load. These value arguments are important for managing state between tiles, like the register containing the computed value, or a computed memory address.

To the expression template compiler, I've added the facility to mark certain templates as 'destructive', meaning that they write the value they yield directly to memory. Thus, this value does not become directly available for the JIT tree. Many MoarVM instructions are actually implemented that way, which I noticed when timotimo kindly offered to help write an automated translation tool from the 'old' JIT graph to the new format.

I've changed the JIT-to-interpreter interface to be simpler and deal more sensibly with exits. It used to be the case that the interpreter had to return control out from the JIT-ted frame, so that it'd know when it had unwound the last stack and had to exit. Now the JIT-ted frame is responsible for unwinding the stack itself, much like interpreted code is. The unwound stack check was simple enough to replicate in the JIT interpreter driver instruction.

I've completed the changes to DynASM that allow us to address all x64 general-purpose-registers in all different sizes. This required the addition of a new facility that registered if a so-called REX byte - indicating the use of extended register - was required for addressing the registers supplied to DynASM. Because many fields on MoarVM internal structures are less than 64 bits in size, this was more important than I had realized. At the same time, I've also added value size information to the expression tree, so that these sizes are used correctly.

I've designed, but not yet implemented, a proper register allocator. The key abstraction, I think, is to distinguish between registers which are in use in contrast with registers that are allocated. Allocated registers may be spilt and reused, but used registers may not.

I've documented the syntax and structure of the expression tree format. I hope this will help other people to write JIT extensions and/or optimizations. Documentation on the tiler is work in progress.

My current problem is that the method I had designed last time to derive all possible tiler states is wrong, because it tried to apply a topological sort on a graph that had cycles. Breaking these cycles leads to not generating all permissible tile states, which leads to runtime failure for the tiler. So this has to be avoided, at the same time it is also very important not to generate more tiler states than necessary because otherwise the tile table (and the time spent constructing it) becomes prohibitively large. Despite helpful suggestions by the community, I'm still not sure how to solve this. (It's a similar problem to parser table construction, but... not the same).

All in all, I think I've come very close to really starting to apply the new code generator. I had expected to reach this state earlier, but what I had not realized is that a 'real' compiler requires many moving parts working correctly. Fortunately, I've tried to test new parts as they are being written, so I'm confident that most of them works as expected. Nevertheless, the next weeks are quite exciting, as I'm sure there will be many bugs to find.

Finally, the schedule for this years' YAPC::EU is online, and I'm due to present at 12:00 on Wednesday, in the Aula Magna. For those who'll be there I hope to have an interesting story to tell. See you next weeks!

woensdag 29 juli 2015

Tiles and Compiler Compilers

This week I made progress on the central 'tiling' algorithm used for code generation. I think it makes an interesting story and theory, so I will try to explain it in this post. Unfortunately, I don't think I can explain it very well without resorting to computer-science jargon, but I hope you can forgive me.

What is tiling?
Tiling refers to the process of breaking an input tree into many pieces. The tree refers to the data structures representing the code that the JIT should compile. As I wrote a few weeks back, these trees are ultimately generated from templates, mapped to the MoarVM opcodes. Let's suppose we have a tree for the expression: result = LOCAL[8] + 12 and that it'd look like the code below:


I realize most people probably are not familiar with the syntax used here. These are called s-expressions, and there is fortunately little to know: '(' opens a list, ')' closes a list, and words represent symbols that are reproduced verbatim (as are the numbers). If we act like the first word in the list reperesents the name of a function and the rest of the list the arguments, it hopefully becomes clear how the text represents the expression (and you can add LISP to your linkedin profile at the same time).

For those who prefer a visual layout (which is most people, I guess), the following graph represents the exact same code:

The code fragment under consideration.

Tiling is necessary because there are many ways to implement this expression in machine code. For instance, every node can be mapped to a single instruction, as in the naive code below. Compare this to the improved code, which is only two instructions long:
TreeNaive codeImproved Code
mov r0, 12
lea r1, [rbx+8]
mov r2, [r1]
mov r3, r0
add r3, r2
mov r0, [rbx+8]
add r0, 12

As you can hopefully see from the color, in the improved code each instruction refers to multiple parts of the tree. As a result, the improved code is much shorter, and probably faster to execute.
How do we do it?
There are two basic abstractions in tiling.  The first of these is the tile grammar. A very simple grammar is shown below. Each of these tile rules maps a tree fragment to an action, at terminal and a cost. A tile rule that matches a tree fragment conceptually replaces the fragment with it's terminal; the resulting terminal can be used to match other tiles.

1:  (tile: local    (local) reg 1)
2:  (tile: addr_reg (addr reg) mem 1)
3:  (tile: addr_mem (addr reg) reg 2)
4:  (tile: const    (const) reg 2)
5:  (tile: load_reg (load reg) reg 5)
6:  (tile: load_mem (load mem) reg 5)
7:  (tile: add_reg  (add reg reg) reg 2)
8:  (tile: add_cnst (add reg (const)) reg 3)
9:  (tile: add_ldrg (add reg (load reg)) reg 6)
10: (tile: add_ldmm (add reg (load mem)) reg 6)

The second abstraction is the rule set. A given tree node can potentially be matched by any number of rules. For instance, (const) may be matched by rule 4 or it may be matched as the 'subrule' of rule 8. (We denote the subrule of rule 8 as rule 8a). In fact, there is no way the matcher can distinguish between these two options when it evaluates the (const) node. The matcher can only distinguish between the two options when evaluating the parent node of const. (Parents often know more than their children). Hence, mapping a node often yields multiple rules.

This is something of a challenge for matching the parent node. In case of the (load) node in the graph above, do we match to rule 6 (load mem) or rule 5 (load reg)? The (addr) node can map to either a reg or a mem terminal, so it does not reduce the ambiguity. The answer is that rather than trying to cut through the ambiguity we should embrace it. That is to say, we represent the combination of rules as a single ruleset, and the ruleset represent all possible matching rules.

For example, in the grammar above, a (const) node by itself always matches to rule 4 or rule 8a. So the (const) matches a single ruleset consisting of { 4, 8a }. Similarily, an (addr) always takes a reg terminal and maps to both rules { 2, 3 }. In constrast, the (load) node can match rule 5 - if it's child matches to a reg terminal - or rule 6 if it's child matches to a mem terminal. (It can also match to rule 9a and 10a, but ignore that for simplicity). Since all nodes that generate a mem terminal (i.e. the (addr)) can also generate a reg terminal, rule 6 is always combined with rule 5, but the inverse is not the case. (It's perfectly possible to (load) the result of an (add) operation, for example). Thus, (load) maps to two distinct rulesets: { 5 } and { 5, 6 }.

Table generation
It is pretty easy to determine whether a rule will match a node and a combination of rulesets: just try if any of those rulesets can generate the required terminals. Checking this for all rules available will then give you a new combination of rules, which is also represented with a ruleset.  Better yet, knowing the costs associated with each rule, one can determine the optimal rule to compute a node to the terminal required. For instance, in the tree above:
  1. (local) can only be matched by rule 1 (ruleset {1}).
  2. (addr {1}) can be matched by rule 2 and 3 equally. (ruleset {2,3})
  3. (load {2,3}) can be matched by rule 5, 6, 9a and 10a, because the ruleset {2,3} from (addr) generates both mem and reg terminals. (ruleset {5,6,9a,10a}).
  4. (const) can be matched by rule 4 and 8a (ruleset: {4, 8a}).
  5. (add) can be matched by rule 7 and 8, because ruleset {5,6,9a,10a} can generate a reg, and ruleset {4, 8a} can generate the (const) placeholder expected by rule 8. Hence (add {5,6,9a, 10a} {4, 8a}) yields ruleset {7,8}.
Now to determine the optimum code:
  1. (add) can best be represented by rule 8, because this has lower cost than the combination of rule 4 and rule 7.
  2. (load) can best be represented by rule 6, because this has lower cost than rule 3 and 5 combined.
  3. (const) requires no specific representation being embedded in rule 8.
  4. The same goes for (addr) and (local).
In fact, you can compute this information prior to the actual compilation process, and stick it in a table - simply by applying all rules to all combinations of rulesets. Doing this transforms the ambiguous, nondeterministic matching process into a deterministic table lookup. In CS jargon, it transforms the NFA represented by the tile grammar into a DFA. As in all such transformations, this takes up significant amounts of space.

Let's keep it simple
So much space, in fact, that we're not home and dry just yet. A table mapping the combination of a node and two children - indexed by ruleset - must be at least O(nchild × n2ruleset) large. If we naively generate all combinations of rules that start with a given head node we generate 2n rulesets per type of head. Some heads are potentially involved with over 10 rules (consider (load), which is allowed in nearly all x86 instructions), giving - naively - 1024 rulesets. Most of these rulesets are impossible to generate. For example, in our miniature grammar, a ruleset containing {8,9} clearly cannot be generated. It is therefore in our keen interest to generate the minimum amount of rulesets.

But that is pretty complicated: it either requires rather sensitive analysis of the grammar, which isn't algorithmicly cheap by itself; or we can simply read all the rulesets that can be generated from the grammar, by constructing the table that generates them. Clearly that is a chicken-and-egg problem: to find the rulesets that can be generated by a grammar, we have to make a table based on just those rulesets. Fortunately, chicken-and-egg problems can usually be solved by using some form of topological sorting. To put it in other words, we don't need to have all rulesets available to find the combination of rules the grammar can produce, just some that generate all the terminals needed by a given node. In our grammar above, we can just start by generating all rules for (const) and (local), noting that they generate one ruleset each. After that is done, we can generate all rules that rely only on reg, which is the (addr) rule (generating mem). We continue process this until all rulesets have been generated. This dramatically reduces the size of the table, which is still pretty large. Without this procedure however, the time taken to build the table tends to explode on relatively small grammars.

Ultimately the tiler table must be available for the JIT compiler, which is written in C. The tile table generator is written in perl5 (just like the expression template compiler), because, manipulexity and whipuptitude, and it runs everywhere, you know? In fact, perl5 is already a requirement for building MoarVM, which means I wouldn't introduce new build-time dependencies. (Nobody likes installing a build-time dependency, least of all me). Perl5 natively supports hash tables; C does not. So I chose to represent the table as a sorted array of key + value and use binary search to find the right items. There are certainly more efficient representations, but this is very simple and still guarantees quite adequate lookup times. This is important in ensuring the JIT compiler won't become a bottleneck itself.

So that is the story of how I wrote the tiler table generator (and incidentally, a story how perl saved the day). With these in place, I can implement the final tiler quite trivially (I already did, again, in perl). I conclude with noting that while the Aho algorithm guarantees optimal tiling (within the limitations of the grammar), it is not an optimization method by itself. For truly good code - say, like GCC or LLVM can produce - much more is needed: instruction scheduling, register allocation, and true optimisation techniques. Until my next report, I'll be working on these topics. See you then!