vrijdag 25 juli 2014

Bugs, Updates, and ABIs

Once upon a time, way too long ago, I blogged, and notified you of my progress. There's been plenty progress since then, so it was time to write again. Since I've last wrote, I've added support for invocations, 'invokish' instructions - including decontainerization and object conditionals, which appear really frequently - and OSR. I've also had to fix bugs which crept in the code but which were never properly tested before, due to the fact that we typically need to implement a whole set of ops before any particular frame is compiled, and then if those frames break it is unclear which change caused it. I'll talk about these bugs a bit first and then about my next challenges.

The first bug that I fixed seemed to have something to do with smart numification specifically. This is an example of a so-called 'invokish' instruction, in which an object is coerced into a primitive such as a number or a string. Some types of objects override the default methods of coercion and as such will need to run code in a method. Because the JIT doesn't know beforehand if this is so - many objects are coerced without invoking code at all - a special guard was placed to ensure that the JIT code falls out into the interpreter to deal with an 'unexpected' method invocation. Freakishly, seemed to work about half of the time, depending on the placement of variables in the frame.

Naturally, I suspected the guard to be wrong. But (very) close inspection in gdb assured me that this was not the case, that the guard in fact worked exactly as intented. What is more, usually JIT bugs cause unapologetic crashes, segmentation faults, bus errors and the like, but this bug didn't. The code ran perfectly, just printing the wrong number consistently. Ultimately I tracked it down to the differences in parameter passing between POSIX and Windows. On both platforms, the first few parameters to a function are passed in registers. These registers differ between platforms, but that's easy enough to deal with using macro definitions. In both platforms, floating-point arguments are passed via the 'SSE' registers as opposed to the general-purpose registers. However, the relative positions are assigned differently. On windows, they are assigned in the order of the declaration. In other words, the following function declaration

void foo(int i, double d, int j, double f);

assigns i to the first general-purpose register (GPR), d to the second SSE register, j to the third GPR, and f to the fourth SSE register. On POSIX platforms (Mac OS X, Linux, and the rest), they are first classified by type - integer, memory, or floating point - and then assigned to consecutive registers. In other word, i and j are passed in the first and second GPR, and d and f are passed in the first and second SSE register. Now my code implemented the windows behavior for both platforms, so on POSIX, functions expecting their first floating point argument in the first SSE register would typically find nothing there. However, because the same register is often used for computations, there typically would be a valid value, and often the value we wanted to print. Not so in smart numification, so these functions failed visibly.

The second bug had (ultimately) to do with write barriers. I had written the object acessors a long time ago and had tested the code frequently since then, so I had not expected anything to be wrong with them. However, because I had implemented only very few string instructions, I had never noticed that string slots require write barriers just as object slots do. (I should have known, this was clear from the code in the interpreter). Adding new string instructions thus uncovered a unused code path. After comparing the frames that where compiled with the new string instructions with those without, and testing the new string instructions in isolation, I figured that the accessors had something to do with it. And as it turned out, they had.

The third bug which puzzled me for over a week really shouldn't have, but involved the other type of object acessors - REPR accessors. These accessors are hidden behind functions, however these functions did not take into account the proper behavior on type objects. Long story short, type objects (classes and the like) don't have any attributes to look up, so they should return NULL when asked for any. Not returning NULL will cause a subsequent check for nullity to pass when it shouldn't. Funnily enough, this one didn't actually cause a crash, just a (MoarVM) exception.

I suppose that's enough about bugs and new features though, so let's talk about the next steps. One thing that would help rakudo perl 6 performance - and what jnthn has been bugging me about for the last weeks :-) - is implementing 'extops'. In short, extops are a way to dynamically load new instructions into the interpreter. For the interpreter, they are just function calls, but for the JIT they pose special challenges. For example, within the interpreter an extop can just branch to another location in the bytecode, because this is ultimately just a pointer update. But such a jump would be lost to the JIT code, which after all doesn't know about the updated pointer. Of course, extops may also invoke a routine, and do all sorts of interesting stuff. So for the JIT, the challenge will not be so much executing the extops as figuring out what to do afterwards. My hope is that the information provided about the meaning of the operands - that is, whether they are literal integers, registers, or coderefs - will provide sufficient information to compile correct code, probably using guards. A similar approach is probably necessary for instructions that (may) throw or catch exceptions.

What's more directly relevant is that moar-jit tends to fail - crash and burn - on windows platforms. Now as I've already mentioned, there are only a few differences between windows and POSIX on assembly level. These differences are register usage and calling conventions. For the most part, I've been able to abstract these away, and life was good (except for the floating point functions, but I've already explained that at length). However, there are only so many arguments that can fit in registers, and the rest of them typically go to the stack. I tacitly assumed that all arguments that are pushed on stack should be 64 bits wide (i.e. as wide as a register). But that's not true, smaller arguments take fewer bits as is needed. The ubiquitous MVMint32 type - an integer 32 bits wide - takes only 4 bytes. Which means that a function expecting 2 32 bit numbers on stack would receive the value of only one, and simply miss the other. As POSIX has 6 GPR's available, and Win64 only 4, it is clear this problem only occurs on windows because there aren't any functions with more than 7 arguments.

Seems like a decent explanation, doesn't it? Unfortunately it is also wrong, because the size of the argument only counts for POSIX platforms. On windows, stack arguments are indeed all 64 bits wide, presumably for alignment (and performance) reasons. So what is the problem, then? I haven't implemented the solution yet, so I'm not 100% sure that what I'm about to write is true, but I figure the problem is that after pushing a bunch of stack arguments, we never pop them. In other words, every time we call a function that contains more than 4 parameters, the stack top grows a few bytes, and never shrinks. Even that wouldn't be a problem - we'd still need to take care of alignment issues, but that's no big deal.

However, the JIT happens to use so called non-volatile or callee-save registers extensively. As their name implies, the callee function is responsible for either restoring these registers to their 'original' value upon exit, either by saving these values on stack or by not using them at all. Contrary to popular opinion, this mechanism works quite well, moreover many C compilers preferentially do not use these registers, so using them is quite cheap in comparison to stack usage. And simple as well. But I store and restore them using push and pop operations, respectively. It is vital the stack top pointer (rsp register) is in the right place, otherwise the wrong values end up in these registers. But when the stack keeps on growing, on windows as well as on POSIX systems, the stack pointer ends up in the wrong place, and I overwrite the callee-save register with gibberish. Thus, explosions.

From my review of available literature - and unfortunately, there is less literature available than one might think - and the behavior of C compilers, it seems the proper solution is to allocate sufficient stack space on JIT code entry, and store both the callee-save registers as well as the stack parameters within that space. That way, there's no need to worry about stack alignment issues, and it's always clear just where the values of the callee-save registers are. But as may be clear from this discussion, that will be quite a bit of work, and complex too. Testing might also be challenging, as I myself work on linux. But that's ultimately where VM's are for :-). Well, I hope to write again soon with some good news.

zondag 6 juli 2014

Moar JIT progress

So, it seems I haven't blogged in 3 weeks - or in other words, far too long. It seems time to blog again. Obviously, timotimo++ has helpfully blogged my and other's progress in the meantime. But to recap, since my last blog the following abilities have been added to the JIT compiler:

  • Conditionals and looping
  • Fast argument list access
  • Floating point and integer arithmetic
  • Reading and writing lexicals, and accessing 'world values'.
  • Fast reading and writing of object attributes
  • Improved logging and bytecode dumping.
  • Specialization guards and deoptimisation
The last of these points was done just this week, and the problem that caused it and the solution it involves are relevant to what I want to discuss today, namely invocation.

The basic idea of speculative optimization - that is, what spesh does - is to assume that if all objects in the variable $foo have been of class Foobar before, they'll continue to be FooBar in the future. If that is true, it is often possible to generate optimized code, because if you know the type of an object you typically know its layout too. Sometimes this assumption doesn't hold, and
then the interpreter must undo the optimization - basically, return the state of the interpreter to where it would've been if no optimization had taken place at all.

All the necessary calculations have already been done by the time spesh hands the code graph over to the JIT compiler, so compiling the guards ought to be simple (and it is). However, an important assumption broke because of it. The MoarVM term for a piece of executable code is a 'frame', and the JIT compiler compiles whole frames at a time. Sometimes frames can be inlined to create bigger frames, but the resulting code always represents a single new frame. So when I wrote the code responsible for entering JIT-ted code from the interpreter, I assumed that the JIT-ted code represented an entire frame, at the end which the interpreter should return control to its caller.

During deoptimization, however, the interpreter jumps from optimized, type-specific code, to safe, unoptimized 'duck-typing' code. And so it must jump out of the JIT into the interpreter, because the JIT only deals with the optimized code. However, when doing so, the JIT 'driver' code assumed that control had reached the end of the frame and it ought to return to the caller frame. But the frame hadn't completed yet, so where the caller had expected a return value there was none.

The solution was - of course - to make the return from the current frame optional. But in true perl style, there is more than one way to do that. My current solution is to rely on the return value of the JIT code. Another solution is to return control to the caller frame - which is, after all, just a bit of pointer updating, and encapsulated in a function call, too - from the JIT code itself. Either choice is good, but they have their drawbacks, too. Obviously, having the driver do it means that you might return inappropriately (as in the bug), and having the JIT code might mean that you'd forget it when it is appropriate. (Also, it makes the JIT code bigger). Moreover, the invoked frame might be the toplevel frame in which case we shouldn't return to the interpreter at all - the program has completed, is finished, done. So this has to be communicated to the interpreter somehow if the JIT-code is considered responsible for returning to the frame itself.

The issues surrounding a JIT-to-interpreter call are much the same. Because MoarVM doesn't 'nest runloops', the JIT code must actually return to the interpreter to execute the called code. Afterwards the interpreter must return control back to the JIT code. Obviously, the JIT-ted frame hasn't completed when we return to the interpreter during a callout, so it can't return to its caller for the same reason. What is more, when calling out to the interpreter, the caller (which is JIT code) must store a return address somewhere, so the JIT driver knows where to continue executing after the callee returns.

I think by now it is too late to try and spare you from the boring details, but the summary of it is this: who or what should be responsible for returning control from the JIT-frame to the caller frame is ultimately an issue of API design, specifically with regards to the 'meaning' of the return value of the JIT code. If the 'driver' is responsible, the return value must indicate whether the JIT code has 'finished'. If the JIT code is responsible, the return value must indicate whether the whole program has finished, instead. I'm strongly leaning towards the first of these, as the question 'is my own frame finished' seems a more 'local' answer than 'is the entire program finished'.

With that said, what can you expect of me the coming week? With object access and specialization guards complete, the next step is indeed calling to interpreted code from the JIT, which I have started yesterday. I should also get at argument passing, object creation, decontainerization, 'special conditionals', and many other features of MoarVM. The goal is to find 'compilation blockers', i.e., operations which can't be compiled yet but are common, and work through them to support ever greater segments of compiled code.

In the long run, there are other interesting things I want to do. As I mentioned a few posts earlier, I'd like to evolve the 'Jit Graph' - which is a linked list, for now - into a 'real' graph, ultimately to compile better bytecode. An important part of that is determining for any point in the code which variables are 'live' and used, and which are not. This will allow us to generate code to load important variables - e.g., the pointer input arguments buffer - temporarily in a register so that further instructions won't have to load it again. It will also allow us to avoid storing a computed value in a local if we know that it will be overwritten in the next instruction anyway (i.e., is temporary). Because copy-instructions are both very frequent and potentially very costly (because they access memory), eliminating them as best as possible should result in great speed advantages. Ultimately, this will also allow us to move more logic out of the architecture-specific parts and into the generic graph-manipulating parts, which should make the architecture-dependent parts simpler. I won't promise all this will be done in a single summer, but I do hope to be able to start with it.

zaterdag 14 juni 2014


Those of you who have followed #moarvm or github closely may already know, but this week I've finally checked in code that calculates 2 + 2 = 4 and returns that value to its' caller. To be very specific, I can make a frame that does the following operations:

const_i64_16 r0, 2
const_i64_16 r1, 2
add_i, r2, r1, r0
return_i r2

As a proof of concept, this is a breakthrough, and it shows that the strategy we've chosen can pay off. I didn't quite succeed without help of FROGGS, jnthn, nwc10, timotimo and others, but we're finally there. I hope. (I'll have to see about windows x64 support). The next thing to do is cleanup and extension. Some objectives for the following week are:
  • Cleanup. The JIT compiler still dumps stuff to stderr for my debugging purposes, but we shouldn't really have that. I've tried moving ad.all output to the spesh log but I can hardly find the data in there, so I think I'll make a separate JIT log file instead. Similarly, the file for the JIT compiler's machine code dump - if any - should be specified. And I should add padding to the dump, so that more than one block can be dumped.
  • Adding operations to compile. MoarVM supports no fewer than 638 opcodes, and I support 4 yet. That is about 0,62% of all opcodes :-). Obviously in those terms, I have a long way to go. jnthn suggested that the specialized sp_getarg opcodes are a good way to progress, and I agree - they'll allow us to pass actual arguments to a compiled routine.
  • Translate the spesh graph out of SSA form into the linear form that we use for the JIT 'graph' (which is really a labeled linked list so far).
  • Compile more basic blocks and add support for branching. This is probably the trickiest thing of the bunch.
  • Fix myself a proper windows-x64 virtual machine, and do the windows testing myself.
  • Bring the moar-jit branch up-to-date with moarvm master, so that testers don't have such a hard time.
 As for longer-term goals,we've had some constructive contact with Mike Pall (of LuaJit / DynASM fame), and he suggested ways to extends DynASM to support dynamic registers. As I've tried to explain last week, this is important for 'good' instruction selection. On further reflection, it will probably do just fine to introduce expression trees - and the specialized compiler backend for them, which would need register selection - gradually, i.e. per supported instruction rather than all at once.

However, the following features are more important still:
  • Support for deoptimisations. Up until now (and the foreseeable future) we keep the memory layout exactly the same
  • JIT-to-interpreter calls. This is a bit tricky - MoarVM doesn't support nesting interpreters. What we'll have to do instead is return to the interpreter with a label that stores our continuation, and continue at that continuation when we return.
  • At some point, JIT-to-JIT calls. Much the same problems apply - in theory, this doesn't have to differ from JIT-to-interpreter calls, although obviously we'd rather optimise the interpreter out of this loop.
  • Support for exceptions, obviously, which - I hope - won't be as tricky as it seems, as it ultimately depends on jumping in the bytecode at the right place.
  • Support for simple optimisations, such as merging various MoarVM opcodes into a single opcode if that is more suitable.
So that is it for now. See you next week!

    maandag 9 juni 2014


    Today is the day I've both created an implementation of the 'JIT graph' and destroyed it. (Or rather stashed it away in a safe branch, but you get the point). The current HEAD of moar-jit has nothing that should deserve a name like 'JIT graph'. it is merely a thin layer around MVMSpeshGraph. So I thought maybe I should explain why I did this, what the consequences are, and what I'll do next.

    First of all, let me explain why we wanted a 'JIT graph' in the first place, and what I think it ought to be. MoarVM contains a bytecode specialization framework called spesh. My current project to write a JIT compiler can be seen as an extension of this framework. Also, the core data structure of spesh - namely, MVMSpeshGraph - is also the input to the JIT compiler. I've promised a thorough walkthrough of spesh and you'll get it, but not today, today I have another point to make. That point is that although the spesh graph applies some sophisticated transformations upon the source bytecode, it is in essence still MoarVM bytecode. It still refers to MoarVM instructions and MoarVM registers.

    Now that is perfectly alright if you want to eventually emit MoarVM instructions as it has done up until now. However there is still quite a layer of abstraction between MoarVM and the physical processor that runs your instructions. For example, in MoarVM acquiring the value of a lexical is a simple as a single getlex instruction. For the CPU there are several levels of indirection involved to do the same, and quite possibly a loop. The goal of the 'JIT graph' then was to bridge these levels of abstraction. In effect, it is to make the job of the (native) code generator much simpler.

    I think the best way to explain this is with an example. Given the following MoarVM instruction:
    add_i r0, r1, r2

    I'd like to construct the following tree:
    store --> address --> moar-register(r0)
          \-> value --> add --> load --> moar-register(r1)
                            \-> load --> moar-register(r2)

    I think we can all criticize this structure for being verbose, and you'd be correct, but there is a point here. This structure is suitable for tree-matching and rewriting during code generation - in short, for generating good code. (Simpler algorithms that emit lousy code work too :-)). There are too many nice things I have to say about this structure. But it depends critically on my capability to select the registers on which operations take place. And as it turns out, on x86_64, I can't. Or on any other architecture than x86. Oh, and LuaJit doesn't actually use DynASM to compile its JIT, what do you know.

    Actually, I kind-of could've guessed that from the luajit source. But I didn't, and that is my own dumb fault.

    So, what to do next? There are two - or three, or four - options, depending on your level of investment in the given tools. One such option is to forgo register selection altogether and use static register allocation, which is what I did next. If we do that, there is truly no point in having a complicated graph, because all information is already contained in the MoarVM instructions themselves, and because you can't do anything sensible between instructions. After all, static register allocation means they're always the same. In essence, it means translating  the interpreter into assembly.  For most instructions, this approach is trivial - it could be done by a script. It is also rather unambitious and will never lead to much better performance than what the interpreter can do. Maybe 2x, but not 10x, which is what I think should be doable.

    The other option is to do register selection anyway, on top of DynASM, just because. I'm... not sure this is a great idea, but it isn't a terrible idea, either. In essence, it involves writing or generating giant nested switch structures that emit the right code to DynASM, like so, but everywhere, for every instruction in which you'd want this. I don't think that is particularly tractable, but it would be for a preprocessor.

    The third option is to fix DynASM to do dynamic register allocation on x86_64 and any architecture you need it. This is possible - we maintain a fork of DynASM - but it'd involve deep diving into the internals of DynASM. What is more, Mike Pall who is vastly more capable than I am decided not to do it, and I'm fairly sure he had his reasons. The fourth option is to look for another solution than what DynASM provides. For while it is certainly elegant and nice, it may not be what we ultimately want.

    woensdag 4 juni 2014

    Goals and subgoals

    Hi everybody! As it seems that a JIT compiler doesn't fall into place fully formed in a weekend, I've decided to set myself a few goals - along with smaller subgoals that I hope will help keep me on track. The immediate goal for the week is to compile a subroutine that adds two numbers and returns them, like so:
    sub foo() {
        return 3 + 4;

    Which is literally as basic as you can get it. Nevertheless, quite a few parts have to be up and moving to get this to work. Hence the list. So without further ado, I present you:

    • Modifying the Configure / Make files to run DynASM and link the resulting file.
    I've actually already done this, and it was more complicated than it seems, and I'm still not completely happy about it.
    • Obtaining writable memory that can be marked executable
    • Marking said memory executable and non-writable (security folks!)
    I plan to do this by hijacking MVM_platform_allocate_pages(), which nobody uses right now.
    • Determine, for a given code graph, whether we can JIT compile it. 
      • Called MVM_can_jit_graph(MVMSpeshGraph*)
    • Transforming a Spesh graph into a JIT graph
      • Note that I don't know yet what that JIT graph will look like.
      • I think it will hold values along with their sizes, though. I'm not sure the spesh graph does that. 
    • Directly construct our very simple code graph, by hand, using MAST.
    • JIT compiling the very simple code graph of our code.
    • UPDATE: attach a JIT code segment to a MVMStaticFrame
    • Calling and returning from that code.
    This... will probably be a bit experimental - it's of no use to throw in a full-fledged register allocation and instruction selection algorithm to add two constant numbers. We can - in principle - also do without these, but it will lead to rather poor machine code. 

    I've probably forgotten quite a few things in here. But this seems like a start. If there's something you think I missed, please comment :-)

    zondag 18 mei 2014

    MoarVM as a Machine

    If you read my blog, you'll likely know what MoarVM is and what it does. For readers who do not, MoarVM is a virtual machine that is designed to execute perl6 efficiently. Like a real computer, a virtual machine provides the following:
    • A 'processor', that is to say, something that reads a file and executes a program. This simulation is complete with registers and an instruction set.
    • An infinite amount of memory, using a garbage collector schema.
    • IO ports, including file and network access.
    • Concurrency (the simulation of an infinite amount of processors via threads)
    In this post I'll focus on the 'processor' aspect of MoarVM. MoarVM is a 'register virtual machine'. This means simply that all instructions operate on a limited set of storage locations in which all variables reside. These storage locations are called registers. Every instruction in the bytecode stream contains the address of the memory locations (registers) on which it operates. For example, the MoarVM instruction for adding two integers is called add_i, and it takes three 'operands', one for the source registers to be added together and a third for the destination register to store the result. Many instructions are like that.

    A register VM is often contrasted with a stack VM. The Java Virtual Machine is a well-known stack VM, as is the .NET CLR. In a stack VM values are held on an ever growing and shrinking stack. Instructions typically operate only on the top of the stack and do not contain any references to memory addresses. A typical stack VM would add two numbers by popping two of the stack and pushing the result.

    Why was the choice for a register VM made? I'm not certain, but I think it likely that it was chosen because register machines are frequently faster in execution. In brief, the trade-off is between instruction size on one hand and total number of instructions needed to execute a given program. Because stack VM instructions do not contain any addresses (their operands are implicitly on the stack), they are smaller and the VM has to spend less time to decode them. However, values frequently have to be copied to the top of the stack in order for the stack machine to operate on them. In contrast, a register machine can just summon the right registers whenever they are required and only rarely has to copy a value. In most VM's, the time spent executing an instruction is much larger than the time spent decoding it, so register VM's are often faster. 

    From the point of view of somebody writing a (JIT) compiler (like myself), both architectures are abstractions, and somewhat silly too. All actual silicon processor architectures have only a limited number of registers, yet most 'register' VM's - including MoarVM - happily dole out a new set of registers for every routine. In some cases, such as the Dalvik VM, these registers are explicitly stack-allocated, too! The 'register' abstraction in MoarVM does not translate into the registers of a real machine in any way.

    Nonetheless, even for a compiler writer there is a definitive advantage to the register VM architecture. To the compiler, MoarVM's instructions are input, that is to be transformed into native instructions. The register VM's instructions are in this sense very similar to something called Three Address Code. (Actually, some MoarVM instructions take more than three operands, but I'll get to that in a later post). A very convenient property of TAC and MoarVM instructions alike is that every variable already has its own memory location. In contrast, in a stack VM the same variable may have many copies on the stack. This is inconvenient for efficient code generation for two reasons. 

    First of all, naively copying values as they would be in the stack VM will lead to inefficient code. It may not be obvious which copies are necessary and which are redundant. Nor is it immediately obvious how much run-time memory compiled routine would use. To efficiently compile stack VM code a compiler might do best to translate it into Three Address Code first.

    But the second reason is perhaps more profound. Modern JIT compilers use a technique called type feedback compilation. Briefly, the idea is that a compiler that is integrated into the runtime of the system can exploit information on how the program is actually used to compile more efficient code than would be possible on the basis of the program source code alone. A simple example in javascript would be the following routine:

    function foo(a) {
        var r = 0;
        for (var i = 1; i < a.length; i++) {
            r += (a[i] * a[i-1]);
        return r;

    If all calls to foo happen to have a single argument consisting of an array of integers, the semantics of this routine become much simpler than they are otherwise. (For example, in javascript, the addition of a number and a string produces a well-defined result, so it is totally valid to call foo with an array of strings). A type-feedback compiler might notice a large number of calls to foo, all with integer arrays as their sole argument, assume this will always be so, and compile a much faster routine. In order to correctly handle arrays of strings too, the compiler inserts a 'guard clause' that checks if a is really an array of integers. If not, the routine must be 'de-optimised'.  Note that spesh, which is the optimisation framework for MoarVM, also works this way

    The goal of de-optimisation is to resume the execution of the interpreted (slow) routine where the assumptions of the compiled routine have failed. A typical place in our 'foo' function would be on entry or on the addition to r. The idea is that the values that are calculated in the optimised routine are copied to the locations of the values of the interpreted routine. In a register machine, this is conceptually simple because all variables already have a fixed locationHowever, the layout of the stack in a stack vm is dynamic and changes with the execution of the routine, and mapping between compiled and interpreted values may not be very simple at all. It is certainly doable - after all, the JVM famously has an efficient optimising JIT compiler - but not simple.

    And in my opinion, simplicity wins.

    zondag 11 mei 2014

    DynASM is awesome

    As part of my ‘community bonding’ period, I’ve taken it upon me to write a small series of blog posts explaining the various parts I’ll be using to add a JIT compiler to MoarVM. Today I’d like to focus on the DynASM project that originates from the awesome LuaJIT project.

    DynASM is probably best described as an run-time assembler in two parts. One part is written in lua and acts as a source preprocessor. It takes a C source file in which special directives are placed that take the form of assembly-language statements. Here is a fully worked-out example. These are then transformed into run-time calls that construct the desired bytecode. The generated bytecode can be called like you would a regular function pointer.

    DynASM has no run-time dependencies. But to run the preprocessor you will need lua as well as the Lua BitOp module (also from the luajit project). The run-time part is contained within the headers. DynASM is licensed under the MIT license and supports many different architectures, including x86, x64, ppc, and arm. DynASM also intergrates neatly into a Makefile-based build.

    In many respects DynASM is an ideal tool for this particular job. However, it also has a few drawbacks. The most important of these is the lack of documentation. With the exception of a few scattered blog posts , there is barely any documentation at all. For many of the simple operations, this is sufficient. For the more complex things, such as dynamic register selection, or dynamic labels, it seems there is no other option than to ask directly. (FWIW, the 'dynamic registers' question was asked an answered only two days ago on the luajit mailing list). However, I think the benefits of using DynASM outwheigh these issues.

    For my next blog, I'll be looking at the MoarVM machine model and bytecode set, especially in relation to x64. Hope to see you then.