Bugs, Updates, and ABIs
Once upon a time, way too long ago, I blogged, and notified you of my progress. There's been plenty progress since then, so it was time to write again. Since I've last wrote, I've added support for invocations, 'invokish' instructions - including decontainerization and object conditionals, which appear really frequently - and OSR. I've also had to fix bugs which crept in the code but which were never properly tested before, due to the fact that we typically need to implement a whole set of ops before any particular frame is compiled, and then if those frames break it is unclear which change caused it. I'll talk about these bugs a bit first and then about my next challenges.
The first bug that I fixed seemed to have something to do with smart numification specifically. This is an example of a so-called 'invokish' instruction, in which an object is coerced into a primitive such as a number or a string. Some types of objects override the default methods of coercion and as such will need to run code in a method. Because the JIT doesn't know beforehand if this is so - many objects are coerced without invoking code at all - a special guard was placed to ensure that the JIT code falls out into the interpreter to deal with an 'unexpected' method invocation. Freakishly, seemed to work about half of the time, depending on the placement of variables in the frame.
Naturally, I suspected the guard to be wrong. But (very) close inspection in gdb assured me that this was not the case, that the guard in fact worked exactly as intented. What is more, usually JIT bugs cause unapologetic crashes, segmentation faults, bus errors and the like, but this bug didn't. The code ran perfectly, just printing the wrong number consistently. Ultimately I tracked it down to the differences in parameter passing between POSIX and Windows. On both platforms, the first few parameters to a function are passed in registers. These registers differ between platforms, but that's easy enough to deal with using macro definitions. In both platforms, floating-point arguments are passed via the 'SSE' registers as opposed to the general-purpose registers. However, the relative positions are assigned differently. On windows, they are assigned in the order of the declaration. In other words, the following function declaration
assigns i to the first general-purpose register (GPR), d to the second SSE register, j to the third GPR, and f to the fourth SSE register. On POSIX platforms (Mac OS X, Linux, and the rest), they are first classified by type - integer, memory, or floating point - and then assigned to consecutive registers. In other word, i and j are passed in the first and second GPR, and d and f are passed in the first and second SSE register. Now my code implemented the windows behavior for both platforms, so on POSIX, functions expecting their first floating point argument in the first SSE register would typically find nothing there. However, because the same register is often used for computations, there typically would be a valid value, and often the value we wanted to print. Not so in smart numification, so these functions failed visibly.
The second bug had (ultimately) to do with write barriers. I had written the object acessors a long time ago and had tested the code frequently since then, so I had not expected anything to be wrong with them. However, because I had implemented only very few string instructions, I had never noticed that string slots require write barriers just as object slots do. (I should have known, this was clear from the code in the interpreter). Adding new string instructions thus uncovered a unused code path. After comparing the frames that where compiled with the new string instructions with those without, and testing the new string instructions in isolation, I figured that the accessors had something to do with it. And as it turned out, they had.
The third bug which puzzled me for over a week really shouldn't have, but involved the other type of object acessors - REPR accessors. These accessors are hidden behind functions, however these functions did not take into account the proper behavior on type objects. Long story short, type objects (classes and the like) don't have any attributes to look up, so they should return NULL when asked for any. Not returning NULL will cause a subsequent check for nullity to pass when it shouldn't. Funnily enough, this one didn't actually cause a crash, just a (MoarVM) exception.
I suppose that's enough about bugs and new features though, so let's talk about the next steps. One thing that would help rakudo perl 6 performance - and what jnthn has been bugging me about for the last weeks :-) - is implementing 'extops'. In short, extops are a way to dynamically load new instructions into the interpreter. For the interpreter, they are just function calls, but for the JIT they pose special challenges. For example, within the interpreter an extop can just branch to another location in the bytecode, because this is ultimately just a pointer update. But such a jump would be lost to the JIT code, which after all doesn't know about the updated pointer. Of course, extops may also invoke a routine, and do all sorts of interesting stuff. So for the JIT, the challenge will not be so much executing the extops as figuring out what to do afterwards. My hope is that the information provided about the meaning of the operands - that is, whether they are literal integers, registers, or coderefs - will provide sufficient information to compile correct code, probably using guards. A similar approach is probably necessary for instructions that (may) throw or catch exceptions.
What's more directly relevant is that moar-jit tends to fail - crash and burn - on windows platforms. Now as I've already mentioned, there are only a few differences between windows and POSIX on assembly level. These differences are register usage and calling conventions. For the most part, I've been able to abstract these away, and life was good (except for the floating point functions, but I've already explained that at length). However, there are only so many arguments that can fit in registers, and the rest of them typically go to the stack. I tacitly assumed that all arguments that are pushed on stack should be 64 bits wide (i.e. as wide as a register). But that's not true, smaller arguments take fewer bits as is needed. The ubiquitous MVMint32 type - an integer 32 bits wide - takes only 4 bytes. Which means that a function expecting 2 32 bit numbers on stack would receive the value of only one, and simply miss the other. As POSIX has 6 GPR's available, and Win64 only 4, it is clear this problem only occurs on windows because there aren't any functions with more than 7 arguments.
Seems like a decent explanation, doesn't it? Unfortunately it is also wrong, because the size of the argument only counts for POSIX platforms. On windows, stack arguments are indeed all 64 bits wide, presumably for alignment (and performance) reasons. So what is the problem, then? I haven't implemented the solution yet, so I'm not 100% sure that what I'm about to write is true, but I figure the problem is that after pushing a bunch of stack arguments, we never pop them. In other words, every time we call a function that contains more than 4 parameters, the stack top grows a few bytes, and never shrinks. Even that wouldn't be a problem - we'd still need to take care of alignment issues, but that's no big deal.
However, the JIT happens to use so called non-volatile or callee-save registers extensively. As their name implies, the callee function is responsible for either restoring these registers to their 'original' value upon exit, either by saving these values on stack or by not using them at all. Contrary to popular opinion, this mechanism works quite well, moreover many C compilers preferentially do not use these registers, so using them is quite cheap in comparison to stack usage. And simple as well. But I store and restore them using push and pop operations, respectively. It is vital the stack top pointer (rsp register) is in the right place, otherwise the wrong values end up in these registers. But when the stack keeps on growing, on windows as well as on POSIX systems, the stack pointer ends up in the wrong place, and I overwrite the callee-save register with gibberish. Thus, explosions.
From my review of available literature - and unfortunately, there is less literature available than one might think - and the behavior of C compilers, it seems the proper solution is to allocate sufficient stack space on JIT code entry, and store both the callee-save registers as well as the stack parameters within that space. That way, there's no need to worry about stack alignment issues, and it's always clear just where the values of the callee-save registers are. But as may be clear from this discussion, that will be quite a bit of work, and complex too. Testing might also be challenging, as I myself work on linux. But that's ultimately where VM's are for :-). Well, I hope to write again soon with some good news.
The first bug that I fixed seemed to have something to do with smart numification specifically. This is an example of a so-called 'invokish' instruction, in which an object is coerced into a primitive such as a number or a string. Some types of objects override the default methods of coercion and as such will need to run code in a method. Because the JIT doesn't know beforehand if this is so - many objects are coerced without invoking code at all - a special guard was placed to ensure that the JIT code falls out into the interpreter to deal with an 'unexpected' method invocation. Freakishly, seemed to work about half of the time, depending on the placement of variables in the frame.
Naturally, I suspected the guard to be wrong. But (very) close inspection in gdb assured me that this was not the case, that the guard in fact worked exactly as intented. What is more, usually JIT bugs cause unapologetic crashes, segmentation faults, bus errors and the like, but this bug didn't. The code ran perfectly, just printing the wrong number consistently. Ultimately I tracked it down to the differences in parameter passing between POSIX and Windows. On both platforms, the first few parameters to a function are passed in registers. These registers differ between platforms, but that's easy enough to deal with using macro definitions. In both platforms, floating-point arguments are passed via the 'SSE' registers as opposed to the general-purpose registers. However, the relative positions are assigned differently. On windows, they are assigned in the order of the declaration. In other words, the following function declaration
void foo(int i, double d, int j, double f);
assigns i to the first general-purpose register (GPR), d to the second SSE register, j to the third GPR, and f to the fourth SSE register. On POSIX platforms (Mac OS X, Linux, and the rest), they are first classified by type - integer, memory, or floating point - and then assigned to consecutive registers. In other word, i and j are passed in the first and second GPR, and d and f are passed in the first and second SSE register. Now my code implemented the windows behavior for both platforms, so on POSIX, functions expecting their first floating point argument in the first SSE register would typically find nothing there. However, because the same register is often used for computations, there typically would be a valid value, and often the value we wanted to print. Not so in smart numification, so these functions failed visibly.
The second bug had (ultimately) to do with write barriers. I had written the object acessors a long time ago and had tested the code frequently since then, so I had not expected anything to be wrong with them. However, because I had implemented only very few string instructions, I had never noticed that string slots require write barriers just as object slots do. (I should have known, this was clear from the code in the interpreter). Adding new string instructions thus uncovered a unused code path. After comparing the frames that where compiled with the new string instructions with those without, and testing the new string instructions in isolation, I figured that the accessors had something to do with it. And as it turned out, they had.
The third bug which puzzled me for over a week really shouldn't have, but involved the other type of object acessors - REPR accessors. These accessors are hidden behind functions, however these functions did not take into account the proper behavior on type objects. Long story short, type objects (classes and the like) don't have any attributes to look up, so they should return NULL when asked for any. Not returning NULL will cause a subsequent check for nullity to pass when it shouldn't. Funnily enough, this one didn't actually cause a crash, just a (MoarVM) exception.
I suppose that's enough about bugs and new features though, so let's talk about the next steps. One thing that would help rakudo perl 6 performance - and what jnthn has been bugging me about for the last weeks :-) - is implementing 'extops'. In short, extops are a way to dynamically load new instructions into the interpreter. For the interpreter, they are just function calls, but for the JIT they pose special challenges. For example, within the interpreter an extop can just branch to another location in the bytecode, because this is ultimately just a pointer update. But such a jump would be lost to the JIT code, which after all doesn't know about the updated pointer. Of course, extops may also invoke a routine, and do all sorts of interesting stuff. So for the JIT, the challenge will not be so much executing the extops as figuring out what to do afterwards. My hope is that the information provided about the meaning of the operands - that is, whether they are literal integers, registers, or coderefs - will provide sufficient information to compile correct code, probably using guards. A similar approach is probably necessary for instructions that (may) throw or catch exceptions.
What's more directly relevant is that moar-jit tends to fail - crash and burn - on windows platforms. Now as I've already mentioned, there are only a few differences between windows and POSIX on assembly level. These differences are register usage and calling conventions. For the most part, I've been able to abstract these away, and life was good (except for the floating point functions, but I've already explained that at length). However, there are only so many arguments that can fit in registers, and the rest of them typically go to the stack. I tacitly assumed that all arguments that are pushed on stack should be 64 bits wide (i.e. as wide as a register). But that's not true, smaller arguments take fewer bits as is needed. The ubiquitous MVMint32 type - an integer 32 bits wide - takes only 4 bytes. Which means that a function expecting 2 32 bit numbers on stack would receive the value of only one, and simply miss the other. As POSIX has 6 GPR's available, and Win64 only 4, it is clear this problem only occurs on windows because there aren't any functions with more than 7 arguments.
Seems like a decent explanation, doesn't it? Unfortunately it is also wrong, because the size of the argument only counts for POSIX platforms. On windows, stack arguments are indeed all 64 bits wide, presumably for alignment (and performance) reasons. So what is the problem, then? I haven't implemented the solution yet, so I'm not 100% sure that what I'm about to write is true, but I figure the problem is that after pushing a bunch of stack arguments, we never pop them. In other words, every time we call a function that contains more than 4 parameters, the stack top grows a few bytes, and never shrinks. Even that wouldn't be a problem - we'd still need to take care of alignment issues, but that's no big deal.
However, the JIT happens to use so called non-volatile or callee-save registers extensively. As their name implies, the callee function is responsible for either restoring these registers to their 'original' value upon exit, either by saving these values on stack or by not using them at all. Contrary to popular opinion, this mechanism works quite well, moreover many C compilers preferentially do not use these registers, so using them is quite cheap in comparison to stack usage. And simple as well. But I store and restore them using push and pop operations, respectively. It is vital the stack top pointer (rsp register) is in the right place, otherwise the wrong values end up in these registers. But when the stack keeps on growing, on windows as well as on POSIX systems, the stack pointer ends up in the wrong place, and I overwrite the callee-save register with gibberish. Thus, explosions.
From my review of available literature - and unfortunately, there is less literature available than one might think - and the behavior of C compilers, it seems the proper solution is to allocate sufficient stack space on JIT code entry, and store both the callee-save registers as well as the stack parameters within that space. That way, there's no need to worry about stack alignment issues, and it's always clear just where the values of the callee-save registers are. But as may be clear from this discussion, that will be quite a bit of work, and complex too. Testing might also be challenging, as I myself work on linux. But that's ultimately where VM's are for :-). Well, I hope to write again soon with some good news.
Thank you for taking the time to post during what must be the most trying of circumstances.
BeantwoordenVerwijderenI aim to please :-). For what it's worth, the windws ABI bugs seem to have been fixed, but new bugs have popped up under the coves. Such is life, it seems.
Verwijderen