Function Call Milestone
Hi everybody. It's high time for another update, and this time I have good news. The 'expression' JIT compiler can now compile native ('C') function calls (although it's not able to use the results). This is a major milestone because function calls are hard! (At least from the perspective of a compiler, and especially from the perspective of the register allocator). Also because native function calls are really very important in MoarVM. Most of its 'primitive' operations (like hash table access, string equality, big integer arithmetic) are implemented by invoking native functions, and so to compile almost any program the JIT has to compile many function calls.
What makes function calls 'hard' is that they must implement the 'calling convention' of the relevant 'application binary interface' (ABI). In short, the ABI specifies the locations of function call parameters. A small number of parameters (on Windows, the first 4, for POSIX platforms, the first 6) are placed in registers, and if there are more parameters they are usually placed on the stack. Aside from the calling convention, the ABI also specifies the expected alignment of the stack pointer (per 16 bytes) and the registers a functions may overwrite (clobber in ABI-speak) and which registers must have their original values after the function returns. The last type of registers are called 'callee-saved'. Note that at least a few registers must be callee-saved, especially those related to call stack management, because if the callee function would overwrite those it would be impossible to return control back to the caller. By the way, manipulating exactly those registers is how the setjmp and longjmp 'functions' work.
So the compiler is tasked with generating code that ensures the correct values are placed in the correct registers. That sounds easy enough, but what if the these registers are taken by other values, and what if those other values might be required for another parameter? Indeed, what if the value in the %rdx register needs to be in the %rsi register, and the value of the %rsi register is required in the %rdx register? How to determine the correct ordering for shuffling the operands?
One simple way to deal with this would be to eject all values from registers onto the stack, and then to load the values from registers if they are necessary. However, that would be very inefficient, especially if most function calls have no more than 6 (or 4) parameters and most of these parameters are computed for the function call only. So I thought that solution wouldn't do.
Another way to solve this would be if the register allocator could ensure that values are placed in their correct registers directly,- especially for register parameters - i.e. by 'precoloring'. (The name comes from register allocation algorithms that work by 'graph coloring', something I will try to explain in a later post). However, that isn't an option due to my choice of 'linear scan' as the register allocation algorithm. This is a 'greedy' algorithm, meaning that it decides the allocation for a live range as soon as it encounters them, and that it cannot revert that decision once it's been made. (If it could, it would be more like a dynamic programming algorithm). So to ensure that the allocation is valid I'd have to make sure that the information about register requirements is propagated backwards from the instructions to all values that might conflict with it... and that point we're no longer talking about linear scan, and I would be better off re-engineering a new algorithm. Not a very attractive option either!
Instead, I thought about it and it occurred to me that this problem seems a lot like unravelling a dependency graph, with a number of restrictions. That is to say, it can be solved by a topological sort. I map the registers to a graph structure as follows:
Unlike a 'proper' dependency graph, cycles can and do occur, as in the example where '%rdx' and '%rsi' would need to swap places. Fortunately, because of the single-inbound edge rule, such cycles are 'simple' - all outbound edges not belonging to the cycle can be resolved prior to the cycle-breaking, and all remaining edges are part of the cycle. Thus, the cycle can always be broken by freeing just a single node (i.e. by copy to a temporary register).
The only thing left to consider are the values that are used after the function call returns (survive the function call) and that are stored in registers that the called function can overwrite (which is all of them, since the register allocator never selects callee-saved registers). So to make sure they are available afterwards, we must spill them. But there are a few spill strategies to choose from (terminology made up by me):
What makes function calls 'hard' is that they must implement the 'calling convention' of the relevant 'application binary interface' (ABI). In short, the ABI specifies the locations of function call parameters. A small number of parameters (on Windows, the first 4, for POSIX platforms, the first 6) are placed in registers, and if there are more parameters they are usually placed on the stack. Aside from the calling convention, the ABI also specifies the expected alignment of the stack pointer (per 16 bytes) and the registers a functions may overwrite (clobber in ABI-speak) and which registers must have their original values after the function returns. The last type of registers are called 'callee-saved'. Note that at least a few registers must be callee-saved, especially those related to call stack management, because if the callee function would overwrite those it would be impossible to return control back to the caller. By the way, manipulating exactly those registers is how the setjmp and longjmp 'functions' work.
So the compiler is tasked with generating code that ensures the correct values are placed in the correct registers. That sounds easy enough, but what if the these registers are taken by other values, and what if those other values might be required for another parameter? Indeed, what if the value in the %rdx register needs to be in the %rsi register, and the value of the %rsi register is required in the %rdx register? How to determine the correct ordering for shuffling the operands?
One simple way to deal with this would be to eject all values from registers onto the stack, and then to load the values from registers if they are necessary. However, that would be very inefficient, especially if most function calls have no more than 6 (or 4) parameters and most of these parameters are computed for the function call only. So I thought that solution wouldn't do.
Another way to solve this would be if the register allocator could ensure that values are placed in their correct registers directly,- especially for register parameters - i.e. by 'precoloring'. (The name comes from register allocation algorithms that work by 'graph coloring', something I will try to explain in a later post). However, that isn't an option due to my choice of 'linear scan' as the register allocation algorithm. This is a 'greedy' algorithm, meaning that it decides the allocation for a live range as soon as it encounters them, and that it cannot revert that decision once it's been made. (If it could, it would be more like a dynamic programming algorithm). So to ensure that the allocation is valid I'd have to make sure that the information about register requirements is propagated backwards from the instructions to all values that might conflict with it... and that point we're no longer talking about linear scan, and I would be better off re-engineering a new algorithm. Not a very attractive option either!
Instead, I thought about it and it occurred to me that this problem seems a lot like unravelling a dependency graph, with a number of restrictions. That is to say, it can be solved by a topological sort. I map the registers to a graph structure as follows:
- Each register forms a node
- A transfer from a register to another register, or from a register to a stack location or a local memory location is an edge
- Each node can have multiple outbound edges, but only one inbound edge (only one value can ever be required in that register)
I linked to the topological sort page for an explanation of the problem, but I think my implementation is really quite different from that presented there. They use a node visitation map and a stack, I use an edge queue and and outbound count. A register transfer (edge) can be enqueued if it is clear that the destination register is not currently used. Transfers from registers to stack locations (as function call parameters) or local memory (to save the value from being overwritten by the called function) are also enqueued directly. As soon as the outbound count of a node reaches zero, it is considered to be 'free' and the inbound edge (if any) is enqueued.
Unlike a 'proper' dependency graph, cycles can and do occur, as in the example where '%rdx' and '%rsi' would need to swap places. Fortunately, because of the single-inbound edge rule, such cycles are 'simple' - all outbound edges not belonging to the cycle can be resolved prior to the cycle-breaking, and all remaining edges are part of the cycle. Thus, the cycle can always be broken by freeing just a single node (i.e. by copy to a temporary register).
The only thing left to consider are the values that are used after the function call returns (survive the function call) and that are stored in registers that the called function can overwrite (which is all of them, since the register allocator never selects callee-saved registers). So to make sure they are available afterwards, we must spill them. But there are a few spill strategies to choose from (terminology made up by me):
- full spill replaces all references to the value in register, with references to the value in memory, by inserting load and store operations around every use and definition of the value. Basically, it splits the single value live range in parts.
- split-and-spill finds the code segment where the spill is not required and splits it off from the rest and only edits the second part. This is actually rather tricky since to do this correctly requires data-flow analysis, which isn't otherwise required by a 'naive' implementation of linear scan
- spill-and-restore will find the shortest piece of code where the spill is necessary, store the value to memory, and load it directly afterwards, as if the spill never happened.
The current register allocator does a full spill when it's run out of registers, and it would make some sense to apply the same logic for function-call related spills. I've decided to use spill-and-restore, however, because a full spill complicates the sorting order (a value that used to be in a register is suddenly only in memory) and it can be wasteful, especially if the call only happens in an alternative branch. This is common for instance when assigning values to object fields, as that may sometimes require a write barrier (to ensure the GC tracks all references from 'old' to 'new' objects). So I'm guessing that it's going to be better to pay the cost of spilling and restoring only in those alternative branches, and that's why I chose to use spill-and-restore.
That was it for today. Although I think being able to call functions is a major milestone, this is not the very last thing to do. We currently cannot allocate any of the registers used for floating-point calculations, which is a relatively minor limitation since those aren't used very frequently. But I also need to do some more work to actually use function return values and apply generic register requirements of tiles. But I do think the day is coming near where we can start thinking about merging the new JIT with the MoarVM master branch, making it available to everybody. Until next time!
Reacties
Een reactie posten