Some things I want
Lately I've been fixing a few bugs here and there in the JIT compiler, as well as trying to advance the JIT expression optimizer. The story of those bugs is interesting but in this post I want to focus on something else, namely some support infrastructure that I think we should have that would make working on MoarVM and spesh/jit in particular much nicer.
There are a bunch of things related to runtime control of spesh and the JIT:
There are a bunch of things related to runtime control of spesh and the JIT:
- jit-bisect.pl is an awesome tool (if I say so myself) that deserves it's own blog post. It can figure out which frame is the first to be miscompiled and cause an execution failure. (Or to be specific, the least miscompile to cause a failure). When a JIT issue is brought to me, I reach for it directly. It should be more powerful:
- Because it does not know in advance how many frames are compiled in a program, it uses an 'exponential probing' strategy to find the first breaking block, and then binary search between the last two probes. This means that we need to have log(n) tries to find the first failure during probing, and then another log(n) tries to find the exact frame. Usually, failures are much faster than successful runs, so the first phase takes a disproportionate amount of time. It would probably speed up bisects if we could start probing from a number higher than 1.
- It would be nice if as part of (optional) output, jit-bisect.pl could create a 'replicator' script that would replicate the failure exactly as the bisect script found it. E.g. if it finds that frame 237 block 6 fails, then it should output a script that executes the program with the expression JIT active to exactly that point.
- We assign sequence number 0 to the first frame to be JIT-compiled, but can't bisect that because we start with sequence number 1.
- Both spesh and JIT logging should be much more specific. Currently, when MVM_SPESH_LOG is defined, we log all information on all frames we encounter onto the log file, resulting in enormous (easily >20MB) text files. Of those text files, only tiny bits tend to be affected in a way that interests you, and especially as the result of bisect process, only the last frame will be of any interest. So I propose a MVM_SPESH_LOG_FOI (frame of interest) flag that toggles when a frame should be logged or not. Could be an array as well (comma- or semicolon-separated-values). Same goes for the JIT.
- We can limit the number of frames that passes through spesh using the flag MVM_SPESH_LIMIT, and the number of frames (and basic blocks) that pass through the expression JIT with MVM_JIT_EXPR_LAST_FRAME and MVM_JIT_EXPR_LAST_BB. We cannot do the same for the 'lego' JIT, and we probably should.
- When MVM_SPESH_BLOCKING is not defined, spesh runs concurrently with the main program, the result of MVM_SPESH_LOG is not repeatable, and MVM_SPESH_LIMIT has as a result no well-defined meaning. Although the jit-bisect.pl program will set MVM_SPESH_LIMIT, it is easy enough to forget. Maybe we should either disable MVM_SPESH_LIMIT without MVM_SPESH_BLOCKING, or MVM_SPESH_LIMIT should imply MVM_SPESH_BLOCKING.
- Generally we have too much ad-hoc environment setup code that - I feel - does not fit well into the idea of having MoarVM being embedded within another program. Not that embedding interpreters in other programs is a common thing to do anymore, but still. We should probably take that out of general instance setup and install it somewhere else. (MVM_spesh_setup_control_flags?)
Then there's more ambitious stuff, that still falls under 'housekeeping':
- Currently the expression tree is backed by an array of integers, some of which identify the nodes and some of which are internal pointers (relative to the base of the array). There is an additional info array of structures that has to be explicitly maintained and that holds some information (result type and size, mostly). This is rather annoying when operating on the tree structure (as the optimizer does) because both arrays have to be kept synchronized. Also, these info values are only defined for nodes of the tree. We can reasonably fit all information stored in an info node in 32 bits (if we're packing) or 64 bits (if we're sloppy), even together with the node value. Doing this should save memory, be better for caches, and make tree manipulation easier.
- I want to have explicit cleanup of the spesh worker thread (if requested by program arguments). Not having this makes ASAN and valgrind somewhat less useful, because they will falsely report in-progress memory as being leaked. To do this we'd need to wait, during shutdown, for the worker thread to be stopped, which means that we also need to have a signal to indicate that it has been stopped.
- Sometimes, the JIT can get into a confused state and panic, bringing down the process. This is great for debugging and awful for users. I feel like we might be able to improve this - we can, during JIT graph construction, bail out without crashing (if we find constructs we can't JIT yet). We may be able to extend and generalize this. (If we find that we're in an illegal state entirely, that should probably still crash).
- The legacy ('lego') JIT currently works with a two phase process, first building a JIT graph, then compiling it. This is redundant - without loss of functionality, the lego JIT could be reduced to a single phase (reducing the number of internal structures quite dramatically). This is also something that I'm not totally sure of whether we should do, because we may want having an explicit preparation phase eventually (e.g. when we start doing multiple basic blocks per expression tree).
And there's more ambitious stuff that would fall under optimizations and general functionality improvements:
- MoarVM in some places assumes that we have 'easy' access to the current instruction pointer, e.g. to lookup dynamic variables or the scope of the current CATCH block. This is logical for an interpreter but problematic for the JIT. We currently need to maintain a pointer to the last entered basic block to approximate it, i.e. a memory store per block. This is expensive, and it's not often needed:
- We already have to maintain a 'return pointer' for JIT frames that are high up the call stack, so in those cases the additional stores are redundant.
- In case we are interested in the current position of the current frame, the current position is already on the stack (as the return address of the function call), and we can read it from stack-walking.
- Alternatively, we can replace calls to functions that are interested in the current position to calls that receive that position explicitly. However, I kind of prefer using stack walking since we'll need the return pointer anyway. Also, it doesn't work for anything that might implicitly throw (MVM_exception_throw_adhoc)
- The register allocator can be much better in a bunch of (simple) ways:
- We never need to spill a constant value to memory, since we can always just insert a copy instead.
- It is probably not necessary to insert individual restores before usages in the same basic block, especially for usages prior to the spilling point.
- We sometimes spill values that we didn't really need to spill, since we're already storing them to memory as part of the basic block.
- We have a kind-of hacky way to ensure that our 3-instruction code (c = op(a,b)) is converted to x86 2-instruction code (b = op(b,a)). This should be something that the register allocator can solve, or a phase after the register allocator, but it is currently being handled during code generation.
There is definitively more out there I want to do, but this should be enough to keep me busy for a while. And if anyone else wants to take a try at any of these, they'd be very welcome to :-).