Most Significant Bits

This week I think I fixed irregular behavior in the x64 instruction encoding register selction of DynASM. I think it'll be a fun story to share, so I thought it'd be time to blog.

The astonishingly irregular thing about x64 instruction encoding is that it is mostly very regular. Ignoring for the moment instruction prefixes and constants, an x86 instruction consists of two bytes, one for the instruction proper, and one for it's two operands. Take for example the addition of two registers:

+-----------------+
| add  | eax, ecx |
+-----------------+
| 0x01 |     0xc8 |
+-----------------+

We take the instruction byte as given. It is the second byte that concerns me because it determines which operands to use and how. Like a good CISC architecture, the x86 supports a number of addressing modes, meaning that a register can be used as a value but also as a (part of a) pointer. One of the reasons C does pointer arithmetic so freely is that this reflects the nature of the CPU's which where current when C was designed. The (relevant) x86 addressing modes are shown in the following table. (There are more, but you shouldn't use them):

Addressing ModeByte flagMeaning
Direct0xc0Both operands are used as direct values
Indirect0x00One of the operands is used as a memory reference, whether the source of destination operand depends on the instruction
Indirect with offset0x80 or 0x40One of the operands is used as a memory reference, offset by a constant which is encoded directly after the instruction.
Indexed0x04One operand consists of two registers base and index, which is multiplied by a scale, to provide a memory reference. Optionally also has an offset, in which case the code byte is 0x44 or 0x84, depending on whether the offset fits in a single byte or not.
Instruction-relative0x05One of the operands is a reference to the current location in the code offset by a constant (and the other refers to a register as usual).

Readers which are more careful than can be reasonably expected will have noticed that in the first three addressing modes, the lowest nibble is zero, whereas it is nonzero for the lower two addressing modes. This is in fact the source of irregularities in instruction encoding. To appreciate this it helps to unpack the operand byte in octal rather than hexadecimal. Octal is much closer to how x86 thinks about the world. As demonstrated in this table, the lowest two pairs of 3 bits each encode the actual registers that should be used.

+------+------+-----+-----+----------+
| Byte | Mode | Op2 | Op1 | Meaning  |
+------+------+-----+-----+----------+
| 0xc0 |    3 |   0 |   0 | Direct   |
+------+------+-----+-----+----------+
| 0x80 |    2 |   0 |   0 | Indirect |
+------+------+-----+-----+----------+
| 0x04 |    0 |   0 |   4 | Indexed  |
+------+------+-----+-----+----------+

The upshot of this is that in case the operand mode isn't direct, and the first operand register is either 4 or 5, the meaning of the operand byte is completely different. x86 suddenly expects another operand byte (a so-called SIB byte) to specify which register shall be base and which shall be index.

Normally this isn't much of a problem since the registers refered to by number 4 and 5 are rsp and rbp respectively; meaning the stack top and stack bottom registers. Fun fact: the x86 stack grows downward, so rbp > rsp in basically all cases. Also fun fact: because of this, writing from a rsp relative reference upwards can overwrite the return pointer held somewhere below rbp, which is the basis of most buffer overflow attacks. You thought NULL was a billion dollar mistake? Consider how the engineers that decided the stack should grow downward must feel.

Anyway, considering that rbp and rsp take up such a pivotal role in your program, it's actually unlikely you'll encode them by mistake. So as long as you don't do that, it's safe to ignore this complexity and just 'add in' the correct bits into the operand byte. Thus, for instance, to refer to the 7th and first register respectively in direct mode, we generate:

0300 + (07 << 3) + (01) == 0371 == 0xf9

However, in the land of x64, things are not so happy. I have blogged earlier about how the x64 architecture gives you 8 extra registers on top of the 8 legacy x86 registers, and how the difference between those registers lies only in that x64 specifies a prefix byte called REX. REX byte encoding is not so difficult, the trick is to find it reliably.  But because the lower three bits of the 4-bit register number are placed in the operand byte, register r12 and r13 look exactly like rsp and rbp to the CPU. Well, that's where the fun really starts, because it's all too easy to encode these 'accidentally'. They are after all perfectly regular registers.

For those not keeping score, we have two special cases to handle. First, whenever the first operand is either rsp or r12 and we're not using direct mode, an extra SIB byte needs to be encoded to specify that we are really talking about accessing rsp/r12 directly. This is done by encoding rsp as both the base and index, which the x86 understands because using rsp as an index is usually illegal. (The magic byte is thus 0044 or 0x24). Second, whenever the first operand is rbp or r13 and we're using indirect access without an offset, we need to encode indirect access with an offset instead, just with the offset at zero. This of course requires another byte. Somewhat byzantine, but manageable.

We are, unfortunately, not completely OK yet. It is my central hypothesis of this post that DynASM was not designed to handle register selection at runtime. Evidence for this hypothesis is that DynASM does 'weird' things like mix data and instructions and linking prior to encoding. Usually one encodes first and links afterwards, especially when during encoding you may need to make decisions that influence the final positions of certain segments. DynASM does it the other way around, which means that during linking we should be able to calculate exactly how much space we need for each instruction. Which is a pain, because DynASM mixes the data stream (which we need for inspection) with the instruction stream (which tells the runtime what to do with its input). It's possible to hack around this - basically by copying data into the instructions - but it's not elegant. Starting with this commit, I'm reasonably confident that this stuff works, a test case is provided here.

That almost concludes this weeks madness. The only thing left is to question the method. Why should x86 use this convoluted scheme? I could go on a detailed historical research, but I prefer to speculate it is caused by economy in memory. After all, in the regular case you need only 2 bytes, which is - conveniently - equal to 16 bit, the original register size of the 8086. And since that chip was designed in the 1970s, it makes sense instructions should be as space-efficient as possible. In contrast, ARM uses 32 bit instructions with 3 operands. So space economy seems a plausible cause to me.

See you next time!

Reacties

Populaire posts van deze blog

Reverse Linear Scan Allocation is probably a good idea

Retrospective of the MoarVM JIT

Something about IR optimization