The other day I discovered that 32-bit FreeBSD 11.2 has strange trouble running in an emulated environment. Utilities like
top would just hang when trying to print floating-point numbers through
dtoa() library routine was getting stuck in an endless loop (FreeBSD has excellent support for debugging the binaries shipped with the OS, so finding out where things were going wrong was unexpectedly easy).
Closer inspection identified the following instruction sequence:
fldz fxch st(1) fucom st(1) fstp st(1) fnstsw ax sahf jne ... jnp ...
This code relies on “undefined” behavior. The FUCOM instruction compares two floating-point values and sets the FPU condition code bits. The FNSTSW instruction stores the bits into the AX register, where they can be tested directly, or the SAHF instruction first copies them into the flags register where the bits can be conveniently tested by conditional jump instructions.
The problem is the FSTP instruction in between. According to Intel and AMD documentation, the FSTP instruction leaves the FPU condition codes in undefined state. So the FreeBSD library is testing undefined bits… but it just happens to work on all commonly available CPUs, in a very predictable and completely deterministic manner, because the FSTP instruction in reality leaves the condition bits alone. What is going on?
To be honest, I failed to find why the condition codes are supposedly left in undefined state by most FPU instructions. What I did discover is that in the Intel 8087 and 287 documentation, there is no hint that FSTP might change the condition bits in any way. Although it is not entirely explicit, the Intel 287 documentation leaves a strong impression that only a select few instructions set the FPU condition bits and most instructions do not modify them at all. Which would actually be a very logical behavior.
For some unclear reason, the Intel 387 documentation (1987) very clearly says that most FPU instructions leave the condition codes “undefined”. This appears to have been copied by Cyrix, AMD, and just about any 3rd-party x87 FPU documentation that goes into sufficient detail.
At the same time, for example the book Programming the 80386 by Crawford and Gelsinger (1987), actual 386 designers at Intel, makes no hint that most FPU instructions might modify the condition code bits at all.
It would be misleading to read “undefined” in this context as “random” or “unpredictable”. When it comes to CPU documentation, “undefined” can mean several different things, including the following:
- We couldn’t be bothered to document the behavior because it’s too complicated
- The behavior actually changed between product families in the past
- The behavior has been 100% consistent, but we might want to change it in the future
- The behavior is so strange that we really, really don’t want anyone using it
In any case, the implication for programmers is “please do not rely on this behavior”. Yet sometimes programmers end up relying on it anyway, and it need not be done knowingly at all (including, I strongly suspect, the FreeBSD case described above).
The SHLD/SHRD instructions are a good example of behavior that changed in the past. It is possible to use these instructions with a 16-bit destination register and using a shift count greater than 16. This is arguably a 386 design flaw (the shift count could have been limited), but in any case, the “undefined” behavior did change. According to Sandpile, SHLD/SHRD behaves one way on the Pentium (and likely 386/486) and a different way on the P6 and P4 families, with additional different flag behavior between P6 and P4.
The behavior will be 100% deterministic and predictable on any given Intel CPU. Because the behavior changed across CPU generations, programs that are intended to run on a wide range of CPUs cannot rely on it. Documentation calls this “undefined”, but that is really misleading.
Another example of “undefined” behavior is the BSWAP instruction with a 16-bit operand. On all known processors, it behaves completely consistently: It reads the 16 bits of the operand, zero extends them to 32 bits, byte swaps the resulting DWORD which has the high half zeroed, and writes the result (zeros) into the 16-bit operand. This behavior is arguably completely useless, because it doesn’t depend on the input and there are better ways to zero a 16-bit register anyway. It is possible that the behavior is “undefined” because Intel wanted to keep the possibility of redefining it in the future, or because it’s not validated and no one can say with 100% certainty that all x86 CPUs really behave the same.
Whatever the reason, the “undefined” BSWAP behavior keeps confusing developers and wasting their time. Emulator developers end up playing silly cat and mouse games with anti-emulation software (see VMProtect note here) because correctly emulating undefined behavior is non-obvious, yet “undefined” behavior on real CPUs has a curious tendency to be anything but.
The FreeBSD runtime library relying on “undefined” FPU condition code behavior brings up interesting philosophical questions. Is the code wrong? Can it be said to be wrong if it works correctly on all supported hardware? How likely are CPU designers to change the “undefined” behavior in the future, knowing that existing software relies on it? (Answer: Extremely unlikely.)
In the end, documenting processor behavior as “undefined” is just a poor excuse. Everyone would be much better served if the documentation told the real story.
If the behavior is different across CPU generations, just say so. Even better, give developers some sense about what those generations may be—if the behavior changed between Pentium and P6 but stayed the same since then, it won’t be relevant for 64-bit software, for example.
As an example, Intel documents that the behavior with regard to executing instructions that cross the 4GB boundary on 32-bit processors differs between P6 and P4 processor families. The documentation could have said “undefined”, but it doesn’t always.
If you really don’t want developers to use certain opcodes—again, just say so explicitly, and much better, make them throw a #UD exception.
As in the initial example, undefined behavior of CPU/FPU flags is one of the worst offenders. As Sandpile shows, there really are differences, but the behavior is very far from “undefined”. Flag bits are almost always either set to a fixed value, changed based on the results of an operation, or left unchanged.
In the old days before CPUID, the detection of Cyrix processors relied on the state of flag bits after dividing 5 by 2—although the behavior was “undefined”, all Intel and AMD processors of the same class (486) in fact behaved 100% predictably, and could therefore be reliably distinguished from Cyrix CPUs.
It does not help that even Intel’s documentation keeps changing. For example, the FCOMI, FCOMIP, FUCOMI, and FUCOMIP instructions, added in the Pentium Pro, are documented in the 1999 Intel SDM as leaving the FPU condition codes C0, C2, and C3 “undefined”. But a newer Intel SDM says that those flags are in fact “not affected”. It is almost certain that those instructions never modified the C0, C2, and C3 flags, but Intel simply didn’t bother documenting that fact. At some point in the early 2000s, Intel changed the documentation to reflect reality. Of course the old documentation was not wrong per se, because leaving flags unmodified is a perfectly valid instance of “undefined” behavior!
It is perhaps not a coincidence that FCOMI, FCOMIP, FUCOMI, and FUCOMIP are just about the only FPU instructions documented to leave the condition codes unmodified. It is possible that that Intel simply does not know how all of their old FPUs behave, or cannot easily prove that they behave the same. The FCOMI family of instructions is relatively new, and Intel may have been able to ascertain that those instructions indeed never change the condition codes; at the time the documentation was updated, only the P6 and P4 architectures would have implemented those instructions.
The bottom line is that it is very wrong to understand “undefined” in this context as “random”, and even taking it to mean “unpredictable” is at best misleading. More often than not, CPU behavior is documented as “undefined” not because it is random, or unpredictable, or in any way unknowable, but rather because it either changed in the past or because the vendor does not consider it useful and won’t commit to keeping the existing behavior unchanged. It may be “undefined”, but it is very far from unreliable.