The other day I discovered that 32-bit FreeBSD 11.2 has strange trouble running in an emulated environment. Utilities like
top would just hang when trying to print floating-point numbers through
dtoa() library routine was getting stuck in an endless loop (FreeBSD has excellent support for debugging the binaries shipped with the OS, so finding out where things were going wrong was unexpectedly easy).
Closer inspection identified the following instruction sequence:
fldz fxch st(1) fucom st(1) fstp st(1) fnstsw ax sahf jne ... jnp ...
This code relies on “undefined” behavior. The FUCOM instruction compares two floating-point values and sets the FPU condition code bits. The FNSTSW instruction stores the bits into the AX register, where they can be tested directly, or the SAHF instruction first copies them into the flags register where the bits can be conveniently tested by conditional jump instructions.
The problem is the FSTP instruction in between. According to Intel and AMD documentation, the FSTP instruction leaves the FPU condition codes in undefined state. So the FreeBSD library is testing undefined bits… but it just happens to work on all commonly available CPUs, in a very predictable and completely deterministic manner, because the FSTP instruction in reality leaves the condition bits alone. What is going on?
To be honest, I failed to find why the condition codes are supposedly left in undefined state by most FPU instructions. What I did discover is that in the Intel 8087 and 287 documentation, there is no hint that FSTP might change the condition bits in any way. Although it is not entirely explicit, the Intel 287 documentation leaves a strong impression that only a select few instructions set the FPU condition bits and most instructions do not modify them at all. Which would actually be a very logical behavior.
For some unclear reason, the Intel 387 documentation (1987) very clearly says that most FPU instructions leave the condition codes “undefined”. This appears to have been copied by Cyrix, AMD, and just about any 3rd-party x87 FPU documentation that goes into sufficient detail.
At the same time, for example the book Programming the 80386 by Crawford and Gelsinger (1987), actual 386 designers at Intel, makes no hint that most FPU instructions might modify the condition code bits at all.
It would be misleading to read “undefined” in this context as “random” or “unpredictable”. When it comes to CPU documentation, “undefined” can mean several different things, including the following:
- We couldn’t be bothered to document the behavior because it’s too complicated
- The behavior actually changed between product families in the past
- The behavior has been 100% consistent, but we might want to change it in the future
- The behavior is so strange that we really, really don’t want anyone using it
In any case, the implication for programmers is “please do not rely on this behavior”. Yet sometimes programmers end up relying on it anyway, and it need not be done knowingly at all (including, I strongly suspect, the FreeBSD case described above).
The SHLD/SHRD instructions are a good example of behavior that changed in the past. It is possible to use these instructions with a 16-bit destination register and using a shift count greater than 16. This is arguably a 386 design flaw (the shift count could have been limited), but in any case, the “undefined” behavior did change. According to Sandpile, SHLD/SHRD behaves one way on the Pentium (and likely 386/486) and a different way on the P6 and P4 families, with additional different flag behavior between P6 and P4.
The behavior will be 100% deterministic and predictable on any given Intel CPU. Because the behavior changed across CPU generations, programs that are intended to run on a wide range of CPUs cannot rely on it. Documentation calls this “undefined”, but that is really misleading.
Another example of “undefined” behavior is the BSWAP instruction with a 16-bit operand. On all known processors, it behaves completely consistently: It reads the 16 bits of the operand, zero extends them to 32 bits, byte swaps the resulting DWORD which has the high half zeroed, and writes the result (zeros) into the 16-bit operand. This behavior is arguably completely useless, because it doesn’t depend on the input and there are better ways to zero a 16-bit register anyway. It is possible that the behavior is “undefined” because Intel wanted to keep the possibility of redefining it in the future, or because it’s not validated and no one can say with 100% certainty that all x86 CPUs really behave the same.
Whatever the reason, the “undefined” BSWAP behavior keeps confusing developers and wasting their time. Emulator developers end up playing silly cat and mouse games with anti-emulation software (see VMProtect note here) because correctly emulating undefined behavior is non-obvious, yet “undefined” behavior on real CPUs has a curious tendency to be anything but.
The FreeBSD runtime library relying on “undefined” FPU condition code behavior brings up interesting philosophical questions. Is the code wrong? Can it be said to be wrong if it works correctly on all supported hardware? How likely are CPU designers to change the “undefined” behavior in the future, knowing that existing software relies on it? (Answer: Extremely unlikely.)
In the end, documenting processor behavior as “undefined” is just a poor excuse. Everyone would be much better served if the documentation told the real story.
If the behavior is different across CPU generations, just say so. Even better, give developers some sense about what those generations may be—if the behavior changed between Pentium and P6 but stayed the same since then, it won’t be relevant for 64-bit software, for example.
As an example, Intel documents that the behavior with regard to executing instructions that cross the 4GB boundary on 32-bit processors differs between P6 and P4 processor families. The documentation could have said “undefined”, but it doesn’t always.
If you really don’t want developers to use certain opcodes—again, just say so explicitly, and much better, make them throw a #UD exception.
As in the initial example, undefined behavior of CPU/FPU flags is one of the worst offenders. As Sandpile shows, there really are differences, but the behavior is very far from “undefined”. Flag bits are almost always either set to a fixed value, changed based on the results of an operation, or left unchanged.
In the old days before CPUID, the detection of Cyrix processors relied on the state of flag bits after dividing 5 by 2—although the behavior was “undefined”, all Intel and AMD processors of the same class (486) in fact behaved 100% predictably, and could therefore be reliably distinguished from Cyrix CPUs.
It does not help that even Intel’s documentation keeps changing. For example, the FCOMI, FCOMIP, FUCOMI, and FUCOMIP instructions, added in the Pentium Pro, are documented in the 1999 Intel SDM as leaving the FPU condition codes C0, C2, and C3 “undefined”. But a newer Intel SDM says that those flags are in fact “not affected”. It is almost certain that those instructions never modified the C0, C2, and C3 flags, but Intel simply didn’t bother documenting that fact. At some point in the early 2000s, Intel changed the documentation to reflect reality. Of course the old documentation was not wrong per se, because leaving flags unmodified is a perfectly valid instance of “undefined” behavior!
It is perhaps not a coincidence that FCOMI, FCOMIP, FUCOMI, and FUCOMIP are just about the only FPU instructions documented to leave the condition codes unmodified. It is possible that that Intel simply does not know how all of their old FPUs behave, or cannot easily prove that they behave the same. The FCOMI family of instructions is relatively new, and Intel may have been able to ascertain that those instructions indeed never change the condition codes; at the time the documentation was updated, only the P6 and P4 architectures would have implemented those instructions.
The bottom line is that it is very wrong to understand “undefined” in this context as “random”, and even taking it to mean “unpredictable” is at best misleading. More often than not, CPU behavior is documented as “undefined” not because it is random, or unpredictable, or in any way unknowable, but rather because it either changed in the past or because the vendor does not consider it useful and won’t commit to keeping the existing behavior unchanged. It may be “undefined”, but it is very far from unreliable.
Another thing that trips up emulator writers….. undefined values that are supposed to be random/unreliable. It was common for old software to read memory areas on a floating data bus as a pseudo-random number generator. Many times the emulator doesn’t simulate the behavior and the memory area stays ’00’ or ‘FF’. This was a problem with Apple II emulators.
That is truly undefined behavior. I’m told that at least some old PCs also exhibited this, but newer machines (1990s and later?) don’t. An unconnected data bus acted as a capacitor, if you wrote something and quickly read it again, you likely got back the same value. But the data pretty quickly “dissipated” and after a short time you’d real back something else.
According to this https://groups.google.com/g/comp.sys.apple2/c/3gH0dUpLI3Q/m/JJYnhRYBrY4J the “floating bus” on the Apple II actually ended up reading some semi-unpredictable screen data. I’m not sure if that was really a floating bus.
This https://www.cpcwiki.eu/forum/amstrad-cpc-hardware/cpc-bus-tests/ discussion about Amstrad CPC and Plus describes a proper floating bus, where on some machines reading an unassigned I/O port returned the last opcode byte of the port input instruction. But the behavior could depend on the exact board revision and such.
Is it FreeBSD hand-written asm, or really clang compiled C code?
Apple II is really floating bus. The weird way Woz implemented things the 6502 accesses the memory on one half of the clock cycle and the video circuitry accesses on the other (and was set up so that it read out in a pattern that refreshed the DRAM for “free” while updating the display).
because of that, if you read from an address w/o RAM connected you got the floating value, which was the last byte read out by the video update circuitry.
The original Apple II had no way to let you know when VBLANK happened, but by monitoring the floating bus (doing this was sometimes called “vapor lock”) you could find out where in the video update you were, cycle count, and then do all kinds of racing-the-beam effects.
I know this works because I’ve written a few demoscene demos that use this.
Cool, thanks for the added detail!
I believe it’s compiled code; gdb tells me that __dtoa is in /usr/obj/usr/src/lib/libc/gdtoa_dtoa.c but I have no such source file, maybe it’s generated?
I see similar fucom/fstp/fnstsw sequences here https://reviews.llvm.org/D44091 which makes me think the sequence does come from the compiler, but perhaps it’s hand-written in a way.
It is contrib/gdtoa/gdtoa.c, which is symlinked to lib/libc/gdtoa_dtoa.c by the build process.
Indeed it is hand-written in a sense, but the code comes from the compiler.
OK, that source file is what I thought it probably was.
At any rate, given that this was code emitted by clang/LLVM, I’m sure people would have noticed if it wasn’t working on any reasonably current CPU.
Did you really mean to say ‘the “undefined” behavior did change change’ with a double ‘change’?
No, but my proofreader is on vacation. Thanks for stepping in! Fixed.
I think undefined is largely correct because Intel may not have known what the results would be. The 8087 included a number of unusual techniques to reduce the size of the chip that may have had surprising results. Any given revision might seem to produce a consistent result but it would be impossible to be sure that the actual results might not be random. Later chips had the transistor budget to squander some on reducing the probability of random data alteration though there would still be considerable bad press if 1% of chips occasionally returned the wrong value even if the value was not used. One can’t be wrong if the documentation does not establish an unnecessary value as correct.
The recently released development memos for the TI-88 provide an exemplary warning of how a chip that seems to work may have serious non-deterministic problems that are very difficult to discover or correct.
That sounds like a possible explanation, except that the note about condition code bits only appeared in the 387 documentation in 1987. It’s nowhere to be found in 8087/287 documents.
However, I am entirely prepared to believe that in some cases Intel documents something as “undefined” because Intel is not 100% certain of actual behavior across product generations. If it’s not something they actively validated, it could be difficult to establish the behavior retroactively. Not difficult as “impossible” but difficult as “not worth the effort”.
Condition Code information appears in the 287 manual as Table 1-4. The general purpose table shows some codes can be undefined. The 387 did redesign some parts of the Condition Codes especially with the replacement trigonometric functions and some of the later documentation tends to pretend that the 8087 and 80287 worked like the 387. The post-387 x87 literature did improve things by listing every function which changes condition codes instead of requiring the programmer to know which line of the generic table applies to a given function.
The Richard Startz book mentions the 8087 condition codes and the values therein but in a chatty form that makes it challenging to find information. Having an incomplete FPREM setting C2 is mentioned on the page before the table of quotients is shown but no mention is given that C0,C1, and C3 would be filled with gibberish when C2 = 1. Intel’s own 8087 documentation mentions that Condition Codes can be set but gives no values for it. At least in the early version I can find, I guess Intel released an update of some kind to programmers that requested it once Intel had a working 8087.
All of which continues to make me happy that I was working on a PDP-11 during the early days of the IBM PC. Let someone else spend the time clearing out the surprises.
So the issue is that the emulated FSTP instruction is changing FPU condition codes where the real hardware doesn’t? Presumably the emulator author did it on purpose, because I’d have thought it would be less code to leave them alone. I wonder why.
More by accident, I believe the thinking was approximately “if the condition code bits are undefined, it should be safe to zero them”.
Ok so given the behavior is consistent across x86 CPUs but different in emulators, has anyone submitted bug reports or patches to the emulators?
Yes 🙂 Of course as the article discusses, “accurate emulation” is problematic if vendors refuse to document the behavior and exploring actual hardware can never give 100% authoritative answers.
I have one clear explanation of this topic.
Intel documentation about combining different memory cache modes:
Overlapping variable MTRR ranges are not supported generically. However, two variable
ranges are allowed to overlap, if the following conditions are present:
• If both of them are UC (uncached).
• If one range is of type UC and the other is of type WB (write back).
In both cases above, the effective type for the overlapping region is UC. The processor’s
behavior is undefined for all other cases of overlapping variable ranges.
Ok, but what we will get REALLY if try to combine WC (write combining) and WB (write back) for example? WC ("less cache level") or WB? None of them! UC (uncached)…
So this is also “undefined” but in reality quite consistent?
>in reality quite consistent?
I didn’t check it on AMD still (will try for sure but I bet that result would be the same) but on Intel it works equally (uncached) on Core2, Core Gen 3, Core Gen 8 and Core Gen 10.
So basically the behavior is consistent on all relevant Intel CPUs. It’s very likely that if Core 2 did it that way, Core and Pentium M did too, and most likely P6 as well. The P4 is always a wild card, Intel went off the reservation there 😀
AMD K7, WB+WC->UC
>the behavior is consistent on all relevant Intel CPUs
Probably on all IA32 CPU, I can say now.