Failing to Fail

The other day I was going over various versions of the venerable DOS/16M DOS extender from Rational Systems (later Tenberry Software). The DOS/16M development kit comes with a utility called PMINFO.EXE which is meant to give the user some idea about the performance of a system running in protected mode.

I know that the utility has trouble on faster CPUs and I expected it to fail about like this:

DOS/16M PMINFO.EXE failing with a FP division by 0

But running the utility on an older laptop with an Intel Haswell processor, I instead got this:

The same PMINFO.EXE crashing with a #GP fault

Rather than cleanly exiting after catching a floating-point division by zero, the program crashed with a general protection fault. That looks like a bug, but why would it be happening? And where is the bug?

To get a better sense of the problem, I used the Instant-D debugger shipped with DOS/16M. There I could see the faulting code:

Faulting instruction in the Instant-D debugger

It didn’t take me too long to determine that the code is part of the floating-point exception handling logic, and that it comes from the EMOEM.OBJ file shipped with DOS/16M.

Now, the EMOEM module is provided in source form with many Microsoft compilers, including Microsoft C 5.1 and 6.0 (one of those was likely used to build PMINFO.EXE). But the crashing code fragment is not in the code provided by Microsoft. So why is it there and what is it supposed to do?

It took me a little while to understand what the code is doing, but once I did, it was obvious why it’s there. The problem the code is trying to solve is caused by the fact that the x87 environment differs between real and protected mode. The original real-mode only format used by the 8087 stores a linear 20-bit address of the floating-point instruction (because the 8087 does not know what the original segmented 16:16 address was!) plus 11 bits of the FPU instruction opcode (five bits are always the ESC opcode). The 20-bit linear address and the 11-bit opcode are stored in two consecutive 16-bit words, with one bit left unused.

In protected mode on the 80287, a 20-bit linear address isn’t enough. Intel changed the x87 environment format to store the full 16:16 segmented address, and the FPU opcode is no longer stored.

DOS/16M was designed to work with compilers producing real-mode DOS code. Hence libraries shipped with those compilers expect the original 8087 environment format when handling floating-point exceptions. But because DOS/16M applications in fact run in protected mode, the FPU will be storing the x87 environment in the newer, protected-mode format.

The extra code in the DOS/16M EMOEM.OBJ is clearly meant to read the opcode from the stored CS:IP address, possibly skip one byte of a prefix, and then modify the stored environment, writing the 11 opcode bits right where real-mode exception handling code expects to find them. (Note that the code makes no attempt to produce a 20-bit linear address, since that wouldn’t work anyway.)

So why does this code not work on my Haswell laptop? Because the CPU is not quite backwards compatible.

99% Backward Compatibility

The original 8087 always kept the FPU environment up to date, including the FPU opcode as well as instruction and data addresses. That reflected the internal working of the 8087.

The 287 already changed things from a software perspective, which was a result of the different interface between the CPU and FPU. On the 8087, the stored instruction address points to the ESC opcode. On the 287 and later, it points to any prefixes that might precede the ESC opcode. This change was clearly an improvement, and although it had the potential to upset existing floating-point exception handlers, in practice probably didn’t because most FPU instructions that are likely to fault (division, multiplication, transcendental instructions) aren’t used with prefixes anyway.

The FXSAVE instruction added in the later Pentium II models subtly changed how the processor saves the last FP instruction opcode and code/data addresses. Rather than saving these data items every time, they’re only saved when there is a pending floating-point exception. This reflects the actual usage, since only FP exception handlers are likely to need this information.

In the P4 microarchitecture, Intel added a (presumably) performance optimization called “fopcode compatibility mode”. Bit 2 in the multi-purpose IA32_MISC_ENABLE MSR determines whether the CPU tracks the FP opcode (aka fopcode) for every instruction as before, or whether it’s updated only upon encountering an exception. Newer Intel CPUs no longer support constant updating of the FP opcode at all and only update it when exceptions occur.

None of that is a problem for PMINFO.EXE. But the next step that Intel took to reduce x87 backward compatibility actually is.

In the Haswell and later CPUs, Intel introduced a new CPUID bit. When (in Intel parlance) CPUID.(EAX=07H,ECX=0H):EBX[bit 13] is set, the processor still tracks the last FP instruction code and data addresses, but no longer saves segment register values; that is, the code and data segment values are always stored as zeroes.

This problem most visibly impacts segmented protected-mode exception handlers, such as the one in DOS/16M.

While earlier changes, such as not always tracking the last FP opcode, are easily visible by software, they do not cause trouble in practice. But not saving the segment registers does in fact upset legacy off-the-shelf software. Not often, but it does. PMINFO.EXE is one of the victims, but far from the only one.

Possible Workarounds?

Working around the deficient CPUs is quite difficult. A naive approach would be to intercept the #MF (math fault) exception and record the current CS and DS, but that would be only sometimes correct.

The reason why the FPU separately tracks the instruction and data pointers is that, historically, the FPU was a completely separate chip running in parallel with the CPU. Math exceptions were reported asynchronously through the interrupt controller. The CPU could be doing more or less anything when the math interrupt arrived; the FPU itself had to provide the instruction pointer so that the math error handler could find out what actually faulted.

Even on modern CPUs where everything is one piece of silicon and floating-point errors are reported via #MF exceptions, the problem remains. The #MF exception is reported at some point after the instruction which caused it, namely on the next floating-point instruction or a WAIT instruction. But such an instruction could be executed in a different segment, or in a multi-tasking OS, in a different task.

That is in fact the case with the DOS/16M PMINFO.EXE. The #MF exception is triggered on a WAIT instruction in a floating-point emulator segment, which is different from the segment where the instruction causing the FP exception is.

The upshot is that by the time the #MF happens, it is too late to record the code and data segment values. The only possibility might be to force math instruction emulation with the CR0.EM bit, and track the current code and data pointers, but that would be quite intrusive and slow. At that point it may be simpler to just run the legacy code through software emulation.

Fortunately the impact of this problem is fairly limited. It is rare for software to handle math exceptions during normal operation; more often than not, math exceptions cause a fatal error, and in such cases the practical difference between terminating a program due to a math fault versus a general protection fault isn’t significant. While failing to fail properly is annoying, the program still fails either way.

There is a possible workaround that users may apply in some cases. Once upon a time, Microsoft provided a package called WINFLOAT.EXE described in KB article Q97265. Said package includes a utility called HIDE87.COM which hides a math co-processor from Windows 3.x applications, and possibly from some DOS applications. This forces software emulation built into Windows to be used, avoiding the deficiency of newer Intel CPUs.

Note that the WINFLOAT package can be used to get some sense of whether math exception handling works at all in a given setup. Here it is not working (as expected) on a Haswell CPU:

No math exception handling for you!

For comparison, here it is running on a non-crippled CPU:

Correctly working math exception handling

To date, AMD processors provide better backward compatibility and do not suffer from this particular problem.

Addendum: Same Symptom, Different Cause

Around 2013, users of several virtualization products (VMware, VirtualBox, KVM, XP mode in Windows 7) complained of crashes in WIN87EM.DLL and similar. The symptom was identical, a math fault handler crashing because the code segment of a faulting FPU instruction was zero. Such reports can be found here, here, or here.

But the cause was quite different. It specifically affected 64-bit hypervisors running 32-bit or 16-bit guest software. In the course of normal operation, a hypervisor often needs to save and restore the FPU state, using FXSAVE/FXRSTOR or similar instructions.

The instructions all can save the FPU state in different formats; the two relevant formats are 64-bit with no segments and 64-bit offsets, or 32-bit with 16-bit segment and 32-bit offset.

A hypervisor can save the state twice, once in 32-bit and once in 64-bit format. That way it is possible to recover both the segments and 64-bit offsets. But when restoring state, the hypervisor is faced with a binary choice: Either restore the 64-bit format, zeroing the segment registers, or restore the 32-bit format, keeping the segment values but zeroing any high bits of 64-bit offsets.

It should now be apparent that if a 64-bit hypervisor only uses the 64-bit form of FPU save/restore instructions, the segment register contents stored in the FPU state will be lost after saving and restoring the FPU state. Depending on the hypervisor and guest combination, this loss can be rare and unpredictable, or it can happen with 100% reproducibility.

Hypervisors were fixed to selectively save and restore either 32-bit or 64-bit state. One possible approach is as follows: Save the 64-bit FPU state. If the high DWORD of either the code or data pointer is non-zero, keep this state and restore 64-bit state again. Otherwise save the FPU state again in 32-bit format, and restore it as 32-bit. This approach works well in practice and adapts to the software running in the guest.

As usual, the devil is in the details.

Update: Real Mode Is Broken Too

Readers pointed out that in real mode, recent Intel CPUs also save the state incorrectly, and do not save the full 20-bit (or 32-bit) linear address. This fact is not clearly documented by Intel, but the behavior has been confirmed on at least Haswell and Skylake CPUs.

Experimentation shows that the behavior in real mode is somewhat logical. The processor simply does not keep track of the segment register, ever. When in real mode, FSAVE simply saves the 16-bit IP value as the code pointer. Note that this is usually not the same value as the low 16 bits of the linear address would be.

In real mode, the consequences of not properly storing the FP code and data pointers aren’t as obvious. An exception handler will end up reading some more or less random memory location; it won’t crash, but it may not handle the exception correctly. This failure mode is, in a way, even worse–because it isn’t apparent that things are failing.

This entry was posted in Bugs, Intel, PC architecture, x87. Bookmark the permalink.

11 Responses to Failing to Fail

  1. Rudolf says:

    There are likely many x86 FPU related devils. I suspect that the optimization to throw away segment selectors in the x87 state is done only for FXSAVE. Can you try with fstenv?
    Thanks,
    Ruik

  2. Michal Necasek says:

    If that were the case, old software (which has no idea about FXSAVE) would have no trouble. But it does, because the CPU internally does not track the segments. Which is also what Intel’s documentation claims.

  3. crazyc says:

    > This does not affect real-mode code (since only linear addresses are stored)

    I just tried this on a skylake cpu in virtualized real mode and it didn’t store the full linear address just the ip and data offset. Maybe it’s different in real real mode?

  4. Michal Necasek says:

    It is. There is are separate 16-bit and 32-bit layouts of the FPU state for real and protected mode.

    Again, the original 8087 format (16-bit real mode) stored the linear address because that was the only information the FPU had.

  5. crazyc says:

    Sure, but I meant it’s running in vmx mode with the virtual machine in real mode so the fsave fpu state would be the 16bit real mode one.

  6. Michal Necasek says:

    OK, then yes. And you’re presumably really talking about FSAVE, not FXSAVE. The Intel documentation does not indicate that the FPU instruction/data offsets would be chopped to 16 bits in real mode on Haswell and later, but maybe they are?

    OK, looking at what my Haswell CPU stores, yes, in real mode it’s messed up too. Only the address corresponding to the 16-bit offset is stored, and the segment is lost. This is not clearly documented.

    In real mode it just doesn’t cause crashes. It almost certainly does cause subtle failures.

  7. Joe says:

    Hmmm … from the people that brought you the Pentium FDIV bug …

  8. Michal Necasek says:

    To be clear, the FDIV bug was not exactly well documented 🙂 This behavior, although it does break backward compatibility, was well documented by Intel. Which, in a way, makes it worse because they’re definitely not going to fix it!

  9. crazyc says:

    It’s kind of a preview of x86-s when they intend to take a sledgehammer to backward compatibility.

  10. Richard Wells says:

    x86-s may be the response to problems like this. There may be obscure pieces of backward compatibility that haven’t been relevant for the last 20 years and therefore missed in testing. If one doesn’t have to test for something because it doesn’t exist, one can’t fail the test.

    The FDIV and the Sandybridge SATA bug were both caused by similar late in development changes that had no benefit to being rushed through. Yes, the next revision of the Pentium could have been smaller and new motherboards could have been built with fewer layers but taking advantage of the changes wasn’t going to happen for another year. I hope the Haswell bug was caused by not realizing the intricacies of x87 instead of altering at the last minute a design that should have been locked down.

  11. crazyc says:

    I suspect they could have fixed it in microcode had they really wanted to. You’re right that it probably would be a large reduction in qa testing and it might permit them to toss a bunch of microcode in critical places where bases and limits have to be checked for every memory access when not in long mode.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.