A few weeks ago I had the questionable pleasure of diving into the math exception handler of WIN87EM.DLL, the Windows 3.1 math emulator and FPU support library. Actually WIN87EM.DLL appears to have been first shipped with Windows 3.0, and the version in Windows 3.1 is more or less identical. The version shipped with Windows 3.11 is byte for byte identical to the one in Windows 3.1.
The main function of WIN87EM.DLL is, as the name suggests, an x87 floating-point emulator. It appears to be an outgrowth of the math emulation packages shipped with many Microsoft languages. But even on a system with x87 hardware, WIN87EM.DLL has some work to do.
Namely WIN87EM.DLL intercepts math errors (exceptions). Depending on the system type, WIN87EM.DLL hooks either NMI vector 2 (PC and PC/XT) or IRQ 13 (PC/AT and compatibles).
The math interrupt handler in WIN87EM.DLL is very, very strange. It bears all the hallmarks of code that was written, rewritten, rewritten again, hacked, tweaked, modified, and eventually beaten into submission even if the author(s) had no real idea why it finally worked.
Now, x87 math error handing is a very tricky subject where numerous details changed over FPU and CPU generations, and Intel had a fair share of bugs in this area. But the code in WIN87EM.DLL looks very much like the result of changes made in desperation until it worked somehow, even though the changes made little or no sense.
What WIN87EM.DLL Does
The IRQ 13 handler in WIN87EM.DLL performs the following steps:
- Disable interrupts (CLI)
- PUSH and POP the AX register 70 times, maybe it will slow things down
- Write zero to I/O port F0h to clear the PC/AT FPU error latch
- Mask a selection of interrupts
- Write an EOI to the master interrupt controller (but not slave)
- Execute the FNSTSW instruction to store the FPU status word (but only that)
- PUSH and POP the AX register 16 times, because speed kills
- Write zero to I/O port F0h again, in case it didn’t work the first time
- Execute the FNCLEX instruction to clear pending FPU exceptions
- Write zero to I/O port F0h again, because third time’s the charm
- PUSH and POP the AX register 16 times, because it was so much fun last time
- Execute the FNCLEX instruction again, just to be really sure
- PUSH and POP the AX register 16 times, because it’s the thing to do
- Write zero to I/O port F0h again, because three times might not have been enough
- Finally jump to code that doesn’t look crazy
The not so crazy looking code executes FNSTENV to store the FPU environment, swaps in the previously saved status word (which was subsequently changed by FNCLEX), examines the code at the stored FPU error pointer to see if instruction prefixes should be skipped (that is where things may crash), enables interrupts with STI, and finally jumps to common code that’s executed on both PCs and ATs and handles the actual math error.
All in all, the math error interrupt handler in WIN87EM.DLL makes very little sense. I am extremely doubtful that, for example, writing to I/O port F0h four times does anything useful, or that executing FNCLEX twice is better than doing it once. However, it is entirely possible that the extra time it takes might do something. Likewise the slowdown loops that push and pop AX are very unlikely to be necessary, but may have done something seemingly useful on some particular system.
The code really does look like a desperate attempt to make things work, and there is some possibility that it is the result of trying to support broken hardware. Other similar error handlers I looked at don’t appear nearly as convoluted and are much more “by the book”.
We’ll probably never know why the code ended up so crazy, but we can speculate.
This kind of thing triggers all sorts of flashbacks from 40 years ago….
I spent time digging though this while trying to get fpu error handing to work in winevdm with visual basic (which expects a precise stack layout because it installs a handler that never returns to win87em). The origin of the fpu emulator seems to go back to the earliest microsoft c compilers and it supports two different register allocation methods one where it would try to keep everything in the x87 stack and one where it would spill to memory. This means it has to keep the invalid exception always enabled so it can catch a stack over/underflow and handle it properly. This also means it has to potentially replay any fault so it needs to get the addresses from fsave which is impossible on newer intel cpus thanks to fcs/fds deprecation. Thankfully all win16 programs i’ve seen use the keep in fpu registers allocation method so it can be ignored entirely in winevdm.
Do tell.
It’s probably older than Microsoft C. MS Pascal 3.0 had an emulation library, and that came out in early 1983 or late 1982. Well before MS C 3.0 (the first C compiler written at Microsoft).
I wonder if some of the code goes even to MS BASIC. Maybe not, or probably not exactly, since the oldest MS tools used a different format for binary numbers. MS Pascal 3.0 used IEEE 784 format and supported also real 8087. MS Pascal 2.0 AFAICT did not support emulation or a real 8087. Not too surprising since the 8087 was not available in the early PC days.
And yes, old MS languages used an “infinite” FPU stack which may trigger exceptions during normal operation. I don’t believe that was used anymore by the late 1980s or so.
MSBASIC incorporated 8087 support in 1987 with QuickBASIC 3 and QuickBASIC 4 defaulted to using the 8087 code along with an emulator. So BASIC got it after C and probably was using a variation on the C version of the 8087 emulator. BASIC 6 switched back to including something similar to the pre-8087 floating point libraries since those were faster than the 8087 emulator.
There were a number of third party libraries that did interface MS BASIC with 8087 but I never used one and information on them does not turn up in a quick search. The Startz book shows some techniques for incorporating ASM routines to interface with 8087 which look cumbersome. Makes me happy I stuck to FORTRAN for most of the code that used numeric coprocessors back in the 80s.
I haven’t found the WIN87EM code but I suspect that something has to replace the floating WAITs that require an installed 8087 to respond.
Haha It’s nostalgic to read such post. Reminds me of coding in a 1990s desktop.
You may be amused by my little novella Cheap Complex Devices, which contains a long short story called “Bees, or The Floating Point Error”.
In fact, on one of several possible interpretations of the authorship of Cheap Complex Devices, the whole novella may be the output from the brain of a comatose computer designer, as processed by a somewhat sentient, and definitely bug-infested floating point processor. In any event you’ll find some geeky jokes about floating point errors that only true floating point error obsessives can appreciate.
I won’t spam you with links but the book can be found for purchase (with lots of glowing reviews) on the usual sites. It’s also for free download here: https://dl.bookfunnel.com/1e7142jomu
Seems to me it is trying really, really hard to not use FWAIT. This is the kind of code I would write in case the 8028780286 connection was unreliable (but reliable enough that ordinary 80286 code worked). Were there common problems with 80287 chips that were unreliable or just not socketed correctly?
The comments there do smell of desperation (you can find the source for this handler by googling for “emoemwin.asm”, and then see __fpIRQ13) but there are some interesting bits like
“486 bug – must wait till after last “out f0″ to clear fp exceptions or IGNNE# will be permanently active.”
Given that Microsoft was on the forefront on the early stepping 386 and 486 weirdness it could be more than just nonsense
I think that although the finished code looks very much like nonsense, it most likely did solve some poorly understood problem. And yes, it may well have been a problem on pre-release hardware that was itself buggy. The 486 with a built-in FPU required different motherboard support circuitry compared to the older CPUs/FPUs. It’s fairly certain that Microsoft had pre-release 486 systems and it’s likely such systems had “interesting” bugs.
Intel made changes to the FERR# and IGNNE# documentation in the early revisions of the 486 datasheet, which suggests that the documentation was if not incorrect then poorly understood.
The source code also illustrates a major problem with such code: It doesn’t explain what problem exactly it’s trying to solve. That means the next programmer can’t easily ditch such nonsensical looking code because they can’t be sure that it isn’t solving some relatively widespread and legitimate problem. Testing won’t help too much because it can not be assumed that the hardware in question is actually available for testing.
When I see in retro stuff those things that appears to be madness, I instead admire them as some form of twisted brilliance.
If before the age of CPUID, the same piece of code needed to run in systems spanning multiple CPU + FPU generations with a bunch of different -known- errata (Plus even further NDA errata, and perhaps even discovering bugs on your own before it was officially documented as errata), I would assume than the lowest common denominator that could run in every system would be a total hodgepodge. The whole POP/PUSH between I/O Port writes seems like some form of NOP/wait states, which I recall that were necessary in some fast systems to workaround slow I/O accesses, like this: https://forum.vcfed.org/index.php?threads/io-delays-on-8088-class-computers.45713/
I imagine that it would be fun to see the developer’s notes to see how that piece of code evolved as less test systems failed until they settled on that.
It’s not brilliance, it’s desperation. I’ve been there myself; when the hardware doesn’t behave the way it’s supposed to, you just try different things until something (hopefully) works. But you may never know why it works, and you definitely won’t know if it’s the best workaround.
Yes, the PUSH/POP loops are meant to slow things down, more so than just an “empty” LOOP instruction. The trouble is that the speed of these loops is highly, highly variable across CPU models. So they end up wasting a lot of time on slower CPUs, and potentially may still run too fast on newer CPUs.
All those delays around port I/O, I’m pretty convinced half of it was simply a cargo cult. The other half was working around hardware which was a bit too slow and not smart enough to insert wait states. As that vcfed thread mentions, newer machines don’t really have that problem because the JMP $+2 trick stopped doing anything useful on the i486 and the chipset had to take care of things.
There are chips that require random delays, like the Yamaha OPL chips, but that’s not a problem with the bus design. More sanely designed hardware has status bits which can be polled and these bits tell the host if the chip is ready or not. Older chip datahseets often simply documented that certain situations require the host to wait some microseconds or milliseconds. That’s what’s best achieved by dummy port I/O because using pure CPU delays is quite problematic (due to the vast speed differential between slow and fast x86 processors, plus CPUs stopped running at constant speeds anyway).
My impression from for example reading the readme for DOS/windows re options for himem.sys and other places is that Microsoft probably got paid by various hardware vendors to specifically support broken/odd hardware behavior. Sure, floating point exception handling and A20 gate handling are two very different things, but I think that what the readme says about A20 probably gives a clue for other things too. In particular that larger companies like HP probably paid Microsoft for them to support their hardware, while smaller companies just had to make things 100% compatible or just ignore that their hardware might not work with updated/new software.
A super long term test would be to recompile without these “desperation measures” and distribute the recompiled version among the vintage computing communities and ask everyone who has old computers to test if the recompiled version works or not, and thus try to find what actual hardware might require these desperate measures.
Finding a system with an intermittent connection between the x86 and x87 processors seems difficult these days. The code doesn’t just cover the no coprocessor test at the beginning but makes the short waits that can be recovered by the x86 processor frequently. With the 486 and later chips, the introduced waits probably have little impact on performance and changing the design risks unexpectedly breaking something.
Someone out there have found an interesting link which may explains things: https://www.cs.earlham.edu/~dusko/cs63/prepentium.html
80486 errata:
A bug in the FPU creates three cases when the FERR# error is generated by a floating point operation, but it is not reported correctly.
If an unmasked exception occurs when the numeric exception bit in CR0 is clear and the IGNNE# pin is active, the performance of the FPU will be retarded as long as the exception remains pending.
Thus it may not be as much of a desperation and more as precise knowledge. And it doesn’t matter that faster CPUs would become available: they are not supposed to have this bug and thus wouldn’t need to have a fix for 80486-only bug!
The information given at the link unfortunately has one fundamental deficiency: It does not list which 486 steppings are affected by those bugs. For most of the bugs, the information is not present in the source (Hummel) either, and the text just says that “some versions of the 80486 may be affected”.
WIN87EM was also around in the Windows 2.x days.
It was available as an optional EXE file, though.
There also was an update/fix available in the Windows 3.1x days (Q86869, WW0548.EXE).
WIN87EM also virtualized the physical x87, for multitasking with multiple Windows programs.
“Virtualizing” is a strange term for FPU context switching. Why would you call it that?
Thanks for the pointer though, I see that the old WIN87EM (with EXE extension, although it was a DLL of course) was for whatever bad reason not shipped with Windows 2.x itself, but rather with the Windows 2.x SDK, and applications were expected to ship it if they needed it.
Q86869 is interesting because there’s absolutely no hint as to what it’s supposed to fix, just that it only applies to systems with a 387 and not 486 machines.
Don’t forget the external hardware needed to do FERR# and IGNNE#.