It is well known that Win9x variants prior to Windows 98 have a tendency to crash on fast CPUs. The definition of “fast” is of course fuzzy but the problems were known to occur on AMD K6-2 processors running at 350 MHz or faster as early as 1998. This led to some acrimony when Microsoft attempted to charge $35 for the fix. The crashes were intermittent on the 350 MHz parts but harder to avoid with faster clock speeds.
The problem soon started affecting other CPUs with higher frequencies, but it didn’t affect Intel processors for a while. Were Intel CPUs somehow better? Not exactly, but there was a reason for that; more about it later.
I have been long aware of this problem but never looked into the details. And when I did, at first I didn’t realize it. An acquaintance mentioned that Windows 3.11 for Workgroups no longer works in a VM.
After some investigation it turned out that the issue is related to the host CPU. An older Intel i7-2600 host exhibited the crash, but rarely. A newer Ryzen 7 3800X crashed every time. Some unexpected Intel/AMD difference? Well, yes and no…
That said, “crashed” describes more the cause than the symptoms. After running ‘win’, WfW 3.11 would show the Windows logo and pretty quickly drop back to the DOS prompt without giving any hints as to why.
After hunting down the WfW 3.11 debug kernel (not so easy) and engaging WDEB386, the cause became more apparent. A division by zero occurred in protected mode, causing Windows to shut itself down. Earlier experimentation showed that ‘win /n’ runs without problems, so networking was the prime suspect.
Interestingly, with the debug bits in place, Window did show an error message:
Now, that error message was not really any more enlightening than just silently dropping back to DOS. And it was not clear from WDEB386 where the code was crashing either.
So I resorted to looking for the zero-dividing code sequence in Windows binaries and soon enough found the culprit: NDIS.386. And that explained why the problem only showed up with the protected-mode networking stack in WfW 3.11, but didn’t actually depend on the network driver or protocols or any configuration details.
After a few minutes with IDA Pro, the real cause became apparent. The NDIS module calibrates a delay loop for the NdisStallExecution API. Note that this is 32-bit protected-mode (VxD) code. The core algorithm is as follows:
- Call the Get_System_Time VxD API in a loop until the time changes.
- Run 10000h (1,048,576) iterations of the LOOP instruction.
- Read the current time through Get_System_Time again and calculate the time delta.
- Calculate the number of LOOP iterations per millisecond.
Now, the problem with this approach is that although Get_System_Time reads the time directly using the 8254 PIT (and it could provide microsecond accuracy), it is only accurate to one millisecond. If it takes less than one millisecond to call Get_System_Time twice and run the LOOP instruction about one million times in between, there will be a problem—because the starting and ending millisecond may be the same, causing the delta to be zero, and directly leading to a division by zero (the code is not very careful). As long as Get_System_Time does not return the same value, everything will be fine.
This is exactly the kind of thing that would have been 100% solid on circa 100 MHz and slower CPUs, but when the CPU clock speed went up several times and the LOOP instruction also took fewer cycles to execute… trouble. Given the WfW 3.11 release date (1993), the code could not have been tested on anything faster than a 66 MHz Pentium, if at all. A 350 MHz K6-2 would be more than five times faster on the basis on clock speed alone, but the performance differential was much bigger in practice.
Note that the calls to Get_System_Time make things even more interesting. As mentioned above, these would access the PIT, which means port I/O. There could well have been differences in how quickly different chipsets handled these reads, with faster access being more likely to trigger failures.
What about Windows 95?
As it turns out, NDIS.VXD in Windows 95 has the exact same code for calibrating NdisStallExecution. That is hardly surprising given the close relationship between Windows 95 and Windows for Workgroups 3.11. And thus Windows 95 users may have seen the following screen:
It should be noted that Windows 95 at least had the decency to point the finger firmly in the direction of the troublemaker.
To make things more interesting, Windows 95 also added the same logic in several other modules, namely ESDI_506.PDR and SCSIPORT.PDR.
Microsoft fixed the problems for Windows 95 OSR2 and provided an update. It was somewhat unfortunate that this was called an “AMD fix” (the file containing the solution was called AMDK6UPD.EXE), even though Microsoft was clear that this was not a problem in AMD CPUs but rather in their own code.
The Intel LOOP
Why weren’t Intel CPUs affected before or around the same time as AMDs? In August 1998, Intel already had a Pentium II running at 450 MHz. Shouldn’t it have had more trouble than a 350 MHz K6-2? It didn’t, and to find out why one needs to look at the optimization manuals.
But first let’s consider the original Pentium, the fastest processor available when the code was written. According to the Pentium Processor Family Developer’s Manual Volume 3: Architecture and Programming Manual (Intel order no. 241430), the LOOP instruction. The absolute best case when the branch is taken is 6 clock cycles. The Intel manual notes that “[t]he unconditional LOOP instruction takes longer to execute than a two-instruction sequence which decrements the count register and jumps if the count does not equal zero”.
1,048,576 iterations will then take at minimum 6,291,456 clock cycles on a Pentium, which at 66 MHz frequency would take a bit over 94 milliseconds to execute. Note that that’s the best case, which is actually the worst case (shortest execution time, most likely to cause division overflow).
Now consider the AMD K6-2’s contemporary, a 350 MHz Intel Pentium II. The source of information is the Intel Architecture Optimization Manual (Intel order no. 242816-003). Again the manual advises: Avoid using complex instructions (for example, enter, leave, loop). Use sequences of simple instructions instead. Of course for Microsoft’s purposes, the fact that the LOOP instruction was slow on Intel CPUs was, if anything, desirable.
The simple LOOP instruction decodes to 4 μops on the P6 architecture. Figuring out from the Intel manuals how many clock cycles is… difficult. Agner Fog appears to have the same difficulty and does not present the LOOP throughput for Pentium II/III in his work. But he does present the Pentium M throughput for LOOP—6 cycles. Chances are that that’s what the Pentium II/III also needed.
And experimentation on a Pentium II processor indeed shows that that’s the case. A LOOP loop with 1,048,576 iterations takes just under 6 million clock cycles to execute, indicating throughput of more or less exactly 6 cycles. A 350 MHz Pentium II would then take about 17 milliseconds to execute the loop. A Pentium III running at 1 GHz would still take about 6 milliseconds to run the loop.
What about the K6-2 then? Unlike Intel, the AMD-K6 Processor Code Optimization Application Note (AMD publication 21924 Rev. D, January 2000) actually recommends using the LOOP instruction where applicable and on page 89 says: JCXZ takes 2 cycles when taken and 7 cycles when not taken. LOOP takes 1 cycle.
What that means is that the K6-2 executed the LOOP instruction six times faster than contemporary Intel CPUs at the same clock speed. That’s quite a big difference.
In other words, an AMD K6 running at 350 MHz will chew through 350,000,000 LOOP iterations per second, and 1,048,576 iterations will take just under 3 milliseconds.
Other Win95 Problem Spots
But that does not add up, does it? Even if the NDIS stall calibration loop takes slightly less than three milliseconds to run, there’s no way the measured time delta will be zero, unless Get_System_Time is completely broken. And it’s not. Therefore we just found out why Windows for Workgroups 3.11 did not crash on 350 MHz CPUs. Yet Win95 is known to have have had the same kind of problems… but why?
Because similar code is in other Win95 components, as mentioned above. And the key word here is similar.
For instance ESDI_586.PDR (the port driver for IDE disks, very commonly used) contains the following logic:
- Call the Get_System_Time VxD API (without waiting until the time changes).
- Run 1,000,000) iterations of the LOOP instruction.
- Read the current time through Get_System_Time again and calculate the time delta.
- Divide (1,000,000 * 10,000) by the time delta.
- Divide the result by 153 (that is 10,000,000 / 65536).
- Store the calculated constant.
The SCSI port driver SCSIPORT.PDR contains identical code to implement the ScsiPortStallExecution API. And a Win95 machine would be almost guaranteed to use either ESDI_506.PDR or SCSIPORT.PDR, except in safe mode.
The storage driver algorithm is less generic because ScsiPortStallExecution is specified to only support stalls of less than one milliseconds, whereas the NDIS variant can support arbitrarily long stalls (implemented using different approaches). The final division by 153 is done because ScsiPortStallExecution takes the argument (number of microseconds to stall), multiplies it by the calculated constant, and shifts the result right by 16 bits.
The storage port calibration algorithm is more susceptible to problems. It does not “align” the start of the calibration loop execution with a time tick, which causes the measurement results to be unstable. It uses slightly fewer LOOP iterations, perhaps just enough to make a difference. Most importantly, it divides a very large number (10,000,000,000 or 2540BE400h) by a potentially small count of milliseconds.
The biggest trouble with the storage algorithm then is that it is not only susceptible to division by zero, but it is—unlike the algorithm used in NDIS—also susceptible to division overflows when the divisor is 1 or 2. And that’s exactly what could happen on a 350 MHz AMD K6. Depending on exactly how the LOOP loop aligned with millisecond ticks, the measured result was probably often 3 milliseconds (due to the measurement overhead and inaccuracy) but sometimes only two. That is exactly what would cause a division overflow. Mystery solved.
Again Intel CPUs were effectively impervious to this problem at the time due to their much slower LOOP instruction execution. Today’s Intel CPUs naturally would also crash because although they still execute the LOOP instruction somewhat slowly, their clock speeds are roughly 10 times faster than those 350 MHz Pentium IIs.
Safe Mode Detour
On a modern (“extremely fast” from the Windows 9x perspective) machine, Winows 95 can boot to safe mode, but not safe mode with networking. In the latter case, it still reports a “Windows protection error” in NDIS.
In light of the above, it is easy to see why. In the plain safe mode, networking is skipped and therefore NDIS isn’t loaded. The native Windows 95 storage drivers are not used either. That sidesteps the components which cause division overflows.
Safe mode with networking won’t use the native storage drivers but still uses NDIS, which means it will divide by zero when NDIS initializes.
Later Win9x Versions
Windows 98 (First Edition) appears to have fixed the division overflows in the storage drivers, but not in NDIS. That’s almost certainly because the storage driver crashes could be observed on hardware available in 1998, but the NDIS crashes could not. In fact the “AMD fix” for Windows 95 OSR2 likewise corrected the problems in the storage drivers but left NDIS untouched.
And even the storage driver fixes weren’t great. The calibration algorithm was modified to avoid the possibility of overflow when dividing by anything other than zero, and it was changed to run 10 million LOOP cycles for calibration rather than the original 1 million. If 10 million LOOP iterations completed in under 1 millisecond (no hardware capable of that exists in 2020), the code would still crash with a division by zero.
In 2001, Microsoft issued a fix for the NDIS crash in Windows 98 (but not Windows 95). The fixed calibration algorithm retries once if the first measurement resulted in a zero millisecond delta. If the second try also results in a zero, it’s simply forced to one to avoid the crash. The fixed NDIS calibration therefore won’t crash, regardless of how fast the CPU is.
Windows 98 SE (1999) already came with fixes for the NDIS crashes as well as the storage driver crashes and does not have major speed-related problems on today’s (2020) hardware; it does have other, unrelated problems on recent AMD CPUs.
Update: Intel Pentium 4 (tested on Irwindale Xeon) appears to execute the LOOP instruction in two cycles, noticeably faster than older and newer Intel CPUs. That explains why NDIS crashed on 2.2 GHz P4s in 2001—those were the first Intels capable of executing 1,048,576 LOOP iterations in under one millisecond. Ironically, AMD’s K7 (Athlon) also needed two cycles per LOOP iteration (according to Agner Fog’s tables), and that’s why Intel hit that particular barrier first, since the Pentium 4 ran at significantly higher clock speed, if not higher performance.
Once again, this is the sort of problem that no amount of testing could have caught when the code was written. That said, a code review could and should have asked questions like “what happens if the calibration loop executes in under one millisecond?”. Either that didn’t happen or the possibility was considered sufficiently unlikely to ignore.
It’s likewise interesting to compare the NDIS algorithm with the storage port algorithm. Both use the exact same core logic (run the LOOP instruction a number of times to stall a given number of microseconds) but the storage port code is much more susceptible to problems, because it is both less careful when measuring the delay length and because additional input values trigger a division overflow.
The issue also illustrates how seemingly solid assumptions made by software and hardware engineers sometimes aren’t. Software engineers look at the currently available CPUs, see how the fastest ones behave, and assume that CPUs can’t get faster by a factor of 100 anytime soon. Except they can when the clock speed goes up several times and the instructions execute several times quicker.
In this particular case, it took only 5 years to get from a 66 MHz Intel Pentium to a 350 MHz AMD K6-2, dropping the execution time of the calibration loop from almost 100 milliseconds to under 3 milliseconds.
Hardware engineers on the other hand assume that making instructions faster is a good thing. In this case AMD no doubt optimized the LOOP instruction long before the increased clock speed triggered the crashes.
It is not known whether Intel simply did not bother making the LOOP instruction execute fast (and just effectively told everyone not to use it), or whether Intel knew that making LOOP fast could trigger problems in poorly written software. Either or both is possible.
Windows for Workgroups 3.11 was not exactly the first piece of software using the LOOP instruction for software timing. The IBM PC LAN Program 1.3 from 1988 used LOOP in a similar way, and similarly crashes (in the NETWORK1.CMD component) with a division by zero on CPUs significantly faster than those available at the time.
A slightly different variation on the theme was used by Sierra On-Line’s Sound Blaster drivers. The drivers used the LOOP instruction to wait for an interrupt to arrive. On some late-1990s machines, the delay was insufficient and the drivers failed to load, thinking that interrupts weren’t working.
Many of these LOOP uses were unsafe and implicitly assumed that the LOOP loop cannot execute faster than some arbitrary limit. Time has proven the assumption-makers wrong one by one.