Those Win9x Crashes on Fast Machines…

Posted on June 2, 2020 by Michal Necasek

It is well known that Win9x variants prior to Windows 98 have a tendency to crash on fast CPUs. The definition of “fast” is of course fuzzy but the problems were known to occur on AMD K6-2 processors running at 350 MHz or faster as early as 1998. This led to some acrimony when Microsoft attempted to charge $35 for the fix. The crashes were intermittent on the 350 MHz parts but harder to avoid with faster clock speeds.

The problem soon started affecting other CPUs with higher frequencies, but it didn’t affect Intel processors for a while. Were Intel CPUs somehow better? Not exactly, but there was a reason for that; more about it later.

I have been long aware of this problem but never looked into the details. And when I did, at first I didn’t realize it. An acquaintance mentioned that Windows 3.11 for Workgroups no longer works in a VM.

That’s as far as WfW 3.11 would get

After some investigation it turned out that the issue is related to the host CPU. An older Intel i7-2600 host exhibited the crash, but rarely. A newer Ryzen 7 3800X crashed every time. Some unexpected Intel/AMD difference? Well, yes and no…

That said, “crashed” describes more the cause than the symptoms. After running ‘win’, WfW 3.11 would show the Windows logo and pretty quickly drop back to the DOS prompt without giving any hints as to why.

After hunting down the WfW 3.11 debug kernel (not so easy) and engaging WDEB386, the cause became more apparent. A division by zero occurred in protected mode, causing Windows to shut itself down. Earlier experimentation showed that ‘win /n’ runs without problems, so networking was the prime suspect.

Interestingly, with the debug bits in place, Window did show an error message:

Windows for Workgroups 3.11 crashed. Can you guess why?

Now, that error message was not really any more enlightening than just silently dropping back to DOS. And it was not clear from WDEB386 where the code was crashing either.

So I resorted to looking for the zero-dividing code sequence in Windows binaries and soon enough found the culprit: NDIS.386. And that explained why the problem only showed up with the protected-mode networking stack in WfW 3.11, but didn’t actually depend on the network driver or protocols or any configuration details.

After a few minutes with IDA Pro, the real cause became apparent. The NDIS module calibrates a delay loop for the NdisStallExecution API. Note that this is 32-bit protected-mode (VxD) code. The core algorithm is as follows:

Call the Get_System_Time VxD API in a loop until the time changes.
Run 10000h (1,048,576) iterations of the LOOP instruction.
Read the current time through Get_System_Time again and calculate the time delta.
Calculate the number of LOOP iterations per millisecond.

Now, the problem with this approach is that although Get_System_Time reads the time directly using the 8254 PIT (and it could provide microsecond accuracy), it is only accurate to one millisecond. If it takes less than one millisecond to call Get_System_Time twice and run the LOOP instruction about one million times in between, there will be a problem—because the starting and ending millisecond may be the same, causing the delta to be zero, and directly leading to a division by zero (the code is not very careful). As long as Get_System_Time does not return the same value, everything will be fine.

This is exactly the kind of thing that would have been 100% solid on circa 100 MHz and slower CPUs, but when the CPU clock speed went up several times and the LOOP instruction also took fewer cycles to execute… trouble. Given the WfW 3.11 release date (1993), the code could not have been tested on anything faster than a 66 MHz Pentium, if at all. A 350 MHz K6-2 would be more than five times faster on the basis on clock speed alone, but the performance differential was much bigger in practice.

Note that the calls to Get_System_Time make things even more interesting. As mentioned above, these would access the PIT, which means port I/O. There could well have been differences in how quickly different chipsets handled these reads, with faster access being more likely to trigger failures.

What about Windows 95?

As it turns out, NDIS.VXD in Windows 95 has the exact same code for calibrating NdisStallExecution. That is hardly surprising given the close relationship between Windows 95 and Windows for Workgroups 3.11. And thus Windows 95 users may have seen the following screen:

It should be noted that Windows 95 at least had the decency to point the finger firmly in the direction of the troublemaker.

To make things more interesting, Windows 95 also added the same logic in several other modules, namely ESDI_506.PDR and SCSIPORT.PDR.

Microsoft fixed the problems for Windows 95 OSR2 and provided an update. It was somewhat unfortunate that this was called an “AMD fix” (the file containing the solution was called AMDK6UPD.EXE), even though Microsoft was clear that this was not a problem in AMD CPUs but rather in their own code.

The Intel LOOP

Why weren’t Intel CPUs affected before or around the same time as AMDs? In August 1998, Intel already had a Pentium II running at 450 MHz. Shouldn’t it have had more trouble than a 350 MHz K6-2? It didn’t, and to find out why one needs to look at the optimization manuals.

But first let’s consider the original Pentium, the fastest processor available when the code was written. According to the Pentium Processor Family Developer’s Manual Volume 3: Architecture and Programming Manual (Intel order no. 241430), the LOOP instruction. The absolute best case when the branch is taken is 6 clock cycles. The Intel manual notes that “[t]he unconditional LOOP instruction takes longer to execute than a two-instruction sequence which decrements the count register and jumps if the count does not equal zero”.

1,048,576 iterations will then take at minimum 6,291,456 clock cycles on a Pentium, which at 66 MHz frequency would take a bit over 94 milliseconds to execute. Note that that’s the best case, which is actually the worst case (shortest execution time, most likely to cause division overflow).

Now consider the AMD K6-2’s contemporary, a 350 MHz Intel Pentium II. The source of information is the Intel Architecture Optimization Manual (Intel order no. 242816-003). Again the manual advises: Avoid using complex instructions (for example, enter, leave, loop). Use sequences of simple instructions instead. Of course for Microsoft’s purposes, the fact that the LOOP instruction was slow on Intel CPUs was, if anything, desirable.

The simple LOOP instruction decodes to 4 μops on the P6 architecture. Figuring out from the Intel manuals how many clock cycles is… difficult. Agner Fog appears to have the same difficulty and does not present the LOOP throughput for Pentium II/III in his work. But he does present the Pentium M throughput for LOOP—6 cycles. Chances are that that’s what the Pentium II/III also needed.

And experimentation on a Pentium II processor indeed shows that that’s the case. A LOOP loop with 1,048,576 iterations takes just under 6 million clock cycles to execute, indicating throughput of more or less exactly 6 cycles. A 350 MHz Pentium II would then take about 17 milliseconds to execute the loop. A Pentium III running at 1 GHz would still take about 6 milliseconds to run the loop.

What about the K6-2 then? Unlike Intel, the AMD-K6 Processor Code Optimization Application Note (AMD publication 21924 Rev. D, January 2000) actually recommends using the LOOP instruction where applicable and on page 89 says: JCXZ takes 2 cycles when taken and 7 cycles when not taken. LOOP takes 1 cycle.

What that means is that the K6-2 executed the LOOP instruction six times faster than contemporary Intel CPUs at the same clock speed. That’s quite a big difference.

In other words, an AMD K6 running at 350 MHz will chew through 350,000,000 LOOP iterations per second, and 1,048,576 iterations will take just under 3 milliseconds.

Safe Mode Detour

On a modern (“extremely fast” from the Windows 9x perspective) machine, Winows 95 can boot to safe mode, but not safe mode with networking. In the latter case, it still reports a “Windows protection error” in NDIS.

In light of the above, it is easy to see why. In the plain safe mode, networking is skipped and therefore NDIS isn’t loaded. The native Windows 95 storage drivers are not used either. That sidesteps the components which cause division overflows.

Safe mode with networking won’t use the native storage drivers but still uses NDIS, which means it will divide by zero when NDIS initializes.

Later Win9x Versions

Windows 98 (First Edition) appears to have fixed the division overflows in the storage drivers, but not in NDIS. That’s almost certainly because the storage driver crashes could be observed on hardware available in 1998, but the NDIS crashes could not. In fact the “AMD fix” for Windows 95 OSR2 likewise corrected the problems in the storage drivers but left NDIS untouched.

And even the storage driver fixes weren’t great. The calibration algorithm was modified to avoid the possibility of overflow when dividing by anything other than zero, and it was changed to run 10 million LOOP cycles for calibration rather than the original 1 million. If 10 million LOOP iterations completed in under 1 millisecond (no hardware capable of that exists in 2020), the code would still crash with a division by zero.

In 2001, Microsoft issued a fix for the NDIS crash in Windows 98 (but not Windows 95). The fixed calibration algorithm retries once if the first measurement resulted in a zero millisecond delta. If the second try also results in a zero, it’s simply forced to one to avoid the crash. The fixed NDIS calibration therefore won’t crash, regardless of how fast the CPU is.

Windows 98 SE (1999) already came with fixes for the NDIS crashes as well as the storage driver crashes and does not have major speed-related problems on today’s (2020) hardware; it does have other, unrelated problems on recent AMD CPUs.

Update: Intel Pentium 4 (tested on Irwindale Xeon) appears to execute the LOOP instruction in two cycles, noticeably faster than older and newer Intel CPUs. That explains why NDIS crashed on 2.2 GHz P4s in 2001—those were the first Intels capable of executing 1,048,576 LOOP iterations in under one millisecond. Ironically, AMD’s K7 (Athlon) also needed two cycles per LOOP iteration (according to Agner Fog’s tables), and that’s why Intel hit that particular barrier first, since the Pentium 4 ran at significantly higher clock speed, if not higher performance.

Lessons Learned?

Once again, this is the sort of problem that no amount of testing could have caught when the code was written. That said, a code review could and should have asked questions like “what happens if the calibration loop executes in under one millisecond?”. Either that didn’t happen or the possibility was considered sufficiently unlikely to ignore.

It’s likewise interesting to compare the NDIS algorithm with the storage port algorithm. Both use the exact same core logic (run the LOOP instruction a number of times to stall a given number of microseconds) but the storage port code is much more susceptible to problems, because it is both less careful when measuring the delay length and because additional input values trigger a division overflow.

The issue also illustrates how seemingly solid assumptions made by software and hardware engineers sometimes aren’t. Software engineers look at the currently available CPUs, see how the fastest ones behave, and assume that CPUs can’t get faster by a factor of 100 anytime soon. Except they can when the clock speed goes up several times and the instructions execute several times quicker.

In this particular case, it took only 5 years to get from a 66 MHz Intel Pentium to a 350 MHz AMD K6-2, dropping the execution time of the calibration loop from almost 100 milliseconds to under 3 milliseconds.

Hardware engineers on the other hand assume that making instructions faster is a good thing. In this case AMD no doubt optimized the LOOP instruction long before the increased clock speed triggered the crashes.

It is not known whether Intel simply did not bother making the LOOP instruction execute fast (and just effectively told everyone not to use it), or whether Intel knew that making LOOP fast could trigger problems in poorly written software. Either or both is possible.

Addendum

Windows for Workgroups 3.11 was not exactly the first piece of software using the LOOP instruction for software timing. The IBM PC LAN Program 1.3 from 1988 used LOOP in a similar way, and similarly crashes (in the NETWORK1.CMD component) with a division by zero on CPUs significantly faster than those available at the time.

A slightly different variation on the theme was used by Sierra On-Line’s Sound Blaster drivers. The drivers used the LOOP instruction to wait for an interrupt to arrive. On some late-1990s machines, the delay was insufficient and the drivers failed to load, thinking that interrupts weren’t working.

Many of these LOOP uses were unsafe and implicitly assumed that the LOOP loop cannot execute faster than some arbitrary limit. Time has proven the assumption-makers wrong one by one.

This entry was posted in AMD, Bugs, Intel, Microsoft. Bookmark the permalink.

24 Responses to Those Win9x Crashes on Fast Machines…

Darkstar says:

June 2, 2020 at 2:19 pm

That reminds me of the well-known “Run-time error 200” that Turbo Pascal programs crash with, when they are run on a reasonably-fast computer. In that case the calibration loop is inside the runtime, and it fails for very much the same reason (division by zero). I don’t remember if it was also based on the LOOP instruction or if it was a simple JNZ-type of loop.
zeurkous says:

June 2, 2020 at 3:09 pm

@Necasek: You did it again! s/0f/of/-1

@Darkstar: Yeah. There’s TPPATCH for that. Or should me say TPAMK6UP?
zeurkous says:

June 2, 2020 at 3:11 pm

(Imagine Windoze 3 being written in Pascal. Wouldn’t that be a delight?)
calvin says:

June 2, 2020 at 4:22 pm

Huh, that Windows 98 issue you linked might be why I’ve had crashes with it on Ryzen. (Interestingly, ME works though – go figure.)
Michal Necasek says:

June 2, 2020 at 5:07 pm

I can confirm that Win9x definitely has that TLB trouble on my Ryzen. But not, I believe, all versions. I did not see the issue with Win95 but it definitely shows up in Win98 SE. There are random crashes, which is exactly the symptom one would expect as a result of TLB mismanagement. In VirtualBox, the problem magically goes away when nested paging is turned off.
Michal Necasek says:

June 2, 2020 at 5:08 pm

Indeed, my fonts make the difference a bit too subtle, especially when I “know” what I had written.
Michal Necasek says:

June 2, 2020 at 5:09 pm

That, and a similar crash in Norton Sysinfo, is on my to-be-investigated list.
Rich Shealer says:

June 2, 2020 at 5:09 pm

I love articles like this. It amazes me how you are able to relatively quickly pinpoint specific code areas the way you do and provide the detailed analysis.

I wish I had something to add, other than I remember when this was a problem and it is nice to know why 20 years later.

Will you be making any commentary on the recently released GW-BASIC source code?
Michal Necasek says:

June 2, 2020 at 6:52 pm

The last time I did BASIC was on a Commodore 64, never on a PC (I came to PCs when Pascal/C/assembler was the hobbyist’s weapon of choice, no longer BASIC). The GW-BASIC source release is interesting because it’s so incomplete. There are bits missing and there’s no sign of the portable source template it had been made from. It’s also quite old (1982?).

I briefly looked at the source and realized that I don’t have a lot of binaries of similar vintage. The source code seems to be newer than IBM’s ROM but older than the typical Microsoft GW-BASIC. The closest thing I could quickly find was a BASIC executable from Compaq DOS 1.1. I may have a look how close the source code is to that binary but can’t promise anything.
Richard Wells says:

June 2, 2020 at 9:41 pm

AMD slowed the LOOP instruction in K8 and K10 designs. AMD also recommended not using LOOP for those CPUs because the LOOP duration differed in 32-bit and 64-bit mode. Isn’t writing portable code exciting?

GW-BASIC seems fitting as a segue here since GW-BASIC omits what used to be a programmer’s first introduction to the evils of timing loops: program controlled cassette operation.
Yuhong Bao says:

June 3, 2020 at 12:19 am

I wonder about the vfbackup.vxd problem in https://jeffpar.github.io/kbarchive/kb/234/Q234259/
Chris M. says:

June 3, 2020 at 4:38 pm

Another notable “bug” with the Windows 9x TCP/IP stack is the brain dead DHCP client. It would hold up booting for like a minute before it timed out if it didn’t acquire an IP address.

Regarding TLB problems, it appears Windows 98SE runs bare metal on Zen+ (Ryzen 2000) machines without a problem. One OS that has seemingly aged well is Windows NT 4.0. I had no problems (compared to 9x!) running a fully patched version on Core2Duo era hardware with the UniATA driver.
Chris says:

June 3, 2020 at 4:53 pm

If I load you site via HTTPS, I can’t see the images, I get a blocked:mixed-content warning in the console. If I load it via HTTP it works.
Simon says:

June 3, 2020 at 10:20 pm

In Microsoft’s GW-BASIC source release, the cassette driver is stubbed out to just return a “device not available” error – https://github.com/microsoft/GW-BASIC/blob/09ad7bc671c90f0eeff4cb7593121ad6f170d903/GIOCAS.ASM – but if you disassemble the 3.23 executable, it isn’t stubbed out, the cassette driver is present and calls the INT 15H BIOS cassette API, you can see those BIOS calls in the disassembly. So that is at least one way in which the binary differs from the available source.
Yuhong Bao says:

June 3, 2020 at 11:19 pm

Note that from https://www.theregister.com/1998/11/19/win95_bug_could_spread/ : “We couldn’t find the patch so we called Microsoft and to get the patch they were telling us that we would have to set up a phone support service account which would cost us $US35.”
I believe that PSS typically refunds this kind of support incident.
Michal Necasek says:

June 4, 2020 at 1:43 pm

The OS/2 DHCP client is just as bad or even worse, it holds up boot for about a minute and then requires the user to press a key to continue. What were they thinking…

I have some trouble believing that Win98SE runs on bare metal Ryzen flawlessly but has TLB trouble in a VM. That said, TLB bugs are nasty and may require specific circumstances to trigger.

NT was always much more solid in this regard, even NT 3.1 is stable on fast machines.
John D. says:

June 5, 2020 at 9:59 pm

Simon (GW-BASIC): But that’s obvious. GW-BASIC wasn’t for the actual IBM-PC, but the generic version for MS-DOS: all the clones. IBM’s was BASICA, and relies on the ROM version. No clone actually built cassette port hardware, and even IBM dropped it after the first model. After that point Microsoft probably updated the source to remove it, since anything built from the source after that point (except the separate PCjr) would never need it.
Richard Wells says:

June 5, 2020 at 11:31 pm

Does any code in GW-BASIC 3.23 call on the Int 15h routines? The standard PC and exact clones include file might incorporate it even if the hardware doesn’t support it. There were programs that called on entry points within BASIC directly so unused cassette interface code would keep the rest of GW-BASIC at the proper memory locations.

Don’t forget IBM also had the PC JX with cassette support. There were near clones of the XT with a cassette interface made but those were from the Soviet Union. See the Poisk 1 and Elektronika MS1502. These were similar enough to the 5150 that cassette programs from them were successfully loaded on a 5150 though sometimes with the addition of a BASIC loader program to make up for IBM not having a monitor ROM like the Soviet machines did. IIRC, IBM had a contractual requirement that MS would not supply ROM BASIC to any competitor making a cassette interface less desirable even beyond the problems of getting a system with a faster CPU to manage 5150 compatible cassette routines.

The last few years have seen a surprising resurgence in interest in the 5150 cassette port with games that can be loaded off cassette and even a rudimentary program to transfer files from disk to cassette and from cassette to disk.
Simon says:

June 6, 2020 at 9:04 am

@Richard @John

If I disassemble my copy of GWBASIC.EXE 3.23 (size 80608 bytes, MD5 hash a75f8ad162b673cf28df0c49b7f26711), I see a far subroutine to call INT 15,3 (Write Blocks to Cassette) at 0x102C.
At 0x1058 there is a far subroutine to call INT 15,2 (Read Blocks from Cassette)
At 0x1087 there is a far subroutine to call INT 15,0 and INT 15,1 (toggle cassette motor)
There are 3 other INT 15 calls, but those are not cassette API calls:
At 0x1E26 there is a call to INT 15,86 (AT Elapsed Time Wait service)
At 0x93E6 and at 0x95CF there are calls to INT 15,84 (BIOS Joystick API)

Note I am disassembling using ndisasm, which doesn’t understand the MZ executable format, so these offsets are byte offsets relative to the start of the file, not IP offsets.
When I start GWBASIC.EXE using DEBUG.EXE, it starts at IP=0xF85A.
In the ndisasm output, the code there matches that at file offset 0x1292A.

Are these 3 casette subroutines actually being called? Or are they just dead code?
Well, one interesting observation about GWBASIC.EXE – all the far calls are actually indirect. (i.e. you only find “call far […]”, never “call far 0x…”, although you can find “call 0x…”.) So, it is obvious there is some kind of call table being used, and the question would be whether these routines are referenced in the call table.

Well, I know that the toggle casette motor routine at 0x1087 is being used.
At 0x1092 is the INT 0x15 call (CD15).
If I change the byte at 0x1092 to be 0x19 instead, and then run the “MOTOR” statement twice, on the second execution I get the “Reboot requested, quitting now” message which DOSBox displays when INT 0x19 is invoked. So this proves that routine is actually being called.

A similar technique can be used to prove the Write Casette routine at 0x102C is being called.
Change the byte 0x15 at offset 0x1043 to 0x19.
Then execute the BASIC command: SAVE “CAS1:FOO”
You will get the “Reboot requested, quitting now” message displayed in DOSBox

Likewise to prove Read Casette routine at 0x1058 is called:
Change the byte 0x15 at offset 0x106D to 0x19.
Now execute the BASIC command: LOAD “CAS1:FOO”
Again will get the “Reboot requested, quitting now” message displayed in DOSBox

By contrast, without patching the INT 0x15 calls to INT 0x19, both “SAVE CAS1:” and “LOAD CAS1:”
display “Device I/O Error”.

CAS1: device is not just supported for LOAD/SAVE, but for all GW-BASIC IO commands,
such as “BSAVE”,”BLOAD”,”OPEN”,etc

For example, OPEN “I”,1,”CAS1:FOO” will display “Device I/O Error” before the INT 0x19 hack, and “Reboot requested” afterwards.

Directory listing of casette tapes appears to be unsupported.
FILES “CAS1:” always prints “Device Unavailable”, regardless of whether INT 0x19 hack is applied or not.
However, FILES does appear to at least understand “CAS1:”, because if you do FILES on a non-existent
drive (e.g. D:), or even on the non-existent “CAS2:” device, you get “File not found” error instead of “Device Unavailable”
My guess is, that there is a stub routine to do a casette tape directory listing, which just displays the “Device Unavailable” error.
GW-BASIC actually understands the casette filesystem, since if you have multiple files on the same tape,
commands like “LOAD” can retrieve individual files. (The tape is sequential, and there was no support in IBM PC hardware for automated tape rewind – the user had to manually rewind the tape at the start. This means tape I/O might be able to retrieve multiple files, but only in the order they were written. (You could skip a file, but then you couldn’t come back to it later without asking the user to manually rewind.)
ForOldHack says:

June 9, 2020 at 7:20 am

Of course, I would come here to be greeted by a scary amount of information about some… 30 year old program. BASIC Hmmm…
I had a 5150, but never used the cassette interface, and didn’t need to know much about it, unless the floppy drive did not have a OS on it, then I needed to reboot from Cassette Basic 1.1, ( I got mine in August with DOS 1.1, but had the second batch of patched ROMS ). We helped friends with GW-Basic, and later found a program that copied the BASIC Roms into BASICA that would run on clones. ( Gee, I wonder what the disassembly of that program would reveal. ).

Great work, and very informative.

Here are the fun links I put up on Wikipedia…

http://www.cnd.org/HYPLAN/yawei/freesoft.html
Vikki McDonough says:

May 17, 2021 at 11:42 pm

Is this why my W95 VMs fail to boot (except in safe mode) if hardware virtualisation is enabled (thus making said VMs unusable in VirtualBox 6.1 and newer, which no longer support software virtualisation)?
Michal Necasek says:

May 18, 2021 at 9:29 am

Maybe. There are two separate problems with Win9x. One is those timing loops that just blow up on fast CPUs, that usually causes a “protection error” or something before the OS even boots up. The other issue is that Microsoft violated the rules Intel specified for page table management, and that causes Win9x to randomly crash on AMD CPUs made after 2012 or so.

The latter problem can be worked around by turning off nested paging in the VM settings. The former is a bit harder but there should be patches for more or less everything from back in the day, because the “too fast” machines existed in the late 1990s and early 2000s already.

FWIW, I believe Windows 98 SE had all the speed problem fixed, but Windows 95 definitely did not.
Vikki McDonough says:

June 7, 2021 at 5:06 am

The VMs crash with a “Windows protection error”, so yeah, it’s probably those timing loops. (Especially since the desktop I’m running them on has an Intel Core 2 CPU, so it probably wouldn’t be affected by issues involving specifically AMD CPUs. :-P)

And it looks like at least some of the speed problems were fixed even in 98 FE, since my 98 FE VMs boot up no problemo even with hardware virtualisation on.
Michal Necasek says:

June 7, 2021 at 9:40 am

That sounds very likely. For whatever reason, Microsoft had several instances of those timing calibration loops and they ran with different parameters, so some of them started causing trouble sooner than others. IIRC some problems first popped up with ~350 MHz AMD K6 CPUs, while other problems only turned up when Pentium 4 got past 2 GHz. I don’t know exactly what got fixed when and it’s entirely possible that the first Win98 release already included the fixes, or at least most of them.

Because the problems/fixes were in different components, the behavior also depends on hardware configuration. If you don’t have networking enabled, you won’t run into NDIS crashes, if you have no SCSI drives you won’t run into bugs in SCSI storage drivers, and so on. This is also why safe mode tends to work.