It is well known that Win9x variants prior to Windows 98 have a tendency to crash on fast CPUs. The definition of “fast” is of course fuzzy but the problems were known to occur on AMD K6-2 processors running at 350 MHz or faster as early as 1998. This led to some acrimony when Microsoft attempted to charge $35 for the fix. The crashes were intermittent on the 350 MHz parts but harder to avoid with faster clock speeds.
The problem soon started affecting other CPUs with higher frequencies, but it didn’t affect Intel processors for a while. Were Intel CPUs somehow better? Not exactly, but there was a reason for that; more about it later.
I have been long aware of this problem but never looked into the details. And when I did, at first I didn’t realize it. An acquaintance mentioned that Windows 3.11 for Workgroups no longer works in a VM.
After some investigation it turned out that the issue is related to the host CPU. An older Intel i7-2600 host exhibited the crash, but rarely. A newer Ryzen 7 3800X crashed every time. Some unexpected Intel/AMD difference? Well, yes and no…
That said, “crashed” describes more the cause than the symptoms. After running ‘win’, WfW 3.11 would show the Windows logo and pretty quickly drop back to the DOS prompt without giving any hints as to why.
After hunting down the WfW 3.11 debug kernel (not so easy) and engaging WDEB386, the cause became more apparent. A division by zero occurred in protected mode, causing Windows to shut itself down. Earlier experimentation showed that ‘win /n’ runs without problems, so networking was the prime suspect.
Interestingly, with the debug bits in place, Window did show an error message:
Now, that error message was not really any more enlightening than just silently dropping back to DOS. And it was not clear from WDEB386 where the code was crashing either.
So I resorted to looking for the zero-dividing code sequence in Windows binaries and soon enough found the culprit: NDIS.386. And that explained why the problem only showed up with the protected-mode networking stack in WfW 3.11, but didn’t actually depend on the network driver or protocols or any configuration details.
After a few minutes with IDA Pro, the real cause became apparent. The NDIS module calibrates a delay loop for the NdisStallExecution API. Note that this is 32-bit protected-mode (VxD) code. The core algorithm is as follows:
- Call the Get_System_Time VxD API in a loop until the time changes.
- Run 10000h (1,048,576) iterations of the LOOP instruction.
- Read the current time through Get_System_Time again and calculate the time delta.
- Calculate the number of LOOP iterations per millisecond.
Now, the problem with this approach is that although Get_System_Time reads the time directly using the 8254 PIT (and it could provide microsecond accuracy), it is only accurate to one millisecond. If it takes less than one millisecond to call Get_System_Time twice and run the LOOP instruction about one million times in between, there will be a problem—because the starting and ending millisecond may be the same, causing the delta to be zero, and directly leading to a division by zero (the code is not very careful). As long as Get_System_Time does not return the same value, everything will be fine.
This is exactly the kind of thing that would have been 100% solid on circa 100 MHz and slower CPUs, but when the CPU clock speed went up several times and the LOOP instruction also took fewer cycles to execute… trouble. Given the WfW 3.11 release date (1993), the code could not have been tested on anything faster than a 66 MHz Pentium, if at all. A 350 MHz K6-2 would be more than five times faster on the basis on clock speed alone, but the performance differential was much bigger in practice.
Note that the calls to Get_System_Time make things even more interesting. As mentioned above, these would access the PIT, which means port I/O. There could well have been differences in how quickly different chipsets handled these reads, with faster access being more likely to trigger failures.
What about Windows 95?
As it turns out, NDIS.VXD in Windows 95 has the exact same code for calibrating NdisStallExecution. That is hardly surprising given the close relationship between Windows 95 and Windows for Workgroups 3.11. And thus Windows 95 users may have seen the following screen:
It should be noted that Windows 95 at least had the decency to point the finger firmly in the direction of the troublemaker.
To make things more interesting, Windows 95 also added the same logic in several other modules, namely ESDI_506.PDR and SCSIPORT.PDR.
Microsoft fixed the problems for Windows 95 OSR2 and provided an update. It was somewhat unfortunate that this was called an “AMD fix” (the file containing the solution was called AMDK6UPD.EXE), even though Microsoft was clear that this was not a problem in AMD CPUs but rather in their own code.
The Intel LOOP
Why weren’t Intel CPUs affected before or around the same time as AMDs? In August 1998, Intel already had a Pentium II running at 450 MHz. Shouldn’t it have had more trouble than a 350 MHz K6-2? It didn’t, and to find out why one needs to look at the optimization manuals.
But first let’s consider the original Pentium, the fastest processor available when the code was written. According to the Pentium Processor Family Developer’s Manual Volume 3: Architecture and Programming Manual (Intel order no. 241430), the LOOP instruction. The absolute best case when the branch is taken is 6 clock cycles. The Intel manual notes that “[t]he unconditional LOOP instruction takes longer to execute than a two-instruction sequence which decrements the count register and jumps if the count does not equal zero”.
1,048,576 iterations will then take at minimum 6,291,456 clock cycles on a Pentium, which at 66 MHz frequency would take a bit over 94 milliseconds to execute. Note that that’s the best case, which is actually the worst case (shortest execution time, most likely to cause division overflow).
Now consider the AMD K6-2’s contemporary, a 350 MHz Intel Pentium II. The source of information is the Intel Architecture Optimization Manual (Intel order no. 242816-003). Again the manual advises: Avoid using complex instructions (for example, enter, leave, loop). Use sequences of simple instructions instead. Of course for Microsoft’s purposes, the fact that the LOOP instruction was slow on Intel CPUs was, if anything, desirable.
The simple LOOP instruction decodes to 4 μops on the P6 architecture. Figuring out from the Intel manuals how many clock cycles is… difficult. Agner Fog appears to have the same difficulty and does not present the LOOP throughput for Pentium II/III in his work. But he does present the Pentium M throughput for LOOP—6 cycles. Chances are that that’s what the Pentium II/III also needed.
And experimentation on a Pentium II processor indeed shows that that’s the case. A LOOP loop with 1,048,576 iterations takes just under 6 million clock cycles to execute, indicating throughput of more or less exactly 6 cycles. A 350 MHz Pentium II would then take about 17 milliseconds to execute the loop. A Pentium III running at 1 GHz would still take about 6 milliseconds to run the loop.
What about the K6-2 then? Unlike Intel, the AMD-K6 Processor Code Optimization Application Note (AMD publication 21924 Rev. D, January 2000) actually recommends using the LOOP instruction where applicable and on page 89 says: JCXZ takes 2 cycles when taken and 7 cycles when not taken. LOOP takes 1 cycle.
What that means is that the K6-2 executed the LOOP instruction six times faster than contemporary Intel CPUs at the same clock speed. That’s quite a big difference.
In other words, an AMD K6 running at 350 MHz will chew through 350,000,000 LOOP iterations per second, and 1,048,576 iterations will take just under 3 milliseconds.
Other Win95 Problem Spots
But that does not add up, does it? Even if the NDIS stall calibration loop takes slightly less than three milliseconds to run, there’s no way the measured time delta will be zero, unless Get_System_Time is completely broken. And it’s not. Therefore we just found out why Windows for Workgroups 3.11 did not crash on 350 MHz CPUs. Yet Win95 is known to have have had the same kind of problems… but why?
Because similar code is in other Win95 components, as mentioned above. And the key word here is similar.
For instance ESDI_586.PDR (the port driver for IDE disks, very commonly used) contains the following logic:
- Call the Get_System_Time VxD API (without waiting until the time changes).
- Run 1,000,000) iterations of the LOOP instruction.
- Read the current time through Get_System_Time again and calculate the time delta.
- Divide (1,000,000 * 10,000) by the time delta.
- Divide the result by 153 (that is 10,000,000 / 65536).
- Store the calculated constant.
The SCSI port driver SCSIPORT.PDR contains identical code to implement the ScsiPortStallExecution API. And a Win95 machine would be almost guaranteed to use either ESDI_506.PDR or SCSIPORT.PDR, except in safe mode.
The storage driver algorithm is less generic because ScsiPortStallExecution is specified to only support stalls of less than one milliseconds, whereas the NDIS variant can support arbitrarily long stalls (implemented using different approaches). The final division by 153 is done because ScsiPortStallExecution takes the argument (number of microseconds to stall), multiplies it by the calculated constant, and shifts the result right by 16 bits.
The storage port calibration algorithm is more susceptible to problems. It does not “align” the start of the calibration loop execution with a time tick, which causes the measurement results to be unstable. It uses slightly fewer LOOP iterations, perhaps just enough to make a difference. Most importantly, it divides a very large number (10,000,000,000 or 2540BE400h) by a potentially small count of milliseconds.
The biggest trouble with the storage algorithm then is that it is not only susceptible to division by zero, but it is—unlike the algorithm used in NDIS—also susceptible to division overflows when the divisor is 1 or 2. And that’s exactly what could happen on a 350 MHz AMD K6. Depending on exactly how the LOOP loop aligned with millisecond ticks, the measured result was probably often 3 milliseconds (due to the measurement overhead and inaccuracy) but sometimes only two. That is exactly what would cause a division overflow. Mystery solved.
Again Intel CPUs were effectively impervious to this problem at the time due to their much slower LOOP instruction execution. Today’s Intel CPUs naturally would also crash because although they still execute the LOOP instruction somewhat slowly, their clock speeds are roughly 10 times faster than those 350 MHz Pentium IIs.
Safe Mode Detour
On a modern (“extremely fast” from the Windows 9x perspective) machine, Winows 95 can boot to safe mode, but not safe mode with networking. In the latter case, it still reports a “Windows protection error” in NDIS.
In light of the above, it is easy to see why. In the plain safe mode, networking is skipped and therefore NDIS isn’t loaded. The native Windows 95 storage drivers are not used either. That sidesteps the components which cause division overflows.
Safe mode with networking won’t use the native storage drivers but still uses NDIS, which means it will divide by zero when NDIS initializes.
Later Win9x Versions
Windows 98 (First Edition) appears to have fixed the division overflows in the storage drivers, but not in NDIS. That’s almost certainly because the storage driver crashes could be observed on hardware available in 1998, but the NDIS crashes could not. In fact the “AMD fix” for Windows 95 OSR2 likewise corrected the problems in the storage drivers but left NDIS untouched.
And even the storage driver fixes weren’t great. The calibration algorithm was modified to avoid the possibility of overflow when dividing by anything other than zero, and it was changed to run 10 million LOOP cycles for calibration rather than the original 1 million. If 10 million LOOP iterations completed in under 1 millisecond (no hardware capable of that exists in 2020), the code would still crash with a division by zero.
In 2001, Microsoft issued a fix for the NDIS crash in Windows 98 (but not Windows 95). The fixed calibration algorithm retries once if the first measurement resulted in a zero millisecond delta. If the second try also results in a zero, it’s simply forced to one to avoid the crash. The fixed NDIS calibration therefore won’t crash, regardless of how fast the CPU is.
Windows 98 SE (1999) already came with fixes for the NDIS crashes as well as the storage driver crashes and does not have major speed-related problems on today’s (2020) hardware; it does have other, unrelated problems on recent AMD CPUs.
Update: Intel Pentium 4 (tested on Irwindale Xeon) appears to execute the LOOP instruction in two cycles, noticeably faster than older and newer Intel CPUs. That explains why NDIS crashed on 2.2 GHz P4s in 2001—those were the first Intels capable of executing 1,048,576 LOOP iterations in under one millisecond. Ironically, AMD’s K7 (Athlon) also needed two cycles per LOOP iteration (according to Agner Fog’s tables), and that’s why Intel hit that particular barrier first, since the Pentium 4 ran at significantly higher clock speed, if not higher performance.
Lessons Learned?
Once again, this is the sort of problem that no amount of testing could have caught when the code was written. That said, a code review could and should have asked questions like “what happens if the calibration loop executes in under one millisecond?”. Either that didn’t happen or the possibility was considered sufficiently unlikely to ignore.
It’s likewise interesting to compare the NDIS algorithm with the storage port algorithm. Both use the exact same core logic (run the LOOP instruction a number of times to stall a given number of microseconds) but the storage port code is much more susceptible to problems, because it is both less careful when measuring the delay length and because additional input values trigger a division overflow.
The issue also illustrates how seemingly solid assumptions made by software and hardware engineers sometimes aren’t. Software engineers look at the currently available CPUs, see how the fastest ones behave, and assume that CPUs can’t get faster by a factor of 100 anytime soon. Except they can when the clock speed goes up several times and the instructions execute several times quicker.
In this particular case, it took only 5 years to get from a 66 MHz Intel Pentium to a 350 MHz AMD K6-2, dropping the execution time of the calibration loop from almost 100 milliseconds to under 3 milliseconds.
Hardware engineers on the other hand assume that making instructions faster is a good thing. In this case AMD no doubt optimized the LOOP instruction long before the increased clock speed triggered the crashes.
It is not known whether Intel simply did not bother making the LOOP instruction execute fast (and just effectively told everyone not to use it), or whether Intel knew that making LOOP fast could trigger problems in poorly written software. Either or both is possible.
Addendum
Windows for Workgroups 3.11 was not exactly the first piece of software using the LOOP instruction for software timing. The IBM PC LAN Program 1.3 from 1988 used LOOP in a similar way, and similarly crashes (in the NETWORK1.CMD component) with a division by zero on CPUs significantly faster than those available at the time.
A slightly different variation on the theme was used by Sierra On-Line’s Sound Blaster drivers. The drivers used the LOOP instruction to wait for an interrupt to arrive. On some late-1990s machines, the delay was insufficient and the drivers failed to load, thinking that interrupts weren’t working.
Many of these LOOP uses were unsafe and implicitly assumed that the LOOP loop cannot execute faster than some arbitrary limit. Time has proven the assumption-makers wrong one by one.
That reminds me of the well-known “Run-time error 200” that Turbo Pascal programs crash with, when they are run on a reasonably-fast computer. In that case the calibration loop is inside the runtime, and it fails for very much the same reason (division by zero). I don’t remember if it was also based on the LOOP instruction or if it was a simple JNZ-type of loop.
@Necasek: You did it again! s/0f/of/-1
@Darkstar: Yeah. There’s TPPATCH for that. Or should me say TPAMK6UP?
(Imagine Windoze 3 being written in Pascal. Wouldn’t that be a delight?)
Huh, that Windows 98 issue you linked might be why I’ve had crashes with it on Ryzen. (Interestingly, ME works though – go figure.)
I can confirm that Win9x definitely has that TLB trouble on my Ryzen. But not, I believe, all versions. I did not see the issue with Win95 but it definitely shows up in Win98 SE. There are random crashes, which is exactly the symptom one would expect as a result of TLB mismanagement. In VirtualBox, the problem magically goes away when nested paging is turned off.
Indeed, my fonts make the difference a bit too subtle, especially when I “know” what I had written.
That, and a similar crash in Norton Sysinfo, is on my to-be-investigated list.
I love articles like this. It amazes me how you are able to relatively quickly pinpoint specific code areas the way you do and provide the detailed analysis.
I wish I had something to add, other than I remember when this was a problem and it is nice to know why 20 years later.
Will you be making any commentary on the recently released GW-BASIC source code?
The last time I did BASIC was on a Commodore 64, never on a PC (I came to PCs when Pascal/C/assembler was the hobbyist’s weapon of choice, no longer BASIC). The GW-BASIC source release is interesting because it’s so incomplete. There are bits missing and there’s no sign of the portable source template it had been made from. It’s also quite old (1982?).
I briefly looked at the source and realized that I don’t have a lot of binaries of similar vintage. The source code seems to be newer than IBM’s ROM but older than the typical Microsoft GW-BASIC. The closest thing I could quickly find was a BASIC executable from Compaq DOS 1.1. I may have a look how close the source code is to that binary but can’t promise anything.
AMD slowed the LOOP instruction in K8 and K10 designs. AMD also recommended not using LOOP for those CPUs because the LOOP duration differed in 32-bit and 64-bit mode. Isn’t writing portable code exciting?
GW-BASIC seems fitting as a segue here since GW-BASIC omits what used to be a programmer’s first introduction to the evils of timing loops: program controlled cassette operation.
I wonder about the vfbackup.vxd problem in https://jeffpar.github.io/kbarchive/kb/234/Q234259/
Another notable “bug” with the Windows 9x TCP/IP stack is the brain dead DHCP client. It would hold up booting for like a minute before it timed out if it didn’t acquire an IP address.
Regarding TLB problems, it appears Windows 98SE runs bare metal on Zen+ (Ryzen 2000) machines without a problem. One OS that has seemingly aged well is Windows NT 4.0. I had no problems (compared to 9x!) running a fully patched version on Core2Duo era hardware with the UniATA driver.
If I load you site via HTTPS, I can’t see the images, I get a blocked:mixed-content warning in the console. If I load it via HTTP it works.
In Microsoft’s GW-BASIC source release, the cassette driver is stubbed out to just return a “device not available” error – https://github.com/microsoft/GW-BASIC/blob/09ad7bc671c90f0eeff4cb7593121ad6f170d903/GIOCAS.ASM – but if you disassemble the 3.23 executable, it isn’t stubbed out, the cassette driver is present and calls the INT 15H BIOS cassette API, you can see those BIOS calls in the disassembly. So that is at least one way in which the binary differs from the available source.
Note that from https://www.theregister.com/1998/11/19/win95_bug_could_spread/ : “We couldn’t find the patch so we called Microsoft and to get the patch they were telling us that we would have to set up a phone support service account which would cost us $US35.”
I believe that PSS typically refunds this kind of support incident.
The OS/2 DHCP client is just as bad or even worse, it holds up boot for about a minute and then requires the user to press a key to continue. What were they thinking…
I have some trouble believing that Win98SE runs on bare metal Ryzen flawlessly but has TLB trouble in a VM. That said, TLB bugs are nasty and may require specific circumstances to trigger.
NT was always much more solid in this regard, even NT 3.1 is stable on fast machines.
Simon (GW-BASIC): But that’s obvious. GW-BASIC wasn’t for the actual IBM-PC, but the generic version for MS-DOS: all the clones. IBM’s was BASICA, and relies on the ROM version. No clone actually built cassette port hardware, and even IBM dropped it after the first model. After that point Microsoft probably updated the source to remove it, since anything built from the source after that point (except the separate PCjr) would never need it.
Does any code in GW-BASIC 3.23 call on the Int 15h routines? The standard PC and exact clones include file might incorporate it even if the hardware doesn’t support it. There were programs that called on entry points within BASIC directly so unused cassette interface code would keep the rest of GW-BASIC at the proper memory locations.
Don’t forget IBM also had the PC JX with cassette support. There were near clones of the XT with a cassette interface made but those were from the Soviet Union. See the Poisk 1 and Elektronika MS1502. These were similar enough to the 5150 that cassette programs from them were successfully loaded on a 5150 though sometimes with the addition of a BASIC loader program to make up for IBM not having a monitor ROM like the Soviet machines did. IIRC, IBM had a contractual requirement that MS would not supply ROM BASIC to any competitor making a cassette interface less desirable even beyond the problems of getting a system with a faster CPU to manage 5150 compatible cassette routines.
The last few years have seen a surprising resurgence in interest in the 5150 cassette port with games that can be loaded off cassette and even a rudimentary program to transfer files from disk to cassette and from cassette to disk.
@Richard @John
If I disassemble my copy of GWBASIC.EXE 3.23 (size 80608 bytes, MD5 hash a75f8ad162b673cf28df0c49b7f26711), I see a far subroutine to call INT 15,3 (Write Blocks to Cassette) at 0x102C.
At 0x1058 there is a far subroutine to call INT 15,2 (Read Blocks from Cassette)
At 0x1087 there is a far subroutine to call INT 15,0 and INT 15,1 (toggle cassette motor)
There are 3 other INT 15 calls, but those are not cassette API calls:
At 0x1E26 there is a call to INT 15,86 (AT Elapsed Time Wait service)
At 0x93E6 and at 0x95CF there are calls to INT 15,84 (BIOS Joystick API)
Note I am disassembling using ndisasm, which doesn’t understand the MZ executable format, so these offsets are byte offsets relative to the start of the file, not IP offsets.
When I start GWBASIC.EXE using DEBUG.EXE, it starts at IP=0xF85A.
In the ndisasm output, the code there matches that at file offset 0x1292A.
Are these 3 casette subroutines actually being called? Or are they just dead code?
Well, one interesting observation about GWBASIC.EXE – all the far calls are actually indirect. (i.e. you only find “call far […]”, never “call far 0x…”, although you can find “call 0x…”.) So, it is obvious there is some kind of call table being used, and the question would be whether these routines are referenced in the call table.
Well, I know that the toggle casette motor routine at 0x1087 is being used.
At 0x1092 is the INT 0x15 call (CD15).
If I change the byte at 0x1092 to be 0x19 instead, and then run the “MOTOR” statement twice, on the second execution I get the “Reboot requested, quitting now” message which DOSBox displays when INT 0x19 is invoked. So this proves that routine is actually being called.
A similar technique can be used to prove the Write Casette routine at 0x102C is being called.
Change the byte 0x15 at offset 0x1043 to 0x19.
Then execute the BASIC command: SAVE “CAS1:FOO”
You will get the “Reboot requested, quitting now” message displayed in DOSBox
Likewise to prove Read Casette routine at 0x1058 is called:
Change the byte 0x15 at offset 0x106D to 0x19.
Now execute the BASIC command: LOAD “CAS1:FOO”
Again will get the “Reboot requested, quitting now” message displayed in DOSBox
By contrast, without patching the INT 0x15 calls to INT 0x19, both “SAVE CAS1:” and “LOAD CAS1:”
display “Device I/O Error”.
CAS1: device is not just supported for LOAD/SAVE, but for all GW-BASIC IO commands,
such as “BSAVE”,”BLOAD”,”OPEN”,etc
For example, OPEN “I”,1,”CAS1:FOO” will display “Device I/O Error” before the INT 0x19 hack, and “Reboot requested” afterwards.
Directory listing of casette tapes appears to be unsupported.
FILES “CAS1:” always prints “Device Unavailable”, regardless of whether INT 0x19 hack is applied or not.
However, FILES does appear to at least understand “CAS1:”, because if you do FILES on a non-existent
drive (e.g. D:), or even on the non-existent “CAS2:” device, you get “File not found” error instead of “Device Unavailable”
My guess is, that there is a stub routine to do a casette tape directory listing, which just displays the “Device Unavailable” error.
GW-BASIC actually understands the casette filesystem, since if you have multiple files on the same tape,
commands like “LOAD” can retrieve individual files. (The tape is sequential, and there was no support in IBM PC hardware for automated tape rewind – the user had to manually rewind the tape at the start. This means tape I/O might be able to retrieve multiple files, but only in the order they were written. (You could skip a file, but then you couldn’t come back to it later without asking the user to manually rewind.)
Of course, I would come here to be greeted by a scary amount of information about some… 30 year old program. BASIC Hmmm…
I had a 5150, but never used the cassette interface, and didn’t need to know much about it, unless the floppy drive did not have a OS on it, then I needed to reboot from Cassette Basic 1.1, ( I got mine in August with DOS 1.1, but had the second batch of patched ROMS ). We helped friends with GW-Basic, and later found a program that copied the BASIC Roms into BASICA that would run on clones. ( Gee, I wonder what the disassembly of that program would reveal. ).
Great work, and very informative.
Here are the fun links I put up on Wikipedia…
http://www.cnd.org/HYPLAN/yawei/freesoft.html
Is this why my W95 VMs fail to boot (except in safe mode) if hardware virtualisation is enabled (thus making said VMs unusable in VirtualBox 6.1 and newer, which no longer support software virtualisation)?
Maybe. There are two separate problems with Win9x. One is those timing loops that just blow up on fast CPUs, that usually causes a “protection error” or something before the OS even boots up. The other issue is that Microsoft violated the rules Intel specified for page table management, and that causes Win9x to randomly crash on AMD CPUs made after 2012 or so.
The latter problem can be worked around by turning off nested paging in the VM settings. The former is a bit harder but there should be patches for more or less everything from back in the day, because the “too fast” machines existed in the late 1990s and early 2000s already.
FWIW, I believe Windows 98 SE had all the speed problem fixed, but Windows 95 definitely did not.
The VMs crash with a “Windows protection error”, so yeah, it’s probably those timing loops. (Especially since the desktop I’m running them on has an Intel Core 2 CPU, so it probably wouldn’t be affected by issues involving specifically AMD CPUs. :-P)
And it looks like at least some of the speed problems were fixed even in 98 FE, since my 98 FE VMs boot up no problemo even with hardware virtualisation on.
That sounds very likely. For whatever reason, Microsoft had several instances of those timing calibration loops and they ran with different parameters, so some of them started causing trouble sooner than others. IIRC some problems first popped up with ~350 MHz AMD K6 CPUs, while other problems only turned up when Pentium 4 got past 2 GHz. I don’t know exactly what got fixed when and it’s entirely possible that the first Win98 release already included the fixes, or at least most of them.
Because the problems/fixes were in different components, the behavior also depends on hardware configuration. If you don’t have networking enabled, you won’t run into NDIS crashes, if you have no SCSI drives you won’t run into bugs in SCSI storage drivers, and so on. This is also why safe mode tends to work.