Same Old Disk Bug

After I Kryofluxed my MS OS/2 SDK 1.01 disks, I once again tried installing the OS in a VM. While the system booted up fine, it stubbornly refused to get past FORMAT. At the end, after going through all the cylinders and heads, it would always hang.

FORMAT hanging in MS OS/2 1.01 SDK

Analyzing the VM I realized that it does not so much hang as crash, because the OS kernel stack gets exhausted. And it gets exhausted because the disk driver gets into a funk, keeps repeatedly doing SET DRIVE PARAMETERS and READ VERIFY SECTORS, then crashes, probably tries to show the trap screen, and just miserably dies without actually showing anything. But why?

So I thought, the source for the disk driver is on the OS/2 1.0 DDK (one of the two drivers provided in source form, the other being the serial port drive), let’s see what it does then. Except I discovered that it’s not that easy… because the DDK release notes say that it’s not really the source code for the OS/2 1.0 disk driver, but rather an updated version. And sure enough, the source code says it’s the OS/2 1.1 disk driver (even though the code is from May 1987!). Why Microsoft would have done that is anyone’s guess.

The DDK source code is still close enough to the OS/2 1.01 SDK disk driver that I was able to make some sense of the disassembly, though I could not find any real problem in the code. But I had a very good idea what the problem might be–OS/2 1.0, like a number of old UNIX PC/AT ports, most likely silently assumes that the disk is very slow, and interrupts take a while to arrive. I was not able to spot the exact bug but it’s almost certain that the disk driver writes a command to the controller, and only then updates some internal state. Sometimes the command completion interrupt arrives before the state update, and then things go badly wrong.

So I tried simply delaying interrupts by a millisecond. And sure enough, FORMAT no longer hung! That indirectly proves the bug.

Successful FORMAT in MS OS/2 1.01 SDK

Interestingly, the hang seems to be specific to the FORMAT command, probably because FORMAT does something normal file operations don’t. If the disk is formatted with another OS (such DOS), the MS OS/2 1.01 SDK can use the disk without any apparent trouble. OS/2 1.1 (starting with the MS OS/2 1.03 SDK from March 1988) does not have this problem, whatever it is exactly. Perhaps that’s why I could not spot anything in the provided source code.

As a side note, the OS/2 disk driver source mentions LOADALL by name in a comment: WARNING!!!! Care must be taken to ensure that DS or ES are not put on the stack after they have been mapped to a buffer since if this buffer is in high memory and we are in real mode, the LOADALL mapping will be lost. That refers to the mapping obtained by the PhysToVirt DevHlp.

It is amusing that the OS/2 device driver programming documentation says the same thing, but carefully never mentions LOADALL by name. Nor does it explain how the LOADALL equivalent is implemented on a 386. When one understands the implementation, the restrictions are obvious. Without that, it just sounds like voodoo. Which, I suppose, is exactly what it was.

This entry was posted in Bugs, Microsoft, OS/2, Virtualization. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.