I recently attempted to install RedHat Linux 3.0.3 (that’s the one from 1996, not RHEL 3.0) in VirtualBox. I thought I’d use the BusLogic SCSI emulation and the newer 1.3.57 Linux kernel. It did not work at all.
The problem was that the BusLogic SCSI driver, version 1.3.1 by the late Leonard N. Zubkoff, wouldn’t load. It failed with the following error message: ‘INQUIRE INSTALLED DEVICES ID 0 TO 7 FAILED – DETACHING’. That in turn caused the kernel to panic as it was unable to mount the root filesystem. The real problem turned out to be caused by a rather interesting collection of bugs in the Linux BusLogic driver.
The timeout handling in the old Zubkoff BusLogic drivers (BusLogic.c and BusLogic.h)(older Linux versions used an entirely different driver!) is most charitably described as very naive. The routine
BusLogic_Command() calculates a local variable called
TimeoutCounter based on the kernel variable
loops_per_sec (nothing like completely different coding styles to improve code readability). The
TimeoutCounter is then used to calculate the maximum number of times hardware status registers will be polled.
This approach usually works reasonably well for short periods of time (at most a few milliseconds), but it is not accurate because the speed of port I/O operations is not fixed. Unfortunately, here it’s used even for timeouts meant to last “approximately 60 seconds”. Why proper OS timing services weren’t used here is a question no one alive can probably answer.
INQUIRE INSTALLED DEVICES is a convenience controller command which sends TEST UNIT READY commands to all SCSI devices attached to the bus, and as such may take many seconds to complete. Hence the desired 60-second timeout.
One problem is that the driver assumes that the speed of port I/O operations is directly proportional to CPU clock speed, which is most certainly not true. On the contrary, the port I/O speed remains more or less constant because it’s tied to the bus (ISA, PCI) frequency and not CPU frequency.
A related but much worse problem is that the logic calculating
loops_per_sec is far too optimistic (i.e. buggy). The
loops_per_sec value is either shifted right by 4 or shifted left by 2 to obtain
TimeoutCounter. However, not even a slightest attempt is made to ensure that the
loops_per_sec value is in the implicitly assumed range, i.e.
TimeoutCounter won’t end up as zero after shifting.
To add insult to injury,
TimeoutCounter is a signed variable even though
loops_per_sec is not. Obviously values like 0x20000000 will suddenly turn into negative integers after being shifted left by two. That then completely confuses the rest of the code which checks for
TimeoutCounter being greater or equal to zero using a signed comparison.
And that’s exactly what happens on my system: The CPU is detected as having about 3434 BogoMIPS, which corresponds to a
loops_per_sec value 500,000 times larger, or 0x66575740. When shifted left by two, this results in 0x995D5D00… which, sadly, is a negative number and the BusLogic driver timeout logic completely falls apart.
This is a classic example of a latent logic bug which can’t be discovered during normal testing. At the time the driver was written, the
loops_per_sec values were in the expected range on all test systems. But fast forward a few years and poof, suddenly the “tested and proven” code breaks. Only a code review by an experienced engineer might help… perhaps.
It would be fairly easy to make the driver timeout logic more robust, but that’s an exercise left for another day.