Timing In Software Is Too Hard?

I recently attempted to install RedHat Linux 3.0.3 (that’s the one from 1996, not RHEL 3.0) in VirtualBox. I thought I’d use the BusLogic SCSI emulation and the newer 1.3.57 Linux kernel. It did not work at all.

Red Hat 3.0.3 BusLogic Panic

The problem was that the BusLogic SCSI driver, version 1.3.1 by the late Leonard N. Zubkoff, wouldn’t load. It failed with the following error message: ‘INQUIRE INSTALLED DEVICES ID 0 TO 7 FAILED – DETACHING’. That in turn caused the kernel to panic as it was unable to mount the root filesystem. The real problem turned out to be caused by a rather interesting collection of bugs in the Linux BusLogic driver.

The timeout handling in the old Zubkoff BusLogic drivers (BusLogic.c and BusLogic.h)(older Linux versions used an entirely different driver!) is most charitably described as very naive. The routine BusLogic_Command()  calculates a local variable called TimeoutCounter based on the kernel variable loops_per_sec (nothing like completely different coding styles to improve code readability). The TimeoutCounter is then used to calculate the maximum number of times hardware status registers will be polled.

This approach usually works reasonably well for short periods of time (at most a few milliseconds), but it is not accurate because the speed of port I/O operations is not fixed. Unfortunately, here it’s used even for timeouts meant to last “approximately 60 seconds”. Why proper OS timing services weren’t used here is a question no one alive can probably answer.

INQUIRE INSTALLED DEVICES is a convenience controller command which sends TEST UNIT READY commands to all SCSI devices attached to the bus, and as such may take many seconds to complete. Hence the desired 60-second timeout.

One problem is that the driver assumes that the speed of port I/O operations is directly proportional to CPU clock speed, which is most certainly not true. On the contrary, the port I/O speed remains more or less constant because it’s tied to the bus (ISA, PCI) frequency and not CPU frequency.

A related but much worse problem is that the logic calculating TimeoutCounter from loops_per_sec is far too optimistic (i.e. buggy). The loops_per_sec value is either shifted right by 4 or shifted left by 2 to obtain TimeoutCounter. However, not even a slightest attempt is made to ensure that the loops_per_sec value is in the implicitly assumed range, i.e. TimeoutCounter won’t end up as zero after shifting.

To add insult to injury, TimeoutCounter is a signed variable even though loops_per_sec is not. Obviously values like 0x20000000 will suddenly turn into negative integers after being shifted left by two. That then completely confuses the rest of the code which checks for TimeoutCounter being greater or equal to zero using a signed comparison.

And that’s exactly what happens on my system: The CPU is detected as having about 3434 BogoMIPS, which corresponds to a loops_per_sec value 500,000 times larger, or 0x66575740. When shifted left by two, this results in 0x995D5D00… which, sadly, is a negative number and the BusLogic driver timeout logic completely falls apart.

This is a classic example of a latent logic bug which can’t be discovered during normal testing. At the time the driver was written, the loops_per_sec values were in the expected range on all test systems. But fast forward a few years and poof, suddenly the “tested and proven” code breaks. Only a code review by an experienced engineer might help… perhaps.

It would be fairly easy to make the driver timeout logic more robust, but that’s an exercise left for another day.

This entry was posted in Bugs, BusLogic, Linux, SCSI. Bookmark the permalink.

10 Responses to Timing In Software Is Too Hard?

  1. Andreas Kohl says:

    How it works when using Adpatec 1542 driver instead? I was trying here Caldera 1.0 without success, but Caldera’s 1.2 autoprobing finds here the disk on Adaptec 154x. But I was not able to boot from the CD connected to Buslogic – is this still a limitation?

  2. Raúl Gutiérrez Sanz says:

    Just to mention more timing issues, remember Windows 95 problem booting on CPUs over 400 MHz and all the DOS software compiled by Turbo Pascal which just displays “Runtime error 200” and terminate, even on a Pentium.

  3. Andreas Kohl says:

    So I’m now running here Caldera Network Desktop 1.0 as a guest with BusLogic drivers. This distribution is similar to Redhat 2.1. During installation the older BusLogic driver (kernel 1.2.13) is complaining about timeouts, but it seems to work so far.

  4. Michal Necasek says:

    Can’t say anything about Caldera 1.0 without trying it.

    What I can say in general is that the BusLogic is highly compatible with the AHA-154x, but drivers can distinguish between the two. The native BusLogic mode supports bus-master transfers anywhere in 32-bit address space while the 154x is limited to 24 bits (16MB). So nearly every OS has drivers for both AHA-154x and BusLogic HBAs. In some cases, the Adaptec drivers only work with Adaptecs (NT, OS/2), in others either the 154x or the BusLogic driver works with BusLogic HBAs (Solaris, Linux, various BSDs, really just about every driver not written by Adaptec).

    The fun part is that with many of the old OSes, the Adaptec and BusLogic drivers have different sets of bugs. Sometimes one is broken, sometimes other, in the worst case it’s both. From release to release the drivers change slightly, or sometimes completely, such as the BusLogic driver in Linux 1.3 vs 1.2 or so. The OS needs to be examined on a case by case basis, unfortunately.

    Perhaps a generalization but the drivers for NT and OS/2 do seem to be somewhat better written than the Unix-y ones.

  5. Michal Necasek says:

    Yes, but at least some of those are genuinely tough problems. The BusLogic driver only needs to time about 1 second or 60 seconds, and for such long timeouts it is a) bad to spin, and b) OS timing services can and should be used. The worst problem in old systems is usually timing in the 1 millisecond range, long enough that it matters but short enough that the OS may not offer suitable services. That’s the case of the infamous Runtime error 200 for example.

  6. Yuhong Bao says:

    “The older Adaptec 154x driver works on both Adaptec and BusLogic HBAs. In the NT 3.1 release, there are separate BusLogic (buslogic.sys) and Adaptec 154x (aha154x.sys) drivers and the Adaptec one no longer works on BusLogics. In one beta build, there’s the updated manufacturer-specific Adaptec driver but no BusLogic driver yet. ”
    Makes me wonder what other OSes did something similar happen?

  7. Michal Necasek says:

    In a way… OS/2 2.1 (I believe) shipped with AHA154X.ADD written by Adaptec. This driver does not work on BusLogic HBAs. In OS/2 Warp, BTSCSI.ADD was added to support BusLogic HBAs.

    FWIW, Adaptec’s DOS driver (ASPI4DOS.SYS) also won’t work on BusLogics. Of course BusLogic supplied their own drivers so it doesn’t matter much.

  8. Yuhong Bao says:

    Of course, old versions of both probably did work on BusLogic.

  9. This is like the soundblaster drivers for OS/2 that needed patches to work with other vendors like the MediaVision ThunderBoard…..

    So annoying!

  10. Michal Necasek says:

    Well, yes and no. Since the BusLogics were generally a superset of the AHA-154x, it was always possible to detect the difference; they were never a 100% clone. Adaptec understandably did not want to support other people’s hardware in their drivers, hard to blame them. And when an OS shipped both drivers (like NT), it was actually a good thing that the Adaptec drivers didn’t try to load on the BusLogics.

    With the OS/2 SB drivers, I’d blame the clones for not being properly compatible… I had one of those things (can’t remember exactly, but it wasn’t a MediaVision) and had all sorts of trouble in DOS games, too. A real shame since the brand name SoundBlasters were crap audio quality wise, unlike the MediaVisions.

Leave a Reply

Your email address will not be published. Required fields are marked *