Last week I ran into two wholly unrelated problems while researching the history of BSD-derived Unix systems on PCs. Both are classics in their category and merit a closer look.
Y2K Strikes Again
The first issue is a very typical Y2K bug found in 386BSD 0.0 and 0.1. When the system comes up (if it does—it’s not easy to bring up 386BSD 0.x on anything remotely modern!), it shows the system date as January 1, 1970, i.e. the beginning of the UNIX epoch. This is not merely a cosmetic issue.
For example when rebuilding the 386BSD kernel, or indeed any software which uses the
make utility, the source files will be timestamped 1992 or later, but the object files will be timestamped 1970. As a consequence, the object files will be always out of date and
make will be forced to rebuild them. It gets much worse if the system is networked. It is possible to correct the date manually but it will be reset to 1970 every time the system boots, which is rather unsatisfactory.
Luckily, fixing the problem is not difficult, especially if one has indexed source code at hand. The 386BSD kernel must read the initial date from the RTC CMOS non-volatile memory, and a search for “CMOS” brings us to /usr/src/sys.386bsd/i386/isa/rtc.h. The
RTC_YEAR macro corresponds to the year byte in the RTC and it is only used in a single function,
inittodr() in usr/src/sys.386bsd/i386/isa/clock.c. The function was obviously written with the assumption that the year is roughly in the 70-99 range and corresponds to 1970-1999.
To get a sensible result in the third millennium, simply assume that year values lower than, say, 80 must correspond to the 2000s rather than the 1900s. The following rough patch is intended for 386BSD 0.1:
--- clock.old+++ clock.c @@ -138,6 +138,7 @@ sec = bcd(rtcin(RTC_YEAR)); + if (sec < 80) sec += 100; leap = !(sec % 4); sec += ytos(sec); /* year */ yd = mtos(bcd(rtcin(RTC_MONTH)),leap); sec += yd; /* month */
With this fix in place, 386BSD boots up with a clock time that’s reasonably close to reality.
The 386BSD clock initialization code obviously has more bugs, such as the leap year detection, or incorrectly adding the
ytos() result to
sec rather than just assigning it, which results in the clock being off by a few minutes. This (three separate bugs in a tiny section of code) is par for the course in the early 386BSD releases which were really quite buggy where the PC-specific support was concerned.
Note that the two-digit to four-digit year conversion cutoff is by necessity somewhat arbitrary. 1970, 1980, or 1990 would all have made sense; the UNIX epoch starts in 1970, the PC wasn’t released before 1980, and 386BSD wasn’t released before 1990. Neither is a real solution, which would require reading the century from the RTC as well.
As with many other Y2K bugs, this one slipped through because it was undetectable in a normal usage scenario. It just couldn’t happen before January 1st, 2000—not unless someone deliberately set their PC’s date into the future.
Just how fast are interrupts?
NetBSD 1.0 shares much code with 386BSD but there’s a difference of about two years between 386BSD 0.1 (1992) and NetBSD 1.0 (1994). NetBSD 1.0 has no trouble guessing the current year correctly (it uses 1970 as the cutoff). But it has another somewhat common problem which is related to hardware interrupt processing.
NetBSD 1.0 for the i386 architecture came with two boot floppies, one with Adaptec AHA-154x SCSI HBA support and the other with support for BusLogic BT-742 and compatibles. One reason for separate floppies was probably the fact that the BusLogic adapters were compatible with the AHA-154x and both drivers might load on a system equipped with a BusLogic HBA, with predictably unpleasant consequences.
If one boots from the BusLogic floppy and simply mounts a disk attached to
/dev/sd0a or similar, after about 10 seconds there might be a message on the console along the lines of
sd0(bt0:0:0): timed out. After further two seconds, the system might panic and die. Ironically, before the timeout message, the system had no trouble accessing the disk and all I/O requests had been completed.
The timeout is in fact bogus. This bug is an example of an incorrect assumption which is easily made and often difficult to detect. The assumption is that when a hardware device is asked to perform some action whose completion is signaled by an interrupt, it will always take a certain non-negligible amount of time before the interrupt arrives.
The BusLogic driver in NetBSD 1.0 submits a SCSI controller command and then sets up a 10-second timeout. The timeout is canceled when the command is completed, normally in an interrupt service routine. There is an obvious race—if the device signals an interrupt before the timeout is set up, it won’t be canceled and will incorrectly trigger after the timeout period elapsed.
Interestingly, the AHA-154x driver shipped with NetBSD 1.0 is extremely similar but does not have this problem. The architecture of the two HBAs is the same; the crucial difference is that the AHA-154x only has a 24-bit (16MB) address space whereas the BusLogic additionally supports 32-bit (4GB) addressing. The similarity of the hardware architecture naturally lends itself to very similar drivers.
The key difference is that the NetBSD 1.0 AHA-154x driver sets up the timeout before submitting a SCSI command, and additionally protects the command submission by raising the priority level via
splbio(). That way, the race condition is doubly prevented.
This class of problems is somewhat common with older operating systems and often shows up in virtualized environments. Virtualized devices tend to have extremely fast response time and incorrect assumptions about the time it takes for an interrupt to be processed will be exposed. However, physical systems can also trigger similar problems, only much less frequently. It is possible for the CPU to be held up by some external event—perhaps a higher priority interrupt, perhaps an SMI—with the same end result of an interrupt arriving “too fast”. Such races are then near-impossible to debug and fix.
How to get around this problem? Modifying the /usr/sys/arch/i386/isa/bt742a.c driver module would be easy, but it is difficult to do without installing the system first. If a real or correctly emulated BusLogic HBA is at hand, the AHA-154x boot floppy can be simply used instead. The driver will not be as efficient with more than 16MB RAM in the system, but it will do. As explained above, despite the high degree of similarity the AHA-154x driver does not suffer from the race condition.
NetBSD 1.1 includes a fixed BusLogic driver. The timeout setup is still performed after submitting a SCSI command, but the entire sequence is protected by raising the priority level. Thus the interrupt service routine can no longer be executed before the timeout is set up, even if the hardware signals an interrupt more or less immediately after a command is submitted.
Sharing much of the BusLogic-specific driver code, FreeBSD 1.0 and 2.0 suffer from the same bug. With FreeBSD 2.0, the workaround with using AHA-154x drivers is not viable, as the installation kernel is smart enough to probe for BusLogic HBAs first and only then try the Adaptec driver. FreeBSD 3.0 comes with a heavily rewritten and fixed driver.