Solaris 7 Boot Panic

On some systems, it has been observed that Solaris 7 panics during boot from installation media and reboots the system. At least Solaris 7 U1 (3/99) and U4 (11/99) are affected. Only “fast” systems (definitely including Sandy Bridge 3+ GHz processors) exhibit this problem, and the exact behavior depends on hardware configuration.

When booting with kadb, the system doesn’t reboot itself and the panic information can be easily read:

Solaris 7 Panic

Solaris 7 Panic

Clearly a page fault caused a null pointer dereference… but why?

The problem is an “obvious” coding error that lay dormant for some time masked by slow hardware. A stack backtrace gives a better picture of the problem:

Panic backtrace

Panic backtrace

Solaris maintains a table of kernel symbols which needs to be updated when kernel modules are loaded/unloaded. To avoid unnecessary CPU load when many modules are quickly loaded/unloaded (especially during system boot), the OS ensures that at least one second elapsed between symbol table updates.

Astute readers are probably starting to smell a rat. Yes, Solaris 7 also waits for a second before the first update. Until that update runs, the ksyms_table pointer is NULL and if an attempt is made to access the symbol table, the kernel will panic. Duh!

As mentioned above, exposing this race condition requires a sufficiently fast system, probably considerably faster than what was available when Solaris 7 was released.

Workaround

Solaris is fortunately flexible enough that the bug can be worked around on a live system. On the boot loader prompt when the installation CD asks for the installation type, enter

b kadb -d

That will boot with kernel debugger (kadb) and stop (-d) once the kernel is loaded. After the kernel is loaded, on the kadb prompt enter:

ksyms_update_delay/W 0
:c

That will remove the 1-second symbol table update delay and the panic will be avoided. Once Solaris is installed, the problem is unlikely to occur.

This entry was posted in Bugs, Debugging, Solaris. Bookmark the permalink.

6 Responses to Solaris 7 Boot Panic

  1. zeurkous says:

    An optimization for installation boot time? That sounds like a bug on itself, at least to me…

  2. Michal Necasek says:

    No, it’s a general optimization. I think the difference is that the kernel behaves differently during installation because it tries to discover all hardware.

  3. zeurkous says:

    Well, I don’t have any experience with Slowaris. I do have plenty of experience with {Net,Open}BSD, which generally try to discover hardware on each boot (see autoconf(9)); whether that’s a feature or a bug is another question…

  4. Michal Necasek says:

    If they also try to discover all non-PnP hardware, that sounds like a bug 🙂 PCI or SCSI devices are of course enumerated every time.

  5. zeurkous says:

    The default configuration is to try and probe everything, with the exception of a) controversial devices (to prevent it from, say, performing a killer poke), which are commented out; and b) stuff that is supposed to be invariable (such as say, the timer or the PC speaker for an IBM PC), which is “hard-wired” into the configuration file.

    The configuration can, of course, be adjusted so as to hard-wire everything, or at least exclude devices that are unlikely to be found at the current machine. In fact, I believe the latter to be more or less standard procedure for many operators (it sure is mine :^).

    The problem with the {Net,Open}BSD kernels is that, while in the former some support for kernel modules is present, neither can really be called modular in the modern sense of the term. This is, most of the time, no problem at all — but it sure is fairly inelegant and somewhat inconvenient at times, with the great variety of devices that are out there these days…

    But aren’t we going a bit off-topic? 🙂

  6. mindbyte the post necromancer says:

    Whenever I hear “crashes on fast machines” it’s *always* either a division by zero by calculating CPU speed by time or some race conditions based on time.
    But it’s always about time. Makes sense, as a CPU doesn’t even know the concept of time.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.