SCO UNIX 3.2.0f, Limping Along

For the purposes of ancient TCP/IP and NFS research, I wanted to run old SCO UNIX in a VM. I was able to run XENIX with TCP/IP earlier, but SCO’s NFS (provided, like the TCP/IP stack, by Lachman Associates Inc.) never supported XENIX.

A search of the basement yielded 5.25″ floppies of SCO UNIX 3.2.0f from June 1989 (30 years old!), which must be very close to the oldest SCO UNIX version. There was just one small problem. SCO UNIX kernels released before 1993 or so do not run on any post-Pentium processor, and also don’t run in emulators/hypervisors.

The gory details are in the old blog post mentioned above but the gist is that the SCO UNIX kernel (but not, for whatever reason, 386 XENIX) relies on more or less unspecified micro-architectural behavior of then-current 386/486 CPUs, and through instruction pipelining and/or TLB behavior manages to execute an instruction which is not mapped in the page tables, even though paging is enabled.

This misbehavior is not entirely trivial to patch, but I realized that it’s not that difficult to manually work around in the VirtualBox VM debugger.

It’s a somewhat tedious manual process, but needs to be done only once when SCO UNIX boots. Once the OS is brought up, it needs no further help.

The following applies to VirtualBox 6.0. Earlier versions may need a slightly different process.

The first step is just letting SCO UNIX boot and die with a guru meditation (triple fault). Then examine the VBox.log file and look for “Guru Meditation”. There ought to be something like “VCPU0: Guru Meditation 1155 (VINF_EM_TRIPLE_FAULT)”. A few lines down is the guest CPU register dump, preceded by “{cpumguest, verbose}”. The EIP register value is what’s significant. The value is not constant and varies depending on at least the memory size and kernel configuration. In my VM, I currently get ‘eip=00b7601e’. That’s the magic cookie.

The VM debugger GUI can be enabled by setting the VBOX_GUI_DBG_ENABLED environment variable. Once the VM is started again, the debugger needs to be brought up (via the Debug / Command Line menu item) before the SCO UNIX kernel initializes. Then a breakpoint needs to be set on the problematic address:

ba x 1 0b7601e

That sets up an execution breakpoint at the crash address. Once the kernel starts initializing, the breakpoint should be hit:

dbgf event: Breakpoint 0! (raw)
eax=80000011 ebx=00000000 ecx=d0010020 edx=00000180 esi=00000020 edi=00000004
eip=00b7601e esp=0000fff0 ebp=000001ee iopl=0 nv up di ng nz na po nc
cs=0018 ds=0020 es=0020 fs=0008 gs=0000 ss=0020 eflags=00200086
u: error: DBGCCmdHlpVarToDbgfAddr failed on '0018:00b7601e L 0': VERR_PAGE_TABLE_NOT_PRESENT

Now the CS:EIP points at the problematic code which is not mapped by the page tables, and therefore cannot be disassembled.

It is interesting to consider why the breakpoint gets hit even though the instruction cannot be accessed. The reason is that breakpoints are specified for a linear address, i.e. before any paging translations are applied. The breakpoint therefore fires before any paging is performed and before any page faults can occur.

We happen to know that the not-mapped instruction is ‘jmp ecx’. So we help the guest OS a little:

r eip = ecx

This simulates the effect of the troublemaking jmp instruction. After that, we just need to enter ‘g’ (as in go, continue execution). And lo and behold:

The kernel boots up without any further incident. And look at all the fancy TCP and NFS daemons there:

Indeed NFS works and can talk to a modern Synology DSM NAS. The NFS package (SCO NFS 1.1.0o) was finalized in March 1990, though most of the files are from 1989. Essentially this is a 1989 NFS client, probably with a few minor fixes.

On a non-networked note, there was a minor surprise in the SCO UNIX Development System 3.2.0f (August 1989):

This is Microsoft CodeView 2.4, running on top of SCO UNIX. The SCO Development System package is essentially Microsoft C 5.1, and can cross-compile to DOS and even OS/2.

It took a little while to figure out how to actually debug anything with CodeView. This early SCO UNIX is based on the COFF format (unlike XENIX), but while the AT&T sdb debugger delivered with the system can debug COFF executables, CodeView can’t — the compiler must be told to produce a XENIX x.out executable, and only then can CodeView work with it. It’s a bit schizophrenic.

This entry was posted in NFS, SCO, Software Hacks, TCP/IP, UNIX, VirtualBox, Virtualization. Bookmark the permalink.

14 Responses to SCO UNIX 3.2.0f, Limping Along

  1. calvin says:

    Similarly weird to me at least, is the fact that QNX (at least 4.x when I played with it) comes with Watcom tools ported from DOS. It feels bizarre using what feels clearly like culturally DOS tools on what is nominally a Unix.

  2. Michal Necasek says:

    It’s probably not a coincidence that both QSSL and Watcom were Canadian (Ontario). In the end you do ‘cc’ and an executable falls out, so who cares if the compiler first existed on DOS or not. The Watcom tools have at least some mainframe heritage, while Microsoft C was available on XENIX before it was available on DOS. Actually from the Wikipedia article it’s not clear if the Watcom compilers were first available on QNX or on DOS. Watcom also did contract work which was not necessarily offered to the general public, so it can be hard to say what happened when.

  3. calvin says:

    The full-screen debugger (but it runs in pterm anyways) and IIRC the editor are also from Watcom, so still feels quite DOS-like. (That, and OMF objects and IIRC, not very Unixy flags to pass in.)

  4. Julien Oster says:

    By the way, on the old blog post’s comment about that one machine not resetting on a Triple Fault (surprising to me, too): I wonder if that board, somewhere deep within the CMOS bits or as a jumper, has some kind of setting to not reset in this case for development and debugging reasons.

    The CPU is shut down after a Triple Fault, so there is probably not much opportunity to get any state out of that (though I don’t know what a Pentium Pro provides), but with the right equipment sitting on a bus somewhere, OS developers could still dump out memory contents and some hw register states maybe.

  5. Michal Necasek says:

    I think the CPU state might be recoverable to some extent, but it’s difficult without special hardware. Actually a SMI might be able to do something useful. System memory would definitely not be affected by a triple fault and I’m pretty sure there are remote control PCI cards which could access memory and send the contents to a remote system.

    Still, locking up instead of rebooting is unusual, though in the end I don’t know if it’s any less helpful. The OS is dead either way, and if you reboot you’re just hoping it won’t happen in exactly the same way again. And in this particular case it would.

  6. Ian Dunbar says:

    Another one of those things that’s very difficult to replicate, since SCO UNIX is difficult to find online WITH its license key. Even when I’ve gotten it to boot and start the install process (at least one version I tried did actually play nice with 86box, no modifications required), I can’t find a working key. This is one of those that I’ll just have to keep an eye on eBay for (and SCO UNIX almost never comes up).

  7. MiaM says:

    If you could get SCO Unix to run in any kind of real emulator (that actually emulates the x86 CPU) it might be possible to figure out what instruction preceds the one that causes trouble, and somehow patch that one. Or?

  8. Julien Oster says:

    Michal already found out in detail, it’s in the other blog post that has been linked. Because it’s due to a (documented!) change in the microarchitecture on how memory is translated immediately after setting the paging bit, there likely is no single instruction to patch, so the issue is far less trivial to fix.

    The straightforward approach is to copy the few instructions that set the paging bit and jump to the new linear address into a known identity mapped region in low memory, and jump to there. But since this needs to be patched into existing object code, some cleverness in doing so might be beneficial. (Though there might very well be enough unused padding bytes that are loaded in as well near the startup code, which could just hold the most straightforward subroutine necessary.)

  9. Michal Necasek says:

    Check your e-mail.

  10. Michal Necasek says:

    Yes, there is no single instruction to patch; it’s the page tables that need “patching” to cover that one JMP instruction. Not trivial since they’re built dynamically and do not cover the initialization code.

  11. Andreas Kohl says:

    I dont’t know if it’s possible to disable the FPU inside VirtualBox. But you can try to use “ignorefpu” at the boot prompt. There were a lot of SLS for 3.2.0 that deal with different issues. Also the latest uod426d should be usually applied.

  12. Julien Oster says:

    Well, I don’t know whether the page tables need patching per se. Assuming that there already is an identity mapped region with a few spare bytes anywhere, copying the few necessary instructions and transferring control should work.

    But it depends, of course. If later code rewrites the page table entry for (as of your example) the page around 0xb7601e anyway, then creating a page table entry that identity maps this page can be simpler.

    I might actually give it a shot this weekend, either way.

  13. Julien Oster says:

    Hi Michal, how can I send you an email?

  14. Michal Necasek says:

    If you provided a valid e-mail address, you should have mail from me 🙂

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.