I recently found that Solaris 2.6 and 2.5.1 does not work when run in a VM on a modern Intel CPU (Sandy Bridge generation Core i7), or to be exact fails most of the time (about nine times out of ten) when nested paging is used. The symptom is Solaris hanging or rebooting immediately after the kernel is loaded, even before the kernel banner is printed. When nested paging (or hardware virtualization) isn’t used, there’s no problem.
After managing to boot Solaris 2.6 on a physical Core i7 system (not so easy, since a boot floppy plus CD is required!), it turned out that the exact same thing happens, and this is thus not a virtualization issue. But what’s going on there? And why would nested paging fail when the older, slower virtualization methods work? Thanks to the debugging capabilities built into Solaris, it’s possible to answer most of those questions.
One of the nice things about Solaris is that it has had a usable kernel debugger for a long time, and unlike most operating systems, the kernel debugger is always there and in fact available even on the installation CD. That makes it possible to debug a problem like this without even installing Solaris.
All one needs to do is enter
b kadb -d on the boot prompt. That will cause kadb (the kernel debugger, kernel adb) to be loaded; the
-d switch causes kadb to stop at the earliest opportunity, so that the user has a chance to set breakpoints etc.
It should be noted that kadb is derived from adb, and thus uses syntax which can be only described as “interesting”.
Cause of the crash
The crash occurs very early during system initialization. The first thing the Solaris kernel does is detect the CPU type, and if support for 4MB pages is available (Pentium/Pentium Pro class CPUs), that support is enabled. This is where things go wrong.
The Solaris kernel hits a page fault, which is intercepted and handled by the kernel. Unfortunately, the fault handler triggers another exception, perhaps because the kernel isn’t really set up yet, causing a cascade of faults. The stack eventually overflows and ends up overwriting the GDT, which is stored in memory just below the stack. That is fatal, because the trap handling code reloads segment registers a lot, and with a corrupted GDT, an exception cannot be dispatched. OS death ensues.
The ultimate cause is the first trap in the cycle. To enable 4MB pages, Solaris needs to modify the CR4 register, which must be done with paging disabled. To that end, Solaris creates an identity mapping (physical address equals virtual address) for a single page which holds the routine modifying CR4. The routine turns off paging in CR0, updates CR4, turns paging on again, and returns to the caller. The (indirect) call to the routine causes the crash.
A bit of work with kadb and knowledge of the x86 architecture makes it possible to determine that the indirect call instruction causes a page fault because the destination page is not present. Yet using the debugger to examine the supposedly non-present page shows that it’s very much there. It’s also safe assume that when Solaris 2.6 was released, it did not crash like that. What’s going on there?
Changed semantics and living dangerously
The true cause of the crash is not easy to determine. What is known is that Solaris 2.6 happily works on Pentium II class machines, as well as AMD Phenom CPUs. It may be only the Intel Core i5/i7 CPUs it has trouble with.
It’s fairly clear that one strong contributing factor to the crash is that Solaris updates the page tables but does not invalidate the TLB (Translation Lookaside Buffer) for the updated page. That’s living dangerously, but by itself shouldn’t cause problems. If an existing page mapping were changed without invalidating the TLB, that would be asking for trouble. However, in this case a previously non-present page is made present, and in that case the CPU can’t store anything in the TLB. Therefore when the page is referenced, the CPU should traverse the page tables in memory and find the expected mapping.
That’s clearly how things worked back in 1996-1997 when Solaris versions 2.5.1 and 2.6 were released. But something has changed, and it wasn’t the OS. There are at least two possibilities.
First, Solaris may be falling foul of more aggressive speculative execution. Intel documents that code fetches performed shortly after updating a page table may use the previous value in the absence of a synchronizing instruction between the page table update and the code fetch. That could be happening here, although it would imply truly impressive speculative execution.
Second, Solaris could be the victim of an Intel CPU erratum (bug). The problem was observed on two very different systems with Sandy Bridge Core i7 CPUs, which both contain a TLB-related erratum. The processor specification update (largely a bug list) lists erratum BJ88, “An Unexpected Page Fault May Occur Following the Unmapping and Re-mapping of a Page”. The symptoms match, although it’s unclear if that’s what’s truly causing the page fault.
Solaris 2.5 and earlier does not have the problem because 4MB pages aren’t supported, so there’s no need for the special identity-mapped code. Solaris 7 and later, on the other hand, buckled up and added code to read and write CR3 after updating the page tables. That causes a full TLB flush and any problems with stale TLB entries or processor errata won’t happen.
Working around the problem
When running Solaris 2.5.1 or 2.6 in a VM, there are at least two workarounds available. The first is not using nested paging. In that case, the paging behavior in the VM is very different and the TLB size is effectively much smaller. Page faults are handled on the host and only some are forwarded to the guest. However, turning off nested paging causes a performance hit (sometimes very noticeable), so it would be nice to not have to do that.
Another possible approach would be patching the Solaris kernel, but that was not explored.
The other tested workaround is convincing Solaris that it shouldn’t even try to use 4MB pages. There is little benefit from using large pages in the 2.5.1 and 2.6 Solaris releases, so giving up 4MB page support is not particularly painful.
All it takes is removing the PSE bit (Page Size Extensions) from CPUID information. To be exact, it’s bit 3 in register EDX in CPUID leaf 1. VirtualBox unfortunately doesn’t allow masking out specific bits, so one has to take the CPUID leaf from the host (with
VBoxManage list hostcpuids), modify the data, and update the VM configuration with
VBoxManage modifyvm --cpuidset.
For example, if EDX in CPUID leaf 1 on the host contains (hexadecimal)
bfebfbff, the value needs to be modified to
bfebfbf7 — that is, bit 3 must be cleared. The VM settings would then be modified with a command similar to
VBoxManage modifyvm Solaris --cpuidset 1 000206a7 03100800 17bae3ff bfebfbf7.
And with that, Solaris 2.5.1 reliably comes up — hello, OpenWindows:
Solaris 2.6 no longer sulks either — hello, CDE:
As mentioned earlier, Solaris 7 does not not have this problem. The code which needs to run at an identity mapped address was made somewhat more complex and could be called more than once. That necessitated explicit TLB flushing and the problem was thus avoided.