VME Broken on AMD Ryzen

That’s VME as in Virtual-8086 Mode Enhancements, introduced in the Intel Pentium Processor, and initially documented in the infamous NDA-only Appendix H.

Almost immediately since the Ryzen CPUs became available in March 2017, there have been various complaints about problems with Windows XP in a VM and with running 16-bit applications in DOS boxes in Windows VMs. Multiple versions of Windows are affected. Some (but not all) other operating systems are affected as well, for example OS/2 Warp running in a VM on Ryzen when attempting to open a DOS window:

OS/2 Warp VM DOS session on Ryzen

After analyzing the problem, it’s now clear what’s happening. As incredible as it is, Ryzen has buggy VME implementation; specifically, the INT instruction is known to misbehave in V86 mode with VME enabled when the given vector is redirected (i.e. it should use standard real-mode IVT and execute in V86 mode without faulting). The INT instruction simply doesn’t go where it’s supposed to go which leads to more or less immediate crashes or hangs.

How did AMD miss it? Because only 32-bit OSes are affected, and only when running 16-bit real-mode code. Except with Windows XP and Server 2003 it’s much worse and these systems may not even boot.

To be clear, the problem is not at all specific to virtualization. It has been confirmed on a Ryzen 5 1500X running FreeDOS—which comes with the JemmEx memory manager, which enables VME by default. Until VME was disabled, any attempt to boot with JemmEx failed with invalid opcode exceptions. After disabling VME, FreeDOS worked normally.

That is not surprising because when the problematic INT instruction is executed inside a VM using AMD-V, it is almost always executed without any intervention from the hypervisor, which means the hypervisor has no opportunity to mess anything up.

Now, back to the XP trouble. Windows NT supports VME at least since NT 4.0 and enables it automatically. That is the case for NT 4.0, XP, Windows 7, etc. For the most part, it would only matter when running a 16-bit DOS or Windows application (such as EDIT.COM which comes with Windows).

Windows XP and Server 2003 (that is NT 5.1 and 5.2) is significantly more affected because it was the first Windows OS that shipped with a generic display driver using VBE (VESA BIOS Extensions), and the only Windows family which executed the BIOS code inside NTVDM (with VME on, if available). Starting with Vista, presumably due to increased focus on 64-bit OSes where V86 mode is entirely unavailable, the video BIOS is executed indirectly, likely using pure software emulation.

The upshot is that the problem is visible in Windows versions at least from NT 4.0 and up, but XP and Server 2003 may entirely fail to boot, either hanging or crashing just before bringing up the desktop. Other operating systems which use VME are affected as well (OS/2, DOS with certain memory managers).

The workaround is simple—if possible, mask out the VME CPUID bit (bit 1 in register EDX of leaf 1), which is something hypervisors typically allow. Windows does not require VME and without VME, XP can be booted normally on Ryzen CPUs, at least in a VM.

This entry was posted in AMD, Bugs. Bookmark the permalink.

34 Responses to VME Broken on AMD Ryzen

  1. Lazaro Millo says:

    Thanks for pointing this out Michael.

    I still use Windows NT 5.2 (Server 2003) on 10+ year old hardware and was considering potentially moving that legacy software to new machines. That would then continue until my customer “is ready to fully transition to modern software” … i.e. never.

    I will keep the workaround in mind (as it’d run in a virtual machine, not on the hardware directly) but it seems likely now that Intel will get this sale and not AMD.

  2. Yuhong Bao says:

    There is also a DisableVme registry key.

  3. Morty says:

    As sad as it always is when there’s a bug in a CPU, there’s also some sense of satisfaction for me, as a long-time followers of x86 architectural details, to see that all these super esoteric legacy things (as the “secret” VME surely must count as) still coming up even in the context of the shiniest new products 😉

    Wondering if this bug will be fixable in microcode.

  4. Yuhong Bao says:

    I think most of it was implemented in microcode in the first place. That was probably how Intel was able to do VME in later 486s.

  5. Nathan Anderson says:

    Do we know if AMD is aware of the problem?

  6. Richard Wells says:

    How much software used currently will benefit from VME? IBM’s old estimate was that VME made DOS code run about 10% faster. That is such a minor impact that fully disabling VME might be the best solution in addition to the easiest.

  7. Michal Necasek says:

    I don’t know, and I don’t know how to tell. There’s been remarkably little documentations specific to Ryzen published so far. Notably no revision guide.

  8. Michal Necasek says:

    Exactly, the INT instruction (and IRET) pretty much has to be microcoded already because it’s hideously complex. So almost certainly fixable in a microcode update.

  9. Michal Necasek says:

    Intel has other problems. For example their current CPUs are not fully compatible with old x87 FPU code because they no longer keep track of floating-point code/data segment information. So old segmented FP exception handlers may crash (depending on how they’re written).

  10. Michal Necasek says:

    How much software currently used — very, very little. The benefit as always depends on the workload, I would expect it to be anywhere between significant and negligible. But it only affects software written for a completely different class of PCs, so it’s a non-issue. For the affected users, it’s not even a question, software a tiny bit slower is far better than software not working.

    We’ll see if AMD wants to disable VME entirely and admit that they can’t get such a simple thing right 🙂

  11. Pingback: VME Broken on AMD Ryzen (virtual 8086 mode) | Ace Infoway

  12. Morty says:

    From a pure performance perspective I agree it might make much sense to entirely disable VME. The discussed affected software is so old it should still run much faster than it did on the computers available when it was originally written! Of course there is a possibility that some more modern software might use VME for something useful where it matters more. Seems unlikely, but people have used esoteric CPU feature before for other things than what they were intended.

    There is also another thing: There could be software that doesn’t work at all if VME is disabled. Just because the CPU reports it as being unavailable doesn’t mean the program will politely refrain from using it. Maybe the program will 1) reject to run at all, 2) not check to see if the feature is available and will not go ahead and use it (and fail), 3) detect VME isn’t available and attempt to use a fall-back path that doesn’t require VME – but this fallback path might not be well-tested and might hence fail for other reasons that wouldn’t have occurred if VME had been supported. While you can say these cases aren’t the fault of the CPU, it does impact the end-user and is a reduction in the “real-world, actual, backwards-compatibility” of the CPU.

  13. Yuhong Bao says:

    Win2000 NTVDM had such a bug I think.

  14. Morty says:

    I agree aspects of INT/IRET are likely microcoded but since they are often used it is likely aspects are hardwired as well. So it depends on the bug whether they can be fixed in a microcode update. Of course, even if the bug lies in a hardwired aspect it might still be fixable in microcode through some other workaround, but this might then be at a cost in performance.

  15. calvin says:

    Yeah, Windows 2000 had issues with presuming all i586 and up could do VME. It affected Centaur (and thus VIA) CPUs. I think VIA provided patches to 2000, but only up to SP3?

    PS: The site seems to be eating my comments if I use Edge. Weird.

  16. Pingback: VME Broken on AMD Ryzen | ExtendTree

  17. Pingback: Ryzen hat den VME-Modus im Ryzen verkackt. Das ist … – sm00th.it

  18. Yuhong Bao says:

    Any response from AMD, one day later?

  19. Michal Necasek says:

    No, but I don’t even know who to contact. Do you happen to know?

    I assume AMD knows already because VMware or Microsoft or someone should have figured this out weeks if not months ago 🙂

  20. Soumyajit Deb says:

    Thanks for your update. Had the same issue in VMware while installing XP 32bit on an R7 1700. The fix was to disable VME by adding the following line to the config of the virtual machine:

    cpuid.1.edx = “—-:—-:—-:—-:—-:—-:—-:–0-“

  21. zeurkous says:

    Reading the above, I wonder whether or not we can safely draw the conclusion that properly testing an i86 processor has become near-impossible.

  22. Michal Necasek says:

    I think it’s been that way for a while. The Intel F00F bug is a classic in that category.

  23. Richard Wells says:

    F00F bug and a lot of other bugs that show up are caused by interaction of multiple instructions which is a challenge to test. Bugs that are specific to a single instruction that arrive without microcode corrections are fairly rare. The only times one slips through to affect end users is with a late architectural change that can’t be tested in time like the Pentium math flaw.

    Compared to the number of bugs that slipped into even very simple chips like the Z-80 when moved to new processes, I think the current i86 designers do a very good job. Ryzen unfortunately seems to have been slightly rushed to market.

  24. zeurkous says:

    @Richard Wells: yeah, I suppose you’re more or less right. But as I
    noted quite a while ago, the margin of error decreases as complexity
    increases. I don’t think that anyone can handle the kind of complexity
    of modern i86 processors gracefully. But, as Michal alluded to, that’s
    a pretty old argument by now 🙂

  25. zeurkous says:

    Okay, wordpress is being clever again. That last post ended with ‘colon’ ‘closing parenthesis’. Sigh.

  26. Michal Necasek says:

    I think for a completely new architecture, the Ryzen problems so far have been typical. There are issues but there’s very little that would affect the majority or even a large number of users (VME definitely does not qualify).

    I agree that single instructions rarely fail, but it’s all the interactions where it gets interesting. See Intel TSX which Intel introduced with great fanfare and then effectively disabled on the early CPUs (again, negligible user impact).

    The x86 is also hideously complex in ways that tends to bite Intel/users in the ass. For example there’s that thing where if you set up a ring 3 #AC handler and misalign the stack, the CPU completely hangs (no response to NMI, SMI, nothing). Intel’s response was “don’t do it” but in the days of cloud, that doesn’t fly anymore (and yes, a VM can hang the CPU equally well).

    The complexity is such that for certain things like task switches, Intel does not even attempt to document what precisely happens. There’s just a lot of hand-waving and threats of implementation-specific behavior.

    And yes, the engineers are doing a good job, but doing a 100% flawless job is probably not possible.

  27. zeurkous says:

    If it were up to me we’d just junk the thing (i86), but that’s also a
    very old sentiment, I’m aware

  28. zeurkous says:

    left angle bracket, literal ‘g’, right angle bracket

  29. Yuhong Bao says:

    “For example there’s that thing where if you set up a ring 3 #AC handler and misalign the stack, the CPU completely hangs (no response to NMI, SMI, nothing). ”
    Does this really dates back to the 486?

  30. Lazaro Millo says:

    @zeurkous: It wasn’t even up to Intel. iAPX432 comes, once again, to mind. 🙂

    (The smiley is intentional.)

  31. Fruit says:

    Does the ring 3 #AC handler thing happen on AMD CPUs too?

  32. Michal Necasek says:

    I believe it does but have not verified that myself.

  33. Joey says:

    POP Flags
    POP CodeSegment
    POP InstuctionPointer

    AMD seriously fucked up of such well known and well documented procedures?

  34. Michal Necasek says:

    No. It’s more complicated than that, but it’s not clear to me what the CPU is doing exactly. And it all works fine in real mode, so it’s not like they don’t know how the basic INTn/IRET works.

Leave a Reply

Your email address will not be published. Required fields are marked *