As expected, AMD fixed the problem with VME that affected Ryzen processors. The fix is shipped in the form of a microcode patch as part of AGESA 126.96.36.199, currently being rolled out by OEMs as part of a BIOS update. Which means that depending on the OEM and board, a fix may or may not be available for a given system today.
The patch level for the fixed microcode is 8001126 or higher. The older microcode revision 800111C (part of AGESA 188.8.131.52a) is known to have trouble with VME.
Really cool 🙂 Good for AMD that they made it amenable to microcode updates. I am also surprised at how many of Intel’s microcode bugs, including the AVX bugs in Skylake, have turned out to be fixable in microcode. Would really love to have a more inside view on these bugs and how they were solved, but we will probably never know!
Microcode is super duper secret sauce but I think it’s safe to assume that more or less every instruction can be patched in microcode. Possibly a lesson learned from the Pentium FDIV bug which was not cheap for Intel, and almost certainly could have been fixed by a microcode update.
A highly complex instruction like INTn or IRET is almost certainly implemented in microcode to begin with, so fixing it via a microcode update is probably a no-brainer.
Here’s a relevant talk, given by an AMD engineer, regarding the very issue of how they make sure that things just work on x86 CPUs: https://media.ccc.de/v/32c3-7171-when_hardware_must_just_work
Every instruction can be patched in microcode but some will incur a performance penalty if an instruction that is implemented in fixed function hardware (in silicon) is being replaced with microcode (which can be thought of as basically emulating the hardware). Generally the performance penalties are negligible, though, and you definitely want to be running the latest microcode because wrong results in calculations and hard lockups are far worse than a 0.5% performance loss!
One of the worst things that ever happened (beyond the FDIV bug) was the AMD Phenom TLB bug, for which a BIOS workaround was possible to implement but it caused a performance penalty of about 10%. Given how behind AMD was in performance at that time, the bug was really catastrophic.
AMD has patents that describe their methodology for handling microcode patches. Usual patent problems of being written up by a lawyer based on engineer descriptions so a trifle confusing but gives a good sense of the limits of the technique. The patch RAM has 64 entries and there are 8 match registers which transfer control to the patch RAM. Every instruction can’t be patched in the patch RAM; there just isn’t room. I am fairly sure it will be impossible to fit complete replacement FPU logic in the patch RAM. Intel’s mechanism seems similar but is minimally documented.
AMD expected switching from internal microcode ROM to the patch RAM would take 2 cycles.
Thanks for the link, definitely worth watching.
Ok – but I am thinking in the case of IRET, I agree it is hideously complex so it is very likely to be microcoded to some extent. But it also seems clear it can’t be totally microcoded. All the checks for various conditions and CPU modes etc., if those were implemented by sequencing a stream of micro-ops corresponding to those checks it could take maybe 100s of cycles just to go through those. But maybe it is a combination so that it does sequence a stream of micro-ops, but those ops have access to various result flags generated by fixed-function hardware so that the micr-ops doesn’t have to work out all the logic manually. Also, I would hope/expect that maybe even at the CPU decoder level it would take CPU mode and bit-size etc. into account to select the proper microcode-program rather than having just one “IRET” program that would have to wake up and “discover the world” and cover all cases. So when operating in the more native 64-bit long mode a microcode program is used that doesn’t even know about VME because that case can’t occur. Maybe there’s a special microcode routine used especially for the situation of a VM-subtask.
The cost of the IRET fix of course will depend on exactly what is wrong. If the microcode now has to calculate something that was previously done using fixed functions then it could slow things down. But hopefully only affecting the version of the IRET microcode used when operating in VME mode.
Still it must be complex when you have things like a hyper-visor running inside a long-mode emulating a CPU set in 32-bit protected mode, executing a 16-bit VME subtask which then does an IRET and things like that (of course with full two-levels of paging active) 😉 Not to speak of the case with nested-virtualization which is also hardware-supported, essentially some of the principles from VME now applied to the CPU virtualization extensions. Of course a lot of it probably comes from free, and a good design/architecture helps immensely, but still I can’t help having a feeling that there must be very complex scenarios especially error scenarios with potential faults going on at all levels of this virtualization ladder. But maybe it is just the simple common virtualization cases that are fast, and whenever something more complex happens a “real program” (almost like a software emulator) gets invoked that figures it out. This program will also be more amenable to patching.
Given the enormous transistor count of CPU’s it is not unthinkable that programs in the size of 100K’s to several MB’s count fit there and even then only use up a small portion of transistors.
By the way, I don’t know why match registers would be required at all (limiting the number of patches to some set amount). Why doesn’t the CPU just make a copy of the microcode-ROM into a RAM-area on the CPU (at least the jump/dispatch table for all the instructions and routines, or such) which can then be patched up by the microcode-patches. In this way there would be no match-limit and you would also save the power from the match-registers having to compare everytime some instruction gets sequenced.
Of course there are probably many more details and trade-offs than one might at first think 😉
Take a look at:
New bug on board – SKL150.
“Under complex micro-architectural conditions, short loops of less than 64 instructions that use AH, BH, CH or DH registers as well as their corresponding wider register (e.g. RAX, EAX or AX for AH) may cause unpredictable system behavior. This can only happen when both logical processors on the same physical processor are active.”
(According to the errata sheet that one was also fixable in microcode – amazing since it didn’t even seem very instruction specific and could have been more low-level!)
I won’t even guess how the CPU works internally. I know what the CPU does logically (all the different conditions that INTn or IRET checks and the various modal behaviors) but not how it’s implemented. I imagine there are various building blocks that are optimized to handle common tasks (e.g. IRET has a lot in common with FAR CALL/JMP, things like loading segment registers are common, etc.).
IRET is a serializing instruction which means goodbye performance. Just for reference, on the 386 an IRET to a lesser privilege is documented to take 82 cycles, in the 486 it’s “only” 36 cycles (while a relative CALL takes 3 cycles). Those timings are for a case where everything is in cache/TLB, which probably typically is not the case. I guess I’m saying that IRET is seriously slow already.
As for virtualization, it’s not nearly as complex as you might think. The trick is that when a 64-bit host OS runs a V86 task in a VM, the CPU is in V86 mode for all intents and purposes. The core feature of hardware virtualization is that it exposes LOADALL-like functionality that allows the hypervisor to load a CPU state which is completely different from its own. The intercepts, of which there are not that many, are what makes it work, and a VM exit returns back to the original state (well, unless you have brain-damaged VT-x, then it get interesting).
As for nested virtualization — it can be implemented purely in software, although recent CPUs have additional features that lower the overhead. At any rate, hardware virtualization makes far fewer changes to the overall architecture than it might seem. If you look at Intel’s initial VT-x implementation from late 2005, that is about the barest minimum required for something which is not completely useless.
Re match registers… just because there’s a patent doesn’t necessarily mean the method is used, or used exactly as described. But again, what do I know.
That’s a nice one. Now what exactly are those “complex micro-architectural” conditions… (clearly something that did happen to someone in real life)
@Michal: Thanks for info yes IRET isn’t that fast to begin with so could well be fully microcoded at least on 386. Maybe a future optimization avenue to implement it all in fixed-function logic 😉
Regarding VME, yes you are right that it isn’t that complex because the CPU is to a large extent put into the mode it is expected to emulate, however there are still notable exceptions. For instance, the paging setup by the hypervisor continues to be in effect (at least on the newer implementations that support 2nd level address translation) and probably other things related to memory, caching (MTRR’s and all) etc. and then with some traps in place as you mention. So it is probably the memory part and the exception delivery associated with that that might be the hardest part to get right.
Yes, nested paging complicates things, especially TLBs and various specialized caches. And there were problems with the early implementations in that area. But that’s actually quite orthogonal to instruction execution.
Apparently people using the OCaml compiler have been encountering SKL150: https://lists.debian.org/debian-devel/2017/06/msg00308.html
@Random lurker: Cool – my “discovery” was completely independent. I took a look at the errata sheet after reading about this microcode-fixable AMD VME bug to see how Intel was doing in that department, now I am seeing the bug mentioned on Slashdot 😉
Good stuff. The spec update does not say that “this issue has not been observed in commercially available software” or whatever the formulation du jour is, which strongly implies that someone did encounter the erratum in the wild (and we now know the Ocaml people did, whether or not it was their report that Intel acted on). The nasty thing about this bug is that it is probably very difficult to say whether given software might trigger the erratum, and short of disabling hyperthreading there is no workaround. So Skylake/Kaby Lake owners have to hope that their board vendor is not slacking and will issue updated firmware.
@Michal: Yes agreed – it also sounded quite nasty from the description of the bug! Also, apparently microcode updates have not been issued for all models of Skylake so at this point it isn’t even possible to update. Which is a bit worrying because I own one of those models and because it is very hard to predict what software might be affected. On the other hand, I have used this system for almost two years without problems – or at least any problems I noticed. Oh well, now I have an excuse for the embarassing typos that have crept in to the documents I have worked on on that system.