386 Cache Coherency

I’ve been slowly chewing my way through U.S. Patent 5,724,549, titled Cache Coherency without Bus Arbitration Signals, initially filed by Cyrix Corporation in 1992 and published in 1998 (when it was utterly irrelevant, but such is the life of patents).

When Intel designed the 386, it was already well known that an internal (L1) cache is one of the most effective ways to increase processor performance. But given the manufacturing process available at the time, Intel could only squeeze about 512 bytes of cache onto the complex chip, and that was not enough to be effective.

The 386 was designed with external cache (sometimes called L2, although in the case of a 386 there was no L1 cache) in mind and Intel produced its own cache controller for use with a 386, the 82385 (although much like the 80387, the 82385 became available significantly later than the 386 itself).

A classic Cyrix Cx486DLC upgrade processor for the 386 socket, circa 1993.
Cyrix Cx386DLC, 33 MHz

When Cyrix designed the 486DLC upgrade processors for the 386 socket, it was not a problem to put 1K of internal cache on a 386-socket chip, or later even 8K in the case of Texas Instruments 486SXL processors. But keeping the cache coherent was a problem. There were in fact two sources of trouble: External bus masters and our old friend (frenemy?), the A20 gate.

External Bus Masters

The Cyrix 486DLC had 486-style KEN# (Cache Enable), FLUSH# (Cache Flush) and A20M# (A20 Mask) pins, but those were only usable on motherboards specially designed for the 486DLC—and the Cyrix 486DLC had to work without the signals.

A 50 MHz Texas Instruments SXL2-50 processor (left, circa 1994) and a 40 MHz Texas Instruments TX486DLC-40BGA (right, 1993).
Texas Instruments SXL2 (50 MHz) and TX486DLC (40 MHz) processors

The 386 bus was designed with some cache coherency in mind, since an external cache controller still had to ensure cache coherency. The 386 HOLD / HLDA signals can be used for this purpose as follows.

If an external master (such as an 8237 DMA controller or an ISA bus master) takes control of the bus, it is necessary to flush the cache (write out dirty lines in case of write-back cache, and invalidate everything for both write-back and write-through caches) because the external master was probably going to read from and/or write to system memory. In the more complex case of a write-back cache, the contents have to be written out to prevent an external master from reading stale data that had been updated by the CPU, and the cache has to be invalidated to prevent the CPU from reading stale data written by the external bus master.

The CPU knows when an external master controls the bus through the HOLD / HLDA signals, and these signals can be used as triggers for cache flushing. This is not very efficient because almost certainly more data will be flushed than necessary, but it is acceptable in typical systems without huge amounts of DMA traffic (with one exception described later).

Unfortunately for Cyrix, it turned out that in some boards the HOLD / HLDA pins weren’t connected in the CPU socket at all. Why? They were connected to an external cache controller, and since an Intel 386 processor had no internal cache, the CPU did not need to worry about those signals. The cache controller took care of everything. But this made upgrading such systems with a 486DLC quite problematic.

What Cyrix patented is a hardware solution to the problem of missing HOLD / HLDA signals. Instead of using those as synchronization triggers, Cyrix discovered that it’s possible to use I/O reads and interrupt acknowledges instead. This assumes a write-through internal cache which ensures external masters will never read stale data, but the CPU cache itself may become stale after writes by an external master.

The patent describes a small PCB that sits between the processor and the socket. It monitors several CPU pins: M/IO# (Memory / IO), R/W# (Read / Write), and D/C# (Data / Control). Thus it can detect I/O reads as well as interrupt acknowledges. When the detection triggers, the PCB signals the processor to invalidate the cache using the FLUSH# pin.

A20 Gate

There was another, separate problem caused by everyone’s favorite bad boy of the PC compatible world, the A20 gate. Depending on circuitry external to the CPU, accesses to physical memory with bit 20 of the address set would either be passed along unchanged, or the bit could be cleared, wrapping around one megabyte lower.

This was obviously a problem for the internal cache, because if the cache accessed the contents of memory above 1MB instead at the beginning of memory or vice versa, bad things were bound to happen. The Intel 486, as well as the Cyrix 486DLC, solves the problem with the A20M# pin which informs the CPU of the current state of the A20 gate. The processor can then apply the mask internally when accessing its internal cache.

But again, the A20M# signal was not present in standard 386 motherboards. Cyrix solved the problem the hard way—by default, the first 64KB of each 1MB region had to be uncached. This was obviously quite inefficient, because it meant that much of the memory most frequently accessed by DOS could not be cached.

Software could help though. Once the system reached a stage where the state of the A20 wasn’t going to change (just about any protected-mode operating system, including memory managers such as EMM386), it was possible—not to mention highly desirable—to enable normal caching for the first 64KB of each megabyte in the CPU.

Memory Refresh

There was yet another wrinkle that users upgrading certain 386 motherboards with a Cyrix 486DLC had to worry about. In the original IBM PC and PC/XT, system DRAM refresh was handled by the DMA controller; in the PC/AT, there was dedicated circuitry but the effect was very similar. Approximately every 15 microseconds, the DMA controller (in the PC and PC/XT) or custom circuitry (PC/AT) took control of the system bus and issued a memory read.

A keen reader can already see the problem: This refresh mechanism involves an external bus master, and hence the HOLD and HLDA signals. Which means that every 15 microseconds, the processor’s cache will need to be flushed (because the CPU has no way to tell that it’s a DRAM refresh cycle that in fact is completely irrelevant to cache coherency). That is not ideal, to put it mildly.

Even without an internal cache, this kind of memory refresh is not great. As long as the DMA/refresh controller holds the bus, the CPU can’t read of write memory during that time and is forced to stall. That is more of a problem with faster processors because they are more likely to starve during the memory refresh period.

Newer systems solve the problem using so-called hidden refresh, which uses different circuitry for refreshing the system DRAM that does not use the HOLD / HLDA signals and is thus hidden from the processor, hence the name. Needless to say, a 386 board with hidden refresh is much preferable for a 486DLC processor.

Note again that this is not a cache coherency problem as such, it is a performance problem triggered by the cache coherency mechanism.

Cyrix Cache Coherency Control

The 486DLC has several cache coherency pins which are not present in standard 386 sockets, but are available on motherboards designed for the Cyrix 486DLC. All these pins are optional and after reset, the 486DLC does not use them—typically the BIOS would program the CPU to use those pins.

The A20M# pin was already described above. The FLUSH pin (also mentioned previously) is an input triggering a cache flush.

The KEN# (Cache Enable) signal is an input telling the processor whether the currently accessed address should be cached. It is generated by the motherboard, and may be configurable by the BIOS. It is used to inform the processor about non-cacheable memory regions. Note that the 486DLC itself also has internal cache configuration register with the same functionality; those would be used on boards with no KEN# pin.

Also notable is the BARB (Bus Arbitration) bit in the 486DLC’s internal cache configuration register CCR0. When BARB is set, the HOLD / HLDA signals will trigger cache flushes. The BARB bit will need to be set in standard 386 motherboards, but in boards designed for the 486DLC, the BARB bit should not be set to avoid unnecessary flushes.

Mystery Gadget

The OS/2 Museum has a curious processor in its collection. It’s a Texas Instruments SXL2-50 (a clock-doubled Cx486DLC with 8KB internal cache) with a bolted-on gadget that looks like this:

Improve Technologies TI SXL2-50

Improve Technologies was a company specializing in CPU upgrades, but the above model does not appear to be documented anywhere. Here’s what the processor looks from the bottom:

Bottom view of Improve Technologies TI SXL2-50

Examining the traces shows that other than three pins used for power supply (Vcc/Vss), only the HLDA and FLUSH# pins are connected.

What purpose exactly the gadget serves is unclear. But it is obvious that it has something to do with cache coherency, and it was perhaps designed to make the Cyrix/TI upgrade CPUs usable in specific boards. Whatever it is, it’s not the logic described in U.S. Patent 5,724,549.

All in all, managing cache coherency in a board not designed for it is fraught with peril. It can be done, but only at the cost of some (if not a lot of) performance.

Further Reading

This entry was posted in 386, Cyrix, PC architecture, PC history. Bookmark the permalink.

15 Responses to 386 Cache Coherency

  1. Stephen Kitt says:

    The Improve Technologies-improved CPU looks like a Make it 486: http://web.archive.org/web/20140911063351/http://sandy55.fc2web.com/ps55/386cpu/makeit486_2.jpg (see http://ps-2.kev009.com/sandy55/Interposer/386_upgrade.html for details).

    On many systems the cache was disabled on boot, and utilities were required to enable it; see the link above for example, or http://ohlandl.ipv7.net/8570_Tim_OConnor/cyrx.htm

  2. Michal Necasek says:

    Yes. Thanks! I have been aware of the existence of the Make-it-486 upgrades, but the Web Archive link is the first one I’ve seen that shows exactly the same CPU, with no interposer board but rather the flexible plastic circuit tacked on. Thanks for digging that up! The clock-doubled TI is one of the best CPU upgrades for a 386 board.

    The cache is always disabled on reset. Either the BIOS enables it (on boards that explicitly support Cyrix CPUs), or some driver or software does it after boot-up.

  3. Michal Necasek says:

    I get the timeline right, the CPU I have (and that is pictured on the box) is an old version, while most of the newer ones used an interposer PCB. And that’s why most pictures of Make-it 486 found on the web don’t look like mine.

  4. Octocontrabass says:

    It’s safe to say the gadget is intended to flush the cache in response to bus holds, like the BARB bit in CCR0, but the presence of an oscillator and a binary counter suggests that it’s a bit more clever. Any chance for a picture that shows the rest of the circuit?

    It’d be pretty interesting to run some benchmarks to see how the gadget compares to BARB.

  5. random lurker says:

    That folded circuit board looks absolutely hideous. But hey, I suppose whatever works? 😀

    I am also interested in benchmarks vs. the 386 that is being upgraded and also vs. a real 486DX2-50/66. (Why is it labeled SXL2-50 when the magazine article says it goes up to 66 MHz?)

  6. Richard Wells says:

    The interposer on the later Make-It 486 for 386 upgrades looks to have only few components as well. May be the same rearranged around the side of the board but I can’t find a good picture to compare. Nothing like the complexity of the upgrades aimed at 286 systems.

    Interposer becomes necessary once Cyrix switched to QFP for chips as the goal of sticking into 386 sockets on redesigned motherboards was not going to happen.

  7. Julien Oster says:

    I tried tracing out the pictures visually, but with some of the traces hidden below the components, I quickly lost motivation.

    One of the more boring possibilities is that that the Quartz and counter may just be there to hold the line down a specified amount of time. Note how the board does not interface with any clock line on the CPU.

  8. Michal Necasek says:

    Correct, some of the traces are not visible, so visually it’s not really possible to figure out how everything is connected. I’m not yet desperate enough to sit down with an ohmmeter.

    I concur that the circuit is not connected to the CPU’s clock at all, and the oscillator’s function might be to simply extend the duration of the FLUSH signal.

    Why they needed this is something I couldn’t figure out. From experience I know that the SXL2-50 works in many (most?) 386 boards as is, but probably not everywhere.

  9. Michal Necasek says:

    Not only does it look hideous but it also looks like it would have been a lot of work to install on the CPU.

    There may or may not have been a 66 MHz version, I’ve only seen a 50 MHz SXL2 chip, and that’s also what’s listed in TI documentation (Cyrix did make 66 MHz variants, but those only had 1 KB cache).

    I don’t have exact numbers but from what I recall the performance greatly depended on how well the board supported Cyrix CPUs, i.e. how much the cache had to be flushed. The 50 MHz SXL2 could perform like a 33 MHz 486 or maybe a bit better. Certainly it would significantly outperform a 25 MHz 386 it replaced.

  10. techfury90 says:

    That Improve Technologies upgrade with the added bodge reminds me of a Buffalo/Melco-made AMD K6-2 upgrade for Socket 5 PC-9821 systems that I have sitting in my Xa12. There’s some kludge in it related to A20- allegedly trying a regular Socket 5/7 interposer with an AMD chip results in the PC-98 BIOS failing the A20 test. I haven’t been able to really dig into this interposer due to its complex multi-layer construction, but there appears to be a Xilinx CPLD connected to at least part of the data bus, address bus, and the A20M and FLUSH lines.

    What I have never been able to ascertain from AMD datasheets is *why* this kludge is necessary. Did AMD change something about A20 handling? Bizarre…

  11. Julien Oster says:

    Total shot into the dark: Maybe AMD’s “Fast A20” implementation was different from what the BIOS expected? What was the original CPU on that board? Should be easy to compare that one at least.

  12. Michal Necasek says:

    The fast A20 implementation is a function of the chipset/keyboard controller, not the CPU. Maybe the test failure was a function of the CPU’s much higher speed? Never seen such an issue in “regular” PC clones. Makes me wonder what it is that the PC-9821 is doing.

  13. techfury90 says:

    Yeah, I suspect AMD did indeed put in an incompatible fast A20 implementation within the CPU, but I have no data to support that, as I don’t recall it being mentioned in any of the K6-2 datasheets I looked over. Original chip was an Intel SK084 Pentium 120.

    As far as how PC-9801/9821 A20 works: you can “unmask” A20 (as they call it in Japanese) it with a write (any value) to I/O port F2h, or by writing 10h to port F6h. Interestingly, FreeBSD for PC-98 tries both of these in a row; yet both Japanese sources I consulted while writing this comment claim that you don’t need to use both. Re-masking A20 can be done by writing 00h to F6h. It appears that all 286 and up PC-98 models use this method of handling A20. Note that there is no keyboard controller in a PC-98 (the keyboard is handled by an 8251 USART), so any implementation would have to be “fast”.

    I also should add that the upgrade vendor claims this particular part is only for PC-98 systems. Here’s the (Japanese) page for it: http://buffalo.jp/products/catalog/item/h/hk6-ms400-n2/index.html

    And the corresponding DOS/V (aka regular PC) version: http://buffalo.jp/products/catalog/item/h/hk6-md400-v2/index.html

  14. techfury90 says:

    Oops, just realized I had some things backwards and missed some more information.
    The correct story on PC-98 A20 masking, as they call it (they don’t call it the A20 gate, heh):

    A20 mask enable (allow 1M wraparound) can be enabled by writing 01h to I/O F6h.
    A20 mask disable (allow access to all memory) can be enabled by writing 00h to F6h, /or/ writing *any* value to I/O F2h.

  15. ForOldHack says:

    Oh the days…

    “486DX2-50/66” is 66Mhz part, that was clock doubled in the CPU, what made it run at 50Mhz or 66Mhz was the system bus. At a system bus of 25Mhz, it ran at 50Mhz, and at 33Mhz it ran at 66Mhz. At fist there were only 25 or 33, then 25 and 33, then 25, 33 and 40, and finally true 50 Mhz motherboards that could use a doubler for 100, or a tripler for 133, or a tripler for 25, which ran at 125, and the quads ( IBM Blue lightning had triplers and quads, ) and AMD also had a doubled doubler for 25Mhz system that ran at 100 Mhz internally. (DX4s ) and a DX4/120s for 33Mhz busses.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.