I’ve been slowly chewing my way through U.S. Patent 5,724,549, titled Cache Coherency without Bus Arbitration Signals, initially filed by Cyrix Corporation in 1992 and published in 1998 (when it was utterly irrelevant, but such is the life of patents).
When Intel designed the 386, it was already well known that an internal (L1) cache is one of the most effective ways to increase processor performance. But given the manufacturing process available at the time, Intel could only squeeze about 512 bytes of cache onto the complex chip, and that was not enough to be effective.
The 386 was designed with external cache (sometimes called L2, although in the case of a 386 there was no L1 cache) in mind and Intel produced its own cache controller for use with a 386, the 82385 (although much like the 80387, the 82385 became available significantly later than the 386 itself).
When Cyrix designed the 486DLC upgrade processors for the 386 socket, it was not a problem to put 1K of internal cache on a 386-socket chip, or later even 8K in the case of Texas Instruments 486SXL processors. But keeping the cache coherent was a problem. There were in fact two sources of trouble: External bus masters and our old friend (frenemy?), the A20 gate.
External Bus Masters
The Cyrix 486DLC had 486-style KEN# (Cache Enable), FLUSH# (Cache Flush) and A20M# (A20 Mask) pins, but those were only usable on motherboards specially designed for the 486DLC—and the Cyrix 486DLC had to work without the signals.
The 386 bus was designed with some cache coherency in mind, since an external cache controller still had to ensure cache coherency. The 386 HOLD / HLDA signals can be used for this purpose as follows.
If an external master (such as an 8237 DMA controller or an ISA bus master) takes control of the bus, it is necessary to flush the cache (write out dirty lines in case of write-back cache, and invalidate everything for both write-back and write-through caches) because the external master was probably going to read from and/or write to system memory. In the more complex case of a write-back cache, the contents have to be written out to prevent an external master from reading stale data that had been updated by the CPU, and the cache has to be invalidated to prevent the CPU from reading stale data written by the external bus master.
The CPU knows when an external master controls the bus through the HOLD / HLDA signals, and these signals can be used as triggers for cache flushing. This is not very efficient because almost certainly more data will be flushed than necessary, but it is acceptable in typical systems without huge amounts of DMA traffic (with one exception described later).
Unfortunately for Cyrix, it turned out that in some boards the HOLD / HLDA pins weren’t connected in the CPU socket at all. Why? They were connected to an external cache controller, and since an Intel 386 processor had no internal cache, the CPU did not need to worry about those signals. The cache controller took care of everything. But this made upgrading such systems with a 486DLC quite problematic.
What Cyrix patented is a hardware solution to the problem of missing HOLD / HLDA signals. Instead of using those as synchronization triggers, Cyrix discovered that it’s possible to use I/O reads and interrupt acknowledges instead. This assumes a write-through internal cache which ensures external masters will never read stale data, but the CPU cache itself may become stale after writes by an external master.
The patent describes a small PCB that sits between the processor and the socket. It monitors several CPU pins: M/IO# (Memory / IO), R/W# (Read / Write), and D/C# (Data / Control). Thus it can detect I/O reads as well as interrupt acknowledges. When the detection triggers, the PCB signals the processor to invalidate the cache using the FLUSH# pin.
There was another, separate problem caused by everyone’s favorite bad boy of the PC compatible world, the A20 gate. Depending on circuitry external to the CPU, accesses to physical memory with bit 20 of the address set would either be passed along unchanged, or the bit could be cleared, wrapping around one megabyte lower.
This was obviously a problem for the internal cache, because if the cache accessed the contents of memory above 1MB instead at the beginning of memory or vice versa, bad things were bound to happen. The Intel 486, as well as the Cyrix 486DLC, solves the problem with the A20M# pin which informs the CPU of the current state of the A20 gate. The processor can then apply the mask internally when accessing its internal cache.
But again, the A20M# signal was not present in standard 386 motherboards. Cyrix solved the problem the hard way—by default, the first 64KB of each 1MB region had to be uncached. This was obviously quite inefficient, because it meant that much of the memory most frequently accessed by DOS could not be cached.
Software could help though. Once the system reached a stage where the state of the A20 wasn’t going to change (just about any protected-mode operating system, including memory managers such as EMM386), it was possible—not to mention highly desirable—to enable normal caching for the first 64KB of each megabyte in the CPU.
There was yet another wrinkle that users upgrading certain 386 motherboards with a Cyrix 486DLC had to worry about. In the original IBM PC and PC/XT, system DRAM refresh was handled by the DMA controller; in the PC/AT, there was dedicated circuitry but the effect was very similar. Approximately every 15 microseconds, the DMA controller (in the PC and PC/XT) or custom circuitry (PC/AT) took control of the system bus and issued a memory read.
A keen reader can already see the problem: This refresh mechanism involves an external bus master, and hence the HOLD and HLDA signals. Which means that every 15 microseconds, the processor’s cache will need to be flushed (because the CPU has no way to tell that it’s a DRAM refresh cycle that in fact is completely irrelevant to cache coherency). That is not ideal, to put it mildly.
Even without an internal cache, this kind of memory refresh is not great. As long as the DMA/refresh controller holds the bus, the CPU can’t read of write memory during that time and is forced to stall. That is more of a problem with faster processors because they are more likely to starve during the memory refresh period.
Newer systems solve the problem using so-called hidden refresh, which uses different circuitry for refreshing the system DRAM that does not use the HOLD / HLDA signals and is thus hidden from the processor, hence the name. Needless to say, a 386 board with hidden refresh is much preferable for a 486DLC processor.
Note again that this is not a cache coherency problem as such, it is a performance problem triggered by the cache coherency mechanism.
Cyrix Cache Coherency Control
The 486DLC has several cache coherency pins which are not present in standard 386 sockets, but are available on motherboards designed for the Cyrix 486DLC. All these pins are optional and after reset, the 486DLC does not use them—typically the BIOS would program the CPU to use those pins.
The A20M# pin was already described above. The FLUSH pin (also mentioned previously) is an input triggering a cache flush.
The KEN# (Cache Enable) signal is an input telling the processor whether the currently accessed address should be cached. It is generated by the motherboard, and may be configurable by the BIOS. It is used to inform the processor about non-cacheable memory regions. Note that the 486DLC itself also has internal cache configuration register with the same functionality; those would be used on boards with no KEN# pin.
Also notable is the BARB (Bus Arbitration) bit in the 486DLC’s internal cache configuration register CCR0. When BARB is set, the HOLD / HLDA signals will trigger cache flushes. The BARB bit will need to be set in standard 386 motherboards, but in boards designed for the 486DLC, the BARB bit should not be set to avoid unnecessary flushes.
The OS/2 Museum has a curious processor in its collection. It’s a Texas Instruments SXL2-50 (a clock-doubled Cx486DLC with 8KB internal cache) with a bolted-on gadget that looks like this:
Improve Technologies was a company specializing in CPU upgrades, but the above model does not appear to be documented anywhere. Here’s what the processor looks from the bottom:
Examining the traces shows that other than three pins used for power supply (Vcc/Vss), only the HLDA and FLUSH# pins are connected.
What purpose exactly the gadget serves is unclear. But it is obvious that it has something to do with cache coherency, and it was perhaps designed to make the Cyrix/TI upgrade CPUs usable in specific boards. Whatever it is, it’s not the logic described in U.S. Patent 5,724,549.
All in all, managing cache coherency in a board not designed for it is fraught with peril. It can be done, but only at the cost of some (if not a lot of) performance.
- Cyrix Cx486 DLC Data Sheet, May 1992
- Texas Instruments TI486SXLC and TI486SXL Microprocessor Reference Guide, 1994