A few days age I came across an article about the 8237 DMA controller in an old German computing magazine (DOS Extra, issue 1 ’87/88, page 123, Schnelle Speicherverwaltung mit dem DMA-Controller, or Fast memory management with the DMA controller). While skimming through the article, I began to suspect that the although the author did a good job reading the 8237 datasheet, he had only a rather vague idea of how the controller was actually wired up in the IBM PC.
On closer reading of the article, my suspicion was confirmed. While there is some PC-centric information in the article (which I/O ports the 8237 is mapped at, or the fact that the DMA controller is used for memory refresh in the IBM PC), absolutely crucial IBM-specific information is missing.
Sometimes the article contradicts itself slightly, or at least does not address the full implications: It claims that DMA transfers allow the CPU to do other things in the meantime, but elsewhere also points out that DMA transfers block the CPU. It is, of course, both true—the CPU can do other things while DMA transfers are running, but when active DMA transfers own the bus, the CPU can’t access any data or read instructions, which is especially an issue in the original cache-less designs. The faster the DMA runs, the less time there’s left for the CPU to access the bus.
But now to the gaping holes in the article. It correctly notes that the 8237 can address and transfer up to 64KB. It does not appear to have occurred to the article’s author that since the PC’s address space is 1MB (and 16MB in the PC/AT, but that is not truly considered in the article), there’s a bit of a problem there. And thus the article says absolutely nothing about the PC’s DMA page registers, which determine the 64KB region of memory a given DMA channel will address. Naturally there’s also no mention of the fact that in the IBM PC and many (but not all) clones, the DMA page registers contain strictly bits 19:16 (on PC and XT; bits 23:16 on the AT) of the physical address and the 8237 supplies bits 15:0. Which has the rather annoying implication that DMA transfers can’t cross a 64K physical address boundary (in some chipsets, the page register and DMA address are added and the limitation does not exist; in the original IBMs they’re only logically ORed).
To be fair, the IBM PC Technical Reference does not do a very good job of explaining the DMA page registers either, but it does mention them, the BIOS listings provide rather solid clues as to how the page registers really work, and the board schematics also clearly show how the DMA page registers are wired. (The DOS Extra article does not provide any references so it’s anyone’s guess what it’s based on besides the 8237 datasheet.) The IBM PC/AT Technical Reference is much more informative on the subject.
The other giant hole in the article has to do with memory-to-memory DMA transfers. The article correctly mentions that the 8237 supports memory-to-memory transfers using DMA channels 0 and 1 (and only those). It then says things like this: “Think for example of moving blocks of texts within a larger area of text, which could be significantly accelerated through the 8237.” Sounds cool, right? Not so fast.
Elsewhere the article also mentions that the IBM PC uses the DMA controller for memory refresh. What is omits to mention is that DMA channel 0 is the one used for memory refresh in the IBM PC, which makes memory-to-memory DMA quite problematic, given that channel 0 would be also required for memory-to-memory transfers. You can do memory-to-memory transfers, or you can keep the DRAM refreshed, but not both.
It’s actually even worse than that. Looking at the IBM PC system board schematics (page D-5 in the original August ’81 Tech Ref), it is apparent that DMA channels 0 and 1 effectively share the same DMA page register; the way the page registers are wired is rather non-obvious but it is explained here for the PC/AT. In the original IBM PC, DACK 2 and 3 pins of the 8237 were connected to pins RB and RA of an LS670 latch, respectively. Because the PC uses active-low DACK signals, DMA channel 2 uses the page register accessible on I/O address 81h, DMA channel 3 uses port 82h, but DMA channel 1 uses port 83h—when neither DACK2 nor DACK3 is active, the RA and RB pins on the LS670 are in the high state, selecting register 3. To put it differently, if either DACK0 or DACK1 or no DACK signal is active, register 3 in the LS670 is selected; memory-to-memory transfers would be constrained to a single 64K window, severely limiting their usefulness (memory-to-memory transfers don’t use DACK signals at all, so it would not be easy to select the right page register anyway).
Another issue is that memory-to-memory transfers would have to program the DMA controller for block mode (rather than the typically used single or demand modes). It is questionable how much the CPU could actually do while a block DMA transfer was running and hogging the bus bandwidth.
As an aside, using the DMA controller for memory refresh has an implication for IBM PC initialization. Although the PC has no complicated memory controller and no memory timings to set, immediately after power-up RAM can’t be used because it’s not being refreshed yet. During POST, the BIOS sets up the DMA controller (and timer, which is also involved) to perform the refresh function, but until then, there’s no RAM, and therefore notably also no stack.
The DOS Extra article nicely demonstrates why reading component datasheets is not sufficient for understanding and programming of the IBM PC and compatibles. Numerous ICs are wired in such a way that some of their theoretical capabilities cannot be used. And some of the board wiring is practically “secret sauce” (not truly secret back when the schematics were published) which may be surprisingly difficult to find adequately documented, precisely because the information isn’t covered by any datasheet.
The biggest reason why the PC had a DMA controller may well have been the floppy controller. The FDC does not produce or consume data at any kind of prodigious rate, but it is very sensitive to latency. The CPU should be theoretically able to service the FDC in PIO mode, but it would not be able to do so with interrupts enabled, and that would be a problem especially when transferring several sectors of data at once. The DMA controller can react with low latency, avoiding underruns or overruns.
P.S.: The IBM PC/AT does not use DMA channel 0 for memory refresh. It is probably still impossible and certainly highly impractical to do memory-to-memory DMA. One practical problem is the inability to use separate page registers for the source and destination since memory-to-memory transfers don’t use DACK. A much worse problem is likely the fact that the 8237 would have to hold the bus during the entire transfer, entirely blocking the CPU, and possibly interfering with DRAM refresh. Finally, memory-to-memory DMA is almost guaranteed to be slower than CPU transfers on a PC/AT, since it is limited to fairly slow 8-bit transfers, while the CPU can use faster 16-bit cycles.
Late Update: By happenstance, I came across DOS Extra Nr. 7 1989, which rehashed the hardware article from the first ’87/88 issue. On page 37, an article titled simply Der DMA-Controller 8237 (The 8237 DMA Controller) presents much of the same information that the older article did, but with notable improvements (and some new problems).
The new article clearly states that memory-to-memory DMA is not possible on the IBM PC because DMA channel 0 is used for DRAM refresh. It also states, without explaining why, that on the AT, memory-to-memory DMA is still not practical even though the 8237 is no longer used for memory refresh (“Leder ist beim AT keine sinvolle Speicher-zu-Speicher-Übertragung programmierbar […]”). The article also notes that DMA transfers lock out the CPU, which is a problem with longer block transfers.
At times the new article is incomplete, such as when it claims that the DMA controller is “up to ten times faster” than the CPU. That may have been true for the original PC, but certainly isn’t true for any PC/AT or newer system.
At times the new article is misleading, such as when it says that the DMA controller can process up to four transfers concurrently (“Weil der 8237 vier getrennte Kanäle besitzt, ist er zudem in der Lage, vier solche Datenübertragungen gleichzeitig durchzuführen.”). That gives a distinct impression that four data streams could be actually transferred at the same time, but of course only one DMA channel can own the bus at any one time. Four DMA transfers can all be “running”, but they still have to transfer data one at a time.
Unlike the original 1987 article, the 1989 update explains the DMA page registers (although it omits their port addresses), and it also briefly describes the cascaded DMA controllers in the PC/AT (without presenting the not-so-obvious details of 16-bit DMA transfers). Overall a clear improvement over the original article.
I’ve got this vague memory that the second DMA controller on the AT was wired up so as to be able to perform transfers on 128K boundaries, and 16 bit transfers.
But that could be my memory playing tricks on me.
It’s not. That is how the second DMA controller (DMA channels 4-7) worked, all the addresses were shifted left by one bit. The controller could only address aligned words and the page registers ignored the lowest bit.
> The faster the DMA runs, the less time there’s left for the CPU to access the bus.
Did you mean “the longer” (i.e. *slower*)?
No, “fastest” in this context is “as fast as the 8237 can handle”, which would imply the bus being owned completely by DMA. “Slow” would be a short DMA transfer here and there, leaving the CPU relatively undisturbed.
It would have been crazy if a memory to memory “bitter” was hiding all this time… Although considering how the AT is so much like a cascaded XT tied onto another XT it’d be too much to hope for.
Then again the simplicity of the AT made it cheap and easy to Clone.
The problem really was that on a PC/AT, DMA was slower than the CPU (16-bit cycles have much faster timings than 8-bit!). That had not been case with the 8088 PCs, where the DMA really was faster.
Drivers for the 3Com 3C501 EtherLink often use DMA to transfer packets to/from the card on an 8088/8086 CPU, but on 286+ they just use string I/O instructions because those are faster. The reason was speed, DMA was faster on the PC but not on the AT.
When was the article written? The two year issue date suggests a special issue filled with reprints. It was common in the US in magazines not focused on the IBM PC to discuss hardware done correctly with a sidebar mentioning how the IBM PC got it wrong. I didn’t follow German computer magazines closely while I was living there so I don’t know if German magazines had the same style.
8-bit DMA was slow. The 16-bit DMA chips were a lot faster. IIRC, Siemens 16-bit DMA controller marketed for use with the 80286 was advertised as doing memory copies through DMA at twice the speed of the 80286 doing memory copies. Just because IBM chose a simple and cheap method that maintained compatibility does not mean IBM chose the fastest.
The PC Jr was the most common 8088 system to lack DMA. The memory expansion sidecars needed a memory controller which added to the cost. There were third party add-ons that dealt with other issues caused by the lack of DMA. On the whole, having DMA would have been much cheaper. Tandy figured that out and placed the DMA chip on the memory expansion card for the Tandy 1000.
Some Eastern Bloc computers inspired by the IBM PC handled the 8088 and DMA issues correctly by having both DMA controller and memory controller leaving DMA channels free to do DMA. Compupro pushed the idea of using static RAM to ensure that DMA worked flawlessly. Compupro’s prices might have been a bit daunting for some customers; even after a major price cut, the 128 kB RAM memory card cost $1695.
There was a German monthly magazine called DOS International. DOS Extra was an irregular supplement. The first issue (late 1987?) and the 7th issue (1989) of DOS Extra both covered the PC hardware in some detail; the articles are not reprints, and at least the DMA article is not just a slight edit, either. It’s very similar and the newer version could be based on the older one, but the differences are significant.
The thing about the 8237 DMA controller which probably confused people is that it legitimately was faster than a ~5 MHz 8088, but the PC/AT used the exact same 8237 running at more or less the same speed and a much faster CPU with 16-bit bus. And with 386s it only got worse. MCA and EISA had faster DMA controllers, but I don’t think it ever amounted to much, as hardware designers went for bus mastering instead.
For some reason I have this PDF downloaded, likely to see what was improved about EISA DMA controllers: https://physics.bgu.ac.il/COURSES/SignalNoise/DMA.pdf
It makes a passing mention of the two channel memory-to-memory transfer mode, but doesn’t mention that its quite useless on the PC!
That’s a good appnote. It’s a little inaccurate when it talks about the original PC (e.g.: PC DMA page registers were 4-bit, not 8-bit), but it was clearly written with the PC/AT and later designs in mind.
It nicely explains some things the 8237 datasheet says are possible but actually can’t be done in a PC, like block DMA transfers which would be fast and lock out everything else on the bus, including memory refresh.
I hadn’t thought of that but it’s true that in the PC and XT, the 8237 ran faster (4.77 MHz) than in the PC/AT (4 MHz), which even further disadvantaged 8237 DMA transfers in the PC/AT. The maximum transfer rate they give for the 8237 in the PC/AT with 16-bit transfers is 1.6 MB/s, which is a lot less than a bus-mastering adapter can transfer on 8-MHz AT bus (4 MB/s and very likely 5.33 MB/s or more).
The bottom line is that on a PC/AT, peripherals that needed to transfer data quickly (hard disks, network cards) did not use 8237 DMA because it would be counterproductive. But peripherals with limited bandwidth, tight timing constraints, and minimal buffering (floppies, sound cards) benefited from 8237 DMA.
EISA probably had the best 8237 superset with fast DMA transfers, though I’m not sure how much it was really used. I suspect bus-mastering was preferred since it was much more widely applicable.
One problem with figuring out what uses DMA is that Adaptec used the term DMA to refer to both using the motherboard based DMA controller and what is often referred to as bus-mastering. Which means there is no obvious documentation saying when the motherboard DMA controller is used.
Winn L. Rosch back in a 1990 issue of PC Magazine had an in-depth EISA article focused heavily on how DMA worked with it. Little diagrams of how control is shared between bus master devices, DMA, and RAM refresh helps clarify things.
They are both called DMA, that’s not something Adaptec made up. Sometimes third-party (8237 style) vs. first-party (bus-mastering) DMA. But yeah it confuses things. What confuses things even more is that those Adaptec and other bus-mastering ISA cards needed to have a DMA channel assigned, even though they did not use the 8237 DMA controller at all, but they still needed that DMA channel for arbitration. If you look at the documentation or drivers for something like Adaptec 1542 or Novell NE2100, there are jumpers and driver settings for a DMA channel, and you probably have to look pretty hard to find out that those cards are not using the system board DMA controller.
Marketing literature tended to emphasize bus-master DMA (e.g. Adaptec 154x) but from an end-user perspective the configuration was identical. With PCI it’s a lot easier.
My experience is that other than sound cards, there were not too many 16-bit ISA cards that used 8237 DMA. I’m actually struggling to think of any, though there must have been something.
The only notable modern example still commonly found on LPC bus systems is the ECP mode parallel port which commonly uses DMA channel 3.
It looks like EISA systems could do 8237 style DMA transfers faster than regular ISA systems and still remain compatible. Everything you wanted to know about EISA system DMA can be found in this book: http://oldcomputers-ddns.org/public/pub/manuals/eisabook.pdf
EISA was pretty neat and well thought out. Too bad it didn’t replace ISA as the “base” PC architecture even though Intel made a few southbridges with EISA support. It even fixed issues that came up on MCA based systems.
Yes, EISA was capable of doing third-party DMA at full bus speed. And I totally agree that EISA was a really solid improvement over ISA while retaining the compatibility that MCA ditched, and even the slots were backward compatible.
It’s ironic that in the end EISA didn’t really do much better than MCA, and it seems like it could and should have. VL Bus may have been a big factor because it took care of the one or two high-speed devices (graphics and storage) and would guess it was much cheaper and simpler than EISA. Once PCI came out, it was all over for both EISA and MCA, really.
There was an alternate way of doing memory to memory transfers, and was used by a SCAT UMB driver, as well as by UMBPCI. Basically it involves setting block transfers for two channels one after another, read and write, which will make the DMA controller transfer the memory block from one channel to the other. The transfer is started by writing to the DRQ register to set the DRQ of both channels.
That’s fascinating. Is that something that actually works on all 8237 DMA controller implementations?
And why would UMB drivers need such a thing? The point of UMBs is that they’re directly addressable in real mode, so what does DMA have to do with it? Or was it only used to check if DMA works at all in a given memory region?
Almost certaintly to check if DMA works in a UMB so it could warn the user to avoid loading drivers which use DMA or BUFFERS (since the floppy controller which uses DMA) as those things wouldn’t work properly. UMBPCI is probably still around though I don’t know if source code was made available. Certainly it would be easy enough to disassemble it regardless.
Yes, it’s done to check if DMA works in a given UMB block.
Also, I’m not yet sure how this is done in the first place, but I suspect it has to do with the temporary register – the datasheet is not clear on what happens to it in case of non-memory-to-memory transfers, but I suspect it’s still used to hold the data that would normally be transferred to/from a device, which would then explain how this is able to work – initialize the channels and start the transfers in the correct order, and the read transfer will read one byte into the temporary register, which will then write the byte from that register. But I need to do more research into what’s actually going on.
But this is certainly supposed to work both on a normal 8237(A?) that would be used in 286 machines and on the 8237 implementation on Intel PIIX and PIIX3, since both that SCAT driver for UMB on 286 and UMBPCI’s check DMA utility for PCI Pentium and later machines do it.
The C&T SCAT includes DMA controllers so what works or doesn’t work there doesn’t necessarily imply the same must be true for all 8237A implementations. The PIIX would presumably include a 100% functional equivalent of an Intel 8237A though.
The 8237 temporary register is clearly documented as only being needed for memory-to-memory transfers, but it’s entirely plausible that it would be loaded for all transfer types. It is not at all clear to me how writing would work though, because I see no indication that data from the temporary register would be driven out onto the bus. I think I’ll need to disassemble the UMBPCI thingy to see what it really does…
I have partially analyzed what DMACHK from the latest UMBPCI does. It does not attempt to do memory-to-memory transfers as far as I can see. What it does is unmasks DMA channel 1, programs it for a 1KB write in block mode at the given UMB address, and sets the DREQ bit in the Request register (block transfers are the only type that can be triggered by software). I’m not sure what this is really supposed to do, but I am pretty certain the 8237 should write something to the memory if it can access it. It’s not obvious what exactly the data will be, but the datasheet seems pretty clear that a block transfer + software request will run regardless of what any device might or might not be doing. And if nothing responds to the DACK signal then there certainly won’t be anything to stop the transfer with an #EOP (although from what I can tell, in the PC/AT the #EOP pin is output only, so devices can’t control it anyway). I believe the expected result is that the memory contents will change, though not necessarily in predictable ways.
Which piece of software exactly do you think attempts to do actual memory-to-memory DMA transfers? I would like to have a closer look.
Update: DMACHK.COM fills a 1KB block with a 55AAh pattern, starts a DMA block transfer as described above, and then checks if the 55AAh pattern is still in place. If it’s not, DMA is considered to be working. That further convinces me that what DMACHK does is useful for checking whether a particular physical memory region is accessible by the 8237 DMA controller, but very little else.
Weird indeed. From the datasheet, I can only conclude that in this situation, the data bus is *not* driven by the temporary register or the DMA controller in general. As you mentioned, the datasheet is pretty insistent on pointing this out, and it would interfere with normal operation: For actual DMA transfers between device and memory, the controller does not drive the data bus itself, but only instructs memory and device to read or write respectively using the IOR/IOW and MEMR/MEMW lines.
I have never messed with ISA DMA, but it appears to me that in block transfer (write) mode, the device on the corresponding channel is supposed to continue driving the next byte of the block every time IOR goes low, as long as its corresponding DACK is active (though I wish the datasheet would be more explicit in telling so to fully confirm that).
So, with no device attached to channel 1, I’d expect the bus to not be driven, which memory then latches in as 0xff bytes. With a device attached, it’s indeed pretty unpredictable (depending entirely on the state that device is in), and would certainly interfere if strobing IOR is not entirely side-effect free in that particular moment.
What data did you actually get? Have you confirmed that it’s 0xff with no device attached to the channel?
Afaik the DRAM is refreshed if something (the CPU or DMA controller) does reads with the least significant address bits set for each combination of a certain range (0-0x7F or 0-0xFF IIRC) within the time period refreshes has to be done. So reprogramming the DMA controller to do a memory-to-memory transfer would ensure that memory is refresed if the transfer size is large enough (128 or 256 bytes). I assume that for smaller transfers it would anyway be faster to let the CPU do a REP MOVS… instead of taking the time to set up the DMA controller.
Re the UMB DMA testing software mentioned above: It seems rather useful to check if the DMA controller can access UMB. Having hardware chipset support for enabling UMB would kind of imply that the DMA controller can reach the UMB, but using something like EMM386 to map ram as UMB would certainly not be compatible with DMA transfers to the UMB area.
My impression re the usage of “DMA” v.s. “Bus Mastering” is that “DMA” got a bad reputation of being some slow old technique among the general “power users” (those who did read specs and compared performance and had a decent clue how to configure hardware – but didn’t know much about how the hardware did work on a component level like an electronics engineer would), and thus “Bus Mastering” were the new fancy term when the Intel 430 and 440 chipset ranges were marketed (and whichever chipset that used DMA/Bus Mastering earlier on in the >=386 era).
Btw I’m not surprised that EISA didn’t catch on. The sockets are more expensive, and like AGP it’s prone to failures if cards aren’t inserted properly. With the high quality server/workstation class hardware that used EISA that wasn’t a problem, but imagine using EISA with the cheap noname PC cases that the simpler ISA based low cost computers used. Also before Windows 3.x there wasn’t much demand for a fast bus as most software wasn’t that graphics intensive, the hard disks were still rather slow and 10Mbps ethernet were slower than what the ISA bus could perform. Sure a server needed a bit faster hardware if it were to serve disk intensive clients, and that’s the reason why EISA did sell at all. Soon after Windows 3.1 really took on, VESA provided the needed fast performance bus until PCI took over. Both VESA and PCI are mechanically more robust than EISA or AGP, so VESA and PCI works decent with low quality cases.
I haven’t tried that on real hardware yet, so I don’t know what the data would be. Yes, the device is supposed to be driving the data on the bus (definitely not the temporary register) and if that’s not happening, there might be 0FFh bytes (likely on newer hardware) or maybe something else (older hardware). The utility clearly only cares that the memory contents changed.
A bit of background about UMBs and DMA. UMBs are prone to having trouble with DMA, but the issues are very different in the EMM386 vs. UMBPCI case. EMM386 remaps normal physical memory using paging such that linear address != physical address. VDS (Virtual DMA Services) was designed to solve that. Basically with VDS, software performing DMA can ask the memory manager what the physical addresses are for a given memory buffer. The problems here are identical for ISA DMA and for any kind of bus mastering, and the solutions are the same.
UMBPCI does not remap anything but rather enables existing memory that’s already there. The trouble is that the memory is only really meant for ROM shadowing, and on some chipsets it is not accessible by DMA at all. That is not a problem for the intended purpose, but it may make it difficult to load stuff into the UMBs. Conceivably (just a guess) the UMB accessibility by ISA DMA might be different from accessibility by PCI bus mastering, depending on the chipset design.
On a PC/AT or later it’s hard to envision any scenario where ISA DMA could do anything faster than the CPU (on the PC and PC/XT that was different!). If it were possible to perform a transfer “in the background” then it could have some application, but on a 286 it can’t be done–the CPU is locked out, even if memory-to-memory transfers actually worked. And on anything newer ISA DMA is just so hopelessly slow that there’s no point. ISA DMA is still very useful for floppy drives and sound cards which have limited transfer rates but stringent latency requirements. It’s not unlike what USB isochronous transfers were designed for.
“Bus mastering” was considered fancy and a Good Thing even back in the late 1980s when the first bus-mastering HBAs and Ethernet controllers showed up.
I think both EISA and MCA were victims of being too far ahead. VL bus was a short-term and crude but very effective solution, and by the time ISA+VLB became an issue for the majority of users, PCI was already there. Not having to have a reference diskette or an ECU was IMO a huge plus for PCI.
>Conceivably (just a guess) the UMB accessibility by ISA DMA might be different from accessibility by PCI bus mastering, depending on the chipset design.
You are right. For example on 443BX all PCI transfers are routed to UMB (if corresponding PAM registers are activated, of course). But for ISA (and thus for ISA DMA) transfers PIIX4 controls some regions as ISA only and traffic never leaves PIIX4 and reaches 443BX (plus some regions are switchable).
P.S. You reminded me one old history about realization of hardware UMB on AMD Athlon (and EV6 bus). It was really surprising for (in order of this code viewing) apple_rom, me and Uwe Sieber.
“Yes, EISA was capable of doing third-party DMA at full bus speed. And I totally agree that EISA was a really solid improvement over ISA while retaining the compatibility that MCA ditched, and even the slots were backward compatible.”
MCA was better off without ISA, due to the ISA edge-triggered Interrupts being incompatible with the IBM level-triggered Interrupts. EISA was NOT fully b
ackwards compatible with ISA, due to the ISA edge-triggered Interrupt. Using all EISA cards allows interrupt sharing and level-triggered interrupts on EISA, but drop ONE ISA card in there and the benefits of EISA superiority disappear.
Suprising what IBM did with MCA, depending on simple logic [non-volatile] to make MCA come to life. Now we have massive amounts of memory, gigabit Ethernet, WiFi….
There is no doubt that MCA was technically far superior to ISA. In the end that wasn’t enough, but I doubt there was anything the people designing MCA could have done about it.
EISA (and PCI) does not work the way you describe. There’s a special register called ELCR (Edge/Level Configuration Register) which switches individual IRQ levels between edge-triggered and level-triggered. In an EISA (or PCI) system, it is thus easy to mix adapters with edge- and level-triggered interrupts.
EISA made the connectors more complex and expensive, but VLB just added an extra connector on top of ISA, which made the cards longer.
“Intel made a few southbridges with EISA support”
PCEB/ESC was two chip, and SIO/SIO.A was only one chip.