A recent post explored the motivation (i.e. backwards compatibility) to implement the A20 gate in the IBM PC/AT. To recap, the problem IBM solved was the fact that 1MB address wrap-around was an inherent feature of the Intel 8086/8088 CPU, but not the 80286 and later models—but a number of commercial software packages intentionally or unintentionally relied in the wrap-around.
Interestingly, it is obvious that the address wrap-around was much better known and understood in 1981 than it was in the 1990s. For example in 1994, the usually very well informed Frank van Gilluwe wrote in Undocumented PC (page 269): A quirk with the 8088 addressing scheme allowed a program to access the lowest 64KB area using any segment:offset pair that exceeded the 1MB limit. […] Although there is no reason for software to ever use this quirk, bugs in a few very old programs used segment:offset pairs that wrap the 1MB boundary. Since these programs seemed to work correctly, no actions were taken to correct the defects.
Yet it is known that Tim Paterson quite intentionally used the wrap-around to implement CALL 5 CP/M compatibility in QDOS around 1980, and Microsoft Pascal intentionally used it in 1981. In both cases there were arguably very good reasons for using the wrap-around.
Intentional or not, software relying on 8086 address wrap-around was out there and important enough that by the end of 1983, IBM had implemented the A20 gate in the upcoming PC/AT. But did they have to do that?
Let’s explore several what-if scenarios which would have preserved a high degree of compatibility with existing software without requiring the A20 gate.
The CALL 5 interface was a bit of a tough nut to crack because it was a clever (too clever?) hack to begin with. One option would have been to place an INT3 instruction at offset 5 (the one documented single-byte interrupt instruction), or with slight restrictions, use a two-byte interrupt instruction. That would have avoided the need for wrap-around. It might not have been nice but it would have been manageable.
The Pascal run-time was a much nastier problem. Existing variants of the start-up code might have been detected and patched, but it was reasonable to assume that modified variants were out there. It was also reasonable to assume that Microsoft Pascal was not the only piece of code guilty of such shenanigans. The bulletproof solution would have been simple but probably unpalatable—force all or some applications to load above 64K. Any remaining free memory below 64K might still have been available for dynamic allocation. If this option was considered at all, it was likely thought too steep a price to pay for the sins of the past (that is, the 1981-1983 era).
A serious and quite possibly fatal shortcoming of software workarounds was that they required modified software. In an era where bootable floppies were the norm, and even third-party software was sometimes delivered on bootable floppies, a solution which did nothing for users’ existing bootable diskettes was probably considered a non-solution.
A Solution and a Problem
The A20 gate was easy to implement in the PC/AT because there were no caches to contend with. Simply forcing the output of CPU’s address pin A20 to logical zero was all it took to regain the address wrap-around behavior of the 8086. The switch was hooked up to the keyboard controller which already had a few conveniently available output pins.
The implementation was clearly an afterthought; in the PC/AT days, DOS didn’t know or care about the A20 line, and neither did DOS applications. DOS extenders weren’t on the horizon yet. There was no BIOS interface to control the A20 gate, but IBM probably didn’t think there needed to be one—the INT 15h/87h interface to copy to/from extended memory took care of the A20 gate, and the INT 15h/89h interface to switch to protected mode also made sure the A20 gate was enabled. Everyone else was expected to run with the A20 gate disabled.
OEMs like HP or AT&T similarly didn’t think that A20 gate control was something important and devised their own schemes of controlling it, incompatible with IBM’s. There was no agreement across implementations on what the A20 gate even does—some really masked the A20 address line (PC/AT), some only mapped the first megabyte to the second and left the rest of the address space alone (Compaq 386 systems), and others implemented yet other variations on the theme. The A20 gate effects are likewise inconsistent across implementations when paging and/or protected mode is enabled.
The real trouble started around 1987-1988, with two completely unrelated developments. One was the advent of DOS extenders (from Phar Lap, Rational, and others), and the other was Microsoft’s XMS/HIMEM.SYS and its use of the HMA (High Memory Area, the first 64K above the 1M line). In both cases, there was a requirement to turn the A20 gate on in order to access memory beyond 1MB, and turn it off again to preserve compatibility with existing software relying on wrap-around (which, thanks to EXEPACK, only proliferated since the PC/AT was introduced).
The big problem which implementors of DOS extenders and HIMEM.SYS faced was that there was no BIOS interface to control the A20 gate alone, and no uniform hardware interface either. A lesser problem (but still a problem) was that switching the A20 gate on or off through the keyboard controller (KBC) wasn’t very fast, so even on systems which did provide a compatible KBC interface, any faster alternative was well worth using.
The solution was to do it the hard way. Some versions of HIMEM.SYS for instance provided no fewer than a dozen A20 handlers for various systems, and provided an interface to install a custom OEM handler. Around 1990, OEMs realized that this wasn’t workable and only hurt them, and stopped inventing new A20 control schemes. Effectively all systems provided PC/AT-style A20 gate control through the KBC, and typically also PS/2-style control through I/O port 92h (much faster than the KBC).
There were complications for CPU vendors, too. Intel was forced to add the A20M# input pin to the i486—the CPU needed to know what the current state of the A20 gate was, so that it could properly handle look-ups in the internal L1 cache. This mechanism had to be implemented in newer CPUs as well.
Cyrix faced even greater difficulties with the 486DLC/SLC processors designed as 386 upgrades. These processors had 1K internal cache (8K in case of the Texas Instruments-made chips) and did also implement the A20M# pin, but needed to also work in old 386 boards which provided no such pin. That only left unpleasant options available, such as not caching the first 64K of the first and second megabyte.
The life of system software was also quite complicated by the A20 gate. For example memory managers like EMM386 needed to run the system with the A20 gate enabled (otherwise on a 2MB system, no extended memory might be available!) but emulate the behavior software expected. EMM386 needed to trap and carefully track A20 gate manipulation through the KBC, as well as through port 92h. When the state of the A20 gate changed, EMM386 had to update the page tables—either to create a mapping for the first 64K also at linear address 100000h (A20 gate disabled), or remove it again (A20 gate enabled).
Hindsight Is 20/20
From today’s vantage point, it is obvious that IBM should have just sucked it up back in ’84. Leave the the A20 address line in the PC/AT alone, and force software to adapt. That would have saved so much time, effort, and money to software developers and users over the subsequent decades. Complexity is expensive, that is an unavoidable fact of life.
It’s just as obvious that in 1984, the equation was different, and adding the A20 gate to the PC/AT was considered the lesser evil (relative to breaking existing software). Predicting the future is a tricky business, and back then, DOS extenders or the HMA probably weren’t even a gleam in someone’s eye. IBM likely assumed that DOS would be gone in a few years, replaced by protected-mode software which has no need to mess with the A20 gate (such as IBM’s own XENIX, released in late 1984).
By the time the A20 gate started causing trouble around 1987, reliance on the address wrap-around was much more entrenched than it had been in 1984, not least thanks to EXEPACK. At that point, the only practical option was to press on. There was no longer a company which could have said “enough with this nonsense”; hardware had to support existing software, and software had to support existing hardware.
Over time, the A20 gate turned into a security liability and modern CPUs deliberately ignore the A20M# signal in various contexts (SMM, hardware virtualization).
Many years and many millions of dollars later, here we are. DOS has been sufficiently eradicated that some recent Intel processors no longer provide the A20M# pin functionality. Some.
Even the folks at Intel clearly don’t understand what it’s for, and thus a recent Intel SDM contains amusing howlers like a claim that “the A20M# pin is typically provided for compatibility with the Intel 286 processor” (page 8-32, vol. 3A, edition 065). To be fair, other sections in the same SDM correctly state that the A20M# pin provides 8086 compatibility.
It is likely that over the next few years, the A20M# functionality will be removed from CPUs. Since legacy operating systems can no longer be booted on modern machines, it is no longer necessary. In an emulator/hypervisor, the A20 gate functionality can be reasonably easily implemented without hardware support (manipulating page tables, just like EMM386 did decades ago). Goodbye, horrors of the past.
You’re imagining some kind of completely separate 64-bit mode, but there isn’t one. You can still execute 32-bit instructions in 64-bit code, use 32-bit (or smaller) registers, and so on. Not supporting 32-bit code would save some complexity, but not a lot.
I mean really, if people want a clean, neat 64-bit architecture, AMD64 should be the very last thing to consider.
I would focus on getting rid of things like x86 segmentation and real mode first.
> I incorrectly remembered what software changed the A20 on every access. It was the DOS LAN Manager redirector
Most likely, lanman used local enable/disable A20 himem’s functions, so if A20 was enabled before the call to the redirector (e.g by MS-DOS), its stayed enabled.
I think there is a good chance that real/PM16 mode doesn’t survive the next decade if Intel really drops BIOS support in 2020, but IMO, 32-bit mode will be with us for a long time. On the other hand, Intel announced end of A20M# support in Haswell and it still kinda works in Skylake.
The BIOS code got turned into spaghetti because some application programmers chose to jump directly to the start of a function instead of using INT and function number. Rod Canion went through a lengthy complaint about seeing those during the preliminary testing of Compaq against the 30 best seller list. IBM left some room between functions for patching but it was not enough for to handle all the needed code leading to clever assembly trickery to squeeze out bytes and jumps to whatever addresses hadn’t been allocated yet and the loss of occasionally used functionality.
A 64-bit only AMD/Intel chip would be about 5% smaller; might be even less if areas can’t be shuffled around to extract out the now dead silicon. Not really worth it to me since the transition will likely cause new bugs to be discovered in programs that bent the rules slightly.
I think that while it is well known that legacy BIOS has specific addresses for the INT routines etc that must be supported for compatibility, that is a relatively small problem and not what the other posters are referring to.
That should be the case with DOS 5+ loaded high. Not on DOS 4.01 or 3.x though.
The direct jumps to specific addresses (which I think IBM very explicitly asked programmers not to do) were an annoyance for sure, but not a huge one. It’s more the support for power management, USB, CD-ROM booting, PCI, multiple processors, various chipsets and CPUs, and so on and on that caused the originally neat code to turn into a big mess.
Totally agree about the 64-bit only chips, I suspect 5% smaller is a very optimistic estimate actually. These days CPUs are huge caches and memory controllers and complex systems-on-a-chip that the actual x86 instruction decoding is just a tiny part of the functionality.
Kinda works? Can you elaborate? Haven’t tried anything which manipulates the A20 gate on a Skylake…
What I’ve always wondered is how many non-IBM compatible 286+ machines have the A20 gate. I suspect that things like Sun386is and SGI VW320s probably don’t have it. No need for them, as they only ran 32 bit PM OSes. PC-98s have it (method similar to “fast A20” poke but different), and I’d be damn surprised if FM-R/FM Towns and friends don’t have it, given that they probably had EXEPACKed executables floating around.
I wonder how much faster would the 32-bit 386 implementation of FAT32 in MS-DOS 7.1 be over PC-DOS 7.1’s 16-bit FAT32 implementation.
@Necasek^0: Yes, metoo figures that there’s a good chance it will be
@Necasek^1: It takes a true master to be good at both hardware design
and programming; thus, in general, me’d have to agree.
@Necasek^2: no, medidn’t try to read the `efi’ source. Mewas speaking
from more general experience.
@dosfan: Some might not identify C as quite so high-level 🙂
@Necasek^3: Yeah, a pile of rotting crap, just like the rest of the IBM
@Necasek^4: Menever liked amd64, either. Isn’t it incredible how vendors
keep pushing kludges when the need to move on to something
cleaner has long been obvious?
@Yuhong Bao: Or just make it possible to run native RISC programs, w/o
the i86 emulator shoved in front.
@Vlad Gnatov: If the move to ARM really will work out it’ll prolly be
one of those times where the old stuff (i.e., i86) is
immediately obsolete. But we’ll see.
I really wouldn’t think a Sun386i would need one. PC compatibility was done in V86 mode, and there the A20 gate behavior needs to be emulated through paging anyway.
Only systems that can run DOS software in real mode would need an A20 gate; the PC-98 is firmly in that category I suppose.
You might take the success of the AMD64 architecture as an indication that such kludges are exactly what the market wants. It’s not like Intel didn’t try with IA-64, or Digital with Alpha AXP.
Success? In financial terms, maybe… which, as a technical type,
medoesn’t reall care that much about.
> Kinda works? Can you elaborate? Haven’t tried anything which manipulates the A20 gate on a Skylake…
It’s worked, but I’m not sure if cpu had A20M support, perhaps it was some firmware magic, like mapping 1st mb into 2nd and flushing cpu caches.
The pc9801 has a20 support and the firmware checks it. The FMTowns might but if so it’s undocumented, the firmware and himem.sys don’t check for it. The towns unlike the pc9801 only had 386+ cpus.
techfury90: An A20 gate would be needed on 286+ systems that for some reason can run thrid party code written for previous 8088/8086 systems (or 80188/80186 systems) where the manufacturer has no way of forcing all third party vendors to do free updates or previously had forced them to never use negative offsets.
Yuhong Bao: 32-bit or 16-bit code for fat32 implementation probably doesen’t matter as RAM is generally many times faster than hard disks. It’s only if you use a ram disk or if you install a 15 years newer ide/pata hard disk in an old 386 computer it might really matter.
The reason for vfat.386 in Windows for Workgroups 3.11 is so much faster than ms-dos is due to it being able to take up more space. Even though smartdrv can use a large amount of data for cache the code for figuring out what to cache still has a size limit. Also vfat.386 can do other things in a smarter way afaik, but that’s not due to technical differences between 16-bit and 32-bit code per see, but a difference in how Windows 3.x and dos treats different kind of executable binaries.
I wonder how much faster vfat.386 was because it used “32-bit disk access” and the vendor drivers may well have been much better than the BIOS, supporting things like faster data rates or multi-sector transfers. Better caching also could have made a huge difference.
It’s interesting how much of an obvious evolutionary step WfW 3.11 was between Windows 3.1 and Windows 95.
How about the original, non AT compatible, Apricon Xen? It used a 286, and supported more than 1M.
I wonder if it avoided the A20 gate, as I believe it had some reserved location (Font memory / Graphics RAM) in low memory locations.
Yes, with the devolution setting in ’round the time of the latter.
YMMV, of course.
I remembered that I’d recently found again the Xen spec / tech docs (codename Candyfloss). According to that, it also had an equivalent of the A20 gate:
CPU CONTROL PORT
This is located at I/O address 0CE0H:
bit 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
X X X X X | | |
| | — 0 = Boot ROM present
| | 1 = Boot ROM mapped out
| —— 0 = Hard Reset
| 1 = Soft Reset
———- 0 = Full Memory
1 = 1 Mbyte wraparound
D0-D2 are all writable; D2 cannot be read back but D1 and D0 can.
and later when referring to the CPU reset port it says:
Before performing a soft reset it is advisable that:
1. The boot ROM is enabled.
2. Full memory is switched in, to accommodate the CPU reset
vector at 16 Mbytes.
(1) and (2) are functions of the CPU control port.
The Apricot Xen was a “DOS compatible” machine much like the NEC PC-98. So not surprising it had A20 control as well, to solve the same problem.
Well, windows 3.1 had “32-bit disk access” and 3.11 added “32-bit file access” (=vfat.386).
You could switch on and off theese two separately.
But the big performance gain is probably only there when used together.
My memory says that theese never worked well out of the box if your disk were larger than 528MB, at least unless you installed some third party driver.
Windows 3.1 32-bit disk access was just a cute name for Win3x VMM’s builtin IDE/ATA disk controller port driver, which is a full 32bit disk driver VxD, and can drive the disk controller directly without DOS/BIOS assistance. In Win95 the storage stack had improvements, was layered better, and the hardware controller bits were separated and became ESDI_506.PDR. Michael wrote a full article about this, which is available here in an onlder post.
Ofc, there are also vendor suppliers for this disk driver besides builtin Windows one. One is Ontrack’s ONTRACKW.386 provided with DiskManger utilities from various Disk vendors, and the most compatible/less buggy one if you ask me, as Ontrack provided updated versions of the file as late as 2003, so it will support all modern ATA characteristics and LBA/CHS proper translation. The other one is Western Digital’s DiagTools MH32BIT.386, coded by Microhouse. Both drivers can be used without overlays installed, but if you install the Overlay, you will have to use the corresponding version to the package you’re using (EZDrive or DiskManager). There are also more older and disk specific drivers, as WDCDRV.386 and SEG32BIT.386. I don’t recommend any of them, as them are old.
Check if you have an SCSI adapter, you also need a VxD for it if you want to use 32bit disk access. In Adaptec’s case there is one main FASTSCSI.386 file, and a smaller per card AHA* or AIC* 386 VxD driver.
@raijinkai: Yes, I know that.
But those third party vendors weren’t just those typical software houses, but afaik also the hardware vendors (like HP or CMD or whoever were responsible for the drivers for an HP Vectra VL 5/75 with an 8xx MB disk).
It’s really a shame that Microsoft never produced any updated versions that could handle different geometry translations than the old max 528MB one.
Btw I guess the reason behind the separation onto ESDI_506.PDR and similar in Win9x were due to that the non-hardware specific part actually did something more than some simple api glue. I assume that the non-hardware specific part handled all kinds of geometry translations in Win9x.
Btw I still find it strange that Win9x dropped the support of creating permanent swap files.
3rd party Windows 9x IDE controller drivers (Promise cards, SATA cards, etc.) all tended to use the SCSI Miniport driver model and thus appeared as SCSI controllers in device manager. ESDI_506.PDR was way too buggy with newer hardware, even on supported chipsets. Not supporting 48bit LBA was among its limitations. Plus it couldn’t deal with IDE controllers on “non-standard” ports (anything that wasn’t at 1F0 or 3F0). Thats why Intel had to add compatibility modes to the ICH5 to swap in 2 of the 3 onboard controllers to the legacy primary/secondary IDE ports.
While multi-sector IO and PM drivers certainly didn’t hurt, main advantage of “32-bit disk access”, IMO, is good _dynamic_ cache. Ability to scale the cache size up to all available free memory and down to almost nothing in case of memory pressure makes it better than any existing DOS cache software.
And as cache+32bit pm filesystem and 32-bit IO drivers can be turned on/off individually, it should be easy to measure performance for all the four combinations (and perhaps eight combination if we add smartdrive or no smartdrive as another factor).
It’s just that someone needs to actually get around doing it. Seems we are stuck here 🙂
All Microsoft had was WDCTRL.386, which was adequate in the Windows 3.1 days. Big IDE disks came out just a little too late for Microsoft to care about them being used with Windows 3.1.
Permanent swap files made good sense in the Windows 3.1 days because that way Windows could entirely bypass the file system (and hence, DOS). In Win9x days the FS was part of the OS so there wasn’t nearly as much benefit.
PC Magazine did evaluations of the various options of 32-bit disk and 32-bit file access. See, for example, Jul 1995 page 194.
The semi-permanent swap file (with minimum and maximum values set) made the most sense for Win9x since it kept the system from bogging down to create the initial swap region while providing room to grow if necessary.
“better than any existing DOS cache software” … there were dozens, but how many are left? (FreeDOS only has two, of varying degrees of suitability.) If you can’t even find it anymore (legally), it’s less useful overall, IMO.
(Also, NWCACHE from DR-DOS 7.03 was very limited in maximum size, IIRC, so FreeDOS is at least better than that.)
Besides, I doubt anybody’s workload is still based upon Windows just for speed alone. “Well, I would use DOS, it does literally all I need, but it’s so slow!” (Yeah right. I think speed is the least worrisome problem.)
(I know we’re reminiscing about the old days, just saying, even ancient Windows gets too much credit.)
Back in the 1990s I went through Super PC-KWIK (bundled with DR-DOS), SMARTDRV, NWCache (bundled with Novell DOS), Norton Cache, and maybe one or two others. They all basically worked and gave a good performance boost, but there were big differences in configuration, memory usage, and capabilities. For example SMARTDRV worked with Windows well and did a good job, but could not be dynamically unloaded and used relatively large amount of conventional/UMB memory.
As an aside, WDCTRL makes installation checks which may fail on drives implementing ATA-2 or later. WDCTRL checks various command completion register states which are not defined in newer ATA specifications. Obviously WDCTRL pre-dates even the final ATA-1 standard (1994).
Thanks for the PC Mag pointer. Here’s a short summary of their benchmarks: 32-bit disk access alone makes no difference (from what I remember, it might have done a difference when swapping… but then performance is out of the window already). 32-bit file access alone makes a sizable difference. 32-bit file access together with 32-bit disk access is even better.
@zeurkous technical type or not, it is a success because economics of scale make it possible for you to buy much more advanced equipment than you otherwise would, and because the financial success of things like the amd64 ‘architecture’ funds the development of technology.
Stuff just doesn’t magically appear, and success in the marketplace is simply required for being able to develop additional technology, so as a tech type you should actually care about it at least somewhat, you wouldn’t be able to have your nice tech toys without it.
[didn’t notice Bart’s comment until now…]
That’s exactly what’s wrong w/ the “market”. We’re being forced to deal
with stuff that’s both plain unnecessary and an active impediment to
doing our jobs well.
Merealizes this is turning into a political argument, and thus mewill
shut up =)
Inertia is one of the most powerful forces in the universe.
Exactly the reason why one should be very careful when propagating new
While hindsight is 20/20, I do wonder if there could have been an easier path back then. In the ideal case, in 1987 when the first problems were seen, OEM’s should have centered on one interface (as also happened) but at the same time agreed and announced this would be the last generation of support for A20, and that 80486 certainly wouldn’t support it (requests to KBC or PS/2 would simply be ignored). This would have given end-users and software vendors some years to adapt and update software. Almost all of the hardware problems and know-on software problems started occurring with 80486, and ballooned further due to the interaction with future CPU extensions. Thus, almost all costs and problems of A20 would have been avoided.
I would argue, even in a less ideal case where no one foresaw the 80486 problems in advance, by the time Intel designed 80486, they should have stepped up and never implemented #A20M, and the same would have been true, with minimal difficulties.
But what about those insisting on running old (e.g. two generations behind) DOS / ancient software on their 80486? They wouldn’t necessarily be left in the dark. The A20 behavior could have been emulated in programs like e.g. EMM386 shipped in operating systems made for the new CPU or perhaps in simpler, dedicated TSR’s for legacy DOS. One could imagine a relatively simple TSR “A20FIX.EXE” that could run even on old DOS. It would have no other function than moving the CPU to protected+paging mode with the 8086 memory layout, and could be run before using legacy software. It wouldn’t have to trap or emulate the KBC or PS/2 layout, since the legacy software would by definition not know of this. Likewise, it wouldn’t have to emulate XMS, EMS etc. And in the ideal case, users and ISV’s would have had a whole generation to adapt and ship updates to those stubborn users running two gen old software on their shiny 486’s. And for those who could neither get updates nor use A20FIX.EXE, there would be the option of using an older machine.
The original IBM A20 hack for 80286 had it’s merit, but I think supporting the 8086 mode in hardware for 80486 and onwards was absurd. I might suspect it has been more out of cultural/political thing at Intel (NEVER break backwards compatibility) than any practical reasons.
You’ve got the chronology slightly wrong. The first problems were seen not in 1987 but in 1983-84, when the IBM PC/AT was being designed. The AT had no V86 mode and no ability to run real mode programs in protected mode (sure, it was done, eventually… several years after the AT came out). I’m sure IBM was aware how many problems not implementing address wraparound would have caused, and it was safe to assume it would have caused some problems they didn’t know about. Needless to say, when designing the PC/AT, IBM wanted an architecture suitable for 1985, not 1995. Adding the A20 gate control was a very simple fix with very minimal hardware cost. The same simple fix also applied to the 386.
When the 486 came out (1989), DOS was far, far from two generations behind — that’s what the vast majority of PCs ran at the time! While you could blame many companies for that, Intel had no control over it. The problem was arguably worse because now more code relied on address wraparound (EXEPACK!).
Sure, EMM386 could have worked around the problem, but Intel was not in a position to hand out EMM386 to users. The choices Intel had with the 486 were a) not supporting A20 wraparound and have the shiny new CPU fail to run a number of existing applications, or b) suck it up, add a few transistors, and support A20 wraparound and avoid compatibility headaches. It’s not hard to see why Intel didn’t go with a).
The real trouble with the A20 wraparound was not the original design but rather all the incompatible hardware implementations, coupled with the fact that there was no BIOS A20 gate control. IBM did not add a separate BIOS function but rather built it into all the relevant routines (INT 15h/87h, INT 15h/89h).
Sure, we can blame IBM in 1983-84 for failing to realize that ten years later, PCs would be still stuck running DOS or DOS-based environments. That was certainly not IBM’s plan or desire, and I doubt most people at the time would have found it believable.
The big A20 mess was the result of numerous people/companies making very reasonable decisions given the constraints they had to work with at the time the decisions were made. Evolution often works that way.