The History of a Security Hole

Posted on August 31, 2018 by Michal Necasek

Warning: If you do not care for the finer points of x86 architecture, please stop reading right now—in the interest of your own sanity.

A while ago I was made aware of a strange problem causing a normal user process running on 32-bit i386 OpenBSD 6.3 to crash the OS (i386 only, not amd64). The problem turned out to be a security hole with history that goes back more than three decades.

The crashing code looked like it didn’t really have any business crashing, but the CPU was in a very odd state with inaccessible kernel stack and GDT (that’s extremely unhealthy because exceptions and interrupts cause triple faults and CPU shutdown).

After much head scratching, I noticed that the (virtual) CPU’s A20 gate was off. That’s a big no-no because when the CPU is in protected mode, turning the A20 gate off has very nasty, unpredictable, and system-specific consequences. It’s one of those Just Don’t Even Try That things. But could a user process really turn off the A20 gate? That makes no sense.

As it turns out, a user process really could do that on i386 OpenBSD 6.3 (again, i386 only, not amd64). A security hole allowed regular user processes to read and write many I/O ports, which is obviously very unhealthy. The chain of events that led to this is long, and probably the biggest player in it is Intel, with important contributions from NetBSD and OpenBSD developers. Thanks to the nature of open source, we can trace back exactly how it came to be, and perhaps even learn a thing or two from the mistakes.

Exposition, Intel Lays a Trap

When the 80286 was released in 1982, it introduced support for hardware task switching, something which, in certain circles, was in vogue in that era. The basic state of a task was held in a Task State Segment, or TSS. The TSS records the register state of an inactive (“switched away”) task, and also specifies the stack to use when switching to a ring with higher privilege (for that reason, every typical protected-mode OS must have a valid TSS).

John Crawford, one of the main 80386 designers, described the 286/386 task switching as “miles of microcode” which “never did work out quite right”, a very realistic assessment of the feature. But it’s baked into the x86 architecture, and TSSs are necessary even when hardware task switching isn’t used (see the AMD64 architecture—no hardware task switching, but TSSs are still a necessity).

When the 80386 first became available in silicon in 1985, the TSS was trivially extended (relative to the 286) to support 32-bit registers and also hold the task’s copy of the CR3 register (which massively complicated task switching, but that’s a different story).

In mid to late 1985, someone—likely Compaq and/or Microsoft—convinced Intel to add a permission bit map for I/O port access, allowing the OS to trap certain port accesses but allowing others to proceed at full speed; it is known that the permission bit map was not part of the original 386 specification, and there is no mention of it in the original 80386 datasheet (October 1985, Intel order no. 231630-001). The added level of granularity was very useful for V86 mode, and the feature was utilized by Compaq’s CEMM as early as 1986. Note that the I/O permission bitmap applies to every protected mode task (with a 386 TSS), not just V86 ones; the caveat is that for V86 tasks, the permission bit map is consulted for every I/O port access, and for non-V86 tasks only if CPL is numerically greater than IOPL (that is, when I/O would be otherwise not permitted).

Intel decided to place the I/O permission bit map (IOPB) in the TSS, providing per-task I/O privileges. But because it was tacked onto an existing design with a bit of chewing gum, some wires were inevitably left sticking out. The last DWORD (32 bits) of a TSS was originally specified to contain just one bit indicating whether a debug breakpoint should trigger when switching to the task/TSS; this was bit zero. Intel redefined the high 16 bits of the last DWORD to contain a 16-bit offset to the IOPB within the Task State Segment.

The size of the IOPB was not explicitly specified in the TSS, only its starting address was; the IOPB size was implied by the size of the TSS itself (the size of the segment is recorded in the GDT, or Global Descriptor Table). In other words, the IOPB started at the given offset and continued until the end of the TSS, or until all 65,536 possible I/O ports were covered by the bit map. That allowed the OS to place its own data structures between the end of the fixed part of the TSS and the beginning of the IOPB. Any of the 64K I/O ports not covered by the IOPB is automatically considered not accessible; a full IOPB is 8KB in size, which was memory worth saving in systems with limited memory (1-2MB), a handful of I/O ports at the beginning of the range, and potentially lots of Task State Segments.

AMD’s documentation says the IOPB offset must be 68h or more (68h is the size of the fixed TSS portion) to be valid, so that it wouldn’t overlap the fixed TSS portion. While that’s perfectly sensible, Intel’s documentation makes no mention of such restriction, and in fact Intel CPUs allow the IOPB to start at offset zero in the TSS, leading to “interesting” results if the OS designer is not careful.

That sounds tricky enough, but of course Intel didn’t stop there. I/O port access can be 1, 2, or 4 bytes wide and therefore 1, 2, or 4 bits in the IOPB are considered for each access. Because port accesses may be unaligned, the CPU may need to read 2 or 4 bits crossing a byte boundary when evaluating the IOPB. Likely because the design was an afterthought and there was not enough room for more complex microcode, this fact was exposed to the user in that the CPU always reads two bytes from the IOPB. For that reason, Intel requires the IOPB to end with one padding byte with all bits set (i.e. “access not allowed”). That way there is always valid data to read. It is not documented what exactly happens when this requirement is not satisfied (i.e. the last byte of the IOPB does not have all bits set); as one might perhaps expect, if the padding byte is zero, word or dword port I/O is allowed to cross into otherwise inaccessible area.

To make things even better, this subtlety (requirement for an extra padding byte with all bits set) was not documented at all in the widely read original 80386 PRM (1986). It was documented in the April 1986 datasheet for the 80386 (Intel order no. 231630-002), with a good level of detail, but software developers clearly didn’t always think of looking there. The padding byte requirement was properly documented in the 386SX PRM (1989) and subsequent Intel programming references.

Note that sandpile.org claims that only the lowest three bits of the last byte need to be set. That makes perfect sense because at most four permission bits need to be checked at once (dword-sized I/O), with the lowest bit being the last bit of the IOPM proper and a worst case of three bits spilling over to the padding byte. In other words, the high five bits of the padding byte are never considered.

The original 386 PRM from 1986 was in fact not just incomplete but flat out wrong: “For example, if TSS limit is equal to I/O map base + 31, the first 256 I/O ports are mapped”—implying that 32 bytes are needed to map 256 ports, when in reality 33 would be needed. The updated 1989 reference read: “For example, if the TSS segment limit is 10 bytes past the bit map base address, the map has 11 bytes and the first 80 I/O ports are mapped.” The updated text is clear in requiring the extra padding byte.

Interestingly, an Intel memo from January 20, 1986—which first described the I/O permission bit map—read as follows: “For example, setting the TSS limit to {BitMapBase + 32} will allow bit mapping the first 256 I/O ports”. The text in the public PRM was similar, only wrong because 32 somehow turned into 31. Perhaps Intel documentation writers were confused by the difference between segment sizes and segment limits just like everyone else, or perhaps the documentation was written before the implementation was fully completed and not corrected until several years later.

As an aside, additional chewing gum was applied in the Pentium when implementing V86 mode enhancements. For enhanced V86 tasks, the IOPB offset in the TSS also doubles as the end of the interrupt redirection bitmap, which is located in the 32 bytes immediately preceding the IOPB (32 bytes or 256 bits for 256 software interrupts). There is no user-settable flag to specify whether the interrupt redirection bitmap is present in a TSS or not; it’s always considered present when V86 mode enhancements are enabled in the CR4 register.

The Intel 386 has a documented erratum related to the IOPB offset field in the TSS (Intel order no. 272874-001 from July 1996 and later updates). The processor should refuse switching to any TSS with a limit less than 103 (67h), but in fact only refuses switching to a TSS with a limit less than 101 (65h). When encountering an I/O instruction, the 386 may attempt to read the IOPB offset and trigger a #TS fault if the TSS limit is not big enough. That is just another reminder that the IOPB was a last-minute addition to the 386 design.

Intel’s documentation is somewhat unclear on how to set up a TSS with no IOPB. Intel’s 386 documentation (1986) said: If I/O map base is greater than or equal to TSS limit, the TSS segment has no I/O permission map, and all I/O instructions in the 80386 program cause exceptions when CPL > IOPL. AMD’s documentation on the other hand says: The bitmap can be located anywhere within the first 64 Kbytes of the TSS, as long as it is above byte 103. In other words, if the IOPB offset is less than 68h on AMD CPUs, there is no IOPB. That is very logical because it keeps backwards compatibility with software written before the IOPB was defined, and avoids pathological cases when the IOPB would overlap the fixed TSS portion. On Intel CPUs, such pathological cases are not prevented and software which does not set the IOPB base may end up with unexpected IOPB.

It is obviously not trivial to use the IOPB correctly. In addition, an incorrectly set up IOPB is unlikely to cause obvious problems, but may disallow access to desired ports or (much worse, security-wise) allow access to undesired ports.

386BSD Sets the Scene

In the late 1980s, Bill Jolitz started porting BSD UNIX to the ubiquitous 386 architecture. 386BSD 0.0 came out in early 1992. Internal process data were held in struct pcb which was defined in src/usr/src/sys.386bsd/i386/include/pcb.h and looked like this (excerpted):

struct pcb {
    struct i386tss pcb_tss;
#ifdef notyet
    u_char pcb_iomap[NPORT/sizeof(u_char)]; /* i/o port bitmap */
#endif
    struct save87 pcb_savefpu; /* floating point state for 287/387 */
    struct emcsts pcb_saveemc; /* Cyrix EMC state */
/*
* Software pcb (extension)
*/
    int     pcb_flags;
    short   pcb_iml;     /* interrupt mask level */
    caddr_t pcb_onfault; /* copyin/out fault recovery */
    long    pcb_sigc[8]; /* XXX signal code trampoline */
    int     pcb_cmap2;   /* XXX temporary PTE - will prefault instead */
};

This structure definition is useful for understanding the subsequent story. It is notable that pcb_iomap (i.e. the IOPB) was not yet defined but would have been placed right after pcb_tss; that would have made struct pcb unsuitable for placing into a hardware TSS as is, because the IOPB needs to be at the end of a TSS (unless it covers all 64K ports).

It is also notable that the “software pcb” does indeed only contain software-defined items.

Subtle NetBSD Bug and a Landmine

In the mid-1990s, 386BSD turned into NetBSD (among other things).

In 1995, NetBSD developers rewrote the OS’s task management such that each process had its own TSS. One of the objectives was to allow tasks to have custom IOPBs, with selective I/O port access from user processes. Only the first 1024 ports could be opened up in this fashion. There was struct pcb which mapped to a TSS; it contained the fixed TSS portion, custom NetBSD fields, and an IOPB at the end. It looked like this (excerpted from src/sys/arch/i386/include/pcb.h):

struct pcb {
	struct	i386tss pcb_tss;
	int	pcb_tss_sel;
        union	descriptor *pcb_ldt;	/* per process (user) LDT */
        int	pcb_ldt_len;		/*      number of LDT entries */
	int	pcb_cr0;		/* saved image of CR0 */
	struct	save87 pcb_savefpu;	/* floating point state for 287/387 */
	struct	emcsts pcb_saveemc;	/* Cyrix EMC state */
/*
 * Software pcb (extension)
 */
	int	pcb_flags;
	caddr_t	pcb_onfault;		/* copyin/out fault recovery */
	u_long	pcb_iomap[1024/32];	/* I/O bitmap */
};

It is notable that pcb_iomap is now defined, and also that it seemingly moved into the “software pcb”, even though it’s not at all software-defined; this was presumably done to allow the entire structure to be placed into a TSS). That makes the existing “Software pcb (extension)” comment very misleading.

New Task State Segments were set up using the following code in src/sys/arch/i386/i386/gdt.c:

void
tss_alloc(pcb)
	struct pcb *pcb;
{
	int slot;

	slot = gdt_get_slot();
	setsegment(&dynamic_gdt[slot].sd, &pcb->pcb_tss, sizeof(struct pcb) - 1,
	    SDT_SYS386TSS, SEL_KPL, 0, 0);
	pcb->pcb_tss_sel = GSEL(slot, SEL_KPL);
}

The third argument to setsegment() is the new segment limit. The authors planted a well hidden landmine in making an implied connection between the size of struct pcb and the size of the corresponding hardware TSS. That is not even hinted at in the struct pcb definition, practically begging unsuspecting programmers to step on said landmine.

Sharp-eyed readers may have noticed that there’s something missing in struct pcb—the final padding byte required by Intel. The IOPB didn’t really cover 400h ports as the authors intended, but only 3F8h ports.

OpenBSD Bug Fix Creates a Different Bug

The problem with incorrect IOPB size was noticed and fixed in OpenBSD in May 2000. The updated struct pcb now looked like this:

#define	NIOPORTS	1024		/* # of ports we allow to be mapped */

struct pcb {
	struct	i386tss pcb_tss;
	int	pcb_tss_sel;
        union	descriptor *pcb_ldt;	/* per process (user) LDT */
        int	pcb_ldt_len;		/*      number of LDT entries */
	int	pcb_cr0;		/* saved image of CR0 */
	union	fsave87 pcb_savefpu;	/* floating point state for 287/387 */
	struct	emcsts pcb_saveemc;	/* Cyrix EMC state */
/*
 * Software pcb (extension)
 */
	int	pcb_flags;
	caddr_t	pcb_onfault;		/* copyin/out fault recovery */
	int	vm86_eflags;		/* virtual eflags for vm86 mode */
	int	vm86_flagmask;		/* flag mask for vm86 mode */
	void	*vm86_userp;		/* XXX performance hack */
	u_long	pcb_iomap[NIOPORTS/32];	/* I/O bitmap */
	u_char	pcb_iomap_pad;	/* required; must be 0xff, says intel */
};

The commit message read as follows:

Add an extra byte to the end of struct pcb and make sure that it is set to
0xff.  Intel (vol1 section 9.5.2) says that there must be a byte inside the
TSS after the iomap because it always reads two bytes when checking
permissions for io accesses.  before this, bits 1016-1023 were ignored.

This means that the entire pcb_iomap (and i386_*_ioperm) are accurate;
pr#1190 fixed

At first glance, the fix looks perfectly reasonable. Sadly, it’s not, because this is where the landmine planted in 1995 struck. Because struct pcb contains 32-bit members, the structure’s size is rounded up to 32 bits by the C compiler. Instead of fixing the IOPB to cover 400h I/O ports rather than 3F8h ports, the fix expands the IOPB size to cover 418h ports. Not only does it virtually ensure that the last byte of the IOPB will not have all bits set, it also opens up access to ports 408h-418h in an uncontrolled fashion.

That’s a potentially serious hole because there likely are important system ports in that range, and every process can likely access them (“likely” because the unintended final three padding bytes of a TSS are not explicitly initialized but are probably zeros, which would allow access).

This problem could be blamed on the C language and/or compiler for making “invisible” (but well understood) adjustments to structure sizes… or on programmers using the language incorrectly.

NetBSD Independently Bitten by Same Bug

OpenBSD programmers weren’t the only ones running into the structure padding issue. NetBSD 4.x had the exact same problem, only a tiny bit worse. In version 4.0 (2007), their implementation ofstruct pcb looked completely sane, but wasn’t:

#define	NIOPORTS	1024		/* # of ports we allow to be mapped */

struct pcb {
	struct	i386tss pcb_tss;
	int	pcb_cr0;		/* saved image of CR0 */
	int	pcb_cr2;		/* page fault address (CR2) */
	union	savefpu pcb_savefpu;	/* floating point state for FPU */

/*
 * Software pcb (extension)
 */
	int	pcb_fsd[2];		/* %fs descriptor */
	int	pcb_gsd[2];		/* %gs descriptor */
	void *	pcb_onfault;		/* copyin/out fault recovery */
	int	vm86_eflags;		/* virtual eflags for vm86 mode */
	int	vm86_flagmask;		/* flag mask for vm86 mode */
	void	*vm86_userp;		/* XXX performance hack */
	struct cpu_info *pcb_fpcpu;	/* cpu holding our fp state. */
	u_long	pcb_iomap[NIOPORTS/32];	/* I/O bitmap */
};

That looks reasonable, except union savefpu contains struct savexmm, which has an __aligned(16) attribute for obvious reasons. Unfortunately, struct pcb just happened to have a natural size that was not even 8-byte aligned, so the compiler added 12 bytes of padding. That expanded the IOPB by 12 “invisible” bytes, and because those bytes are usually zeroed, numerous ports became accessible.

The actual consequence is that I/O ports in the range 400h-458h were open in NetBSD 4.0, accessible to any process. It is entirely possible that no one ever noticed. The problem no longer existed in NetBSD 5.0.

Exactly as in the OpenBSD case, the problem was directly caused by reliance on sizeof(struct pcb) in hardware-specific code that was entirely unprepared to deal with usual C structure padding. Unlike the OpenBSD case, the problem was far less obvious because it was caused by a structure inside a union inside a structure; a nice example of how a perfectly reasonable code change in one place causes problems in another, seemingly completely unrelated place.

OpenBSD Keeps Digging

In October 2007, the OpenBSD hole grew a little bigger. After further changes, struct pcb now looked like this:

#define	NIOPORTS	1024		/* # of ports we allow to be mapped */

struct pcb {
	struct	i386tss pcb_tss;
	int	pcb_tss_sel;
	union	descriptor *pcb_ldt;	/* per process (user) LDT */
	int	pcb_ldt_len;		/*      number of LDT entries */
	int	pcb_cr0;		/* saved image of CR0 */
	int	pcb_pad[2];		/* savefpu on 16-byte boundary */
	union	savefpu pcb_savefpu;	/* floating point state for FPU */
	struct	emcsts pcb_saveemc;	/* Cyrix EMC state */
/*
 * Software pcb (extension)
 */
	caddr_t	pcb_onfault;		/* copyin/out fault recovery */
	int	vm86_eflags;		/* virtual eflags for vm86 mode */
	int	vm86_flagmask;		/* flag mask for vm86 mode */
	void	*vm86_userp;		/* XXX performance hack */
	struct  pmap *pcb_pmap;         /* back pointer to our pmap */
	struct	cpu_info *pcb_fpcpu;	/* cpu holding our fpu state */
	u_long	pcb_iomap[NIOPORTS/32];	/* I/O bitmap */
	u_char	pcb_iomap_pad;	/* required; must be 0xff, says intel */
	int	pcb_flags;
};

There was another member added after the end of the IOPB, which meant that it inadvertently expanded the IOPB again by another 32 bits/ports. The actual value of pcb_flags determined which ports exactly would be accessible, but this time it was guaranteed some would be (because the value was never -1).

This bug cannot be blamed on the C language, it was clearly a programming error. However, it was greatly aided by the code written in the 1990s. Given the “Software pcb (extension)” comment before the last chunk of the structure, it is quite non-obvious that the end of the so-called “software pcb” is in fact hardware-defined.

Max Payne

By now we have a hazardous design from 1985, incomplete documentation from 1986, fishy code from 1995, subtly broken code from 2000, and less subtly broken code from 2007. Can it get worse? Well…

In March 2016 (for OpenBSD 6.0), the following commit message could be seen:

Delete i386_{get,set}_ioperm(2) APIs and underlying sysarch(2) bits.
They're no longer used by anything and should let us simplify the TSS
handling.

That sounds good, right? The entire IOPB can be dropped, no more open I/O ports. Well… as they say, the road to Hell is paved with good intentions. The updated struct pcb now looked like this:

struct pcb {
	struct	i386tss pcb_tss;
	int	pcb_cr0;		/* saved image of CR0 */
	caddr_t	pcb_onfault;		/* copyin/out fault recovery */
	union	savefpu pcb_savefpu;	/* floating point state for FPU */
	struct	segment_descriptor pcb_threadsegs[2];
					/* per-thread descriptors */
	int	vm86_eflags;		/* virtual eflags for vm86 mode */
	int	vm86_flagmask;		/* flag mask for vm86 mode */
	void	*vm86_userp;		/* XXX performance hack */
	struct  pmap *pcb_pmap;         /* back pointer to our pmap */
	struct	cpu_info *pcb_fpcpu;	/* cpu holding our fpu state */
	int	pcb_flags;
};

The IOPB is now completely gone. That is to say, it is gone from struct pcb, but not from the actual TSS. Because the code setting up the TSS includes the following line:

pcb->pcb_tss.tss_ioopt = sizeof(pcb->pcb_tss) << 16;

In other words, the OS is telling the CPU that there is an IOPB starting right after the fixed TSS portion (at offset 68h). What that means is that all the software-defined fields in struct pcb will be interpreted as an IOPB by the CPU. And there will be some, because the code setting the TSS limit still says

setgdt(slot, &pcb->pcb_tss, sizeof(struct pcb) - 1,
    SDT_SYS386TSS, SEL_KPL, 0, 0);

so the TSS will be plenty big.

Now, Intel made this easy, but the bug was in OpenBSD. The IOPB offset in the TSS should have been greater than the TSS limit (to cover both Intel and AMD CPUs).

The consequence of this bug is that in i386 OpenBSD 6.0, instead of removing the IOPB entirely, a much bigger and uncontrolled IOPB was created, guaranteeing access to many ports in the 0-11B8h range (in that particular OpenBSD version). Again, zero bits mean “access allowed”, and there will be lots of zero bits.

This range covers system ports including the legacy interrupt controller, DMA controller, timer, various system ports, VGA, IDE drives, PCI configuration space, and who knows what else. Every user process can read and write those ports.

That is, to put it mildly, not so great security-wise. If you have I/O port access to the PCI configuration space, you can, say, make sure that IDE or AHCI legacy access ports are accessible, read and write from disk, and perhaps even use DMA to read and write physical memory your user process has no business accessing.

Unrelated Changes

In the Spring of 2018, OpenBSD worked on Meltdown mitigations in the i386 kernels. These changes were unfortunately not ready for OpenBSD 6.3 and were temporarily reverted before the 6.3 release.

Among other things, the Meltdown patches abolished the per-process Task State Segments and used only one TSS per CPU. As a consequence, struct pcb no longer maps to a TSS at all.

Quick Fix

When the problem was reported to OpenBSD developers, it was very quickly fixed for OpenBSD 6.2 and 6.3. The actual fix is so simple that it can be quoted here in full:

Index: sys/arch/i386/i386/gdt.c
===================================================================
RCS file: /cvs/src/sys/arch/i386/i386/gdt.c,v
diff -u -p -u -r1.37 gdt.c
--- sys/arch/i386/i386/gdt.c	7 Mar 2016 05:32:46 -0000	1.37
+++ sys/arch/i386/i386/gdt.c	23 Jul 2018 23:53:28 -0000
@@ -210,7 +210,7 @@ tss_alloc(struct pcb *pcb)
 	int slot;
 
 	slot = gdt_get_slot();
-	setgdt(slot, &pcb->pcb_tss, sizeof(struct pcb) - 1,
+	setgdt(slot, &pcb->pcb_tss, sizeof(struct i386tss) - 1,
 	    SDT_SYS386TSS, SEL_KPL, 0, 0);
 	return GSEL(slot, SEL_KPL);
 }

The TSS limit is simply set to the required minimum, then there is no room for an IOPB and no possibility of incorrect permissions… as long as the IOPB offset is past the TSS limit, which it is. Now there is no IOPB and user mode applications cannot access I/O ports.

How Do Others Do It?

For the sake of completeness, it may be useful to check how other operating systems indicated (or still indicate) that there is no IOPB in a TSS.

In 386 Enhanced Mode Windows 3.1, for example, this is a non-issue because the IOPB covers all 64K I/O ports. The same is also true of Windows 3.0, EMM386 (at least in version 4.50), 386MAX 6.02, or Windows 9x.

Windows NT 3.1 sets the IOPB offset to equal the TSS size, i.e. one greater than the TSS limit. That is also what Windows 7 does (both 32-bit and 64-bit), as well as other NT derivatives like Windows 10.

OS/2 2.0 (and subsequent versions) uses 0DFFFh as the IOPB offset (using a TSS with minimal size of 68h bytes). That matches the note in Intel documentation (e.g. the 1990 i486 PRM) saying that “base address for I/O bit map must not exceed DFFF (hexadecimal)”; the note is still present in the current Intel SDM. It’s obvious that an IOPB covering all 64K ports cannot start beyond offset 0DFFFh and still fit within 64K (because it needs 8K + 1 padding byte), though it’s not at all obvious why that would be relevant if the TSS limit is 0EFFFh or less, for example, or why the IOPB couldn’t cross the 64K boundary.

At any rate, the OS/2 programmers at Microsoft/IBM weren’t the only ones reading the note in Intel’s documentation; for example Solaris 2.4 and Solaris 7 use the same 0DFFFh IOPB base.

In BeOS 5.0 (1999) or NetBSD 5.0 (2009), the IOPB offset is set to 0FFFFh which produces the desired effect (no IOPB), although it perhaps violates the note in the Intel SDM.

Literature Review

As with so many things related to the x86 architecture, there is a wealth of available literature, with numerous books contradicting each other, and often even contradicting themselves. That starts (but by no means ends) with Intel’s official documentation, as shown in the preceding paragraphs.

Hummel

As noted above, the extra IOPB byte with all bits set was not mentioned at all in Intel’s original 386 PRM, and that caused some authors to spin what appears to be utterly unfounded fiction. Robert L. Hummel’s PC Magazine Programmer’s Technical Reference: The Processor and Coprocessor (Ziff-David Press, 1992) claims on page 116 that “to improve processor efficiency in the case of unaligned ports, the logic of the 80386SX and 80486 processors was redesigned [relative to the 80386DX] to always fetch two bytes from the I/O permission bit map”, and goes on to say that “[…] the end of the I/O permission bit map must be padded with an additional byte. The byte must have the value FFh to provide compatibility with the 80386DX.” It also claims that the “80386SX and 80486 processors ignore the value of the pad byte and do not include it when calculating the limit of the I/O permission bit map. The 80386DX, however, does consider the byte significant.” It is possible that some heretofore unidentified CPUs do not use the actual padding byte value and consider it to contain FFh. But the claims make no logical sense—if the padding byte is needed so that the CPU could always read two bytes at once, why would its value not be significant? And obviously the claims directly contradict the 80386 (DX) datasheets which always documented the padding byte requirement. The text appears to be to a large extent simply made up.

Crawford & Gelsinger

On the other hand there’s Programming the 80386 (SYBEX, 1987) by John H. Crawford and Patrick P. Gelsinger. The authors were the 386 chief architect and a 386 designer, respectively. Pages 490-495 provide a very detailed treatment of the IOPB, including pseudo-code (significantly more fine-grained than what’s in official documentation).

Even such a book manages to include text that is at best highly questionable. For example it claims (p. 491) that “to access the bitmap as quickly as possible”, two bytes are always read, but does not explain how reading an unaligned word can be faster than reading a single byte (in case the second byte is not required).

Crawford and Gelsinger say that the IOPB “can be stored anywhere within the first 64K bytes of the TSS” and “can start anywhere in the first 56K of the TSS”, statements that are already somewhat contradictory (why not start at 60K and cover half the ports?). The pseudo-code description (p. 493) suggests that no such limitations exist, and the only restriction on the IOPB location comes from the fact that the IOPB offset in the TSS is 16-bit. That is to say, the pseudo-code allows an IOPB starting anywhere within the first 64K of the TSS and potentially extending past 64K.

Either the text or the pseudo-code given in Programming the 80386 must be wrong, and possibly both might be. Even so, the book’s description of the IOPB is much clearer and significantly more detailed than most.

Agarwal

Also relevant is 80×86 Architecture & Programming Volume II: Architecture Reference (Prentice Hall, 1991) by Rakesh K. Agarwal, another Intel engineer involved in 386 design (note that there is no Volume I). Agarwal’s pseudo-code is similarly detailed as Crawford & Gelsinger’s, yet distinctly different.

Agarwal states that the IOPB must “not exceed the maximum TSS limit of 0xFFFF”; there is no explicit explanation why the maximum TSS limit should be restricted in such manner (a TSS descriptor format should allow up to 4GB). However, the book also says that if the I/O permission bit map base is beyond 0DFFFh, “I/O permission checks may succeed when they should fail”. That does not say, but strongly hints, that the IOPB offset calculation may be done using 16-bit arithmetic on the 386 and if the IOPB is too close to 64K, the calculation might overflow and wrap around to the very beginning of the TSS. The pseudo-code on pages 120-121 of the book is unfortunately not clear on this point.

Conclusions

Over the course of years and decades, small errors and inaccuracies can mutate into bigger errors and even serious security vulnerabilities. The process is insidious because it is largely invisible. To summarize:

Incomplete or misleading documentation is dangerous; which also means that
Insufficient or misleading source code comments are dangerous
Complex, difficult to use hardware design is dangerous
Last minute design changes tend to cause unanticipated problems
Using programming languages without understanding their subtleties is dangerous
Whatever you don’t understand will eventually hurt you
Over time, minor errors can turn into major problems without anyone realizing

In the saga examined here, the bugs and security holes remained largely invisible. Correctly written software continued to work correctly, but malicious programs could find the door wide open.

This entry was posted in 386, BSD, Bugs, Documentation, PC history. Bookmark the permalink.

44 Responses to The History of a Security Hole

Julien Oster says:

September 1, 2018 at 12:53 am

Using `sizeof(struct pcb) – 1` as the TSS limit may be unfortunate, but much more unfortunate is that struct pcb, which was obviously meant to have a fixed binary representation at least as soon as it contained the iomap, was not specified as packed (using `__attribute__((packed))` or whatever pragma was available at the time).

A lot of the trouble here could have been avoided by following the simple rule of never trusting the compiler to predictably lay out your data without explicitly telling it so, and the first fix attempt that added the padding byte would have been perfectly reasonable.

My guess is that not packing your structures was common when most C-based operating systems were 32bit only (and the 16bit world was fundamentally different), but fell out of favor once they were ported to widely available 64bit architectures.

But I agree that having the iomap under that “Software pcb” comment was a really bad idea, and perhaps the more consequential blunder. Especially because there isn’t even as much as an empty line separating it from the previous, truly software-only fields, tricking any reader into believing that it’s only the kernel that will look at those fields.

If there was a clear comment separating the iomap, I wouldn’t be too unaccepting of the `sizeof(struct pcb)-1` for the limit, and I think at least for me it might have drastically reduced the chances of both ignoring the implicit alignment and blindly removing the iomap fields altogether, as it would be clearer that the binary representation is relevant here.

Of course, all of that is easy to say as an uninvolved observer long after the fact.
Michal Necasek says:

September 1, 2018 at 1:17 am

Part of the problem must have been that when ‘struct pcb’ was first created, the IOPB was not actually used at all so no one likely gave it much thought. And when the IOPB started being used, no one went back to double check the design. All the more understandable when the “leaky” IOPB was invisible. You’re right though, letting the compiler align the struct when the hardware wants it just so is simply unwise.

I always thought it very annoying that structure packing wasn’t standardized and writing portable code that used packed structs wasn’t trivial. Sure, it’s possible to write serialization/deserialization routines to handle that, but in my opinion that goes directly against the philosophy of the C language, especially when such code is made completely redundant by using packed structs.

My experience is that different “cultures” treated this very differently. DOS and the PC is chock full of horrid unaligned structures… which of course made zero difference on an 8088. UNIX-y and especially RISC-y code tends to be very neatly aligned because unaligned accesses just don’t work very well in those environments. Intel always blurred the line by allowing unaligned accesses, albeit with a penalty.
sandpile.org says:

September 2, 2018 at 9:21 pm

Two more tidbits.

The 3 extra ports (from straggling word or dword I/O) do not wrap around to zero, i.e. bit 16 = 1 really goes out on the bus. You can easily test this thanks to the first DMA controller which resides at ports 0x00…0x1F, i.e. ports 0x00…0x02 won’t respond to straggling I/O.

On P5/P54/P55 (Pentium) processors you can observe unexpected behavior for dword IN if 1 or 2 bytes straggle.
Michal Necasek says:

September 2, 2018 at 11:21 pm

Thanks for that! I was wondering about the wrap-around behavior. Not surprised that it’s something a bit unexpected.

Do you happen to know if the Pentium behavior a documented erratum?

I’m still exploring the behavior of an IOPB that crosses the 64K boundary in a TSS and it’s not lining up with the documentation, but my testcase could just be buggy.
Yuhong Bao says:

September 3, 2018 at 4:21 am

You will notice that many x86-related documentation states that it has “64K+3” IO ports.
Andrew Cooper says:

September 5, 2018 at 4:15 pm

There is much more fun to be had with TSS’s.

The observant amongst you might notice that with a IOPB of 0, the vm86 interrupt redirection bitmap starts at TSS.base – 32. It turns out that processors really do read ahead of the TSS base.

Better yet, it appears to be vendor specific as to whether, when TSS.base is close to 0, the read wraps back around to the 4G boundary or not.
Michal Necasek says:

September 5, 2018 at 4:39 pm

Good stuff. That must be one of the few methods to access memory outside of a segment limit. All because the IOPB was a last minute addition and Intel didn’t have the time and/or microcode space to do it in a sane fashion.

My experience is that when long mode is not on, linear address calculations are done using 32-bit arithmetic and do wrap around 4G. But I’m certainly not going to claim that every x86 compatible CPU behaves the same way (I simply don’t know that). Once upon the time there was an OS (Coherent) which even relied on the wrap-around.
Yuhong Bao says:

September 5, 2018 at 9:09 pm

Though the interrupt redirection bitmap was introduced with VME.
Josh Rodd says:

September 5, 2018 at 11:44 pm

Pretty much an unrelated question, but a 56kB TSS seems pretty big. What is OS/2 doing with a 56kB-sized Task State Segment?
Michal Necasek says:

September 6, 2018 at 1:55 pm

Nothing. OS/2 uses a TSS limit of 67h (the minimum) but sets the IOPB offset to DFFFh. Why they do that I don’t know, but any IOPB offset greater or equal to the TSS limit should behave the same. Perhaps the idea was that the TSS might expand quite a bit and the IOPB offset wouldn’t need to change, always indicating no IOPB.
Joshua Rodd says:

September 7, 2018 at 7:45 pm

“My experience is that when long mode is not on, linear address calculations are done using 32-bit arithmetic and do wrap around 4G. But I’m certainly not going to claim that every x86 compatible CPU behaves the same way (I simply don’t know that). Once upon the time there was an OS (Coherent) which even relied on the wrap-around.”

I’m intrigued by this. What did Coherent do that ended up relying on this?
Michal Necasek says:

September 8, 2018 at 4:25 pm

If I remember correctly… the kernel was mapped at a high virtual address using paging. In order to execute the same code without paging, they set the segment base to some high value so that the base + offset would wrap around 4GB and point to the right place in physical memory.
Eric Olson says:

September 8, 2018 at 10:49 pm

A very interesting commentary about not verifying that various IO ports were actually not writable from user processes. When you compare how other operating systems do it, it seems strange not to mention Linux.
Michal Necasek says:

September 10, 2018 at 11:49 am

If there were such a thing as “Linux”, it would be easier to write about it. But there is a bazillion distros and thousands of kernel options, hence it’s very difficult to make generalized statements about Linux. For example i386 Ubuntu 8.04 with Linux 2.6.24 kernel has TSS limit 2073h but the IOPB base is 8000h (i.e. well past limit). Ubuntu 16.04 i386 with Linux 4.15.0 kernel has TSS limit 206Bh but IOPB base is 68h, i.e. there is a full IOPB, with all ports set as inaccessible (that’s more or less the same as Ubuntu 12.04 i386 with Linux 3.2.0 kernel).
Lars Erdmann says:

September 10, 2018 at 5:51 pm

Another good example of why you should program defensively.

The only RELIABLE solution is to specify an interrupt redirection bitmap of 256/8=32 bytes in length and a IO-protection bitmap with the full length of 65536/8+1 = 8193 bytes. If you don’t want to open any ports, then set all these IO protection bytes to 0xFF where the last byte will, per requirement, always need to be set to 0xFF. Set the interrupt redirection bits as appropriate, possible to all zeros if you want all software interrupts being handled by the real mode IDT (which I would think is the “default”).
In that case the length of the TSS would be 104 (static minimum length) + 32 (for the interrupt redirection bitmap) + 8193 (for the full IO-protection bitmap + the “termination” byte) = 8329 bytes and the TSS limit therefore 8328 (and the IOPB base would be 104+32 = 136 = 0x88). That’s only 3 memory pages for the TSS which is absolutely irrelevant in today’s systems and was also irrelevant 20 years ago.

I wonder why this absolutely reliable solution (working on Intel and AMD with no odd side effects) is not used all across the board …
Yuhong Bao says:

September 10, 2018 at 10:13 pm

Well, it is worth mentioning that older Linux kernels did use TSS task switching.
Yuhong Bao says:

September 10, 2018 at 10:29 pm

And AFAIK the IOPB don’t even need to be 8193 bytes in that case. 8192 bytes should work fine as the only way to access the extra ports would be an IN or OUT to port 0xFFFF.
Richard Wells says:

September 11, 2018 at 2:45 am

The IOPB looks to have been designed to make porting Unix from the VAX easy though treating the entire V86 memory block as a set of I/O ports leading to a much larger memory allocation ran rather counter to the more limited memory that would have come with the 80386. Thus, having the option of shrunken or no IOPB makes sense.

Trying the VAX style I/O protection with full sized IOPB on every process would have needed too much memory even 20 years ago. I think my current system would wind up using an additional 512+ MB just to store the full set of IOPBs.
Yuhong Bao says:

September 11, 2018 at 2:57 am

Yea, the reason it works well now is that most OSes don’t support TSS switching any more, so they can just copy a smaller IOPB into the single TSS that is used.
Lars Erdmann says:

September 11, 2018 at 10:50 am

the IOPB does have to be 8193 bytes in size (if you use a full length IOPB). If the HW checks the IOPB for the very last eight I/O ports (0xFFF8 to 0xFFFF) it will address 1 byte beyond the IOPM because it does WORD reads. If the TSS does not contain that additional byte, you will experience a trap because the segment limit would be violated if that last additional byte is missing.

Maybe you confused that every I/O port is represented by a BIT and not a BYTE ?
Yuhong Bao says:

September 11, 2018 at 12:11 pm

In practice, nobody used these ports anyway because they were used for the 80387 coprocessor.
Yuhong Bao says:

September 11, 2018 at 12:15 pm

Actually, I think that is wrong. The 80387 actually uses 800000F8h to 800000FFh.
Chad Dougherty says:

September 11, 2018 at 1:19 pm

Outstanding write up. Thank you for sharing!
Lars Erdmann says:

September 11, 2018 at 5:52 pm

If a system has up to 65536 processes managed by separate TSS’es then, even with a minimum TSS eating up at least a mem page, you already need 256 MB just for TSS info. Such a design is completely flawed to begin with, no matter if that value triples or not.

Even OS/2 which did use the HW task switching mechanism only had one common TSS for all protected mode processes but it had one separate TSS per VDM (DOS or Win 3.1 process). The later mostly due to the fact that there were quite a few DOS applications that needed to program some HW via I/O ports directly (thinking about DOS games programming the sound HW for example).
Richard Wells says:

September 11, 2018 at 7:26 pm

It is possible to have multiple segments within a single memory page. A pure TSS design with no IOPB could take advantage of that to keep memory consumption reasonable.
MiaM says:

September 11, 2018 at 9:45 pm

Well, if you are going to have so much hardware that you use 65536 I/O ports (or even 65539 if you squeeze in a 32-bit access to I/O address 65535) then you’ll probably also have loads of memory.

Btw, as the map can reside at a variable offset, it should be possible to allocate a chunk of physical memory and have separate TSS:es sharing a (cropped) I/O map.

And re memory useage: It seems reasonable to only have an I/O map at all for the few processes that actually need to use I/O.
Yuhong Bao says:

September 11, 2018 at 11:16 pm

AFAIK 65536 TSSs are not even possible given that the GDT is limited to 8192 segments.
djhayman says:

September 12, 2018 at 2:28 am

@Yuhong – but then add LDTs to the mix… Your GDT could have the following layout:

Null segment, 32-bit kernel code segment, 32-bit kernel data segment, 32-bit user code segment, 32-bit user data segment, and 8,187 local descriptor table segments.

Each of the 8,187 local descriptor tables can then have 8,192 task state segments, giving a total of 67,067,904.

Totally useless, but definitely possible!
Michal Necasek says:

September 13, 2018 at 10:48 am

They don’t necessarily all have to be wired into the GDT at the same time. But anyway, the design considerations of the time are pretty obvious. If 1MB is the typical high-end system’s memory size, then 2KB per process is an awful lot of memory. In addition, a) not every process needs I/O access, and b) some platforms might have considerably fewer I/O ports, like 1024 on old PCs. Then it suddenly makes a lot of sense to make the IOPB optional and variable in size.

Today the IOPB seems like a quaint idea, because no one dares to give any user process I/O port access, but one has to keep in mind that the IOPB was designed for things like CEMM/EMM386 where it made a great deal of sense.
Michal Necasek says:

September 13, 2018 at 10:59 am

OS/2 didn’t really use the task switching mechanism more than it needed to. For VDMs it was used, yes, probably not least because of per-VDM IOPBs. Just so that someone reading what you wrote doesn’t think OS/2 actually used the built-in CPU task switching for regular OS/2 processes, because it didn’t. I can’t think of any mainstream OS that ever used it. In fact even OSes with per-process TSSs like various BSDs didn’t use it, and found out in due time that on the P4, the LTR instruction was awfully slow (naturally it was slow because Microsoft OSes didn’t use it, so why would Intel care).
Yuhong Bao says:

September 13, 2018 at 11:14 am

Linux did use TSS task switching until the late 1990s. They removed it in phases with the LTR removed later:
https://lore.kernel.org/lkml/[email protected]/#t
Michal Necasek says:

September 13, 2018 at 12:00 pm

I know. That’s why I wrote “mainstream OS” 🙂 By the time Linux became one, the 386 hardware task switches were gone.
Lars Erdmann says:

September 13, 2018 at 4:15 pm

1) About LDTs: a TSS can only go into the GDT. And therefore it’s true: it is not possible to create 65536 TSSes. At least not simultanteously. Their descriptors would need to be swapped in an out of the GDT. Which would like be extremely slow and defeat the purpose of HW task switching.

2) about a shared IOPB: in theory that would be possible because the HW only expects the first 104 bytes of the TSS to be physically contiguous. But from a practical side I don’t think any OS allows to allocate a range of virtual memory where one or more of its pages are located at a dedicated physical address. And the virtual addresses of a TSS need to be contiguous
Richard Wells says:

September 13, 2018 at 9:20 pm

iRMX was designed around the use of TSS. Not exactly mainstream but rather influential within Intel.

The glory days of TSS were before systems had more than 32 MB of RAM so the GDT was more than big enough with 8192 descriptors. Continuing the TSS implementation with much larger amounts of memory would have required increasing the number of descriptors with TSS. Easy enough to calculate that not going to be a viable method though.

IOPB on every task was more useful in a multi-user context where each user could have a radically different terminal and blocking non-existent devices is critical to prevent crashes. V86 mode got to use it as well; no reason to have two different mechanisms achieving similar goals.
crazyc says:

September 14, 2018 at 3:46 am

I’m not sure about that. Running iRMX II for the 286 in emulation shows no significant TSS usage. Also from http://bitsavers.trailing-edge.com/pdf/intel/iRMX/iRMX_III/Real-Time_and_Systems_Programming_for_PCs_1993.pdf page 153 “Although iRMX II and III do not use CPU management for multi-tasking, they do maintain similar data structures to hold the information needed for iRMX tasks. “
MiaM says:

September 14, 2018 at 4:58 am

“Today the IOPB seems like a quaint idea, because no one dares to give any user process I/O port access,”

Imho it seems like a good idea to actually have processes only able to access relevant I/O ports for all kinds of slow I/O.

Sure, today there aren’t much such drivers needed as most computers have USB for all slow I/O, but in the 90’s it would make sense to try to protect the OS from bugs in the floppy controller driver or the serial port driver, or rather the drivers for all kinds of other hardware.

It was common for “special hardware” to have their own ISA or in some cases PCI card. You might remember those I/O cards having a bunch of digital and/or analogue ports. Also stuff like an eprom programmer (ALL-03, ALL-07 and similar) or a logic analyzer which actually is useable even nowdays for someone working with hardware.

Btw there are hacks to make Windows (XP and compatible versions) allow DOS programs to access certain I/O ports, and hacks for those dos programs to activate the I/O access allowance. For example you can easily find such stuff for the ALL-03 programmer.

So at least any PC compatible computer having an ISA bus could use the possibility to have a user process access a limited range of I/O ports.

Btw for some port ranges it’s even totally safe to let a user process access the hardware. The only thing that could cause problems with for example a serial port is stuck interrupt lines, and that could be handled by the driver for the interrupt controller, separate from the user process accessing the serial port hardware.
Michal Necasek says:

September 17, 2018 at 12:51 pm

Yes, the design really made sense. These days for slow hardware it’s not a problem to go through a generic kernel driver which exposes port access, and for fast hardware a specific driver is needed anyway. DOS boxes were a special but very important case in that the VDM process was a black box that could not be modified, so wholesale access to certain port ranges was a good solution. And in the 386 days especially, the saved CPU cycles made a big difference.
KeyJ says:

September 18, 2018 at 2:49 pm

So, in a perfect world, the programmer who originally enabled `pcb_iomap` would have added a remark like “*must* be the last member of the structure” to the comments, and the programmer who added `pcb_iomap_pad` would have read that and come to the conclusion that something like `tss_limit = offsetof(struct pcb, pcb_iomap_pad) – 1;` is necessary?

Another (only slightly related) question: Why do so few OSes use TSS for task switching? Wikipedia mentions portability (which is obvious), but also performance, which is a bit surprising. What makes hardware task switching slower than doing it “by hand”?
Michal Necasek says:

September 18, 2018 at 9:55 pm

Yes, in a perfect world, that’s about how it would go. We don’t live in a perfect world.

386 OSes indeed very rarely use TSSs for process switching, even the ones which are otherwise very x86-heavy, like OS/2. There are several problems, such as performance, reliability, and a limited number of TSSs. A failed task switch is one of the very few things in the x86 architecture that is not restartable, and task switching interacts somewhat badly with paging. A TSS must be in the GDT, which means a hard limit of 8,191 simultaneously available entries, and many of those may be already taken by things like call gates. OS designers may feel that such a limit on the number of processes is unreasonable, probably even more so if hardware tasks correspond to threads.

A properly written 386 OS will use a task gate for NMI and double fault handling, because those can occur when the CPU state is unpredictable or corrupted. But normal process switching is typically done by hand because it’s no slower than TSSs and more flexible.
Fernando says:

September 21, 2018 at 1:49 pm

I was reading a Oral History from the History Computer Museum with John Crawford. I think that this paragraph when his talking about the 386 pertains to this topic:
“Now we certainly had some non-RISC things. We had these things called task gates that were miles of microcode you would wind through to transition state– it was from the 286. It was kind of a hardware task switch kind of thing, which never did work out quite right, but all that was just microcode. And operating systems rarely use those, anyway, if at all. So a lot of the CISC stuff got to be- you could ignore it. It was there, we had to do it, but it didn’t really impact the performance much and it was tied up in microcode, which was just bits in a ROM, so…”
Richard Wells says:

September 25, 2018 at 10:31 pm

Microcode polishing takes time. The 386 took about 2 years from initial sketch to shipping while the i960 was closer to 5 years. The iAPX 432 (former 8800) had an even longer development cycle. The problems with the 432 were not buggy microcode. (The 386 was arguably even more complex since it tried for the faster 86 and cheap minicomputer and high performance military embedded markets at the same time.)

Hardware task switching seems to have been a desired feature for military uses so one would need to track down those OSes to see how an implementation should work.
Pingback: Unsafe Space
Yuhong Bao says:

October 6, 2018 at 7:58 am

Of course, MS OSes don’t use it partly because of its problems in the first place. (Consider a terminal server with 100 users with an average of 10 processes each per session for example.)
Pingback: EPYC Server Battle | BSD Now 281 | Jupiter Broadcasting