Anyone trying to disassemble the PC DOS 1.1 boot sector soon notices that at offsets 1A3h through 1BEh there is a byte sequence that just does not belong. It appears to be a fragment of code, but it has no purpose in the boot sector and is never executed. So why is the sequence of junk bytes there, and where did it come from?
The immediate answer is “it came from FORMAT.COM”. The junk is copied verbatim from FORMAT.COM to the boot sector. But those junk bytes are not part of FORMAT.COM, either. So the question merely shifts to “why are the junk bytes in FORMAT.COM, and where did they come from?”
It is not known if anyone answered the question in the past, but the answer has been found now, almost 40 years later—twice independently.
The junk bytes are a fragment of Microsoft’s linker, LINK.EXE. There is an almost identical code sequence within LINK.EXE that was shipped on the PC DOS 1.1 disk. The sequence is close enough and unique enough that it’s extremely implausible that it might have come from anywhere else.
For one thing, the code fragment looks like something generated by a high-level language compiler, yet the bulk of DOS 1.1 was written in assembly language. Notably MASM and LINK were not; both were written in Microsoft’s Pascal.
The first person to make this discovery (as far as we can tell) was Daniel B. Sedory aka The Starman, whose illustrated PC DOS 1.1 boot sector page is much nicer than anything I could put together.
I followed about two weeks later. Back when I was reconstructing PC DOS 1.1, I noticed the boot sector junk, but at the time I did not make the next step of trying to identify where it might have come from.
The junk in the PC DOS 1.1 boot sector isn’t the only instance of such junk. For example IBMBIO.COM also contains a different and larger junk sequence which is partly a repetition of the contents of IBMBIO.COM itself.
It is virtually certain that the junk bytes came indirectly from development tools used for building PC DOS 1.1, namely Microsoft’s assembler/linker. MASM can define “uninitialized data” and the linker interprets that quite literally, placing uninitialized data into the resulting executable. The bytes are probably somewhat unpredictable memory contents and might contain fragments of the linked program’s data or code, or even fragments of the linker itself.
The junk somewhat complicates analysis of the resulting executables because it’s not trivial to prove that it has no function; even if the junk bytes never get executed, they might end up being copied somewhere, or become part of the stack contents, and affect program execution indirectly.
There is no reason to believe that the junk bytes are the result of programmer intent. But there are two other oddities that are not random and are much harder to explain.
One is “zero-terminated strings”. When the boot sector checks whether a disk is bootable, it verifies that the first two root directory entries are IBMBIO.COM and IBMDOS.COM. To that end, the strings ‘ibmbio com’ and ‘ibmdos com’ are stored in the boot sector. They are stored in lowercase (and the root directory contents are forced to lowercase, too), which itself may seem odd, since the directory entries should always be uppercase. However, re-reviewing the in-development PC DOS disk, the reason becomes clear: At some point, there was lowercase ibmbio.com and ibmdos.com, as a way to make the files difficult to delete or overwrite. Once the system attribute was invented, the file names were uppercased again:
The system files have reverted to upper case letters again, but will not be included in any directory searches because of a new byte (attribute) in the directory entry (they won’t show on a DIR command, and can’t be erased, copied, folded, spindled or mutilated).Unknown IBMer, from a file dated 06/05/1981
But the real weirdness is that the strings are “terminated” with ASCII character ‘0’ (zero). That is to say, in the boot sector they are stored as “ibmbio com0ibmdos com0”. The zeros have no function since only up to 11 bytes are compared for each of the two file names. This was perhaps a oversight and the strings were meant to be null-terminated, but instead of writing
DB 'ibmbio com', 0
the author might have inadvertently written
DB 'ibmbio com', '0'
Since the terminator has no function, the mistake was never found and corrected. That is just speculation but it makes at least some sort of sense.
There is another strangeness related to string termination, although it was mostly gone from PC DOS 1.1.
In PC DOS 1.0, strings in the boot sector and in IBMBIO.COM are null-terminated, but the last character of text also has the high bit set. The routine which prints the strings strips the high bit from all characters.
Again, it’s very unclear what purpose this might have had. The high bits are simply stripped and thrown away, but it cannot be a coincidence that the last character of each string in the boot sector and in IBMBIO.COM has the high bit set.
In PC DOS 1.1, the high bits are no longer set on the boot sector and IBMBIO.COM strings, but the print string routine in the boot sector still strips them. That was presumably a harmless omission.
It is possible that in some earlier incarnation, the strings were high-bit-terminated, then changed to null-terminated, but the high bit still remained set and stripped in PC DOS 1.0, and in PC DOS 1.1 only the stripping remained.
Note that in the PC DOS 1.0 boot sector, the strange zero-termination is combined with high-bit-termination, and the strings ‘ibmbio com0’ and ‘ibmdos com0’ are each stored with the last byte as B0h, which is ASCII ‘0’ (30h) + 80h. In PC DOS 1.1, the high bit is no longer set.
Other Uses of High-Bit Termination
Terminating strings by setting the high bit of the last character was a somewhat common practice on machines with limited memory and no need to process anything beyond 7-bit ASCII (that is, great many systems in 1980 and earlier). Microsoft used this technique in BASIC; especially for storing BASIC token tables, using the high bit instead of a length or terminator byte saved hundreds of bytes of precious ROM.
DOS does not generally use this technique (strings are generally dollar-terminated), although for example DEBUG uses high bit termination for the instruction mnemonic table in the disassembler, again saving a byte per mnemonic which does add up.
Terminating strings by setting the high bit was clearly a widespread technique at the time (one most likely independently invented more than once) and presumably known to all Microsoft programmers. It is thus not surprising to find it in the parts of DOS written by Microsoft (boot sector, IBMBIO.COM, SYS.COM, FORMAT.COM), quite possibly even written by a single programmer, Bob O’Rear.
It is possible that Microsoft used some assembler string definition macros which automatically added the high bit terminator (see e.g. ‘Q’ macro in BASIC’s BINTRP.H). That might partially explain the strange double termination seen e.g. in FORMAT.COM from PC DOS 0.9 where strings are terminated with ASCII ‘$’ and a B0h byte (ASCII ‘0’ with high bit set); whatever it was meant to accomplish, the B0h is redundant because DOS won’t get past the ‘$’ when displaying strings. But if a string definition macro automatically set the high bit, an extra byte would have had to be added because DOS would not recognize ASCII ‘$’ with high bit set as the expected terminator.
All in all, it’s clear that the code Microsoft wrote for PC DOS underwent some evolution and was cleaned up only after the PC DOS 1.0 release, with some vestiges of the earlier iterations remaining even in PC DOS 1.1.