Anyone trying to disassemble the PC DOS 1.1 boot sector soon notices that at offsets 1A3h through 1BEh there is a byte sequence that just does not belong. It appears to be a fragment of code, but it has no purpose in the boot sector and is never executed. So why is the sequence of junk bytes there, and where did it come from?
The immediate answer is “it came from FORMAT.COM”. The junk is copied verbatim from FORMAT.COM to the boot sector. But those junk bytes are not part of FORMAT.COM, either. So the question merely shifts to “why are the junk bytes in FORMAT.COM, and where did they come from?”
It is not known if anyone answered the question in the past, but the answer has been found now, almost 40 years later—twice independently.
The junk bytes are a fragment of Microsoft’s linker, LINK.EXE. There is an almost identical code sequence within LINK.EXE that was shipped on the PC DOS 1.1 disk. The sequence is close enough and unique enough that it’s extremely implausible that it might have come from anywhere else.
For one thing, the code fragment looks like something generated by a high-level language compiler, yet the bulk of DOS 1.1 was written in assembly language. Notably MASM and LINK were not; both were written in Microsoft’s Pascal.
The first person to make this discovery (as far as we can tell) was Daniel B. Sedory aka The Starman, whose illustrated PC DOS 1.1 boot sector page is much nicer than anything I could put together.
I followed about two weeks later. Back when I was reconstructing PC DOS 1.1, I noticed the boot sector junk, but at the time I did not make the next step of trying to identify where it might have come from.
The junk in the PC DOS 1.1 boot sector isn’t the only instance of such junk. For example IBMBIO.COM also contains a different and larger junk sequence which is partly a repetition of the contents of IBMBIO.COM itself.
It is virtually certain that the junk bytes came indirectly from development tools used for building PC DOS 1.1, namely Microsoft’s assembler/linker. MASM can define “uninitialized data” and the linker interprets that quite literally, placing uninitialized data into the resulting executable. The bytes are probably somewhat unpredictable memory contents and might contain fragments of the linked program’s data or code, or even fragments of the linker itself.
The junk somewhat complicates analysis of the resulting executables because it’s not trivial to prove that it has no function; even if the junk bytes never get executed, they might end up being copied somewhere, or become part of the stack contents, and affect program execution indirectly.
There is no reason to believe that the junk bytes are the result of programmer intent. But there are two other oddities that are not random and are much harder to explain.
One is “zero-terminated strings”. When the boot sector checks whether a disk is bootable, it verifies that the first two root directory entries are IBMBIO.COM and IBMDOS.COM. To that end, the strings ‘ibmbio com’ and ‘ibmdos com’ are stored in the boot sector. They are stored in lowercase (and the root directory contents are forced to lowercase, too), which itself may seem odd, since the directory entries should always be uppercase. However, re-reviewing the in-development PC DOS disk, the reason becomes clear: At some point, there was lowercase ibmbio.com and ibmdos.com, as a way to make the files difficult to delete or overwrite. Once the system attribute was invented, the file names were uppercased again:
The system files have reverted to upper case letters again, but will not be included in any directory searches because of a new byte (attribute) in the directory entry (they won’t show on a DIR command, and can’t be erased, copied, folded, spindled or mutilated).Unknown IBMer, from a file dated 06/05/1981
But the real weirdness is that the strings are “terminated” with ASCII character ‘0’ (zero). That is to say, in the boot sector they are stored as “ibmbio com0ibmdos com0”. The zeros have no function since only up to 11 bytes are compared for each of the two file names. This was perhaps a oversight and the strings were meant to be null-terminated, but instead of writing
DB 'ibmbio com', 0
the author might have inadvertently written
DB 'ibmbio com', '0'
Since the terminator has no function, the mistake was never found and corrected. That is just speculation but it makes at least some sort of sense.
There is another strangeness related to string termination, although it was mostly gone from PC DOS 1.1.
In PC DOS 1.0, strings in the boot sector and in IBMBIO.COM are null-terminated, but the last character of text also has the high bit set. The routine which prints the strings strips the high bit from all characters.
Again, it’s very unclear what purpose this might have had. The high bits are simply stripped and thrown away, but it cannot be a coincidence that the last character of each string in the boot sector and in IBMBIO.COM has the high bit set.
In PC DOS 1.1, the high bits are no longer set on the boot sector and IBMBIO.COM strings, but the print string routine in the boot sector still strips them. That was presumably a harmless omission.
It is possible that in some earlier incarnation, the strings were high-bit-terminated, then changed to null-terminated, but the high bit still remained set and stripped in PC DOS 1.0, and in PC DOS 1.1 only the stripping remained.
Note that in the PC DOS 1.0 boot sector, the strange zero-termination is combined with high-bit-termination, and the strings ‘ibmbio com0’ and ‘ibmdos com0’ are each stored with the last byte as B0h, which is ASCII ‘0’ (30h) + 80h. In PC DOS 1.1, the high bit is no longer set.
Other Uses of High-Bit Termination
Terminating strings by setting the high bit of the last character was a somewhat common practice on machines with limited memory and no need to process anything beyond 7-bit ASCII (that is, great many systems in 1980 and earlier). Microsoft used this technique in BASIC; especially for storing BASIC token tables, using the high bit instead of a length or terminator byte saved hundreds of bytes of precious ROM.
DOS does not generally use this technique (strings are generally dollar-terminated), although for example DEBUG uses high bit termination for the instruction mnemonic table in the disassembler, again saving a byte per mnemonic which does add up.
Terminating strings by setting the high bit was clearly a widespread technique at the time (one most likely independently invented more than once) and presumably known to all Microsoft programmers. It is thus not surprising to find it in the parts of DOS written by Microsoft (boot sector, IBMBIO.COM, SYS.COM, FORMAT.COM), quite possibly even written by a single programmer, Bob O’Rear.
It is possible that Microsoft used some assembler string definition macros which automatically added the high bit terminator (see e.g. ‘Q’ macro in BASIC’s BINTRP.H). That might partially explain the strange double termination seen e.g. in FORMAT.COM from PC DOS 0.9 where strings are terminated with ASCII ‘$’ and a B0h byte (ASCII ‘0’ with high bit set); whatever it was meant to accomplish, the B0h is redundant because DOS won’t get past the ‘$’ when displaying strings. But if a string definition macro automatically set the high bit, an extra byte would have had to be added because DOS would not recognize ASCII ‘$’ with high bit set as the expected terminator.
All in all, it’s clear that the code Microsoft wrote for PC DOS underwent some evolution and was cleaned up only after the PC DOS 1.0 release, with some vestiges of the earlier iterations remaining even in PC DOS 1.1.
CP/M software routinely strips the high bit when printing filenames, because the high bit is used for file attributes. Not sure if that’s relevant here though.
I have a recollection that some Z80 assemblers had a pseudo-op to generate strings with the high bit set on the last byte. Maybe 8088 assemblers did too.
DC in MS Macro-80 saves a string with the last byte having the high bit set to one. If the Macro-80 was automatically converted, the same behavior should show in MASM.
Where you wrote: “MASM can define “uninitialized data” and the linker interprets that quite literally,” could you please explain a bit more how the program itself, not the programmer, defines something, so as it not being anything the programmer has any control over? Or, maybe what I’m missing here is that the programmer should have been more careful with how many bytes total he actually defined as “uninitialized” (not MASM) which MASM then turned into the data byte values it found in Memory?
I’ve not seen a DC operator in released MASM versions. Of course who knows what Microsoft had in late 1980.
Daniel B. Sedory, hello once again.
You didn’t update this page still 😉
I can confirm that the programmers at Apple used the same concept when they wrote the Dos code. In Dos 3.3 (see the book “Beneath Apple Dos 3.3” for a serious in-depth explanation) the table of Dos commands has the high bit of the last character set for each command. This is the only version I can comment on, I only know Dos 3.2 from the “conversion tool” but never got my hands on it.
I used this to introduce some more commands by shortening lengthy command names to abbreviated ones (like “CATALOG” -> “CAT”) and using the freed space for new command names.
I was talking about MASM directives such as ‘DB 100 DUP(?)’, which reserves 100 bytes of memory but the contents are “indeterminate” according to old MASM documentation. Modern linkers and run-time environments typically ensure that “uninitialized” data is in fact zero-initialized, but that does not appear to have been case for the tools used in DOS 1.x days. The programmer has control as far as saying “I don’t care what the memory contents might be”, and explicitly gives up control over what the memory contents will be exactly.
This is not about programmers being careful or not. Memory with “indeterminate” contents can still be used safely, when the programmer can ensure that the contents will be written under program control before being read. Think disk buffers for example, those will be first written with disk contents and only then read — pre-initializing such buffers is pointless (modulo bugs, of course).
The SCP cross assembler has the high bit to indicate the end of a string as the operand DM (Define Message). I wouldn’t expect it to be documented by anything that IBM worked on since that technique will break down once programs start being written for languages other than English. Either a macro or undocumented instruction to do it was likely in the 1980-82 time frame since so much existing code depended on it. Rewriting the code to handle the extended character set correctly could be deferred until the IBM PC proved itself.
It took a lot of time to zero out memory on the slow processors of the time. Losing a few seconds on every file save would cost a lot more sales than having a few users see unexpected text in dead areas of a file.
Within the context of a mature grasp of security, me finds it
inexcusible for the leak not to be plugged (i.e., the region filled with
decently random data or a standard pattern), though. Even back then,
that’d have been easy enough to achieve.
Ideally, of course, it should not be present in the object code until
the final link (load), a la BSS…
Thank you for the explanation Michal,
Although the ‘purist’ and OCD part of me cringes a bit when seeing what was left in the distribution diskette contents, I for one am quite happy that the programmers back then apparently didn’t care at all (or had no idea) about what ended up in Slack Space, nor even something being added to the end of a program (whatever the actual cause) in some cases making it longer than it should have been (e.g., PC DOS 1.10’s FORMAT.COM being 256 bytes longer than is necessary), since it provided us, so many decades later, with various “finds” (though most often like odd pieces in a very large puzzle) to help in understanding the history of DOS.
As you can see, I’ve been very busy with other things here… nor will I bother listing the other many personal things that required my attention, and I’m the kind of person who can truly focus on only 1 thing at a time!
However, thank you for your reminder… I’ll try to get my mind back into that later tonight.
I’ve currently been focused on making some basic IDA Pro *.i64 files (using the Free version) for early IBM PC DOS executable files … it uses a horrendous, in my opinion, human interface that one needs either a full ‘teaching course’ or loads of experience or actually both, but once an .i64 file has been worked on a whole lot and ‘tweaked’ and the ‘bad parts’ corrected, etc. etc. it can be a nice and easy way to to present a disassembly of the code.
Side track re Basic tokens: Somehow the usage of the high bit as a terminator in the keyword table also has the result that the string compare code thinks a plain text keyword matches the token table if the plain text keyword is too short but with the last character having the high bit set. At least this is true for the 6502 version. This results in the basic keyword abbrevations available in all Commodore versions of the 6502 basic (due to Commodores PETSCII setting the high bit when typing shifted alphabetic characters (although the api to output strings also accept the regular $40-5F / $60-7F for unshifted/shifted characters)). The Youtuber 8-bit show and tell has a video about this.
Bob O’Rear sounds like a porn star name XD
Interestingly the PC wasn’t sold in Sweden until 1983, which coincides with the release year for DOS 2.0. Not sure about other countries who’s languages needs 8-bit characters but I would assume the situation were the same.
zeurkous & Daniel B. Sedory:
Not zeroing out unused space has left some really interesting historic pieces. IIRC older-than-previously-found versions of some files were found in the large unused space on miga “Kickstart” diskettes (which basically are disks containing what should had been in ROM but wasn’t finished when the first Amiga model were released, so they added a small boot loader rom that can only read these types of disks, copy them to ram and then flip a bit in an I/O port which maps that ram where rom is intended to be, and restarts the computer).
Blame it on the (b)linker:
@MiaM: Yeah, for us, now, the effect ain’t so bad 🙂 Still bad practice
Moreover, it is extremely unlikely that the SCP assembler was ever used for the parts of DOS we’re talking about (boot sector, IBMBIO.COM). Evidence indicates that those were built with MASM.
It’s also quite possible, even likely, that even though PC DOS 1.0 was English only, IBM was already planning to support other languages (apparently DOS 1.1 existed in German, French, Spanish etc.). It is entirely plausible that Microsoft started with “let’s be clever and use the high bit as string terminator” and then IBM came and said “that’s not gonna work, you have to change it”. And Microsoft changed it, but only halfway at first (using null terminators but still setting the high bit in PC DOS 1.0).
Totally agree about not zeroing out unused areas. Clearing out junk was more expensive than not doing it, and the junk did not bother anyone at the time.
Security? What’s that? On a machine with zero networking capability? Keep in mind that at the time, security meant that other people never got any data you didn’t give them yourself (on a floppy). Different times.
Yeah, back then, no-one cared, just like no-one cared about
long-standing bugs in sendmail(8) (to the point that RTM was criminally
convicted while the admins were let off spot-free), no-one cared about
intents of hammer murders leaking to customers, and ANSI art was cool
even though it could easily be abused to hose the receiver’s terminal.
The “zero networking” argument doesn’t really fly as the 5150 *did* have
serial ports, and you can bet they were used to uplink to bigger
machines. Besides, shipping this stuff to customers, mass-duplicated,
certainly counts as sharing. (You also forgot about the cassette port,
but me guesses “no-one” used that…).
“It was wrong back then, and it is wrong today.” Glad we’ve matured.
Somewhat at least.
(telnet seems to be a bit of a diff case as the computational power
required for something like ssh appears to have been simply not
available at the time. Is me wrong?)
Yeah, the few fortunate(?) had 300-baud or maybe even 1200-baud modems, and they used them to send data over insecure, unencrypted lines. So either security hadn’t been invented yet, or such communication was the equivalent of communicating on a private, closed network. Like I said, different times.
And yes, encryption wasn’t really practical, at least not real encryption that would be genuinely difficult to crack. The computing power just wasn’t there. Even now ssh is expensive without hardware encryption support (which modern CPUs do have), and that’s with many orders of magnitude faster CPUs.
The IBM PC had padding routines to completely fill out a block with the cassette interface. It doesn’t overwrite the gaps between files with blank space though. While it wasn’t for the IBM PC, someone in New Jersey might have been unhappy to discover that a bootleg tape of a Ramones concert was partially over written with a copy of a word processor.
Mainstream use on disks could probably be traced to the Berkeley Fast File System which had multiple files sharing a block in fragments. Big security risk but having blank system fragments filling the excess space would not fundamentally alter the file system design. Took a long time for micro systems to have enough memory to track exact file sizes and finish off a cluster with blank data.
Protecting memory would have prevented in memory patching which was common with paper tapes. Many of the other micro techniques could be blamed on paper tape. A lot of loading formats skipped unused memory locations and retained whatever had been in those memory addresses. At 10 bytes per second, any way to reduce the number of bytes to be loaded was vital.
If security was paramount, we would all be using a descendant of the IAPX-432.
DOS (released) always kept track of the exact file size… but I don’t think it ever cleared the unused parts of a cluster.
I believe early versions of 86-DOS did not record the file size in bytes, but in April 1981 86-DOS already kept track of the exact byte count. There are hints in the DOS revision history that directory entries used to be smaller (16 bytes?). That would have been well before PC DOS 1.0 was released.
An undated preliminary 86-DOS Instruction Manual confirms that directory entries used to be 16 bytes (function 17 returns those); unfortunately I don’t see any mention of what was in the remaining 5 bytes (11 bytes were taken up by the file name). The starting cluster probably needed 2 bytes, so there might have been room for a 24-bit file length, but it’s unclear to me if the FCB interface in that version allowed the file size to be set with byte granularity like the later versions. I suspect the file size was only kept in terms of 128-byte records, but that is guesswork.
Apart from security, there are also intellectual property concerns with “junk” from memory getting written into files that get distributed.
I wonder if anyone’s ever had legal problems due to the “junk” being bits of things they don’t have the right to distribute… But then even in modern times there have been instances of things like games containing unused copyright-infringing assets, seemingly without consequences.
If some slack space junk gets distributed without people’s knowledge (and certainly without intent), it might be very difficult to pin the blame on someone specific. Even proving exactly where the junk came from might get quite tricky.
I worked at a mortgage bank (U.S.A.) and learned to program in C on DOS 1.x (I don’t remember which version since I just used whatever we had. We were in a mad scramble to port our software from an HP 3000 mainframe to IBM PCs. Being a bank it was critical that all memory and files were zero-initialized. We had only about six months to convert everything because they had already sold the HP and everything HAD to be done or the bank wouldn’t have had anything to make loans with. Crazy, crazy time.
While I HATED DOS, there were two OSs that I fell in love with within a two year period. BeOS and OS/2 starting when a saw a presentation of OS/2 2.0 beta in Seattle at the then IBM building. Sadly neither replaced Microsoft. But Microsoft WILL be replaced eventually. I just hope I’m alive to see it.
Encryption may not be common at the time, but RC4 was a fairly simple algorithm designed to support 8-bit processors.