There are situations where software is available only in the form of a floppy image. This goes especially for historic hardware drivers and patches, which were often distributed only in the form of floppy images. This method was quite popular with large OEMs like IBM or Compaq.
Initially, floppy images were distributed as data files, with a separate program required to write them onto a physical diskette. Typically, such programs could only write the image to a floppy and had no ability to extract individual files from the file system on the floppy image (more or less universally the FAT file system). IBM’s LOADDSKF utility is one example of such a program.
Around 1990, someone realized that the program to restore an image onto a physical floppy could be small enough (5-20 KB) that self-extracting floppy images were feasible, similar to self-extracting archives. Compared to the size of a high-density floppy, the size of an self-extracting stub was negligible. Especially for distributing software that fit on 1-2 floppies, it was far simpler to publish 1-2 self-extracting floppy images than to provide a separate utility and documentation how to use it. The self-extracting floppy images also tended to be self-explanatory, and separate documentation was not necessary.
Disk Express
One of the earlier tools used for creating self-extracting images was Disk Express (DXP), a shareware utility by Albert J. Shan. The first public release (version 1.01) was on November 25th, 1991.
The original Disk Express ran on DOS and OS/2, and created self-extracting images with a stub to run on OS/2 or a stub to run on DOS. DXP version 2 could produce a combined stub that would run on both OS/2 and DOS.
The program (including the self-extracting stub) was written in C and was not greatly optimized for size. On the other hand, DXP supported compression which often more than made up for the stub size.
The last known DXP release, version 2.34, was released on March 8th, 1994.
DXP could also create stand-alone floppy images. A licensed version of DXP came with a program called XTRACT; this was functionally very similar to the self-extracting stub, but suitable for distribution with a larger number of “bare” floppy images.
Later DXP versions could also protect the floppy image with a passphrase. This functionality was rarely used.
DXP version 2 could conveniently test the integrity of the image, using the CRC checksums stored in the image, and also display the image description, if present. The verification looked like this:
C:\TEMP>164adap2.exe /t ┌────── Diskette Image Description ──────┐ 16/4 Adapter II, p/n 54G2095, diagnostic diskette V2.00. This also contains instructions for ripl on Novell and Lan Server 3.0 in the rpl subdir. └────────────────────────────────────────┘ 1.44M diskette image stored in file. 100% Read 32-bit CRC stored: 8081F838 32-bit CRC computed: 8081F838
IBM was one of the OEM licensees of DXP. The “IBM special” version of DXP creates images that are not compatible with “normal” DXP; more on that later.
DXP is limited to “standard” PC floppy formats (360K, 720K, 1.2M, 1.44M etc.) which may be why it was superseded by other tools.
CopyQM SXD
Another tool commonly used for creating self-extracting images was CopyQM from Sydex, the makers of TeleDisk, AnaDisk, and other floppy-related utilities.
CopyQM was a shareware utility for copying floppies, which was also capable of creating and restoring floppy images. A commercial version called CopyQM Plus came with a MAKESXD utility which could take a floppy image created by CopyQM and convert it into a self-extracting image. Later versions of CopyQM Plus (version 3.1 and above) could create self-extracting images directly, without the need to first create a CopyQM image and then convert it via MAKESXD.
The self-extracting format is referred to as SXD. While this term cannot be found in CopyQM documentation, the MAKESXD name is rather suggestive, and the floppy data inside self-extracting images always starts with an “SXD” signature.
The exact SXD timeline is unclear. A CopyQM manual suggests that CopyQM Plus was certainly able to create self-extracting images in 1994, but how far back this ability went is unknown.
The SXD format is at about the middle of the spectrum in terms of the floppies it can handle. It is not limited to standard DOS formats, but it only handles “uniform” disks with a constant number of sectors per track and sector size. SXD can also store (and restore) non-standard sector interleave and skew, producing floppies that can be read faster. This gave SXD (and CopyQM) the ability to efficiently deal with 1.68M diskettes formatted using Microsoft’s DMF.
Like DXP, the SXD format enabled the creation of password-protected images, a capability that was again very rarely used.
Similar to DXP, the self-extracting images created by CopyQM Plus were fairly self-explanatory:
C:\TEMP>am2win1.exe Self-extracting diskette image processor (DOS), Version 1.03 Copyright 1995, Sydex, Inc. All Rights Reserved. This file was created on Jun 14, 1995 12:51:54 Actionmedia Windows device driver Version 1.2.22, Disk 1 of 2. Please enter a drive letter compatible with a 1.44M 3.5" disk, or press ESC to quit:
Extracting Self-Extractors
DXP and SXD self-extracting images both have a straightforward structure. A DOS stub executable is optionally followed by an OS/2 executable, which is followed by the image data. The DOS MZ header points to the start of the image data.
The SXD format additionally allows a “WB” block to precede the image. This block typically contains the text of a license agreement which users have to accept before extracting the image.
After that… things get very murky. There is no official documentation for the DXP or SXD format.
For DXP, there is a good description of the image format, although the origin of the information is quite unclear. For the SXD format, there is effectively nothing.
DXP
The DXP format description is all well and good until one gets to this bit: If the file is compressed, then each track is compressed separately. Each track in this case is preceded by a little-endian word, giving the length of the compressed data.
Almost all DXP images are compressed. And there is not a single word about how. The DXP 1.01 documentation states: Advanced data compression based on a modified Lempel-Ziv- Huffman-RLL algorithm. Hmm… that could be just about anything.
But wait! There’s more. In the Acknowledgements section of the DXP documentation, there’s the following: First on the list I must thank Haruyasu Yoshizaki for the conception of the compression code. Some pieces of DSKEXP have been lifted directly from the C version of LH compression code. That name sounds familiar!
c:\util>lha.exe LHA version 2.13 Copyright (c) Haruyasu Yoshizaki, 1988‑91 === <<< A High‑Performance File‑Compression Program >>> ======== 07/20/91 === Usage: LHA <command> [/option[‑+012|WDIR]] <archive[.LZH]> [DIR\] [filenames] -------------------------------------------------------------------------------
Yep, LHA. With something to search for, I went on a wild goose chase, trying to understand the history of the LHA utility and file format. (A whole another fascinating rabbit hole.) I was especially looking for what source code might have been available in 1991, when DXP was first written.
Eventually I found a source file called LZHUF.C. Based on the DXP disassembly, it looked like it might be the right thing. So I plugged it into my DXP decoder and… bingo! It worked perfectly.
My joy was short lived. I quickly discovered that the decompression only works with DXP 1.x images, but not with version 2.x. The DXP 2.0 manual says little: New data compression algorithm based on a modified Lempel-Ziv-Huffman method.
Well… the old DXP also used “a modified Lempel-Ziv-Huffman method”. So that does not help at all. Back down the LHA rabbit hole…
The LZHUF.C algorithm corresponds to the lh1 compression method of LHA. Newer LHA versions also support methods like lzs, lz5, or lh5. Maybe it could be one of them?
Trying to compress a block of floppy data with LHA using different methods soon gave me the answer. DXP 2.x uses the lh5 method. The source was published as part of a LHA-like archiver called ar002 by Haruhiko Okumura.
Plugging the lh5 decompressor into my DXP code let me successfully extract DXP 2.x images.
Then there was the matter of CRCs. To verify that I’m extracting the floppy images correctly, I needed to check the CRC stored in the images. Anyone who worked with CRCs knows that “32-bit CRC” is a rather vague term. Bits can be stored MSB-first or LSB-first, the initial value can be zero or all-bits-one or something else, and the CRC polynomial isn’t always the same either.
The CRC is also highly useful for identifying floppy images. The DXP image header is small and has its own CRC; if the DXP header can be read from a file and successfully checksummed, the odds that the file wouldn’t be a DXP format image are astronomically low. And a valid checksum also indicates that the header contents ought to be trustworthy.
For sorting out the CRC, I disassembled the DXP self-extracting stub. That let me identify the exact CRC algorithm, and also how DXP uses it. DXP 2.x is easy enough–it stores the CRC of the uncompressed image data. If the uncompressed image data has valid CRC, one can be highly confident that it was extracted correctly. Given how DXP 2.x operates, the checksum is also identical whether the image is compressed or not.
DXP 1.x works rather differently. For uncompressed images, DXP checksums the image data which follows the header. That includes track headers, which means that although DXP 1.x and 2.x uses the same CRC algorithm, the checksums will be different for the same floppy. For compressed images, DXP is even odder, and checksums all of the compressed track data not including track headers (which contain the size of the compressed data).
And then there’s IBM DXP. The CRC checksums of images created by IBM-licensed DXP utility are different. A quick disassembly showed that the algorithm is identical but the initial CRC seed value is different. It is possible that other OEM-licensed variants used yet different seeds. When opening an image, my code has to try the known seed values (currently only two) to figure out which one a given image uses.
SXD
Decoding SXD images was rather more work since the format is completely undocumented. Naively, one might think that since SXD images are created from CopyQM images, the format would be the same. But no, it’s not the same at all. It is very similar in terms of capabilities (what kind of floppies it can represent), but the way the floppy contents are stored in an SXD image is quite different.
CopyQM only uses simple run-length encoding (RLE) compression. SXD can use the same compression, but more often uses some form of “advanced” compression.
The only way to crack the SXD format was to sit down with an IDA disassembly and try to understand how the self-extracting stub works. After a while, I understood enough of what’s stored in the SXD header; the contents are quite similar to CopyQM, just the layout is very different.
One minor annoyance was that almost all the messages in the SXD self-extracting stub are scrambled. The program itself is not obfuscated in any way, but the messages are all scrambled with a simple XOR cipher. I used the HIEW utility to help with getting the plain text of the messages. Error messages are always a valuable help when trying to understand what a particular piece of disassembled code does.
The decompressor was the real missing piece. Fortunately, after working with the DXP format, I had some idea about how various algorithms worked, and I was able to recognize certain constants used by the decompressor. It didn’t take long to realize that SXD was based on the same LZHUF.C algorithm that DXP 1.x. used.
But not quite. Just plugging the same code into the SXD decoder didn’t work. Or rather it worked… but only sometimes. After a little bit of head-scratching I realized that the standard LZHUF.C (and LHA) presets the decoding buffer to contain ASCII space characters, which is suitable for text files. SXD presets the buffer to zeros, which is more suitable for floppy images. With this one modification, I was able to extract SXD images.
Once again, I also needed to sort out the CRC. SXD uses CRC-16 and unlike DXP, the uncompressed data of each track is checksummed separately. This adds a slight overhead but if an image is partially corrupted, the CRC clearly says which tracks are good and which are not. As always, different tradeoffs.
As with the DXP format, I used the disassembly of the SXD self-extracting stub to tell me exactly how the SXD CRC algorithm works; there wasn’t anything particularly interesting about it. With the checksumming working, I can again be highly confident that images are getting decompressed correctly.
Unused Tracks
By default, both the DXP and SXD image creation tools analyze the FAT of the source floppy and skip any trailing unused cylinders (SXD) or tracks (DXP). This saves space in the resulting image; the savings can be quite large if the source floppy only contains a small amount of files but is mostly filled with junk (which does not compress well).
Such an approach has obvious advantages, but complicates digital archiving. The contents of the omitted tracks are simply indeterminate, there is no way to know what was in them. Yet when re-creating the floppy data (such as converting to raw images), the unused sectors must be filled with something.
Floppies used for software distribution almost always use a uniform byte pattern in unused sectors. Often the filler byte is F6h (standard when formatting floppies), but it can also be 00h. Omitting unused tracks can thus lead to mismatches when comparing floppy data preserved through different means.
That said, tracks that were omitted implicitly did not contain any important information relevant for the purpose of the floppy.
Code
The source code dealing with DXP and SXD self-extracting floppy images can be found in the OS/2 Museum repository. It is not necessarily production quality code, but should serve as a good example of how to deal with these images.
Addendum: TeleGet
Both DXP and SXD formats were used by IBM. Shortly after adding support for these formats, I came across another one used by IBM: TeleGet.
As the name suggests, TeleGet is a close relative of Sydex’s TeleDisk. Unlike SXD, it is not a self-extracting format. In the early 1990s, before DXP and SXD, IBM distributed floppy images with .TG0 extension and made the TeleGet utility available for download.
TeleGet is effectively a stripped-down TeleDisk, only capable of restoring images. TeleGet was licensed to OEMs and according to Chuck Guzis, TeleGet images were deliberately incompatible with TeleDisk and also incompatible across OEM licenses of TeleGet.
Again, the TeleGet format is completely undocumented, but given its lineage I suspected that it’s related to the TeleDisk format, not least because TeleGet is supposed to handle “strange” floppy formats (something TeleDisk is good at).
After a bit of reverse engineering, it turned out that TeleGet images are in fact almost like TeleDisk images. As far as I can tell, the only difference is that the TeleGet header is not the same; the ‘TD’/’td’ signature is followed by 10 bytes that identify the OEM who licensed that particular copy of TeleGet. In IBM’s case, the string is “IBM NSC” (for IBM National Support Center). Because the header layout is different, TeleDisk won’t work with TeleGet images and vice versa (since the header CRC won’t match).
Within a short time, I was able to modify my existing TeleDisk code to account for the different TeleGet header and “convert” it to the normal TeleDisk header. Once that is done, the TeleDisk decoding logic can handle TeleGet images with no trouble, or at least the few IBM TeleGet images I could find.