The following article was printed in Computer Shopper, June 1992 issue (page 152). Commentary follows.
The Big Squeeze
Compression Scheme Shatters Storage Logjam
Todd Daniel believes he has found a way to revolutionize data storage as we know it.
DataFiles/16, a zip-style data-compression product released by Wider Electronic Bandwidth Technologies (WEB), allows users to compress data at a 16-to-1 ratio. That’s a major advance when you compare it with Stacker 2.0’s 2-to-1 ratio.
DataFiles/16 relies purely on mathematical algorithms; it works with almost any binary file. Because of the math involved, the ratio of compression is directly proportional to the size of the uncompressed file. In order to get the full effect of 16-to-1 compression, the original file must be at least 64K.
During a demonstration at our offices, Daniel, the company’s vice president of research and development, compressed a 2.5Mb file down to about 500 bytes, four levels of DataFiles/16. Because successive levels compress at a lower ratio as the volume of the file decreases, DataFiles/16 directly zips and unzips files to the chosen level.
After compressing a file, users can compress the data another eight times with DataFiles/16. This gives DataFiles/16 the potential to compress most files to under 1,024 bytes, whether the original file is 64K or 2.6 gigabytes. By comparison, SuperStor 2.0’s new compression technique can be performed only once.
By June, WEB plans to release its first hardware packages utilizing the same method. The two new device-driver cards will operate impeccably, compressing and decompressing data on the fly at ratios of 8-to-1 and 16-to-1, respectively.
A standard defragmentation program will optimize data arrangement, while an optional disk cache will speed access time. Both cards will come in DOS, Macintosh, and Unix versions. The DOS version is scheduled for a July release, and the company says the others will follow shortly.
The implications of WEBs data-compression technique in the communications field have yet to be calculated, but Daniel says a 16-to-1 ratio could save certain companies up to 5 percent of their storage costs. If DataFiles/16 lives up to its early promise, data compression will have taken a quantum leap forward. — Jim O’Brien
So much Computer Shopper. Why have you (most likely) never heard of DataFiles/16? Because it was a scam, of course. And since it wasn’t published in the April issue, it was presumably not a hoax by Computer Shopper itself but rather by the company behind it.
The article perhaps highlights the terrible fate of journalists: Writing about things they don’t understand. A computer scientist, or really anyone with a passing familiarity with information theory, would immediately recognize the claims as impossible and preposterous, if enticing. The only question about the article isn’t whether the whole thing was a scam, only how many people were in on it.
DataFiles/16, like other similar scams, was most likely an attempt to defraud investors rather than scam end users. Such compression could never work, so the question is only whether the software failed to achieve anything like the claimed compression ratios, or if it did… and could never decompress to anything resembling the original files.
These days it may be more difficult to set up compression scams, but the hucksters certainly didn’t disappear, they just moved elsewhere.
Alright, it could’ve been like that, medoesn’t recall *exactly*,
but fact remains that mewas never ‘disappointed’ by the drives’
To unwind the stack a little, PS pointed out above that formatting
a disk is not the same as formatting a file system. Me’d like to add
that at least in the UNIX world, ‘formatting a disk’ is a now very
uncommon operation (especially w/ the demise of floppies), while the
very diff ‘creating a filesystem’ has survived and is not even limited
to disks (creating an fs on tape or in a regular file usually works
Due to mess-dos only being originally designed to work w/ floppies, and
there was no expected compatibility w/ other systems, the operation
called ‘formatting’, there, includes creating a file system, the latter
of which has become the primary meaning of the term on windoze, which
inherited the naming convention from mess-dos (me’s not sure how VMS did
it, which is relevant for NT). Hence the term ‘low-level format’ when
referring to hard drives, which aren’t usually formatted as often as
floppies were, ’cause hdds became the more complex and varied sort of
device (in most cases).
The command to create a filesystem on a disk in VMS is INITIALIZE. It only does a LLF on floppies. LLFing hard disks was done using using special diagnostic utilities.
Same as PCs. Low-level format of hard disks often required vendor-specific utilities (anything beyond dumb ST-506 drives, anyway). DOS FORMAT always discovered and marked bad floppy sectors during formatting; I think that was possible for hard disks too, FORMAT read the hard disk entire partition though it didn’t low-level format the drive. Utilities like Norton Disk Doctor had the ability to check for bad sectors and mark them in the FAT (both floppies and hard disks).
Oh, the pain. On ST-506 drives there were usually one of two or three ways to format the drive. On 8-bit ISA cards you usually fired up debug and tried to jump a few bytes in the eprom, for example g=c800:5 or similar. On 16-bit ISA cards that were 100% compatible with the card IBM supplied with their AT you used one of the several tools for this class of cards. Otherwise you usually had to use a vendor specific program, or maybe in rare cases you could press some key during boot or maybe use debug.
I don’t miss this a single bit. Perhaps the only good thing about this is that you had a chance to save data from bad disks using Spinrite from Gibson Research.
Compared to earlier systems, I liked the IBM PC system for formatting drives. Widely documented and lots of third parties devising easy alternate software. Having ROM capacity surge while prices dropped was very freeing for system design.
Several examples of how bad non-IBM PC drives got follow.
DEC didn’t let some systems conduct a LLF on floppy disks; the disks had to be purchased preformatted. At first, this was to handle a bug with early drives but later became a clear money grab. Some people tracked down programs that ran on an IBM AT to conduct the SSQD initialization needed for the RX50.
With disk packs, Diablo shipped them unformatted and Control Data would format and test them in house. No user methods were provided. Waiting 6 hours for a technician to install and format a drive was untenable. An exception to the idea was the Xerox Alto which had tools to format packs and install file systems and to transfer files erasing them from the first pack but lacked any function to test for bad sectors. Oops.
TRS-80 Model 2 had one of the more full featured format programs which did low level format, tested disk, laid out file system and installed system software if needed. Requiring 10 consecutive tracks to be flawless for the system software rather makes it necessary to do all the steps with one program.
That sounds like Xerox alright, bad things don’t exist, they can’t
Though that attitude has spread quite a bit since.
At least back then it was possible to do a low-level format from software; nowadays, the tracks are so small and so close together that low-level formatting a hard drive requires that the read-write head be positioned at least an order of magnitude or so more accurately than the read-write head mover motor thingy is physically capable of positioning the head without the preexisting format marker thingies on the disk to guide it, and low-level formatting consequently has to be done before the drive is sealed with a big machine called a servowriter.
As regards the original topic of the article:
>Such compression could never work, so the question is only whether the software failed to achieve anything like the claimed compression ratios, or if it did… and could never decompress to anything resembling the original files.
There’s a third possibility; they could have demonstrated it reaching a 16:1 compression ratio and then successfully decompressing the files… using only files that could easily be compressed that much with any compression program (such as large, very sparse database files or large, single-colour bitmaps). Even legitimate compression programs will occasionally be fed something that contains humongous amounts of redundant information, and consequently be able to, on occasion, achieve compression ratios that seem impossibly high; I’ve run across 7-Zip archives with compression ratios better than 10:1 on occasion myself.
And keep in mind that this is all lossless compression; in a situation where it’s possible to use lossy compression, far higher compression ratios are possible. MP3 files frequently achieve compression ratios of 20:1 compared to the original WAV file, and the encoding used in DVD-Video often has compression ratios approaching 100:1 (a four-hour movie at a fairly low – 1024×768 – resolution, which would take up
3 bytes [24 bits] per pixel
x1024 pixels per row
x768 rows per frame
x24 frames per second
x60 seconds per minute
x60 minutes per hour
=815,372,697,600 bytes [815.373 gigabytes, or 759.375 gibibytes]
uncompressed, fits on an 8.5-gigabyte DVD in compressed form).
…oops, didn’t see the page full of older comments until just now! :-S
>Or the ongoing hdd capacity scam.
What scam? “Giga” is, and has always been, a decimal prefix. Giga=1000^3. The hard drive makers are using it correctly here. If you want to talk about 1024^3, the correct prefix is “gibi”.
Just because the computer pioneers wrongly used decimal prefixes to refer to binary quantities doesn’t mean that we need to keep misusing them today. It doesn’t even make it a good idea to keep using decimal prefixes instead of binary.
Linux already uses the binary prefixes; hopefully, sometime soon, Microsoft and the RAM makers will see the writing on the wall and switch to using decimal prefixes solely for decimal quantities (such as drive size) and solely binary prefixes for binary quantities (such as RAM size).
Although, to be fair, even that’s not as bad as how the floppy disk makers abused the “mega” prefix (a floppy-manufacturer’s megabyte is neither a megabyte [1,000,000 bytes] nor a mebibyte [1,048,576 bytes], but rather the geometric mean of the two [1,024,000 bytes]! A prime example of why the golden-mean fallacy is a fallacy…).
“Giga is and always has been a decimal prefix” if one is willing to pretend that decades of common usage in the field of computing never happened. You can go read for example Intel’s current manuals pretending that “GB” means 1,000,000,000 bytes. But things just won’t add up. You can argue that it was always wrong (which, lacking a time machine, is utterly pointless), but you can’t argue that 1GB RAM is 1,000,000,000 bytes.
Gibi and the other prefixes indicating that the value is a power of 1024 did not exist until 1998 and probably took a few years for anyone to notice. My first programs were in the 70s; I am a bit too set in my ways to bother with the new definitions. Not that it mattered back then since paper tape was unwieldy and prone to rip long before the kilobyte capacity could be reached. Even in the promised land of disks, no one would care about decimal versus binary since the storage was only reported as blocks or records not bytes or words.
It is a good thing that none of the standards people who are concerned with accurate and exact numeric labels were around in the cassette storage era. At least four different systems called their transfer rate 1500 bps; none gave the exact same transfer rate as the others and none would give the exact same transfer rate on all data.
>You can argue that it was always wrong (which, lacking a time machine, is utterly pointless)
Pointless as regards back then, sure, but not at all pointless now; it’s (usually) easier to convince people to switch over if they can be convinced that what they were previously using was wrong or erroneous in some way (such as decimal prefixes being used to refer to binary quantities), rather than it just being someone wanting to change an old standard to a new one.
>but you can’t argue that 1GB RAM is 1,000,000,000 bytes.
Which is exactly why the RAM makers should start labelling their products in GiB, rather than GB.
I can understand the desire to remove ambiguity but to some extent it’s a solution in search of a problem. No one is confused by a 4GB memory stick, but I have a feeling that more than a few potential buyers will wonder what the heck 4GiB is, and if it’s more or less than 4GB.
For editing etc, using uncompressed audio is beneficial, and on typical modern systems used for editing, storage space isn’t a concern when compared to the size of audio files.
But one of the very significant uses of digital audio files is portable audio, and the ability to carry your music collection with you wherever you go. As storage space on such devices is limited, compression provides very significant benefits there.
Yes, you could solve that by only having a part of your collection on a mobile device, and changing which part on a regular basis, but, thats not the same as having everything on the device, and why would anyone go for such a complicated workaround when its simply not needed at all.
As to your ‘text’ comments for storing audio and video data, take a look at what the IFF format for the Amiga did and why. Having a container format in which you can combine different kinds of data, and which has metadata to identify what kind of data is in each tagged block may seem more complicated than needed, but provides very significant benefits. The idea to just use the first few bytes in a file to identify the data in that file is used a lot, but wrought with problems as its fairly easy to end up with some raw dump of binary data which gets misidentified as some specific format. Its the way things are in the unix world, and its certainly better than using ‘filename extensions’, but its not actually a very good solution. Any real solution should at the very least contain an identifier denoting the container format and some metadata concerning the contained data, and of course the contained data itself. This metadata should contain enough information to read and check the validity of the contained data, which for audio includes things like bitrate, sample size and number of channels.
[again, only noticed Bart’s response just now]
Me’s essentially proposing the same thing, but w/ character granularity,
as opposed to the chunk granularity of IFF. (Meponderd IFF thoroughly
before coming up w/ me approach, so no surprises there :). Combined with
a terminal pipe that multiplexes on the character level, this solves all
the problems of integrating modern media into the traditional stream of
text (which me’d very much argue should be kept, for the sake of keeping
simple things simple), resulting in a massive simplification of the
system. A ton of special interfaces can be effectively thrown out.
Don’t tell me that’s not an improvement .