How Not To Release Historic Source Code

This is how to not do it:

GitHub

Don’t get me wrong, it’s absolutely brilliant that Microsoft was able to release a fairly complete (minus DOSSHELL) source code for MS-DOS 4.00 or 4.01 (see below). As much as it was hated, DOS 4.0 was an important milestone and DOS 5.0 was much more similar to DOS 4.0 than not. This source code will be an excellent reference of modern-ish DOS until Microsoft officially releases the long ago leaked MS-DOS 6.0 source code. The source code includes all required build tools, which makes building it (compared to many other source releases) extremely easy.

But please please don’t mutilate historic source code by shoving it into (stupid) git.

First of all, git does not preserve timestamps, which causes irreversible damage. Knowing when a source file was last modified is valuable information.

Second of all, the people releasing the source code clearly thought, hey, it’s source code, let’s shove it into git, what could possibly go wrong. Well, this is what could go wrong:

Nope, not building

For practical purposes, old source files are not text files. They are binary files, and must be preserved without modification. It is not OK to take an old source file and convert it to UTF-8. For one thing, UTF-8 didn’t even exist in the times of MASM 5.10 and Microsoft C 5.1, of course old tools can’t deal with it!

The above problem was most likely caused by taking a source line using codepage 437 characters and badly converting them to UTF-8. That made the source line too long, past the circa 512 byte line length limit of MASM.

In the case of getmsg.asm it’s easy enough to manually delete the too long line in a comment. But it’s much worse with the src\SELECT\USA.INF file. Here, the misguided use of git not only made some comment lines too long for MASM, but it also actively destroyed the original source code. The byte arrays defined near labels PANEL36 and PANEL37 got turned into junk, or more accurately into a sequence of Unicode replacement characters.

This blunder is all the more regrettable because similar problems affected the previous GW-BASIC source release (very old MASM versions cannot deal with UNIX style line endings).

The timestamp destruction makes it harder to pin down what the source code actually is. The DOS 4.0 release was very confused because IBM first released PC DOS 4.0 in June 1988 (files dated 06/17/1988), but soon followed with a quiet update (files dated 08/03/1988) where the disks were labeled 4.01 but the software still reported itself as 4.00.

The just released source code almost certainly corresponds to this quiet 4.01 update. At least one source comment implies 8/5/88 modification, i.e. August 1988.

At least the core files (IO.SYS, MSDOS.SYS, COMMAND.COM, FORMAT.COM, FDISK.SYS, SYS.COM) built from the source release are a perfect match for the files on “MS-DOS 4.00” disk images that can be found on winworldpc.

Said files are dated 10/06/1988 and DOS reports itself as 4.00. However, the released source code, in the file SETENV.BAT, includes the following line:

echo setting up system to build the MS-DOS 4.01 SOURCE BAK...

This further suggests that the source code in fact corresponds to the quiet update of DOS 4.01 and not to the original IBM DOS 4.00 from June 1988, which to the best of my knowledge was never available from Microsoft. After a few months, perhaps in late 1988 Microsoft changed DOS to report itself as 4.01 because—unsurprisingly—the 4.00 version number was confusing customers.

As a historic footnote, BAK stood for Binary Adaptation Kit. MS-DOS OEMs would receive the BAK to adapt to their hardware. However, most OEMs did not receive the full source code, only the code to components that likely needed modification, such as IO.SYS.

But the fact that the “Source BAK” was something that Microsoft shipped to (select lucky) customers is actually great—since it’s supposed to be built by 3rd parties, it includes all of the required tools and is in fact quite easy to build.

Executive Summary

It’s terrific that the source code for DOS 4.00/4.01 was released! But don’t expect to build the source code mutilated by git without problems.

Historic source code should be released simply as an archive of files, ZIP or tar or 7z or whatever, with all timestamps preserved and every single byte kept the way it was. Git is simply not a suitable tool for this.

This entry was posted in Development, DOS, Source code. Bookmark the permalink.

79 Responses to How Not To Release Historic Source Code

  1. Thalia Archibald says:

    Disk images for Seattle Computer Products 86-DOS 0.1 and 0.34 were uploaded to the Internet Archive in 2023 and the one for 0.34 says it contains some source code. Have you taken a look at those?
    https://archive.org/details/@f15sim?query=dos&and%5B%5D=mediatype%3A%22software%22

  2. ForOldHack says:

    Besides the printer spooler, you did not need stacks until you loaded a network layer, in which case, they added the config.sys stacks command in PC-DOS 3.1, Remember that 4.0 is derived from DOS 2.0.

    I also seem to remember a shareware program that made command.com a TSR, but I cannot seem to find it in umich or elsewhere.

  3. llm says:

    Maybe a highlight to note

    the tools folder contains a nearly full MSC 5.1 installation with compiler,asembler and stdlib, headers etc.

    I would love to help in any form releasing that too in source, some of the games im reversing are based on this compiler

  4. Michal Necasek says:

    No, IBM/MS-DOS 4.0 is derived from DOS 3.2/3.3. You may be thinking of the Multitasking DOS 4.0.

    The extra small 64-byte stack in the DOS 4.0 loader is only used early during boot, effectively it needs to deal with INT 13h and whatever timer or possibly other hardware interrupts might occur.

  5. Git is not to blame for the various archival problems, except perhaps to the degree that Git added that horrible “CR/LF conversion” feature and in some cases has documentation that encourages users to use it.

    The number one rule with Git is to set `core.autocrlf = false`, and then you know it will never modify the data in your files. (If you don’t want CR-LF in the text files in your repo, don’t commit those; any mistake with this is glaringly obvious in a `git diff` or `git log –patch`.) If you have tools that can’t handle LF, or can’t handle CR-LF, fix the tools.

    The (attempted) conversion from whatever character set and encoding that the original used to UTF-8 was nothing to do with Git; Git can’t even do this kind of thing for you. Yes, the files should have been committed just as they originally came and, again, if for some reason you need UTF-8 versions of these, build tools to do the conversion from the committed original files.

    No, Git doesn’t keep timestamps, and for good reason. (Do you really want your build system to link in that new `.o` without recompiling it when you pulled out an old version of the `.c` file to test how your program worked with that?) But even things that do attempt to keep timestamps don’t do a reliable job of it. ZIP file timestamps, for example, are traditionally stored in local time but there is no time zone information with them. So who knows what the real timestamp is? Depends on the time zone not only of the filesystem when the file was created, but the time zone of the person making the ZIP file.

    The correct solution to dealing with timestamps is to record them explicitly, in a file that you commit along with everything else. I’ve got a few repos where timestamps are important, and I keep and commit a file called TIMESTAMPS listing all the files and their timestamps as of the time of commit, though obviously you’d want to do this slightly differently for historical files. (And I’ve a tool for re-stamping the files to their original timestamps when I happen to need that.)

    (If you want suggestions on/help with timestamping stuff like this, you can find my e-mail address on my GitHub user page for “0cjs”.)

  6. ecm says:

    I restored the SELECT files to exactly match the release found at “Microsoft MS-DOS 4.00 (10-6-1988) (5.25-360k).7z” from https://archive.org/details/ms-dos-4.00-and-4.01

    The identicalise work is done on a branch of my hg repo: https://hg.pushbx.org/ecm/msdos4/shortlog/identicalise (On the default branch I started adding new features.)

    The following files are only found in the archive.org images:

    ~/test/20240430/400/content$ ls
    AUTOEXEC.BAT GWBASIC.EXE PCIBMDRV.MOS SHELLC.EXE SHELL.MEU
    CONFIG.SYS HIMEM.SYS README.TXT SHELL.CLR
    DOSUTIL.MEU LINK.EXE SHELLB.COM SHELL.HLP

    (The link.exe here is not an exact match for the one in the src/tools directory of the free software release.)

    The file xmaem.sys is only found in the free software release.

  7. ecm says:

    Not sure whether my prior comment made it through, I don’t see it anywhere. I fixed all the encoding bugs in select. I wrote about that some more in the BTTR Software forum, DOS Ain’t Dead. (I copied the comment I submitted here and posted most of it verbatim.)

  8. ecm says:

    As for the small stack it may be of note that the stack may underflow, and this may or may not cause problems. Due to the near jump and lack of alignment directives, the msload stack is actually on an odd address. On a 386 pushing to offset 0FFFFh in Real 86 Mode (where limit is 0FFFFh) would fault I believe. Not sure about the 286.

  9. Michal Necasek says:

    Of course git doesn’t keep timestamps, it can’t. Which is why it’s not really suitable for archiving old source code.

    Indeed, who knows what the original timestamp is? The files were created and managed on systems that did not keep track of timezones.

    The whole point is that there already exists a tool eminently suitable for the job of preserving historic source code including contents and timestamps, and it’s not git.

  10. Michal Necasek says:

    The XMAEM.SYS driver was distributed as part of IBM DOS 4.0. Microsoft distributed EMM386.SYS instead.

  11. Michal Necasek says:

    I think you got that backwards? If SP=FFFFh and you push a word on it, SP will be FFFDh. If you pop, there’s a problem.

    And if SP=1 then yes, pushing will fault. It’s the one good way to trigger a triple fault in real mode that I know of.

  12. felsqualle says:

    @ecm, thank you so much for your work – this was the last bit I was missing to get a build that perfectly matches the original release for _all_ binaries that are included.

  13. ecm says:

    Yes, the stack will underflow when you try to push with sp = 0001h. By “pushing to offset FFFFh” I was referring to the push that happens with an input of 0001h, which will try to subtract 2 and then (try to) write to the segment end boundary address.

  14. Michal Necasek says:

    I’d think that pushing can only overflow and popping can only underflow… but whatever. A misaligned stack is definitely a potential problem.

    I doubt we’ll ever know why they used a 64-byte stack — but it clearly worked at the time. I could not find any guideline for stack sizes when calling into the BIOS, so someone had to pull a number out of thin air. And they were a little too optimistic on the saving memory side.

    It’s a good reminder that stack overflows are really nasty. They trigger highly unpredictable behavior where the true cause is far from obvious.

  15. ecm says:

    It was actually a 64-word stack: https://hg.pushbx.org/ecm/msdos4/rev/3757ddd142b0#l1.7

    As for overflow or underflow you’re right, I just get confused about the direction that the stack “grows” (using push).

  16. I wonder if that DOS 4 stack crash that in turn crashes out VMware is some potential zero day exploit? Although if someone has console on your VMware hosts trying to install DOS 4, you’re already in deep trouble.

    I did see that there is a lot of EAEAEA Extended Attributes stuff in backup/restore, I need to see if it can be reenabled for an OS/2 enabled backup/restore from DOS? that’d sure be handy! Maybe also reading backups from multiple directories would be nice so I coud load up more than one backup on a target disk…

  17. Yuhong Bao says:

    Which kind of “crash”?

  18. Yuhong Bao says:

    Keep in mind that DOS itself used a 192 word/384 byte internal stack.

  19. Michal Necasek says:

    There’s also extended attribute support in COMMAND.COM, e.g. COPY seems to be set up to also copy over EAs. But the EA (they abbreviated it as XA) support in the DOS kernel itself is stubbed out, with a good chunk of code written and then commented out.

    My guess is that the plan was to bring EAs and HPFS to DOS, but for whatever reason it ended up not happening.

  20. Michal Necasek says:

    Thank you! 128 bytes, not 64. I have a vague memory that INT 13h is entered with about 100 bytes of stack space available, which makes sense with a 128-byte stack but not a 64-byte one.

  21. Michal Necasek says:

    Exactly! That’s three times as big as a 128-byte stack.

  22. Michal Necasek says:
    > The whole point is that there already exists a tool eminently suitable for the job of preserving historic source code including contents and timestamps, and it’s not git.

    And which tool would that be? Surely not ZIP, which not only does not support time zones, but lets you re-make new copies of the ZIP file with different timestamps, or even different contents, with little chance of being detected.

  23. Michal Necasek says:

    I don’t know how the MS-DOS 4.0 source code was actually preserved. But I know that the MS-DOS 1.x and 2.x source code was stored on standard FAT-formatted floppies. Which obviously record no information whatsoever about the time zone. How do you convert such timestamps to UTC? Do you just invent something?

    As for changes being detected… we already know that the MS-DOS 4.0 source code on GitHub was modified after it was published (to obscure Tim Paterson’s name in an unflattering comment), with no record of the change in the commit history. This is only known because some people managed to clone the originally published version.

    We also known for a fact that the files on GitHub were modified and are 100% not originals (failed UTF-8 conversion). So… I guess I don’t follow your arguments?

  24. Richard Wells says:

    UTC considerations were rather irrelevant with DOS 1 or 2 or even any later DOS. Before the RTC, the time and date was whatever entered. No guarantee that would be even close to accurate. RTCs frequently drifted considerably from exact time. The filestamp will always be incorrect. It took the introduction of networking to force time to be somewhat consistent across systems.

    The only way to track of development of software was to hope that the comments noted when and how changes were done.

  25. Michal Necasek says:

    By 1988 (MS-DOS 4.0 times), Microsoft was already networked and time was likely synchronized over the network, so it probably wasn’t too far off. Probably. Of course that’s assuming the files came from Redmond, which is not a given.

    There definitely are lots of files with January 1980 timestamps, which is impossible for anything created on DOS.

    Once developers started using make utilities, the timestamps had to be sane. Not necessarily accurate, but the easiest way to keep them sane probably was to make sure that the PCs’ clocks were more or less accurate.

  26. Retron says:

    “There definitely are lots of files with January 1980 timestamps, which is impossible for anything created on DOS.”

    A sure sign of something created on an original PC or XT, in that case – as they defaulted to midnight on 01/01/80, not having an RTC/BIOS battery (although you could get one as an aftermarket add-on).

  27. MiaM says:

    Side track: did PC users that didn’t have a RTC ever use some software that just incremented the date one day for each boot (saving the “current” date on disk) just to make sure things like make utilities work, and time stamps not being that far off?

  28. Richard Wells says:

    @MiaM: I haven’t seen any such utility. It wouldn’t make sense for most users since they might have multiple programs with different OS boot disks. Skip a few days because of using another program or a holiday or sprint through days if playing a booter game with frequent reboots and the clock is off considerably. Simpler to enter the date and time at boot since the time would need to be entered as few show up for work at midnight.

    The changes to clock availability was one of the hidden alterations to computer usage. Line clocks were an expensive add-on in the early 70s but practically free by the time of the IBM PC which is why the PC includes one. RTCs went from several hundred dollars to only about $20 in just a few years.

  29. MiaM says:

    @Richard Wells: True that it was a hidden or at least overlooked change to computers. In turn I would say that CMOS chips was the key that allowed RTC that would actually be reasonable to run off batteries. As a comparison some of the early digital clock radios from the mid 1970’s had battery backup for time and the alarm, but at least some of them consumed a full 9V battery in just a few hours (and that was to drive the logic in the clock IC; the display was switched off except for flashing a single dot every second to indicate that the batteries was still alive). If you lived somewhere where the grid wasn’t reliably you would likely only have the battery connected when you actually needed the alarm to wake you up.

    Re timers and whatnot: I find it weird that Apple did choose to not have any timers in the Apple II. Given that they sold it for a rather high price, I would think that they could had afforded to include a 6522 or similar chip.
    Side track: One of the differences between the 6522 used in many 6502 systems, and the 6526 used in the Commodore 64 and whatnot, is that the 6526 has a full time-of-day clock with a separate mains frequency input. However since the Commodore 64 is mostly based on the earlier VIC 20 the ROM code still uses one of the system clock derived timers to generate a 60Hz interrupt that in turn runs a software counter to count hours, minutes, seconds and 1/60sec “tics”, which was the only method available with the earlier 6522.

    Re that type of software: I wrote something similar but a bit more manual. Each boot it would read a file and set the time/date according to what was stored in that file, and I would manually run a command that would update the file before powering off or rebooting, at least if I remembered to do so. That was later on in the 1990’s on my Amiga 1200 though, but I would think that something similar would had been useful with an XT class computer with a hard disk but without a RTC. I also did some testing with a feature that some Amigas had where you could run code when a user did the three finger salute. Experimentally I had code that would write the time/date to disk, but I deemed it too dangerous to have it in place as there was a hardware timer that ran for a few seconds and if the software hadn’t finished what it was doing it would do a hard reset. Sure, there was a “dirty” bit that would assure that the disk wouldn’t get corrupted, but still. I wouldn’t say that it worked well, but way better than not having it at all. I wouldn’t had wanted to enter date/time at every boot, so the alternative would had been loads of files with the default date/time. Soon after this I acquired a combined memory expansion and RTC.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.