How Not To Release Historic Source Code

This is how to not do it:

GitHub

Don’t get me wrong, it’s absolutely brilliant that Microsoft was able to release a fairly complete (minus DOSSHELL) source code for MS-DOS 4.00 or 4.01 (see below). As much as it was hated, DOS 4.0 was an important milestone and DOS 5.0 was much more similar to DOS 4.0 than not. This source code will be an excellent reference of modern-ish DOS until Microsoft officially releases the long ago leaked MS-DOS 6.0 source code. The source code includes all required build tools, which makes building it (compared to many other source releases) extremely easy.

But please please don’t mutilate historic source code by shoving it into (stupid) git.

First of all, git does not preserve timestamps, which causes irreversible damage. Knowing when a source file was last modified is valuable information.

Second of all, the people releasing the source code clearly thought, hey, it’s source code, let’s shove it into git, what could possibly go wrong. Well, this is what could go wrong:

Nope, not building

For practical purposes, old source files are not text files. They are binary files, and must be preserved without modification. It is not OK to take an old source file and convert it to UTF-8. For one thing, UTF-8 didn’t even exist in the times of MASM 5.10 and Microsoft C 5.1, of course old tools can’t deal with it!

The above problem was most likely caused by taking a source line using codepage 437 characters and badly converting them to UTF-8. That made the source line too long, past the circa 512 byte line length limit of MASM.

In the case of getmsg.asm it’s easy enough to manually delete the too long line in a comment. But it’s much worse with the src\SELECT\USA.INF file. Here, the misguided use of git not only made some comment lines too long for MASM, but it also actively destroyed the original source code. The byte arrays defined near labels PANEL36 and PANEL37 got turned into junk, or more accurately into a sequence of Unicode replacement characters.

This blunder is all the more regrettable because similar problems affected the previous GW-BASIC source release (very old MASM versions cannot deal with UNIX style line endings).

The timestamp destruction makes it harder to pin down what the source code actually is. The DOS 4.0 release was very confused because IBM first released PC DOS 4.0 in June 1988 (files dated 06/17/1988), but soon followed with a quiet update (files dated 08/03/1988) where the disks were labeled 4.01 but the software still reported itself as 4.00.

The just released source code almost certainly corresponds to this quiet 4.01 update. At least one source comment implies 8/5/88 modification, i.e. August 1988.

At least the core files (IO.SYS, MSDOS.SYS, COMMAND.COM, FORMAT.COM, FDISK.SYS, SYS.COM) built from the source release are a perfect match for the files on “MS-DOS 4.00” disk images that can be found on winworldpc.

Said files are dated 10/06/1988 and DOS reports itself as 4.00. However, the released source code, in the file SETENV.BAT, includes the following line:

echo setting up system to build the MS-DOS 4.01 SOURCE BAK...

This further suggests that the source code in fact corresponds to the quiet update of DOS 4.01 and not to the original IBM DOS 4.00 from June 1988, which to the best of my knowledge was never available from Microsoft. After a few months, perhaps in late 1988 Microsoft changed DOS to report itself as 4.01 because—unsurprisingly—the 4.00 version number was confusing customers.

As a historic footnote, BAK stood for Binary Adaptation Kit. MS-DOS OEMs would receive the BAK to adapt to their hardware. However, most OEMs did not receive the full source code, only the code to components that likely needed modification, such as IO.SYS.

But the fact that the “Source BAK” was something that Microsoft shipped to (select lucky) customers is actually great—since it’s supposed to be built by 3rd parties, it includes all of the required tools and is in fact quite easy to build.

Executive Summary

It’s terrific that the source code for DOS 4.00/4.01 was released! But don’t expect to build the source code mutilated by git without problems.

Historic source code should be released simply as an archive of files, ZIP or tar or 7z or whatever, with all timestamps preserved and every single byte kept the way it was. Git is simply not a suitable tool for this.

This entry was posted in Development, DOS, Source code. Bookmark the permalink.

81 Responses to How Not To Release Historic Source Code

  1. starfrost says:

    I worked with MS to get this released. If you really want I can probably get the original ZIP (I don’t know about timestamps, but in ASCII).

  2. starfrost says:

    The reason I cannot do timestamps is because data protection law mandates anonymisation of source files, at least that is the policy.

    Note that this is far better than the CHM “preservation”.

  3. Michal Necasek says:

    Do you know how the files survived? On floppies or on some hard disk backup? If on disks, then releasing the disk images would be terrific. But really anything that builds without having to be edited would be good.

    There’s probably some way to fix the files in git as well? As I said they’re not plain ASCII (most are, but not all of them) and they must not be converted to UTF-8.

  4. Michal Necasek says:

    Let’s hear it for idiotic corporate policies! Obviously not your fault.

    But I have to point out that in the Multitasking DOS release, there are source files on a floppy image, with original timestamps…

    Yes, the CHM release of DOS source code was… not ideal.

  5. Michal Necasek says:

    I should say that except for the git mutilation, the DOS 4.0 source code seems to be complete (no DOS shell, but that never seems to have been part of it) and easy to build. In comparison, building the DOS 2.x source code was hell because I first had to find just the right tools. Here it’s all included, and the resulting binaries match what Microsoft released in 1988!

  6. starfrost says:

    Yeah, I’m going to look into it. Sorry, I got a bit annoyed. I would prefer to email you about this if that’s okay? Just email me (surely you can view emails?

    I don’t know how they survived.

  7. Random says:

    I’m glad it’s not just me! I’ve been trying to chase down errors building things with the creeping feeling that I was going to need to do some sort of unix2dos magic on the entire source tree.

    Not. Fun.

    What you wrote confirmed my fears. I am going to do a hail mary and install visual studio on a VM and see if I can import the git tree with windows line endings (I vaguely remember that being an option but I’m not sure and could easily be wrong).

    Of course, based on what you’re saying there’s deeper problems than that…

    I’ve been able to build (but not test) a lot of the utilities; I’m not sure if emm386.sys is going to run or not, for instance; but it builds. Command.com builds and runs. So far the problems I’m having are with the kernel files and fdisk…but I’m not done going through the tree and trying to build things manually, either.

  8. Random says:

    Additional note; I got it to build in dosbox-x on windows 11, playing around with mounting folders as “c” and then editing setenv.bat as needed. I didn’t have a lot of luck building it in native MS-DOS (4, natch) in 86box and haven’t tried much of anything else yet.

  9. felsqualle says:

    Thank you for covering this! I’d love to see a release with sources that work without modifications – a clean ZIP file with proper encoding would be so amazing…

  10. Morty says:

    Cool! Just wondering about MS-DOS 6 as referenced here. This was my first DOS (family PC, xmas 1993!). Does anyone know if the source leak included the source for the newer more advanced utils like memmaker, dblspace, defrag etc? I seem to recall someone mentioning they were not.

  11. Derek says:

    The only loss which should be attributed to git is the loss of per-file timestamps.

    Any ‘corruption’ of the contents is down to the user of git.

    I’ve got various old DOS based source, with CP-437 chars etc, and CRLF line endings in git repo’s without any issue; also some associated binaries. So if there is bad UTF-8 encoding of the files, that is down to the person driving git.

  12. Michal Necasek says:

    The git defaults are what they are, so the “person driving git” needs to know exactly which files need what treatment. In this case line endings are not an issue I believe, but the garbled files where the source files were (probably) CP 437 encoded are.

  13. Michal Necasek says:

    It did not, and I’m not entirely clear on how much source code Microsoft had, since those were all licensed from 3rd parties.

  14. Michal Necasek says:

    The line endings aren’t an issue, MASM 5.10/MS C 5.1 can deal with those. Non-ASCII characters in the original source files are the problem.

    I was able to build all of it, just copied it over to a VM running PC DOS 2000 and ran the build there. I did not bother properly fixing the SELECT source file.

  15. Jeff Wilcox says:

    Thanks for the feedback. We’re learning a lot here… and some of our other releases such as the 3D Movie Maker we sort of knew would not be buildable without some work, but didn’t even think about this in the hurry to go publish once it was ready.

    We liked GitHub here to make browsing it all so accessible on the web, and there’s definitely some conflict between pure software preservation, what redaction is important or not, and how much time to invest in providing a great buildable experience as-is or making it more of … a project, sorry.

    If anyone finds a straightforward set of fixes, or some scripts or patches, happy to revisit in a few weeks and see what we could do. Feel free to ping jwilcox at microsoft.

    Thanks for the post and the feedback, we’re learning with each project.

  16. Morty says:

    Yes it seems a lot of those advanced utils were licensed: Dblspace from vertisoft, msav from Central point, defrag seemed to me at the time to be a cut-down version of speeddisk but I haven’t seen this confirmed. Memmaker didn’t look like other utils I have seen, but could be as well. However, I don’t know if this means Microsoft didn’t have the source code and could independently build the software. After all, e.g. Doublespace was deeply integrated. But there could be IP reasons why they can’t release the source even now. On the other hand, if the current MS-DOS 6 out there is a leak, this wouldn’t necessarily apply.

  17. digital archivist says:

    Thank you Mr. Michal Necasek saying loud out what a lot of us are thinking inside.

  18. r34jinkai says:

    @Morty.
    Leaked DOS6 source is from the v6.21, which has all the 3rd party components removed for audit review as part of the Stac Electronics vs Microsoft lawsuit (defrag, scandisk and dblspace are tied as all them need to know the internal structures of the on-disk compression format). Also, DEFRAG is licensed from Symantec and shipped as OBJ files so not much to see there.

    Memmaker, Undelete, MSAV and Backup were licensed from Central Point and also shipped as OBJs. And since this is DOS 6.21, not .22, DriveSpace isn’t still there (As an interesting trivia, leaked NT4 source actually includes an NT FS driver for Doublespace CVFs, but it is opted out and never compiled, hidden in the NTOS tree as source detritus).

  19. Tuomas Tynkkynen says:

    According to HN comments, some of the source was even censored a bit as a hot-fix (original contained a not-so-nice comment about Tim Paterson): https://news.ycombinator.com/item?id=40163766

  20. Derek says:

    The defaults for git (the CLI tools) have always been “store a binary blob” – uninterpreted. See below for CLI proof. Now it may be that some other tool was used to generate the git repo, in which case that tool (or the user driving it) is at fault.

    $ mkdir GG-test
    $ cd GG-test
    $ printf “Hello\xC1\r\nWor\xC2ld\r\n” > test.txt
    $ hexdump -C test.txt
    00000000 48 65 6c 6c 6f c1 0d 0a 57 6f 72 c2 6c 64 0d 0a |Hello…Wor.ld..|
    00000010
    $ git init .
    Initialised empty Git repository in /home/derek/GG-test/.git/
    $ git add *
    $ git commit -m ‘Initial’
    [master (root-commit) bb3afe3] Initial
    1 file changed, 2 insertions(+)
    create mode 100644 test.txt
    $ git show HEAD:test.txt | hexdump -C
    00000000 48 65 6c 6c 6f c1 0d 0a 57 6f 72 c2 6c 64 0d 0a |Hello…Wor.ld..|
    00000010
    $ rm test.txt
    $ ls
    $ git reset –hard HEAD
    HEAD is now at bb3afe3 Initial
    $ ls
    test.txt
    $ hexdump -C test.txt
    00000000 48 65 6c 6c 6f c1 0d 0a 57 6f 72 c2 6c 64 0d 0a |Hello…Wor.ld..|
    00000010

  21. ecm says:

    The line endings actually were problematic for nosrvbld.exe and for exe2bin which has its stdin redirected to supply a default answer to a prompt for a relocation segment (done using int 21h service 0Ah).

  22. Michal Necasek says:

    I haven’t noticed those problems (probably wasn’t looking hard enough) but… yes, CRLF line ending are the safe option. Many DOS-based tools can work with UNIX line endings, but not all. I’m sure the original files all used CRLF.

  23. Michal Necasek says:

    Memmaker was from Helix (Netroom).

  24. OBattler says:

    1. To respond to a comment above, no, the leaked MS-DOS 6.0 source code is not from 6.21, but from 6.0 beta build 0204. There is even a compiled COMMAND.COM in it with that build number.

    2. There is a version confusion here. MS-DOS 4.00 from October 1988 is not PC DOS 4.00 from June 1988. It’s actually based on PC DOS 4.01 but for some reason, Microsoft decided to reset the version number back to 4.00 and release it as that before releasing their own MS-DOS 4.01.

  25. willem says:

    you can preserve the timestamps in git, by setting ‘GIT_COMMITTER_DATE’ before calling ‘git commit’.

  26. Thalia Archibald says:

    > Note that this is far better than the CHM “preservation”.

    FYI the release via the Computer History Museum is a zip, which has original modification times and DOS line endings. Compared to the versions in the repo, it’s got the best metadata.

  27. Morty says:

    Thanks for the info! Interesting that almost all the tools that had anything resembling a UI or an ‘interactive’ element was licensed 😉 Why couldn’t MS make advanced utils themselves? I also suspected the undelete etc. was licensed so nice to see that confirmed.

    BTW, has the source from any of those famous utilities (speeddisk, qemm, helix, stacker etc.) ever come out (either as leaks, or voluntarily)? I would love to see the source and there can’t be that many legal concerns from those companies in releasing it. The problem is maybe more practical – where is the source even stored and who has the rights now?

  28. Michal Necasek says:

    Yes, the problem was that the CHM mixed up several different file sets with absolutely no explanation of what they’d done.

  29. Michal Necasek says:

    The leaked source code is actually MS-DOS 6.0. I don’t know why everyone thinks that the included COMMAND.COM binary corresponds to the source code.

    Microsoft didn’t exactly reset the version number… because the fixed PC DOS 4.01 from August 1988 also showed its version as 4.00 (but the floppy labels said 4.01). So all Microsoft did was nothing. Which of course brings the question why IBM kept the displayed version as 4.00… and I don’t know the answer.

  30. Michal Necasek says:

    Sure. Or you could put the files into a SQL database. All much more clumsy and cumbersome than a simple file archive.

  31. Michal Necasek says:

    MS could and did make their own UIs (QBASIC, DOS Shell). They licensed the utilities from 3rd parties because it was cheaper and especially faster than developing their own. DRI and IBM did the same thing, and Microsoft could not afford to stay behind in terms of features.

    The only such tool that was released in source form that I know of is 386MAX.

  32. Morty says:

    Makes sense! I think I have to make a day out of installing MS-DOS 6 and then all those cool built-in and third party utils in a VM that I played with as a kid. Pure nostalgia 🙂

  33. I was easily able to get this thing up and building in no time, thankfully zip/unzip do a pretty good job of sorting out most of the crlf fun!

    I put a 7z on internet archive as I wanted to cross this from windows 10:

    https://archive.org/details/build-dos4-win32_try4

    of course the real issue is why does’t it not boot from many hard disks? it tripple faults vmware! I tried formatting a disk from dos6 & manually copying io.sys/msdos.sys/command.com and that boots from floppy. I tried the same from hard disk, and it hangs.. so it’s not the bootsector. I guess its in how io.sys loads the rest of itself?!

    I have no idea how to debug anything at boot time. 🙁

  34. Michal Necasek says:

    I recall that IBM DOS 4.0 (and it must apply to MS-DOS 4.0 as well) calls INT 13h with extremely small stack somewhere in its loader. Older and newer DOS versions don’t have this problem. Maybe that’s what you’re running into.

    See ‘Mystacks’ in msload.asm. I’m 98% sure that’s the problem.

  35. I have to dig but you’d said it was something about the BIOS stack being greater than 100 bytes which is what trashed many machines.

    I’ll have to start modifying magic numbers I guess, at least I can build it outside of DOS so it’s not so painful

  36. See ‘Mystacks’ in msload.asm. I’m 98% sure that’s the problem.

    it was!

    double it to 128, and I booted from the HD in Qemu! going to test all the things!

  37. Michal Necasek says:

    Good to know! Honestly I don’t know what they were thinking, 64 bytes is a crazy small stack. It probably didn’t cause more problems because it happens very early in the boot, before any drivers are loaded, so only ROMs are involved.

    You could also have used VirtualBox… which has no trouble booting unmodified PC/MS DOS 4.0 from hard disk.

  38. I havent tried it on my PS/2 yet, but dos4 didn’t work on my model 80 with a spock scsi card. I’m assuming its been stopping 4.00 on so many things. I tried vmware and it went from totally crashing vmware, to now being able to boot from disk, partition, format, and boot from HD. Kind of funny to think that MS-DOS 4.00 was a poison pill to vmware!

    I’ll have to try eltoredio?! boot cd-roms, usb etc etc to see if it works on more machines.

    select is like the worst part about MS-DOS 4, so I’m not too bummed out by not having it working in any sane manner. The more interesting stuff is the family API! and some early doscalls.h … so many things!

  39. raijinkai says:

    @JasonStevens
    Also don’t forget this is the only DOS version which includes the redirector as a separated component. IFSFUNC is full of interesting and historical stuff which connects with what is available in the DOS6 source. Is a shame it ended being too memory consuming at time.

  40. felsqualle says:

    Someone on the freedos-devel mailing list came up with some unix2dos and sed magic that fixes the bits that were malformed when importing the code to git:

    https://sourceforge.net/p/freedos/mailman/message/58765259/

    So far I can confirm it works, I wasn’t able to fully test SELECT though.

  41. Howard says:

    See my attempt at restoring the CP-437 line drawing characters for SELECT and in the code comments:

    https://github.com/hharte/MS-DOS/commits/dos-4.00/

    There is one byte difference in the resulting SELECT.DAT that I have not tracked down yet. Everything else is a match for the official MS-DOS 4.00 release distribution.

  42. ForOldHack says:

    How not to release DOS ( DOS-4.0 ), How not to release DOS Source (DOS 4.0). For the most part, it is just embarrassing. How to do it? 1. Fix your persistent bugs. 2. Index and cross reference everything. Clean up your language, and add some contextual documentation. Look at how Netscape did it. The public release was great. They took the time to finish loose ends, make it seemingly well managed: Anyone who loves law, sausages and software should never ever have to witness them being made. Now for the nitty gritty: Pg 4: “Ideally, the stronger condition, “compatibility with 286 mode!” should be net. XENIX for example runs on the 286 only in 286 mode.”
    Pg 2: Every programmer doing work in assembly language should obtain and study the iAPX 286 Programmer’s Reference Manual.”

  43. ecm says:

    @felsqualle

    That’s me. The commands don’t really “fix” things, especially not select. They just make it so the build can finish. The sed avoids the too long assembly source lines. The unix2dos fixes the files that come out of git with LF line endings (at least on Linux). When I made these adjustments I didn’t even know how broken select is. Probably someone will have to identicalise select using a known good build to restore the correct text.

  44. I have Apricot DOS 4, copies of IBM DOS 4, and some other OEM’s and none of them boot on my PS/2, I’m sure it’s the Spock SCSI card. Anyways I verified that yes with the msload fix, it runs fine!

    I even rebuilt DOS on the PS/2, it took 70 minutes on the 16Mhz 80386! Nothing like self hosting!

  45. ForOldHack says:

    The best historical context for this is to compare HOW the different co-operative multi-taking environments work. i.e. compare Windows 1.0, to Switcher to DOS shell 5, but we only have Switcher to compare to other open source projects.

    Thank you to The archivists at Lotus and Microsoft for the MIT license:

    “Permission is hereby granted, free of charge, to any person obtaining a copy
    of this software and associated documentation files (the “Software”), to deal
    in the Software without restriction, including without limitation the rights
    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
    copies of the Software, and to permit persons to whom the Software is
    furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all
    copies or substantial portions of the Software.”

    We can sublicense and sell it.

  46. John Elliott says:

    At some point I’m going to have to rewrite the OEM ID check in MSINIT.ASM so it doesn’t ignore the BPB if the OEM ID doesn’t start “MSDOS”, “IBM␠” or “OS2␠” (let alone the buggy version number check which doesn’t work if the last character but one isn’t a dot). JDeBP’s suggestion of rejecting only known-bad OEM IDs (ie, “IBM␠␠3.0”) sounds like a much better bet.

    (If you want to experiment with this, use DRDOS 7 FDISK to format a 126Mb partition – it gives it 4k clusters and an OEM ID of “DRDOS␠␠7”, but MSDOS, including 4.00, tries to mount it with 2k clusters because it doesn’t recognise the OEM ID. The same happens if the OEM ID happens to be “MSWIN4.1”.)

  47. Josh Rodd says:

    Jason,

    Your passion for operating old compilers and even older linkers is something to behold!

    I’m trying to remember how we booted PC DOS 4.00 (not 4.01) back in the day on a PS/2 Model 65 SX. These all had a Tribble (non-cached SCSI adapter) standard. PC DOS 4.00 definitely booted from the hard disk.

    Does anyone know why the original IO.SYS/IBMBIO.COM used such a tiny 64 byte stack?

  48. Michal Necasek says:

    Thanks! Quite a few files affected. All in comments except for SELECT.

  49. Michal Necasek says:

    I was wondering about that too — IBM DOS 4.0 *had* to work on contemporary PS/2 machines.

    There’s nothing in the source comments that indicates why such a tiny stack was used. Even stranger, the comment in routine Setup_stack says “Move the stack to just under the boot record and relocation area (0:7C00h)” but that’s not what the code is doing. So the code was clearly changed, but no hint as to why.

    I cannot find any stack size guidelines in the PC/AT Tech Ref., or even in the PS/2 and PC BIOS Tech Ref. All I found is that for the POST, the stack size is 256 bytes on the PC/AT (300-3FF).

    My guess is that approximately this happened: For some reason, the stack below 7C00 could not be used, so they put a stack in the loader itself. They wanted it to be as small as reasonably possible, and 64 bytes happened to work reliably on the machines available at the time. Later on, when DOS 4.0 was no longer relevant, disk BIOSes started needing more than 64 bytes, but OEMs didn’t care because DOS 5/6 did not have this problem.

  50. LightElf says:

    It would be nice to see official release of OS2 Warp source code 🙂 It can be done with collaboration between Microsoft and IBM.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.