Retro-Porting to OS/2 1.0

A few weeks ago I embarked on a somewhat crazy side project: Make the Open Watcom debugger work on OS/2 1.0. This project was not entirely successful, but I learned a couple of things along the way.

The Open Watcom debugger components for OS/2 are 32-bit modules, but didn’t use to be. At least up to and including Watcom C/C++ 11.0, the OS/2 debugger components were almost entirely 16-bit. That allowed them to work on 32-bit OS/2 but also on 16-bit OS/2 version 1.x.

Needless to say, debugging 32-bit programs with a 16-bit debugger required a bit of extra work and there was an additional interface module (os2v2hlp.exe) which called the 32-bit DosDebug API on 32-bit OS/2 and interfaced with the 16-bit debugger ‘trap file’ std32.dll.

Building the 16-bit components again was not particularly difficult. And they even mostly worked on OS/2 1.2 and 1.1. But I wanted to get the code working on OS/2 1.0 as well. Because why not.

Running the Open Watcom debugger on OS/2 1.1

The OS/2 debug trap file has code in it that’s supposed to refuse running unless the host is at least OS/2 1.2. It looks like this:

    DosGetVersion( &os2ver );
    if( os2ver < 0x114 ) {
        strcpy( err, TRP_OS2_not_1p2 );
        return( ver );
    }

The OS/2 major version is in the high byte and minor version in the low byte. This was clearly meant to check for versions below 1.20. But… the major version returned by OS/2 1.x is actually 10, not 1 — this was done so that the OS/2 major version would be higher than the DOS versions of the time (3.x, 4.0). So the version check actually never worked because the reported OS/2 version can’t be lower than 0x114! As a result, the debug trap file didn’t refuse to load on OS/2 1.0 when it should have. But it certainly did not work.

How does one debug a debugger? Either using a different, functioning debugger, or the old-fashioned way—debug print statements and such. I went the second route and after some difficulties I ran into, I’m not at all sure the first option is really workable at all on OS/2 1.0.

Before going into details, I will say that this kind of project may now be easier than ever before. Certainly much easier than it would have been 20 or so years ago. The reason is that antique programming documentation from IBM and Microsoft is now much easier to find than it used to be.

And for reasons that no one ever really understood, Microsoft’s and IBM’s programming documentation for OS/2 1.x was completely different, and there was always some information that could only be found only in one documentation set but not the other.

The first problem I ran into was that the trap file was crashing OS/2 1.0 hard. The cause turned out to be calls to the DosPTrace API executed before a program was loaded. OS/2 1.0 does not appear to detect this situation and dies. Newer OS/2 versions just return an error and don’t do anything harmful. Easy to fix.

The second problem that I identified is that the program to be debugged could not be started. This was because the DosStartSession API (which a debugger must use) takes a parameter structure, but said structure must be smaller on OS/2 1.0 and the API failed when attempting to use it with a newer, larger version of the structure. Again, easy enough to fix.

Another problem had to do with rather interesting logic that the Watcom debugger uses. It injects a small piece of code into the debugged process and runs it. The code does not do much, only invokes the DosLoadModule API to load a copy of the trap file. The debugger can later use this to perform various magic tricks like writing to the debugged program’s console or redirecting files within the debugged process.

The code, quite sensibly, queried the full path of the trap DLL and used that to load a copy of itself in the debugged process. This failed on OS/2 1.0. After briefly wondering why, I reviewed the documentation and found that in OS/2 1.0, DosLoadModule does not accept a full path.

The only thing that DosLoadModule can do on OS/2 1.0 is take an up to 8-character name as input. This name has the .DLL extension tacked onto it and the operating system searches for the file along LIBPATH. This is obviously quite restrictive, but it’s all OS/2 1.0 can do.

The next problem caused me quite a bit of grief. I used remote debugging (over an emulated serial port) to debug the GUI (really text UI) version of the OS/2 debugger. The debugger stubbornly kept failing to load a support file (cpp.prs). But if I ran the debugger directly (not trying to debug it), there was no problem loading the file!

Remote debugging to OS/2 1.0 works. Mostly.

I thought the problem was perhaps due to case sensitivity, or because OS/2 1.0 didn’t accept some specific combination of DosOpen flags. To narrow down the problem, I used the remote debugger to modify the DosOpen flags and re-run DosOpen repeatedly.

Well, I tried. Whenever I tried modifying the flags in the debugged process, the memory contents changed… to junk. After some head scratching, I found that the std16.dll trap file functionality to write memory had a not very obvious bug in it, taking the size of the wrong structure when calculating internal offsets. This bug was likely present for a very, very long time.

With the remote debugger improved, I could attack DosOpen again. I could not find any combination of flags that would work.

Eventually I had a flash of inspiration and realized that the problem was something completely different: When the debugged program started, it was running in the wrong directory! And that’s why it couldn’t find a file that was expected to be in the current directory. I believe the problem is that in OS/2 1.0, a new session always inherits open files etc. from the shell; in newer versions, the debugger can (and does) request that the new session inherits from the debugger instead. This is another area where OS/2 1.0 is different and more limited than the later versions.

In the end, I managed to get the remote debugging somewhat working on OS/2 1.0… but not particularly well. I can load a simple hello world program, run it, set breakpoints, step through it, inspect and modify its memory. I also verified that the debugger can properly catch #GP faults in a simple program.

As an aside, the debug interface in OS/2 1.0 has odd limitations. For example, a debugger can intercept general protection faults, but it can not intercept divide overflow faults. An integer divide by zero always terminates the program and the debugger can’t do anything about it.

But my attempts to debug a more complex program (the GUI debugger) miserably failed. The debugger, when executed directly (not in a debugger), always crashed—no doubt because of yet another subtle difference between OS/2 1.0 and later versions.

But when trying to run the debugger through remote debugging, all I could achieve was to crash OS/2. I could still load the program and set breakpoints, and step through it, but when it crashed, it took the whole OS with it.

All in all, it’s apparent that OS/2 1.0 was a work in progress. OS/2 1.1 was notably improved, clearly building on users’ experience with version 1.0. It was only OS/2 1.2 that finally included all major features that OS/2 1.x was supposed to have, including installable file systems (IFSs) and improved Presentation Manager. It’s no surprise that OS/2 1.0 was rough around the edges.

This entry was posted in Development, IBM, Microsoft, OS/2, Watcom. Bookmark the permalink.

10 Responses to Retro-Porting to OS/2 1.0

  1. random lurker says:

    “It injects a small piece of code into the debugger process and runs it.”

    Did you mean to say it injects code into the *debugged* process? (I don’t understand why the debugger would need to inject code into itself)

    thx

  2. Michal Necasek says:

    Yes, that was a typo, fixed now. Thanks.

  3. Bill Olson says:

    “And for reasons that no one ever really understood, Microsoft’s and IBM’s programming documentation for OS/2 1.x was completely different, and there was always some information that could only be found only in one documentation set but not the other.”

    They had very different ideas of what they wanted OS/2 to be therefore the very different documentation. Also, from what I understand, they each set about creating documentation separately in the beginning and then realized all of the problems they would run into if the documentation wasn’t the same.

    Microsoft gets WAY too much credit for their work on OS/2. If they had been the star programmers for the OS, Windows 95 and versions of Windows after that would have been far better and absolutely weren’t.

    OS/2 was one of only two OSs that I ever LOVED using. There was OS/2 and there was BeOS.

    The version that I loved was OS/2 2.0 beta that I saw at a presentation of the new OS/2 at IBM’s building in downtown Seattle. I was blown away at what it could do. It was literally a much better DOS than DOS and much better Windows than Windows. But it also was its own OS and I was able to go from using two DOS and two Windows computers to using just one OS/2 computer and was probably 400% more productive than I had been before with DOS and Windows.

    While DOS and Windows were not dependable, OS/2 2.0, even the beta version was leagues better than either of those with up times in months when DOS and Windows could barely make it a couple of weeks … if that.

    I use Mac OS now. It’s >> significantly <> could << have gone and been…

  4. This is the kind of content I come here for regularly.
    I really admire your persistence in debugging such issues and reviving old software in a way no one previously imagined!

  5. r34jinkai says:

    ” If they had been the star programmers for the OS, Windows 95 and versions of Windows after that would have been far better and absolutely weren’t.”

    They never needed to be the best. They only needed to fullfill the requirements and needs of their customers. Quoting Silicon Valley SteveJ vs BillG: “SteveJ: We are better… Our product is better BillG: You don’t get it. That doesn’t MATTER (while showing a taiwanese PC with Windows 1.02”. With Windows 95:
    – They offered a Desktop OS which had lower memory requirements than OS/2, but worked well enough at the hardware of time, while offering a revamped interface.
    – Techologies like Dragon (layered storage stack), the VMM and leveraging part of the work on DOS allowed such small memory footprints.
    – Windows 95 works even in the worst taiwanese PCChips mobo trash. Sure you will not get the most stable experience with that, but it allowed poor countries people to have a relatively working, modern OS experience with the resources they had.

    As for IBM:
    – Many software addons, which IBM and others charged a money for them, were offered built-in and free with the OS.
    – Their SDKs, DDKs and programming compilers and tools certainly had cost back then (no “community” editions), but certainly were cheaper than IBM (here has been mentioned how expensive OS/2 programming kits) and UNIX vendor offers. They also had free extras in their BBS first, and then in FTP site.
    – In the last stage… They literally threw to trash all the work done in Taligent veture… OS/2 for PPC, which would have been literally IBMs own “NT” isn’t more than a curious note in the OS tech history. OS/2 4.0, Merlin and Aurora for x86 kept using the assembler crafted Kernel based on CP/DOS ancient technology. They have never opened the sources for OS/2 PPC kernel.


    I don’t know, but OS/2 failure relies only in IBM. MS just managed better their cards and played with the risks. At some time even catered piracy to certain extent, with the premise of putting Windows in every PC of the world, and forcing programmers to offer their software for Windows as the first citizen. They catered the programmer community with more accessible kits. And IBM… Was being just IBM. Doesn’t matter OS/2 being an excellent OS at some point in time. What mattered was IBM understanding the trend and adapting to the market. They chosen don’t do it. Is ironic, as they are doing the same thing again… But this time with RedHat.

  6. Victor Khimenko says:

    No, what’s really ironic is not that IBM does the same mistakes again but the fact that Microsoft, somehow, managed to forget everything they did to win desktop and repeat all the mistakes that IBM did with OS/2 when they tried to conquer smartphones.

    They had decent market share with “DOS of mobile phones”, Windows CE.
    But then they broke compatibility with Windows Phone 7, again with Windows Phone 8, and yet again with Windows 10 Phone. Like IBM did with OS/2 1.x and then OS/2 2.x.
    Instead of supporting existing developer’s tools, they pushed people to use a new one. Like IBM did.
    Instead of releasing crap which would work on everything, they released OS which was very picky about what would it support.
    And the most amusing part: they did all that under guidance of guy who saw and participated, first hand, in desktop wars and observed, personally, how they have been won!

    But even that is not cherry on the top. The cherry on the top is behavior of Google: few years after it did everything RIGHT (released crappy product ASAP, then updated it over time, added support for all the legacy technologies they could support, brought support of that same crappy OS on desktop to ensure ChromeOS would have the same apps) they managed to turn around and repeat these same mistakes IBM did AGAIN, this time with Fuchsia. Thankfully, unlike Microsoft, they haven’t killed their legacy OSes and thus the end result was just DOA Fuchsia, but still…

    Bill Gates was, most decisively, right, but apparently it’s incredibly hard to make software developers develop crap when market demands it (which is almost always because market prefers crappy product NOW to great product TOMORROW approximately 10 times out of 10).

  7. Michal Necasek says:

    Don’t forget the Itanium… after x86 more or less killed all RISC competitors, Intel thought that they were special and could make a RISC CPU better than everyone else could. Boy were they wrong.

  8. Richard Wells says:

    Itanium was not a RISC CPU; it was close to the exact opposite of RISC design as possible. Lots of complicated instructions at a variety of widths with many requirements on how they can be ordered does not match the underlying simplicity for RISC. Intel’s plan was to move all the work to the compiler writers but it was a very difficult problem turning simple code into the correct complicated instructions especially before there was a working CPU. Couple that with Itanium being incredibly effective at solving only a limited set of problems that few cared about ensured a chip with limited appeal.

    Intel did have an effective RISC design in house while the Itanium was being developed: the Alpha which was purchased from Compaq after DEC became part of Compaq. The final Alpha designs were competitive with the released Merced.

    Good enough and cheap and available tended to be the winning strategy. Companies mess with that at their peril.

  9. random lurker says:

    Of course, the great irony is that the unintentional RISC design* ended up beating the pants off of everyone else (except ARM in the low power space).

    * That being modern x64 CPUs which are internally essentially just RISC+SIMD..

  10. Michal Necasek says:

    NexGen and AMD K5 CPUs were already RISC cores with x86 front end. That was around 1995.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.