Tracking Down a Bug

When I first encountered the Unix vi editor many years ago, I recoiled in horror. It was nothing like the editors I was used to—Borland IDEs, DOS 5.0/6.x EDIT, or OS/2 and Windows editors. But over the years, I learned to use vi.

It turns out that on many exotic *nix systems, vi is the only game in town. Sure, there might be some way to install pico or nano or whatever, but even if that is possible, it might be very difficult to get there without editing a couple of files… with vi. And when the system in question is some ancient XENIX or PC/IX or Microport UNIX or whatever, the effort of installing some other editor vastly outweighs the effort of learning vi—and that’s if the system in question even comes with a compiler.

Some time later I started using the vi editor that comes with the (Open) Watcom compilers. It’s a decent vi clone, not great but workable. It has the nice property that there are console versions for DOS, OS/2 (both 32-bit and 16-bit), and NT available (and *nix, too). The DOS version that ships with the compiler is a 32-bit DOS-extended version, but it’s also possible to build a 16-bit version. Which will run on a 286 or even an 8086, and that’s occasionally useful.

Open Watcom vi/286, a 16-bit DOS version with EMS/XMS support
Open Watcom vi/os2, a 16-bit OS/2 version running in OS/2 1.1

I’ve been using the 16-bit DOS version of Open Watcom vi for some time, and it works well… except sometimes it hangs the system. But only sometimes and not very consistently. When it does hang, it usually happens when quitting the editor, but rarely also when starting it up.

I’ve never been able to track down the problem because it doesn’t happen frequently enough and when it does, all I can see that the system ends up in a corrupted state. Until yesterday.

Somehow I ended up in a situation where running vi immediately after starting a certain DOS VM would trigger the hang, maybe 80% of the time. I decided to pounce. Once again, I couldn’t tell much from the hung state. So I decided to roll out the big gun. I ran the VirtualBox VM in a debug build with software instruction emulation and partial instruction logging—that is, logging every instruction, partially decoded, without logging the register state. This is a compromise that is already slow and produces huge log files, but not as horribly slow verbose as full register state logging. Since the problem was runaway code, I hoped the log would tell me something.

The log looks about like this:

IEMExecLots: cs:rip=c9d8:000016bf ss:rsp=8e66:0000054a EFL=027246
decode - c9d8:00000000000016bf mov rAX,Ov [#2761123]
IEMExecLots: cs:rip=c9d8:000016c2 ss:rsp=8e66:0000054a EFL=027246
decode - c9d8:00000000000016c2 add rAX,Iz [#2761124]
IEMExecLots: cs:rip=c9d8:000016c5 ss:rsp=8e66:0000054a EFL=027206
decode - c9d8:00000000000016c5 cmp Gv,Ev [#2761125]
IEMExecLots: cs:rip=c9d8:000016c8 ss:rsp=8e66:0000054a EFL=027246
decode - c9d8:00000000000016c8 jnc/jnb Jb [#2761126]
IEMExecLots: cs:rip=c9d8:000016bf ss:rsp=8e66:0000054a EFL=027246
decode - c9d8:00000000000016bf mov rAX,Ov [#2761127]
IEMExecLots: cs:rip=c9d8:000016c2 ss:rsp=8e66:0000054a EFL=027246
decode - c9d8:00000000000016c2 add rAX,Iz [#2761128]

The first thing I noticed was that when things were going wrong, the VM was executing single-stepping breakpoints. That was certainly unexpected, but quickly explained by a POPF instruction popping junk off the stack.

Now I knew where things had definitely gone off the rails. The challenge was finding where it started. Not entirely trivial in a log file about 1.5 GB in size with hundreds of thousands of lines.

There were multiple red herrings. Executing code with CS=FFFFh turned out to be normal, just DOS in HMA. Similarly a stack with SS=00F5h looked suspicious but was just DOS doing its thing.

After more poking around, I narrowed down the region where things probably started going wrong. Searching through the log I noticed a curious thing: There was a code location which executed a RETN instruction a couple of times, but then that RETN suddenly turned into a JMP and things started going wrong.

In other words, there was a routine which finished with a RETN instruction, but somehow that instruction changed into a JMP and then nothing much made sense anymore. Self-modifying code was a possibility, but the jump did look like it ended up in an implausible location.

I knew that I couldn’t find from the log file where the RETN instruction was overwritten. But I could just put a write breakpoint on that memory location and re-run the testcase. That’s the beauty of deterministic environments that don’t use ASLR—things are in the same place every time. The breakpoint was quickly hit, and things started looking more promising. Interestingly, the instruction doing the overwriting was using a SS: segment override; that was significant.

The reason for hope was that the code segment of the overwriting instruction made me (as it turned out, correctly) suspect that it was code in the vi editor itself doing the nasty work. The next challenge was to find what the code was. With a map file in hand, I was able to narrow down the location to two or three object files. Running those through the wdis disassembler quickly allowed me to determine that the code doing the overwriting was in a module called int.obj. Then it was easy to look at the corresponding source code.

The code installs a timer interrupt handler (called via INT 1Ch from the BIOS timer tick hardware interrupt routine) which is supposed to update the data for a clock displayed in the upper right corner of the editor. I could immediately see the problem: One of the routines was assuming that SS==DS and using SS: segment overrides to access the data segment (DGROUP).

That is a big no-no in hardware interrupt handlers. And of course it explained the observed behavior. Almost all the time, when the timer interrupt was handled, vi was running on its own stack with SS==DS. But sometimes it wasn’t, and then things went wrong, and some innocent memory got corrupted. But the corruption might not affect vi itself and only became apparent after quitting the program. Or it might not become apparent at all.

The problem was easily solved by making sure the int.obj module gets compiled with the -zu flag. That tells the compiler that SS!=DS and the compiler will know it can’t use SS: segment overrides to access DGROUP. And the saddest part? The -zu flag used to be there, but then someone completely reorganized the vi project makefiles, and the flag got lost. The perils of open source development.

This bug was very much in the nasty category. It was only triggered relatively rarely. And even when it was triggered, it corrupted some not entirely predictable memory location which might never be used. Even when it was used, it might be used long, long after the actual corruption occurred. In such cases, an element of luck is often involved—coming up with a reproduction scenario which makes the bug more or less reliably occur.

Even then it’s not smooth sailing, because when the detective arrives at the murder scene, the culprit had long left the room. It requires careful analysis and tracking the footprints of the bug to its origin. That can also be quite difficult and may require various tools to accomplish. In this case, I was able to do that and the fix turned to be trivial. Which is also not unusual—in many cases, finding the bug is orders of magnitude more difficult than fixing the bug.

This entry was posted in Debugging, Development, DOS, Watcom. Bookmark the permalink.

14 Responses to Tracking Down a Bug

  1. zeurkous says:

    And then discovering that the fix introduced another bug…
    (Perhaps not here, but all too often.)

    Even though me was zeroth exposed to mess-dos — and had gotten used to
    EDIT –, by the time me introduced meself to vi(1), me just accepted it
    as a diff way of working; me certainly did not recoil in horror.

    Leaves it to say vi has its problems — mainly unfortunate early choices
    that are now set in stone due to people’s muscle memory –, yet these
    days me does recoil when faced with… captive editors like EDIT.

    What would inspecteur Clouseau make of all this, eh?

  2. Yuhong Bao says:

    Trivia: I believe using SS to access DOSDATA is common practice in DOS/

  3. Stu says:

    Reminds me of a bug I found (in my own code) recently… Occasionally, the program would crash on start-up. While I was doing some unrelated work, “occasionally” turned into “always” and I was able to investigate.

    Turned out, the code was trying to dereference a pointer that had been overwritten in memory. After much examination of the pointer value and surrounding memory, I finally realised the significance of every-other-byte of the corruption being the value “07h”… Turned out some code that was converting a string for display (using the standard PC text mode; 07h meaning the attributes for grey-on-black, the default colour scheme) had a bug and was “converting” twice as much data as it should. Clearly this had previously just happened to hit nothing important in memory (usually) and the unrelated changes I made affected the memory layout making it hit something more critical…

  4. Michal Necasek says:

    I will say that vi is a lot easier to use on a terminal that properly maps cursor keys and such. The most basic vi without a manual is pretty much impossible to use.

  5. zeurkous says:

    Yeah, and the manuals tend not to be of very high quality.

  6. Richard Wells says:

    A very short primer was all I needed to get enough information on vi to do all my programming. Fortunately, no one at the site had modified vi’s behavior in any way. Doing programming where any bug could cost the client millions was nerve-wracking enough; I certainly didn’t want the editor to add to the problems.

    The bug itself shows the importance of having the safest defaults for settings. Skip a step and the preferable result is slightly slower code, not hidden data corruption.

  7. MrMartin says:

    Very impressive! And nice description. Bug hunting (of course especially indeterministic bugs, race conditions, “one-offs” etc.) is hard – you truly require the mind of a detective. What are the things most valuable to investigate now, what to include in logs, how to even log, what data can you even trust, how to make sure you don’t allow wrong assumptions in etc. Sometimes it’s about ruling out every possible avenue, so there’s only the “impossible” left!

  8. MrMartin says:

    BTW, it’s also scary that this bug could not only lead to crashes, but possibly also changes to the files being edited 🙂

  9. Michal Necasek says:

    It’s theoretically possible but I think quite unlikely. If the editor’s DGROUP is loaded at a lower address and the memory used to store file contents is above that in RAM, the file storage should be quite safe. The editor code might be corrupted if the stack is stored further below it, or something else will be corrupted if the stack is located above the memory the editor is using. But I think the file memory is pretty safe in practice.

  10. Michal Necasek says:

    I know from experience that non-deterministic bugs can be scarily on/off — on 999 machines it practically doesn’t happen ever, and then the next one can reproduce it 80% of the time because everything is just right.

  11. MiaM says:

    Interesting, nice find!

    I would say that this kind of bug would likely almost only happen in an environment like DOS. No multi tasking and whatnot makes it possible for the bug to rarely show up, but the DOS environment is still both complicated enough and also simple enough that this could happen at app. By complicated enough I refer to things like a somewhat dynamic memory map where DOS, drivers and TSRs take up different amounts of memory on different installations, but still simple enough to not have some standardized API for some sort of callback to run timer interrupt code.

    If this was on a simpler system like a Commodore 64, there wouldn’t be any need to keep track of varying positions and thus data used by interrupt handlers can just use fixed memory addresses. On the other hand, if it was on a more complicated system but still without memory protection, like AmigaOS, the bug would be triggered every time the application runs as even without any other actual “program” started there are still a few tasks/threads running, at least one task/thread per mounted file system and at minimum whichever shell started the application would also be running. (This ignores the fact that you wouldn’t run an interrupt handler in this case, you would ask the OS to send you a signal/message at the right interval and use that to update a clock display. An actual hardware interrupt handler would be reasonable for something that handles special hardware though, mostly a thing for device drivers but could be used directly for applications that run some application specific hardware).

  12. Roy says:

    lol they closed issue without doing anything. https://github.com/open-watcom/open-watcom-v2/issues/1075

  13. Chris M. says:

    File a pull request fixing the compiler flag.

    There seems to be a TON of refactoring going on with that code. I wouldn’t be surprised if more stuff is broken.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.