When I first encountered the Unix
vi editor many years ago, I recoiled in horror. It was nothing like the editors I was used to—Borland IDEs, DOS 5.0/6.x EDIT, or OS/2 and Windows editors. But over the years, I learned to use
It turns out that on many exotic *nix systems,
vi is the only game in town. Sure, there might be some way to install
nano or whatever, but even if that is possible, it might be very difficult to get there without editing a couple of files… with
vi. And when the system in question is some ancient XENIX or PC/IX or Microport UNIX or whatever, the effort of installing some other editor vastly outweighs the effort of learning
vi—and that’s if the system in question even comes with a compiler.
Some time later I started using the
vi editor that comes with the (Open) Watcom compilers. It’s a decent
vi clone, not great but workable. It has the nice property that there are console versions for DOS, OS/2 (both 32-bit and 16-bit), and NT available (and *nix, too). The DOS version that ships with the compiler is a 32-bit DOS-extended version, but it’s also possible to build a 16-bit version. Which will run on a 286 or even an 8086, and that’s occasionally useful.
I’ve been using the 16-bit DOS version of Open Watcom
vi for some time, and it works well… except sometimes it hangs the system. But only sometimes and not very consistently. When it does hang, it usually happens when quitting the editor, but rarely also when starting it up.
I’ve never been able to track down the problem because it doesn’t happen frequently enough and when it does, all I can see that the system ends up in a corrupted state. Until yesterday.
Somehow I ended up in a situation where running
vi immediately after starting a certain DOS VM would trigger the hang, maybe 80% of the time. I decided to pounce. Once again, I couldn’t tell much from the hung state. So I decided to roll out the big gun. I ran the VirtualBox VM in a debug build with software instruction emulation and partial instruction logging—that is, logging every instruction, partially decoded, without logging the register state. This is a compromise that is already slow and produces huge log files, but not as horribly slow verbose as full register state logging. Since the problem was runaway code, I hoped the log would tell me something.
The log looks about like this:
IEMExecLots: cs:rip=c9d8:000016bf ss:rsp=8e66:0000054a EFL=027246 decode - c9d8:00000000000016bf mov rAX,Ov [#2761123] IEMExecLots: cs:rip=c9d8:000016c2 ss:rsp=8e66:0000054a EFL=027246 decode - c9d8:00000000000016c2 add rAX,Iz [#2761124] IEMExecLots: cs:rip=c9d8:000016c5 ss:rsp=8e66:0000054a EFL=027206 decode - c9d8:00000000000016c5 cmp Gv,Ev [#2761125] IEMExecLots: cs:rip=c9d8:000016c8 ss:rsp=8e66:0000054a EFL=027246 decode - c9d8:00000000000016c8 jnc/jnb Jb [#2761126] IEMExecLots: cs:rip=c9d8:000016bf ss:rsp=8e66:0000054a EFL=027246 decode - c9d8:00000000000016bf mov rAX,Ov [#2761127] IEMExecLots: cs:rip=c9d8:000016c2 ss:rsp=8e66:0000054a EFL=027246 decode - c9d8:00000000000016c2 add rAX,Iz [#2761128]
The first thing I noticed was that when things were going wrong, the VM was executing single-stepping breakpoints. That was certainly unexpected, but quickly explained by a POPF instruction popping junk off the stack.
Now I knew where things had definitely gone off the rails. The challenge was finding where it started. Not entirely trivial in a log file about 1.5 GB in size with hundreds of thousands of lines.
There were multiple red herrings. Executing code with CS=FFFFh turned out to be normal, just DOS in HMA. Similarly a stack with SS=00F5h looked suspicious but was just DOS doing its thing.
After more poking around, I narrowed down the region where things probably started going wrong. Searching through the log I noticed a curious thing: There was a code location which executed a RETN instruction a couple of times, but then that RETN suddenly turned into a JMP and things started going wrong.
In other words, there was a routine which finished with a RETN instruction, but somehow that instruction changed into a JMP and then nothing much made sense anymore. Self-modifying code was a possibility, but the jump did look like it ended up in an implausible location.
I knew that I couldn’t find from the log file where the RETN instruction was overwritten. But I could just put a write breakpoint on that memory location and re-run the testcase. That’s the beauty of deterministic environments that don’t use ASLR—things are in the same place every time. The breakpoint was quickly hit, and things started looking more promising. Interestingly, the instruction doing the overwriting was using a
SS: segment override; that was significant.
The reason for hope was that the code segment of the overwriting instruction made me (as it turned out, correctly) suspect that it was code in the
vi editor itself doing the nasty work. The next challenge was to find what the code was. With a map file in hand, I was able to narrow down the location to two or three object files. Running those through the
wdis disassembler quickly allowed me to determine that the code doing the overwriting was in a module called
int.obj. Then it was easy to look at the corresponding source code.
The code installs a timer interrupt handler (called via INT 1Ch from the BIOS timer tick hardware interrupt routine) which is supposed to update the data for a clock displayed in the upper right corner of the editor. I could immediately see the problem: One of the routines was assuming that
SS==DS and using
SS: segment overrides to access the data segment (
That is a big no-no in hardware interrupt handlers. And of course it explained the observed behavior. Almost all the time, when the timer interrupt was handled,
vi was running on its own stack with
SS==DS. But sometimes it wasn’t, and then things went wrong, and some innocent memory got corrupted. But the corruption might not affect
vi itself and only became apparent after quitting the program. Or it might not become apparent at all.
The problem was easily solved by making sure the
int.obj module gets compiled with the
-zu flag. That tells the compiler that
SS!=DS and the compiler will know it can’t use
SS: segment overrides to access
DGROUP. And the saddest part? The
-zu flag used to be there, but then someone completely reorganized the
vi project makefiles, and the flag got lost. The perils of open source development.
This bug was very much in the nasty category. It was only triggered relatively rarely. And even when it was triggered, it corrupted some not entirely predictable memory location which might never be used. Even when it was used, it might be used long, long after the actual corruption occurred. In such cases, an element of luck is often involved—coming up with a reproduction scenario which makes the bug more or less reliably occur.
Even then it’s not smooth sailing, because when the detective arrives at the murder scene, the culprit had long left the room. It requires careful analysis and tracking the footprints of the bug to its origin. That can also be quite difficult and may require various tools to accomplish. In this case, I was able to do that and the fix turned to be trivial. Which is also not unusual—in many cases, finding the bug is orders of magnitude more difficult than fixing the bug.