When I learned that Microsoft released the GW-BASIC source code, I was mildly curious to find out what is or isn’t there. The short answer is that there’s a whole lot, but a lot is also missing. Spelling note: Both “GW-BASIC” and “GW BASIC” can be found in the source code. The hyphenated spelling will be used here for consistency.
The first question is: When is the source code from? Microsoft marked the source files February 10, 1983, but that’s almost guaranteed to be wrong. The date comes from comments in the code: “This translation created 10-Feb-83 by Version 4.3”. That reflects running some sort of master BASIC source code through a translator generating 8086 code. The source code was almost certainly modified after that date.
My current best guess is that the source code is roughly from mid-1983. But that’s only a guess.
Assembling the Source
The next order of business was figuring out how to assemble the source code. The Microsoft source release provides absolutely no clues on this front. There is no makefile (although perhaps it’s too old for one), no batch file, no build notes, nothing.
The GW-BASIC source code makes several mentions of Intel’s ASM86, but the source uses far too many MASM specifics. It is likely that some older version used ASM86, but not the released source.
Armed with a collection of MASM versions, I tried assembling the source. It did not go well. Nothing could be assembled. MASM 5.1 seemed to get the furthest, which was odd because it’s really far too new (1988); moreover, MASM 5.1 has a built in INSTR operator which clashes with an INSTR symbol in the GW-BASIC source code.
It turned out that MASM 5.1 was merely more tolerant of UNIX line endings. Old MASM versions require DOS style (CR/LF) line endings and get very upset otherwise, spitting out confusing errors.
After massaging the source files to make them more palatable to MASM, things got more interesting. Long story short, almost all the files can be assembled with Microsoft MASM 1.00 or 1.10, as well as IBM MASM 1.0. There are known problems with the very old MASM versions that can be avoided by reducing conventional memory size to 512 KB.
Most files cannot be assembled with Microsoft MASM 1.12 or later, or IBM MASM 2.0. The problem is generally better diagnostics in newer MASM versions which refuse questionable constructs in the GW-BASIC source code.
These are the kinds of statements that MASM 1.12 and later refuses:
MOV DX, OFFSET 256*100+OPCNT
MOVS ?CSLAB,WORD PTR ?CSLAB
The exception is the GWMAIN module. MASM 1.x versions fail to assemble it because they run out of memory. The module can be successfully assembled with IBM MASM 2.0 or Microsoft MASM 3.0. No amount of pleading convinced MASM 1.x to work.
This raises some question marks. IBM MASM 2.0/MS MASM 3.0 are really too new (1984) for the GW-BASIC source code. It is possible that Microsoft used development versions of MASM; it is known (see page 337) that Microsoft shipped the bulk of GW-BASIC to OEMs in object code form and OEMs needed to supply glue code required for GW-BASIC to interface with their platform. It is thus possible that the code could not be actually assembled with a generally available off-the-shelf tool.
There is also some possibility that Microsoft did use MASM 1.0 or 1.1 but not hosted on DOS. At any rate, IBM MASM 1.0 plus IBM MASM 2.0 can be used to assemble the source code, and so can Microsoft MASM 1.10 plus MASM 3.0.
There was also an easily resolved mystery related to the GW-BASIC math package. There are two source files, MATH1.ASM and MATH2.ASM. Neither can be assembled. But if they are merged together, e.g. by including both from a master source file, assembly succeeds. The MATH module may have been split because the source code is almost 180KB and certainly would not fit on a 160KB floppy.
Update: Shortly after writing the above, I hit paydirt. MASM 1.06, ostensibly from 1982, can cleanly assemble all of the GW-BASIC source files, with no syntax errors and no running out of memory. A copy can be found here (as MACRO86.EXE) and here; the two executables have different date stamps but are in fact bit for bit identical. Why both older and newer MASM versions run out of memory on GWMAIN.ASM remains a mystery for now, but we now know that there was at least one MASM version that could assemble everything on a PC.
Comparing with a Binary
The next todo item was finding a GW-BASIC binary that’s close to the released source code. It quickly turned out that most GW-BASIC binaries are either older or newer. The right ones show
(C) Copyright Microsoft 1982
but may display various version numbers. They may or may not mention GW-BASIC. In the end I zeroed in on two binaries. One was GWBASIC.EXE dated Nov 11, 1983, file size 56,832 bytes, showing the following:
EAGLE GWBASIC Version 1.20 11/11/83
(C) Copyright Microsoft 1982
The other was BASICA.EXE dated May 13, 1983, file size 54,272 bytes. The sign-on message was:
The COMPAQ Personal Computer BASIC
(C) Copyright COMPAQ Computer Corp. 1983
(C) Copyright Microsoft 1982
Both of these are a very good but not perfect match for the released source code. I am almost certain that the Compaq version is slightly older than source code (because there are a few bits missing), while the Eagle version is slightly newer (because there are a few extra bits). That implies the released source code is older than November 1983 but possibly newer than May ’83.
Mapping Out the Binary
I concentrated on the Eagle Computers GWBASIC.EXE since it seemed to be a slightly better match for the source code. I was able to match all of the source code with the binary and arrived at the following sequence of source modules (note that BI stands for BASIC Interpreter):
GWDATA.ASM GWMAIN.ASM OEM.ASM GWEVAL.ASM GWLIST.ASM IBMRES.ASM BIMISC.ASM DSKCOM.ASM BIPTRG.ASM BIPRTU.ASM BISTRS.ASM FIVEO.ASM GENGRP.ASM ADVGRP.ASM MACLNG.ASM GWSTS.ASM GIO86.ASM GIODSK.ASM GIOKYB.ASM GIOSCN.ASM GIOLPT.ASM GIOCOM.ASM GIOCON.ASM GIOTBL.ASM SCNEDT.ASM SCNDRV.ASM CALL86.ASM NEXT86.ASM MATH.ASM (MATH1.ASM + MATH2.ASM) KANJ86.ASM GIOCAS.ASM ITSA86.ASM GWRAM.ASM GWINIT.ASM BIBOOT.ASM
OEM.ASM is a hypothesized OEM-supplied module which is not part of the GW-BASIC source code distribution. It is not a trivial piece of code and accounts for over 6,000 bytes of object code in the Eagle GWBASIC.EXE (more than 10% of the total).
It is likely that other GW-BASIC implementations order the modules differently, although the order of some of the modules at the beginning and end may be fixed (for example GWDATA.ASM needs to be first).
Reading the source code is fascinating. The code has clearly long history:
--------- ---- -- ---- ----- --- ---- ----- COPYRIGHT 1975 BY BILL GATES AND PAUL ALLEN --------- ---- -- ---- ----- --- ---- ----- ORIGINALLY WRITTEN ON THE PDP-10 FROM FEBRUARY 9 TO APRIL 9 1975 BILL GATES WROTE A LOT OF STUFF. PAUL ALLEN WROTE A LOT OF OTHER STUFF AND FAST CODE. MONTE DAVIDOFF WROTE THE MATH PACKAGE (F4I.MAC).
Paul Allen was clearly involved for a while:
FIVEO 5.0 Features -WHILE/WEND, CALL, CHAIN, WRITE /P. Allen
There is no indication that Bill Gates or Paul Allen were involved by the time the product became GW-BASIC.
The source code is written, as it was then common, in ALL CAPS (although not completely).
One of the most jarring things is that, as it was also common in the bad old days, identifiers are limited to six characters. That leads to ugly, cramped, and hard to decipher identifiers like FRMQNT or SKPMRF or LEVFRE or XCESDS. The 6-character limitation is also applied to file names.
The code is generally quite unstructured and very hard to follow. The PROC keyword is not used at all. Procedures are used, but rather loosely. Code very frequently jumps into the middle of another routine or returns from a routine by using a JMP rather than RET. As a consequence, there are only minimal attempts to keep values in registers and almost all data is kept in memory. The jumpy programming style also makes it impossible to use local variables on the stack. No doubt the code is written that way because it was originally targeting the Intel 8080.
The code contains a nice collection of “what not to do” Intel recommendations. To be fair, those recommendations don’t really apply to the 8086. The style violations include mixing of code and data and jumping into the middle of an instruction.
For example, calls to the SYNCHR routine are followed by one byte of data (excerpt from FIVEO.ASM):
CALL SYNCHR DB OFFSET 54O ;Must be comma CMP AL,LOW 54O ;Ommit line # (Use ALL for instance)
The byte is not code, it is data. SYNCHR pops the return address off the stack, processes the data and increments the address, then pushes it back.
The other type of abuse is even more interesting (excerpt from GWMAIN.ASM):
PUBLIC SNERR SNERR: MOV DL,LOW OFFSET ERRSN ;"SYNTAX ERROR" DB 271O ; SKIP ;"LXI B," OVER THE NEXT 2 PUBLIC DV0ERR DV0ERR: MOV DL,LOW OFFSET ERRDV0 ;DIVISION BY ZERO DB 271O ; SKIP ;"LXI B," OVER THE NEXT 2 PUBLIC NFERR NFERR: MOV DL,LOW OFFSET ERRNF ;"NEXT WITHOUT FOR" ERROR DB 271O ; SKIP ;"LXI B," OVER THE NEXT TWO BYTES
Note that LXI is an 8080 instruction, clearly revealing where the idea had come from. When the caller jumps to one of the labels, it will execute a MOV DL followed by a sequence of MOV CX instructions. The CX value is ignored and only the contents of DL is used.
Both of these techniques make disassembly somewhat difficult and confusing, although only very slightly so when one is armed with the source code.
Understanding how GW-BASIC manages memory takes a bit of effort. As was common and necessary in the old days, GW-BASIC discards initialization code and uses the recovered memory for other purposes. The label CSEND indicates the end of resident code with the following comment: “All code loaded after this label is resident only until routine MAPINI initializes the new memory map.”
It should be noted that GW-BASIC effectively uses the small memory model. The CS segment register points to code and DS/ES/SS all have the same value pointing to the data segment. The data segment size is variable and depends on the available memory (but can’t be more than 64K). There is no attempt at exploiting the segmented nature of the 8086 architecture; that makes sense given the 8-bit heritage and the fact that early PCs did not have all that much RAM in the first place.
Within the BASIC data segment, memory is subdivided into several areas. The basic layout is documented in the file GWINIT.ASM (see comment “Memory map for GW-BASIC”). There is stack overflow checking which is invoked for all larger memory allocations; as mentioned above, GW-BASIC does not use local stack variables, which means its stack usage is otherwise very minimal.
It would be handy to find an existing GW-BASIC executable which is an exact match for the released source code. So far I’ve not been successful and in fact the vast majority of Microsoft BASIC interpreters are either older (BASIC 5.x) or newer (GW-BASIC 3.x) versions.
It should also be possible to reverse engineer/disassemble/reconstruct the missing OEM source module (or modules) required to produce a complete GW-BASIC executable. That is likely to be a fair amount of work.