From the Annals of Preprocessor Hackery

Posted on October 22, 2023 by Michal Necasek

Over the last few days I’ve been slowly attacking the source code for 386MAX, trying to build the entire product. One of the many problems I ran into turned out to be quite interesting.

There are several (16-bit) Windows components in 386MAX, and many have some sort of GUI. As is common, dialogs etc. are built from templates stored in resource (.rc) files. But… trying to process many of the resource files with the standard Windows 3.1 resource compiler fails:

The resource compiler does not like something

The problem is dialog text labels composed of multiple strings. Something like this:

LTEXT "Qualitas 386MAX\nVersion " "8" "." "03",-1, 12,7,112,17

For obvious reasons, the authors wanted to automatically update the strings with the current version numbers, and used macros to build those strings. Only that doesn’t quite work in the resource compiler.

But Why Not?

In C and C++, this is not a problem. String literals are merged together, thus the following are equivalent:

"Hello" " " "World"
"Hel" "lo World"
"Hello World"

But that does not happen in the resource compiler. We may wonder why not, but the fact is that it doesn’t.

The catch is that although the resource compiler (RC.EXE) uses a preprocessor (RCPP.EXE) which is essentially identical to a C compiler preprocessor (in fact almost certainly built from the same source code), a C preprocessor does not perform string literal merging. Again we may wonder why not, but the fact is that it doesn’t.

The upshot is that if the resource compiler expects a string, it must be supplied with a single string literal, because multiple consecutive string literals will not be merged.

A Quality Hack

For building 386MAX, Qualitas solved the problem in a manner that is as clever as it is dirty. Qualitas wrote a tool called RCPPP, called a “RCPP postprocessor”. The way it was used was as follows: The original resource compiler preprocessor (RCPP.EXE) had to be renamed to _RCPP.EXE, and the Qualitas RCPPP.EXE need to be copied to RCPP.EXE.

When the resource compiler (RC.EXE) was run, the Qualitas wrapper would pass its arguments the original preprocessor; after the preprocessor was finished, the wrapper would re-open the temporary file holding the preprocessor output, rewrite it to merge string literals, and then return control to the resource compiler. Voila, problem solved.

This was a very clever but nasty solution, because it required modifying the vendor tools (a big no-no). It was likely done that way because the Microsoft resource compiler does not offer any way to only run the preprocessor.

A Less Hacky Approach

There would have been a less hacky approach available, at the cost of makefile complication. The RCPP.EXE preprocessor can of course be executed as a standalone tool, and the preprocessed output could then be further rewritten to merge string literals.

Or, one might take advantage of the fact that the resource compiler preprocessor is a C preprocessor, and just use the C compiler to do the preprocessing.

Either approach requires multiple steps (preprocess, postprocess, run resource compiler) but does not need modifying the tools. Using the C compiler to preprocess additionally does not need relying on the internals of the resource compiler.

What Is Even RCPP.EXE?

Raymond Chen says: The Resource Compiler’s preprocessor is not the same as the C preprocessor, even though they superficially resemble each other.

I would rate that claim as misleading. In reality, the resource compiler’s preprocessor is very much a C preprocessor, with minor differences. I should add that the following applies to the Windows 3.1 Resource Compiler, which may not be quite like the newer NT based resource compilers.

A quick look at RCPP.EXE reveals that not only it very much is like a C preprocessor, it is more or less identical to the first phase of a Microsoft C compiler. Which, based on the strings included in it, does a lot more than just preprocessing.

Here is a screenshot of a handful of error messages from the Microsoft Windows 3.1 Resource Compiler preprocessor (RCPP.EXE):

For comparison, here’s a screenshot of error messages from the C1.ERR file corresponding to the first phase (C1.EXE) of the Microsoft C 5.1 compiler:

The similarity is not coincidental, and it is far more than superficial (even though the strings are not identical). Also note that most of the error messages apply to a C language compiler, not just a preprocessor.

I would guess that the Windows 3.1 RCPP.EXE is built from the source code for the first pass of the Microsoft C compiler, circa version 5.1 (other versions have noticeably different error messages, while version 5.1 is a close match). The similarities go far enough that, for example, the command line of the preprocessor child process (C1.EXE/RCPP.EXE) is in both cases passed in an environment variable called MSC_CMD_FLAGS (bypassing the DOS 128 character command line length limit).

It should therefore not be surprising that RCPP.EXE and the C preprocessor behave almost identically. Consider the following:

rcpp -DRC_INVOKED -Ic:\msvc\include -E -g foo.i -f my.rc
cl -DRC_INVOKED -Ic:\msvc\include -E my.rc > foo.i

Both produce nearly identical output, the only difference being slashes and backslashes in #file directives.

As an aside, the RCPP.EXE shipped with the Windows 1.x and Windows 2.x SDKs seems to be a very close relative of the first phase of the Microsoft C 3.0 compiler (P1.EXE); RCPP.EXE is identical between the Windows 1.x and 2.x SDK versions. For Windows 3.0, the preprocessor was upgraded with the one from Microsoft C 5.1 (or something quite close), and stayed unchanged for Windows 3.1.

A Different Solution

Or… instead of massaging the preprocessor output, perhaps there is a way to avoid the problem entirely?

The stringize operator (#) of the C preprocessor can be used to turn preprocessing tokens into a single character string literal. This approach requires separate machinery because the preprocessing tokens must not be string literals—otherwise extraneous double quotes end up in the output.

Using the C (or RC) preprocessor in this manner is not exactly intuitive, and understanding how and why it works requires a fairly deep understanding of the mechanics of the preprocessor. For all intents and purposes, the C preprocessor is a completely different language from C.

Suppose we want to produce a string like “386MAX Version 8.03”. In the original resource file, it was achieved as follows:

#define VER_MAJOR_STR    "8"
#define VER_MINOR_STR    "03"
#define VERSION          VER_MAJOR_STR "." VER_MINOR_STR
...
LTEXT   "386MAX Version " VERSION, ...

It’s simple enough, except (as explained above), it doesn’t work because the resource compiler does not concatenate string literals.

The resource compiler compatible version is rather more involved, and the first attempt might look something like this:

#define VER_MAJOR       8
#define VER_MINOR       03

#define VER_MKSTR(s)    #s
#define VER_STR(s)      VER_MKSTR(s)

#define VER_PV_RC(m, n) VER_PRODVER_RC m.n
#define VER_PVSTR_RC    VER_STR(VER_PV_RC(VER_MAJOR, VER_MINOR))
...
#define VER_PRODVER_RC  386MAX Version
...
LTEXT   VER_PVSTR_RC, ...

Note that instead of VER_MAJOR_STR, we must use VER_MAJOR. Also note that if we want the version to be displayed as 8.03 rather than 8.3, VER_MINOR must be defined as 03 rather than 3.

Therein lies the first pitfall. If we wanted to set the version to 8.08, we might define VER_MINOR as 08. That would work nicely for the resource compiler, but not in C language arithmetic… because 08 is not a valid octal constant, and neither is 09. (If that does not make sense, you just do not know the C language well enough.) It is simple enough to define separate macros (say VER_MINOR_RC) for the preprocessor, should the need arise.

There are other pitfalls. Suppose we want to use the company name in the string:

#define VER_PRODVER_RC    Qualitas, Inc. 386MAX Version

Now, it just so happens that the above works as users likely expect in the Windows 3.1 Resource Compiler preprocessor, but only because the Microsoft preprocessor is strange.

In more or less any other C preprocessor, the above doesn’t work. The comma causes too many arguments to be passed to the VER_STR macro when processing the VER_PVSTR_RC macro. Some compilers (e.g. Watcom, IBM) warn and throw away the comma and everything past it. Other compilers (Borland) error out and do not accept the input. Other compilers (gcc) behave yet differently, not expanding a function-like macro with too many arguments at all.

The C90 standard (relevant for the Windows 3.1 Resource Compiler) is clear:

The number of arguments in an invocation of a function-like macro shall agree with the number of parameters in the macro definition, and there shall exist a ) preprocessing token that terminates the invocation.
C90 Standard, section 6.8.3

It is an error to use a comma in this context. Fortunately there is an easy workaround:

#define VER_PRODVER_RC  Qualitas\x2C Inc. 386MAX Version

Instead of a comma character, we can use a hexadecimal escape sequence with the ASCII code for a comma (2Ch) to achieve the desired result.

This workaround takes advantage of the fact that to the preprocessor, \x2C is just a random sequence of four characters (backslash, x, 2, C). Only in a later phase of translation does the escape code get converted into a single character, together with classics like \n or \0.

Conclusion

At any rate, it is possible to use the preprocessor to produce a single string literal acceptable to the resource compiler. It is not exactly straightforward, primarily because the preprocessor is a rather different beast from the C language proper, but it is doable.

The bottom line is that it is no longer necessary to massage the preprocessor output, and it is certainly not necessary to hack the resource compiler itself to insert an extra processing stage. The unmodified resource compiler now produces the desired output, at the cost of a bit of extra baggage largely hidden away inside one header file.

This entry was posted in 386MAX, C, Development, Microsoft. Bookmark the permalink.

25 Responses to From the Annals of Preprocessor Hackery

zeurkous says:

October 22, 2023 at 5:52 pm

Another workaround would be to put the numbers in hexadecimal 🙂

But yeah, “0 means octal” is one of C’s warts.

The end result looks suspiciously like… what was it called…
ah, SYSEDIT (a program which me’s never found very useful).
Michal Necasek says:

October 23, 2023 at 11:44 am

No, hex won’t help, because if you do ‘#define VER_MINOR 0x08’ then you will end up with "386MAX Version 8.0x08".

The editor is more or less a regular plain text editor, but I guess it was meant for quick editing of configuration files, at least the way it was shipped.

As an aside, I have seen more than one bug report complaining that this or that C compiler does not handle a number like 0123 correctly. Little do they know…
zeurkous says:

October 23, 2023 at 1:56 pm

The hex comment was a joke, sorry.

Even SYSEDIT is basically an MDI version of Notepad, isn’t it?
zeurkous says:

October 23, 2023 at 5:43 pm

While the octal convention likely came from the PDP-7 — the 36-bit word
of which was commonly used to store six 6-bit characters –, with two
octal digits representing a character much like two hexadecimal digits
representing an 8-bit character (byte) now, me has no idea why “0” was
chosen as a prefix. (Fast-forward a century or two, and people will
start tripping over the fact that “x” is a valid digit in higher
bases…)
Ricard Wells says:

October 24, 2023 at 10:15 am

In some implementations of K&R C, octal 8 (08) and octal 9 (09) were accepted as their decimal values. ANSI cleaned that up but broke code in the process. I remember that some compilers allowed keeping the looser K&R syntax.
Michal Necasek says:

October 24, 2023 at 12:35 pm

Yes, octal numbers made sense back in the day, so I can see why they wanted them in C.

Using a leading zero to designate octal numbers was simply a bad choice though, to anyone who knows basic arithmetic it makes no sense that 10 and 010 are different numbers.

Now I almost wonder if they meant to use O instead of zero and someone got it mixed up 🙂
Michal Necasek says:

October 24, 2023 at 12:36 pm

I don’t think ANSI had much choice really, the K&R implementations were divergent enough that not breaking any existing code wasn’t an option. At least in this case the newly unacceptable constants would have been flagged and people could easily fix it.
zeurkous says:

October 25, 2023 at 1:36 pm

You could be on to something there: at [0], it is mentioned that a
phone dictation service at Bell Labs returned “i-node” as “eye node”;
something similar may well have happened with “oh” versus “zero”.

[0] https://arstechnica.com./gadgets/2019/08/unix-at-50-it-starts-with-a-mainframe-a-gator-and-three-dedicated-researchers/3/ [1]

[1] Me’d actually swear that meread this elsewhere before, but now
mecan’t find it.
Fernando says:

October 30, 2023 at 8:29 pm

@zeurkous
It is in a paper from Dennis Ritchie, you can find it in the preserved site of Dennis Ritchie at Nokia (Nokia bought Bell Labs).
The system was that you called an extension inside Bell Labs and recorded your voice message, later a secretary transcribe it and the next day was in your desk. The secretary in this instance didn’t know the technical terms so wrote what she(he) understood.
zeurkous says:

October 31, 2023 at 11:33 am

@Fernando: that’s the zeroth place me looked, since me’d also swear it
was there, but even if it is, me’s unable to find the specific paper.
Fernando says:

November 2, 2023 at 5:02 am

@zeurkous
I misremember it, I was sure that was from that papers, but no, it’s from an interview with Thomson, you can find it here:
https://www.tuhs.org/Archive/Documentation/OralHistory/expotape.htm
zeurkous says:

November 2, 2023 at 9:04 am

Ah yeah, me probably read it there, thanks.
Stu says:

November 8, 2023 at 5:39 pm

If they’d used “o”/”O” as the octal prefix, we’d just be annoyed that we can’t have variables named “o123” while “a123”, “b123”, etc. is fine.

Ideally it would be consistent with the other base specifiers, i.e. 0x and 0b (currently C++ only). “0o” is a bit difficult to read, but “0c” would work.
Richard Wells says:

November 9, 2023 at 7:07 am

The restrictions on variable names in line number BASIC were attributable to limits on memory. C which started on machines with huge amounts of memory should have variables that involve actual words. Variable names comprised of a letter followed by a number is just a recipe for confusion.

It always did seem odd that an alleged transcription error was the reason for having a number instead of a letter for the octal prefix. Enough other errors were corrected over time in the language specification.
zeurkous says:

November 9, 2023 at 12:09 pm

According to [0], the memory limits were on the compiler, so there
likely wasn’t room for something like full syntactic separation of
{variable name,literal}s (or at least, without doing it in a clumsy
manner), which would be a more final solution [1].

[0] https://bell-labs.co./who/dmr/primevalC.html

[1] Of course, even now, n years later, that development still hasn’t
happened. “Good enough” is one thing, but as the decades go by, that
position is starting to look a *little* extreme…
zeurkous says:

November 9, 2023 at 12:14 pm

As for lack of correction of the “oh” “error”: it likely bit so few
people that it wasn’t deemed worth the effort to fix. Me’ll speculate
that this was to be fixed in a direct successor to C (which never quite
appeared; just like a direct UNIX successor never quite appeared).
Stu says:

November 9, 2023 at 6:34 pm

@Richard Wells

“C which started on machines with huge amounts of memory should have variables that involve actual words.”

Don’t forget that for many early C compilers, only the first 6 characters of an identifier were “significant”. Even the ANSI (C89) standard doesn’t require more than that, at least when referring to identifiers defined in external translation units.

There are also, of course, mathematical and scientific formulae for which it’s natural to use variable names like “O1” (e.g. where the mathematical notation might be θ₁; that’s theta-subscript-1 in case of any Unicode issues). It’s generally a good idea to keep the source code close to the mathematical notation for ease of verification.
Victor Khimenko says:

November 13, 2023 at 4:39 am

I wouldn’t call PDP-11 (where C originated) “a system with huge amount of memory”. You would get 64KB for one user if you are lucky.

And six characters limitation was all too real for years after C90 standard was finalised.

It’s a bit harder to explain why C++ retained so many limitations of C, but I guess the idea was to slowly adapt the language and change it, I don’t think Bjarne expected ossification to happen so quickly.
Michal Necasek says:

November 13, 2023 at 4:36 pm

At least early on, there were no C++ compilers, there was Cfront which generated C code. So the limitations were hard to sidestep, because everything had to go through a regular C compiler and a linker. Even much later there were C++ compilers that translated C++ into C.

Agree that the early C machines were quite resource limited, certainly not as much as a PDP-7 but C didn’t assume mainframe class resources.
Richard Wells says:

November 13, 2023 at 10:06 pm

Most PDP-11 models* had 128K for each user; 64KB for code and 64KB for data. That is much more than the 4K or 16K available to many of the micro line-number BASICs that used variable names with a maximum of 2 characters.

* Okay, there was the PDP-11/03 using the LSI-11 chipset that was limited to a total of 64KB which arrived in 1976, long after C was established. With mini-Unix, that meant 4 KW for I/O addressing, 12 KW for OS, and 16 KW for a user program. C compilers weren’t fitting in that space for long.
Michal Necasek says:

November 15, 2023 at 6:35 pm

According to this classic, you might be overstating the resources available to the early C compilers. The first PDP-11 used in the time when B turned into C only had 24K bytes memory total according to the paper. The PDP-11 models available before circa 1975 had quite limited memory; the late 1970s models were certainly much bigger and could handle megabytes of RAM.
Richard Wells says:

November 16, 2023 at 4:05 pm

Recognizable Unix and C show up with Fourth Edition Unix which only ran on the 18-bit addressing PDP-11/45. Replacing core memory with MOS or bipolar memory did increase the memory that could be placed in a system and Unix quickly occupied all of that. Unix required the memory manager and the memory manager meant the system had at least 128K.
Jeffrey H. Johnson says:

December 27, 2023 at 10:53 am

Are your changes to ease building 386MAX being made public?
Michal Necasek says:

December 27, 2023 at 12:56 pm

They will be when I have something to publish. Unfortunately not yet.
MiaM says:

March 7, 2024 at 7:59 pm

Perhaps a stupid question, but:
Is it even necessary to run the rc preprocessor rather than running the c preprocessor on the rc files?

Stu: Having 0o to indicate octal numbers would be great for object oriented code. (this is a really bad joke) 🙂

Richard Wells: I know that I’m really in minority, but my opinion is that local variables should preferably have meaningless names, like either “a1” or nonsense names like “carrot” or “saxophone”. My reasoning is that it makes it easier to identify that a variable is a local temporary variable rather than something “larger”. Also there is no need to understand what such variables do unless you read the code, and if you read the code you might aswell read it good enough that you understand what the variables do. Kind of.

Also, a super hot take is that loads of code written during the last decade(s) does absolutely nothing. There are tons of glue functions that call glue functions, and functions that combines calling a glue function with adding a parameter to another parameter or something trivial that could had been done at the place that function gets called.

I remember a time when it was actually possible to read code and understand what it does without opening the same source code file in half a dozen windows scrolled to different places to follow the glue function steps… I may exaggerate a bit, but still.