Learn Something Old Every Day, Part XVII: DHCP and ARP Don’t Mix in WSA SMP

I just spent an inordinate amount of time debugging a VM running OS/2 Warp Server Advanced SMP (WSA SMP). The VM was working fine (except for sometimes hanging very early during boot, a known issue with the SMP kernel), but TCP/IP networking just would not work.

It’s not that networking did not work–using LAN Server with NETBEUI (that is, NetBIOS Frames protocol) worked fine. So I started digging deeper. I soon realized that Warp Server was unable to resolve hardware addresses using ARP. It sent out ARP queries, but never received any response. That is, the other end did send an ARP response, but it just seems to have vanished somewhere in the ether before it could get processed by the TCP/IP stack.

I tried to find out what debugging tools were available. I checked the packet statistics–nothing received. OS/2 comes with a pair of handy debugging tools, IPTRACE and IPFORMAT. These are sort of like an old-timey Wireshark, working on the IBM TCP/IP protocol stack. Sure enough, IPTRACE/IPFORMAT showed that ARP packets were going out, but the replies weren’t coming in.

If I manually added the requisite ARP entries (using arp -s) then TCP/IP happily sprang into action. But without that, no luck. Nothing worked because the system could not even figure out how to talk to the gateway. The only thing that worked was DHCP, because by definition the DHCP client does not need to know the server’s address (either IP or hardware).

For reference, here’s what the IPTRACE/IPFORMAT looks like when ARP is working properly:

-------------------------- #:3 --------------------------
 Delta Time:  8.813sec   Packet Length: 42 bytes (2A hex)
 DIX:   Dest: FF:FF:FF:FF:FF:FF   Source: 00:00:C0:2B:FB:D2
-------------------------- ARP --------------------------
 ARP:  Hardware Type:1     (Ethernet 10Mb)
 ARP:  Protocol Type:0800 (IP Address)
 ARP:  Hardware Len:6 
 ARP:  Protocol Len:4 
 ARP:  Operation:1  (ARP Request)
 ARP:  Sender HW address: 0000C02BFBD2
 ARP:  Sender PA: 010.000.002.010.
 ARP:  Target HW address: 000000000000
 ARP:  Target PA: 010.000.002.001.

-------------------------- #:4 --------------------------
 Delta Time:  0.000sec   Packet Length: 64 bytes (40 hex)
 DIX:   Dest: 00:00:C0:2B:FB:D2   Source: 52:55:0A:00:02:01
-------------------------- ARP --------------------------
 ARP:  Hardware Type:1     (Ethernet 10Mb)
 ARP:  Protocol Type:0800 (IP Address)
 ARP:  Hardware Len:6 
 ARP:  Protocol Len:4 
 ARP:  Operation:2  (ARP Response)
 ARP:  Sender HW address: 52550A000201
 ARP:  Sender PA: 010.000.002.001.
 ARP:  Target HW address: 0000C02BFBD2
 ARP:  Target PA: 010.000.002.010.

In the problem case, the ARP response never showed up.

I tried to find out if there is some lower level tracing available. And the answer seems to be “not really”, or at least not from IBM. OS/2 does have built in “trace points” for networking (and many, many other things). The tracing is enabled through TRACE ON 164 (where 164 is the “major code” for networking), assuming that the system has tracing enabled in CONFIG.SYS, for example by adding the line TRACEBUF=63.

The OS/2 Debugging Handbook includes extensive documentation of system trace points… but absolutely nothing about the trace points related to networking. Because networking was done by a different group within IBM. After spending some time scratching my head, it turned out that the networking trace points are defined in the IBM LAN Technical Reference, of all places.

Warp Server actually comes with a tool, semi-hidden among the “MPTS applets” which must be manually installed (see MPTSAPLT.ZIP on the installation CD-ROM). The tool is called DTF5 and it gives rough hints about how to enable tracing. Not only must tracing be enabled in CONFIG.SYS (using TRACE or TRACEBUF statements), but it must also be separately enabled in PROTOCOL.INI using the mysterious OS2TRACEMASK keyword (some useful hints can be found here).

Long story short, it did not help, nothing was traced at all. I suspect the problem is that the tracing happens inside the NETBEUI or TCPBEUI protocol drivers, but the ARP traffic is handled by IFNDIS.SYS/AFINET.SYS drivers. To be exact, traffic from NETBEUI was traced, but not anything TCP/IP related (neither outgoing nor incoming). So that didn’t tell me anything new. And I could not find anything whatsoever about debugging IFNDIS.SYS–or rather the recommendation was “use IPTRACE/IPFORMAT”, which I had already done.

There are 3rd-party network tracing tools that rely on a special “wedge” NDIS driver which intercepts network traffic on the NDIS level. For whatever reason, the OS/2 NDIS stack does not have any tracing capability, even though the ODI-based stack appears to have some tracing built in. In the end I didn’t try this and I am 99.9% certain it would not have helped–it would have shown that the NDIS driver was receiving the ARP replies, but TCP/IP somehow didn’t.

I considered how the observed behavior might be possible. Other OS/2 versions did not have this problem. Not Warp Connect, not Warp 4, not Warp Server for e-Business or newer. Even Warp Server Advanced (non-SMP) worked fine. This is quite possibly because WSA SMP comes with a special SMP-enabled TCP/IP stack that was not used in any of the other releases.

But how would such a fundamental problem go undetected in released software? I had a few theories. Not everyone used TCP/IP back in 1996. And for those who did, Warp Server may have been the server, handling DHCP etc. The TCP/IP stack is smart enough to update its ARP table from incoming IP traffic, so it does not necessarily need ARP replies. It might also work with gratuitous ARP (an untested theory).

Eventually I realized that ARP is only broken when Warp Server Advanced SMP is configured to use DHCP. Which the installer actually prevents and claims that DHCP can’t be used if TCP/IP over NetBIOS is also used (I can’t really imagine why… unless that was IBM’s workaround for the broken ARP). No DHCP, no problem. As soon as DHCP is run (dhcpstrt -i lan0), ARP stops working. Curious!

On a hunch, I tried applying a fix for the MPTS component of Warp Server. I happened to notice that IBM suggested several fixes were necessary when installing Warp Server SMP on IBM hardware. One of these was MPTS FixPak WR08503.

Which I absolutely could not find anywhere. But I could find the next one, WR08504. Installed, rebooted… problem gone! With the updated MPTS, ARP now works whether or not DHCP is used. Amazing!

In retrospect, I should have tried the MPTS FixPak sooner. Then again, I just didn’t think the TCP/IP stack would be that broken, except the breakage clearly did not show in practice, or at least not before WSA SMP was released.

This entry was posted in Bugs, IBM, Networking, OS/2, TCP/IP. Bookmark the permalink.

2 Responses to Learn Something Old Every Day, Part XVII: DHCP and ARP Don’t Mix in WSA SMP

  1. MiaM says:

    Interesting!

    I would think that no one with any sense would use DHCP for production servers at the time. And that kind of leaves DHCP for computers used for testing, and at the time few if any would had had a SMP computer as a test rig, perhaps?

    Well, thinking about it, at a somewhat large company there might had been some HP Vectra XU 5/90 computers that could had ended up as lab servers, originally bought with a single processor but fitted with an additional processor for SMP. I have no idea what price and whatnot the XU 5/90 had when new and why a company would buy them for running DOS+Win 3.x, but that happened at a place I used to work at. Later on I ran NT4 on some of these and it was great with two processors.

    But the chance that someone had a left over older SMP computer and used that as a lab server with OS/2 Warp Server Advanced SMP in particular, and also used DHCP, seems slim. So perhaps it took ages before the problem arose.

  2. Michal Necasek says:

    That completely depends on the network setup. It is quite common to have effectively static IP address assignments but utilize DHCP to distribute these settings to the machines, based on MAC addresses.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.