A while ago I griped about a strangely ill-behaved Intel DX79SR Stormville board. To recap, the board simply refused to take any memory in the 4th memory channel. Since then, there have been very interesting new development in the story.
I found another complaint about these boards that exactly matches my symptoms. I missed it at first because it doesn’t mention the DX79SR board (it’s about DX79SI/TO). To recap, in 2011, Intel came out with two boards based on the X79 chipset, DX79SI (Siler) and a slightly stripped down and cheaper DX79TO (Toler). In 2012, the DX79SI was replaced by DX79SR with more USB 3.0 ports and more 6Gbps SATA ports. The three boards all use the same PCB, same BIOS, and are for most intents and purposes identical. It is therefore unsurprising that they’d also have the same problems.
Worth noting is that near the end of the discussion, a user claimed to have fixed just such problem by straightening a bent pin in the CPU socket. Since the memory controller is on the CPU, that is not implausible, although I could not find any sign of a bent pin on my Stormville board.
In the meantime, I obtained a relatively cheap DX79TO board. The board was in excellent condition and looked like new. And it had no trouble using four memory channels, with the exact same memory and exact same CPU that just would not run in the 4th channel on the DX79SR board.
That proved there was nothing wrong with the memory (which was more or less a foregone conclusion anyway) and also that there was nothing wrong with the CPU (the chance that three or four random LGA2011 CPUs would be broken in exactly the same way was, to put it mildly, not high). So the Stormville board must be at fault somehow. But how?
Because I can’t leave well enough alone, I also bought a DX79SI (Siler) board on the cheap side. The board was sold as broken. After it arrived, I didn’t find any major damage on the board but it did have several pins bent in the CPU socket. I straightened the pins as best as I could, plugged in an i7-3820 CPU, powered up the board, and it booted up just fine.
Except the Siler board had exactly the same problem as the Stormville board! No memory worked in the 4th memory channel and no amount of pleading would help. Having read about the bent pins supposedly causing the 4th memory channel to fail, I of course double checked and triple checked the CPU socket but could not find any further problems.
So at this point I had two near identical boards with exactly the same problem and a third board with no problem whatsoever. I’ve had two boards broken in exactly the same way before… could I be so lucky that the same thing happened to me again? If so, I should probably start buying lottery tickets.
For unrelated reasons, I decided to acquire several low-power LGA2011 CPUs. When a Xeon E5-2637 (a mere 80W TDP processor) arrived, I plugged it into the Siler board. It worked fine. It is a funny CPU with 3.0 GHz base frequency, 3.5 GHz turbo, but only two cores.
On a lark, I tried plugging the “spare” DIMM into the fourth memory socket. The machine booted up just fine. It suddenly had no trouble with four memory channels (4x8GB RAM). How is that possible? All I did was to replace the CPU.
But wait! It gets even weirder. Next I put back the i7-3820 CPU, leaving the memory in place. And it still worked. To recap: i7-3820, 4th memory channel no go; E5-2637 CPU, 4th memory channel works; same i7-3820 CPU back, 4th memory channel still works.
In fact I tried doing various things (including a CMOS memory reset) in an attempt to restore the DX79SI board to its original cranky state and failed. It is unlikely that inserting a random CPU would magically heal some sort of hardware defect on the board; but if there is some kind of state managed by the board’s firmware, where is it hiding?
If certain CPUs fail to work with the 4th memory channel under some conditions, that might explain the old tales of woe from users who tried all kinds of different memory modules and even replaced the board once or twice to no avail—because the CPU was usually the only thing they didn’t change.
Back to the DX79SR
A few days ago I returned to my original Stormville board. I attempted to repeat what I had done to the Siler board—install a Xeon E5-2637, put memory in the fourth channel, and see what happened.
Which was nothing. That is to say, the machine still behaved exactly the same, nothing worked in the 4th channel (including the exact same four memory modules that worked in the DX79TO and DX79SI boards).
That was when I started looking at the socket on the Stormville board again. And I noticed that one pin on the edge of the socket was bent after all. While bent pins in the middle of the array are surprisingly easy to spot because they disrupt the regular pattern, damaged pins on the edge do not stand out at all (the edges aren’t straight lines). That was something I learned when fixing the DX79SI socket, which had several bent pins both in the middle and around the edges.
The bent pin was almost certainly A37 on the LGA2011 socket, which corresponds to the DDR3_DQ land on the CPU. Note that DDR3 here does not stand for the DDR3 technology but rather for the fourth DDR memory channel (the others being DDR0, DDR1, and DDR2).
The bent A37 pin may have touched pin B36, which is labeled as VSS (ground) on the CPU. If the two pins really touched, that could perhaps explain why the board reacted angrily to anything in the 3rd memory channel. Even if the pins did not touch, the third memory channel could not work with one data bit effectively missing.
With fine tweezers, I straightened out the pin in the Stormville board’s socket, and re-installed the CPU to see if it still worked; it did. Then I added a fourth memory module… and lo and behold, the board booted up with all four! So I added another bank of four modules for a total of 64 GB (the maximum supported), and the DX79SR board still worked. Mystery solved!
What Really Happened?
I am 99% certain the problem on the DX79SR Stormville board was caused by a bent pin. What I’m less sure about is what caused the problem on the DX79SI Siler board. Did simply swapping out the CPU by sheer luck fix a bent pin? In light of the DX79SR experience, I consider it more plausible than the CPU change triggering some firmware setting side effect.
Or to be more precise, it’s entirely believable that changing the CPU would change the BIOS settings (because it certainly does), but it’s unlikely that completely clearing the BIOS settings wouldn’t bring the original behavior back.
In addition, randomly finding two near-identical boards showing the exact same symptoms yet with completely different root causes seems more than a little unlikely. It’s more believable that they both pins bent in the same area, and those just happen to be connected to the fourth memory channel.
Now, were all or most of the original 2012-2013 Intel DX79 memory complaints caused by bent pins? At least one reportedly was. And if so then the LGA2011 sockets must be more susceptible to this problem than one would think. It is quite likely that some people really had issues with incompatible memory, but I found several cases where users had the exact same trouble I did, and it wasn’t the memory (or CPU) but rather specific board slots causing trouble.
At any rate, I am quite certain that my DX79SR Stormville problem was caused by a bent pin, which “only” caused the 4th memory channel to become unusable but did not otherwise prevent the board from working and in fact running quite fast and stable, with weeks of uptime.
Minor damage to LGA sockets is really sneaky. I have not had it happen (at least not knowingly) with LGA775 sockets, quite possibly because LGA775 is physically much smaller and is therefore less likely to be damaged. The LGA2011 sockets are much bigger and damaging them is therefore easier even when some care is exercised.
I’m also more used to pin-based CPUs which simply do not have this problem. The CPU pins are fairly easy to inspect visually and if they are more than very slightly bent, the CPU simply cannot be installed. If the CPU can be plugged into the socket, the pins are by definition good. In the LGA case, bent pins do not prevent the CPU from fitting in the socket, and may not prevent the system from booting up and mostly working.
It is also sneaky that non-functioning memory slots can be the result of board damage in the CPU socket, rather than the more obvious memory slots. But I’m glad my boards are now working properly!