Every Bit Matters

A couple of months ago the OS/2 Museum got hold of a 13.6 GB Fujitsu MPE3136AT IDE drive from 1999. The drive was working… more or less. It behaved quite strangely; the drive was detected and readable, but seemed oddly slow. It should have been capable of Ultra DMA transfers but delivered data at just under 2 MB/sec.

Looking through Linux dmesg output, it was apparent that the system was trying to communicate with the drive at Ultra DMA speeds, but kept falling back to slower PIO speeds due to CRC errors. Oddly, the drive vendor was also shown as FUBITSU, rather than FUJITSU as one would expect.

And looking at the data the drive returned, it was clear that it was somehow corrupted. For example messages in the boot sector clearly had some letters wrong, but only some.

What could possibly cause such a problem?

The thing to remember is that when a drive reads a sector off the medium, the data is validated through CRC or other methods. If the CRC does not match, the drive reports an error, but that was not happening here. Either the data recorded on the drive was very strange, or it was getting corrupted somewhere between the drive medium and the host system.

On closer look, there was a pattern to the corruption. Looking at the easily recognizable ASCII text in the drive’s boot sector and elsewhere, it was clear that every even byte was fine, and some but not all odd bytes were corrupted. Knowing that IDE has a 16-bit wide data path, that makes some sort of sense.

The buffers (DRAM) on the drive could be bad, but the drive firmware should have noticed that. The IDE cable could be bad, but the same cable/controller/host was used with other drives with zero problems, so that was highly unlikely to be the source of the trouble.

Remember the CRC errors mentioned earlier? The drive was calculating different CRC from the host. That means the drive sent something other than the host received. Assuming that the data path from the host CPU up to the end of the drive cable is good, and everything on the drive side from the medium at least up to and including the drive buffers is also good, that does not leave many places where the data can get corrupted.

Let’s think about the corruption a bit. How can FUJITSU turn into FUBITSU? Why is only one letter wrong? Let’s compare the good and bad hex ASCII codes:

F  U  J  I  T  S  U
46 55 4A 49 54 53 55  <-- good
--------------------
46 55 42 49 54 53 55  <-- bad
F  U  B  I  T  S  U

To go from ASCII J to B, it just takes bit 3 to flip from one to zero. And look: F, T, and U (also in odd bytes) did not change because they have bit 3 clear already. If bit 3 in every odd byte, i.e. bit 11 in a 16-bit data path, was consistently forced to zero, we’d get exactly this kind of corruption. Even bytes are untouched, and odd bytes change if and only if they initially have bit 3 set.

But why would the bit be forced to zero? Is there some obvious damage visible on the drive that I missed when initially plugging it in? Why yes, there is! That bent pin on the back of the IDE connector does not look right at all:

Bent pin on the rear side of an IDE connector

Here’s what it looks like from the connector side:

It’s not entirely clear from the photo but the pin was bent at 90 degrees, it was lying flat against the rear side of the connector.

Now, what does that pin do? Let’s take a look at the pinout on Wikipedia; note that the image shows the cable pinout, so we have to mentally flip it left to right. The bent pin is the fifth from the end in the bottom row, i.e. pin 10, which happens to be data bit 11. That is entirely consistent with the data corruption we’ve seen—bit 11 is not connected and always reads as zero! The corruption happens effectively on the cable between the drive and host, and the Ultra DMA CRC checks are designed to catch exactly that kind of problem. And they did catch the problem… only the host “cleverly” scaled down the transfer speed to a mode which performs no CRC checks, and happily delivered corrupted data.

Now we understand exactly why the data was getting corrupted, but how did the pin get bent that way? I’m honestly not sure—it was bent already when I got the drive, and because it was completely out of the way, I didn’t notice anything unusual when plugging in the cable.

It is unusual for a single pin in the middle to be bent, normally it’s several pins on either end getting bent when plugging in and especially unplugging the cable. I can only guess that the pin got somewhat bent initially and then someone forced the data cable in really hard, pushing the pin partially back out of the connector and bending the part that was still within the connector completely flat.

Needless to say, fixing the drive was not very hard. With careful needle-nose pliers action, I straightened the pin out and pulled it forward. Then I plugged in the IDE cable while making sure the pin couldn’t be pushed back again.

After completeing the surgery, the drive started working normally. It was able to operate at Ultra DMA speeds with no CRC errors, and the corruption was gone. Problem solved!

6 Responses to Every Bit Matters

rasz_pl says:

December 8, 2020 at 11:51 pm

Mmmm Fujitsu, the sweet sweet smell of conifer forest. The flux they used on those drives smelled amazing!
I worked at European Fujitsu distributor in late nineties. Absolutely the best non IBM* hard drives you could buy at the time. Cheap, fast, dead silent, super reliable. Then this happened: https://www.theregister.com/2002/11/05/fujitsu_admits_4_9_million/
https://www.dataclinic.co.uk/fujitsu-hard-disk-recovery/

“blame was laid on the supplier of epoxy mould compound used in the manufacture of Cirrus’ Himalaya 2.0 and Numbur chips”

Symptoms were drive not being detected, reporting as garbage corrupted strings, not spinning up or even clicking. The irony is they are 100% perfect mechanically. Btw I read somewhere swapping pcbs without making sure they have same firmware rev on board might result in service area corruption requiring actual specialist knowledge to recover (or running premade script in PC3000).

* and we all know how IBM Deskstars turned out ;-(
Michal Necasek says:

December 9, 2020 at 12:50 pm

I had never really come across Fujitsu drives in the 1990s, in hindsight I’m not sure why. In the early 2000s I was a happy user of two 3.5″ 10,000 RPM Fujitsu drives. They weren’t silent but 10k-RPMers just weren’t, and they were definitely cheap, fast, and reliable 🙂

Chips randomly failing due to manufacturing problems is sadly nothing new. I guess that’s the exact opposite of the Quantum sticky actuator problem where the electronics are just fine but the heads can’t move. Or the Deskstars scraping the coating clean off the platters.
Richard Cranium says:

December 10, 2020 at 2:20 pm

As soon as I saw “FUBITSU”, I had a pretty good idea of what was wrong. I would have bet on a bad IDE cable rather than the drive connector being damaged, though – that’s how this sort of thing has happened to me in the past.
Michal Necasek says:

December 11, 2020 at 11:00 am

Cool. I had never seen that before, I’m sure I had bad IDE cables but they just didn’t work at all. I think it needs some amount of (bad?) luck for the drive to be detected and usable when the cable is damaged.

I still wonder who managed to so completely bend just one pin in the middle. It’s very unusual drive damage.
Jason Stevens says:

December 13, 2020 at 6:32 pm

This kind of thing is straight out of the CCIE practical exam, where they would bend a single pin on a router backplane to give some kind of off by one error….. Always strip down and build back from hardware up…

I just went through this with some stupid system where it turns out the keyboard was broken, and it was always sending a F2, which oddly enough didn’t trigger the keyboard error (I guess since it’s all UEFI now?) but it was impossible to do anything other than going into setup.

Good catch on the pin/bit flip though that’d have driven me crazy for a while!
Michal Necasek says:

December 14, 2020 at 11:15 am

You mean the keyboard kept sending F2 keystrokes even if you weren’t doing anything? Yeah old BIOSes would detect that and complain about stuck keys.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Every Bit Matters

6 Responses to Every Bit Matters

Leave a Reply

Archives

Categories