[CentOS] PATA Hard Drive woes

Thu Nov 4 13:48:44 UTC 2010
Lamar Owen <lowen at pari.edu>

>From: Keith Roberts <keith at karsites.net>

>On Wed, 3 Nov 2010, Lamar Owen wrote:
>> Might want to check the power supply as well.  Bad/flakey 
>> power can indeed case damage to the drive surface; been 
>> there, done that, have two Maxtor 250GB drives with 
>> scribbled servo data to prove it.

>OK.

> I'm running the server from an APC UPS Back-UPS 650, so 
> there should not be any glitches in the power supply, should 
> there?

Probably not on the AC side, although the Back-UPS 650 isn't a full online UPS but a switching standby UPS (full online, like the APC Symmetra 16KVA units I have here) rectify to DC, float the batteries at all times, and run the output from inverter all of the time (unless they're switched to bypass).  The SmartUPS 1400RM I had in front of the PC that suffered the glitchy power is, unless I'm mistaken, also a full online pure sinewave UPS like the Symmetra, and is still in service (I checked its output on my oscilloscope first, though).

No, I was referring to the output DC voltages (+12V, +5V, +3.3V,-5V, and -12V) from the power supply inside the system.  

In addition to my own personal RAID1 of 250GB drives, I also, a different time, lost a RAID5 array of 15K 36GB SCSI drives in a Dell 1600SC server; testing the power supply showed lots of noise and complete dropouts of a few milliseconds duration on the drive connectors' 5V supply pins.  Completely and thoroughly scrambled the servo data on the Hitachi drives.  Meaning they didn't just start showing bad sectors; they started getting seek errors.  The 5V line on the drive connectors was reading an AC RMS of 4V superimposed on the +5V, yielding an effective DC voltage of 4V.  Happened over a period of three weeks, during which time I had a number of mysterious failures (the Hitachi drives were error-correcting so well that by the time they started reporting errors, it was way past too late, and it became impossible for the Hitachi drives to even power up).  I found that the power supply in question, upon investigation, provided the motherboard (where the DC power sensors on that box are) with clean 5V, and the drives were powered from a separate 5V rail, meaning the Dell management system wasn't seeing the power problems.

A simple power supply tester with a built-in meter can be bought for less than $20; a more thorough power analyzer will run more than that.  But even the simple one caught the failing Dell 1600SC supply.  It took an oscilloscope to test the Antec in my personal box; turned out it was a cold solder joint in the Antec.  A new power supply is less expensive than the equivalent labor it took to fix the Antec.  I keep a known good 500W ATX 12V server-grade (8 pin 12V plug with adapters, and 24-pin ATX plug with 20-pin adapter) around for testing; that's one of the very first things I check when a PC is brought in that is flaky.  (The very first thing is the dust accumulation, and the second thing is the heatsink compound).

One of the first things I do on any CentOS system I put together is install lm_sensors and gkrellm (gkrellm from a third-party repo).  I then enable all the motherboard sensors that are available in the gkrellm plugins, and run it (either local GUI or through ssh X forwarding to my central monitoring PC).  On supermicro boards I install SuperODoctor for Linux, available on the supermicro site.  The GUI runs well (there are some odd dependencies, however) and will e-mail you on alarm conditions that you can set.  These include fan RPM, temperatures, and voltages.  The CLI program isn't quite so sophisticated, but it can be run periodically and the result sent by e-mail for health checks.

Drives that are having trouble will show up with high iowaits; run iostat (from the sysstat package) and look at the await result.  Long awaits mean the drive is having trouble (or it has firmware issues like WD's EARS and EADS drives have in RAID configurations).