[CentOS] hardware issues? driver issues?

Wed Mar 7 18:16:18 UTC 2012
m.roth at 5-cent.us <m.roth at 5-cent.us>

Peter Kjellström wrote:
> On Wednesday 07 March 2012 11.17.15 m.roth at 5-cent.us wrote:
>> Got a bunch of servers from Penguin. Supermicro m/b's H8QG6. We put a
>> 3tb drive in for additional workspace for the users, and some of them
>> won't read, others will go for weeks, then spit out DRDY errors. lshw
>> shows the controller as an ATI SB7x0/SB8x0/SB9x0 SATA.
> ...
>> Now, I've been working on one with Penguin. I noticed one thing, that it
>> was set to native IDE. After googling, I saw that the most recent spec,
>> which included EIDE, should be good to petabytes... but I tried
>> resetting it to AHCI anyway.
>> The user ran one job, ok... then another last night, and it's spitting
>> the same errors.
> ...
>> Mar  7 00:53:28 <server> kernel: ata2.00: failed command: WRITE FPDMA
> ...
>> 40/00:04:20:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
> ...
>> Mar  7 00:53:28 <server> kernel: ata2: hard resetting link
> While writing the drive timed out and the link to it was then subjected to
> a hard reset. This is not normal and usually points to bad drive or buggy
> firmware.
> Have you had a look at smartdata for the drive(s)? (you may want to run
> the smart selftests)
> Also, I'd suggest you test it in a controlled environment. For example,
> can any of your drives survive a full surface write? (dd if=/dev/zero
> bs=1M of=..)
> Full surface read? Do the tests against /dev/sdX to be sure (excludes
> partitioning, filesystems, volume management, etc.)
> Do note that writing your drive full of zeros _will_ destroy your data (I
> really hope that's stating the obvious...).

Of course. Nahhh... I've run bonnie++ against it, but couldn't provoke it.
It's this one user, who runs *large* jobs, with big o/p, when it hits.

smartctl - I ran the short test just before lunch, and smartctl -H reports
it passed, completed without errors.

I saw that it timed out. One of the reasons for some of the stuff I
included, above, was that
kernel: ata2.00: device reported invalid CHS sector 0

Also, I noticed that lshw showed the ATI controller having a width of 32
bits, and a clock of 66MHz, and wondered if there could be some sort of
slip-through-the-cracks where the driver didn't handle this correctly.