On Wednesday 07 March 2012 11.17.15 m.roth@5-cent.us wrote:
Got a bunch of servers from Penguin. Supermicro m/b's H8QG6. We put a 3tb drive in for additional workspace for the users, and some of them won't read, others will go for weeks, then spit out DRDY errors. lshw shows the controller as an ATI SB7x0/SB8x0/SB9x0 SATA.
...
Now, I've been working on one with Penguin. I noticed one thing, that it was set to native IDE. After googling, I saw that the most recent spec, which included EIDE, should be good to petabytes... but I tried resetting it to AHCI anyway.
The user ran one job, ok... then another last night, and it's spitting the same errors.
...
Mar 7 00:53:28 <server> kernel: ata2.00: failed command: WRITE FPDMA QUEUED
...
40/00:04:20:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
...
Mar 7 00:53:28 <server> kernel: ata2: hard resetting link
While writing the drive timed out and the link to it was then subjected to a hard reset. This is not normal and usually points to bad drive or buggy firmware.
Have you had a look at smartdata for the drive(s)? (you may want to run the smart selftests)
Also, I'd suggest you test it in a controlled environment. For example, can any of your drives survive a full surface write? (dd if=/dev/zero bs=1M of=..) Full surface read? Do the tests against /dev/sdX to be sure (excludes partitioning, filesystems, volume management, etc.)
Do note that writing your drive full of zeros _will_ destroy your data (I really hope that's stating the obvious...).
/Peter