Peter Kjellström wrote:
On Wednesday 07 March 2012 11.17.15 m.roth@5-cent.us wrote:
Got a bunch of servers from Penguin. Supermicro m/b's H8QG6. We put a 3tb drive in for additional workspace for the users, and some of them won't read, others will go for weeks, then spit out DRDY errors. lshw shows the controller as an ATI SB7x0/SB8x0/SB9x0 SATA.
...
Now, I've been working on one with Penguin. I noticed one thing, that it was set to native IDE. After googling, I saw that the most recent spec, which included EIDE, should be good to petabytes... but I tried resetting it to AHCI anyway.
The user ran one job, ok... then another last night, and it's spitting the same errors.
...
Mar 7 00:53:28 <server> kernel: ata2.00: failed command: WRITE FPDMA QUEUED
...
40/00:04:20:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
...
Mar 7 00:53:28 <server> kernel: ata2: hard resetting link
While writing the drive timed out and the link to it was then subjected to a hard reset. This is not normal and usually points to bad drive or buggy firmware.
Have you had a look at smartdata for the drive(s)? (you may want to run the smart selftests)
Also, I'd suggest you test it in a controlled environment. For example, can any of your drives survive a full surface write? (dd if=/dev/zero bs=1M of=..) Full surface read? Do the tests against /dev/sdX to be sure (excludes partitioning, filesystems, volume management, etc.)
Do note that writing your drive full of zeros _will_ destroy your data (I really hope that's stating the obvious...).
<g> Of course. Nahhh... I've run bonnie++ against it, but couldn't provoke it. It's this one user, who runs *large* jobs, with big o/p, when it hits.
smartctl - I ran the short test just before lunch, and smartctl -H reports it passed, completed without errors.
I saw that it timed out. One of the reasons for some of the stuff I included, above, was that kernel: ata2.00: device reported invalid CHS sector 0
Also, I noticed that lshw showed the ATI controller having a width of 32 bits, and a clock of 66MHz, and wondered if there could be some sort of slip-through-the-cracks where the driver didn't handle this correctly.
mark