CentOS 4.3 occasionally locking up accessing IDE drive

List overview All Threads
Download

newer

older

newbie kernel question

yum update error

Bart Schaefer

1 Apr 2006 1 Apr '06

9:16 p.m.

For those who haven't seen my several previous postings about problems with this (now not quite so) new PC, I have an ASUS P5N32-SLI Deluxe motherboard. The boot drive and primary filesystems are on an SATA disk and I'm having no problem with that. However, I recently plugged in a couple of IDE drives salvaged from my old PCs and I'm running into trouble with one of those.

The drive in question is a 20GB Maxtor 92049U6. It had an old RH5.2 ext2 filesystem on it when I first plugged it in, from which I tried to recover some data to back up to CD. Mostly this worked, but I started encountering read errors accessing some files so I unmounted the partition and started a fsck on it. At some point during the fsck (I was off doing something else on another workspace at the time), the system locked up hard, leaving the disk activity LED lit. I had to reset the PC.

So at that point I booted single-user and ran the fsck from there. It completed successfully after fixing a number of problems. I continued into multi-user mode, finished doing my backups, repartitioned the drive, and started "mkfs -t ext3 -c" on the larger partition, to check for bad blocks. Again at some point part way through the mkfs, the system locked up.

Back to single user mode, run the "mkfs", everything finishes fine. Back to multi-user mode, start to copy some large files onto the drive. MD5 sums fail to match for some of the copied files. Unmounted and started up "fsck -y". This succeeded, after fixing a number of errors, so (at this point just as a test case) I re-copied the files with bad MD5s. Some of these came through OK this time, others still did not. I decided perhaps this meant there were still bad blocks on the drive that a read-only test was not finding.

You'd think I'd have learned, but encouraged by the success of the previous fsck I optimistically started up another "fsck -c -c -y" on the suspect partition, and this time I waited around to watch it. About 1.6GB into the 16GB partition, the system locked up again.

This time I booted into a hard disk diagnostic program instead of into CentOS. After running overnight last night, a non-destructive read-write surface-scan reported no problems with the drive. This leads me to suspect that the problem is with linux, but I don't know how to proceed with diagnosing it. Suggestions would be appreciated.

Show replies by date

William L. Maltby

1 Apr 1 Apr

9:47 p.m.

On Sat, 2006-04-01 at 11:16 -0800, Bart Schaefer wrote:

...

For those who haven't seen my several previous postings about problems with this (now not quite so) new PC, I have an ASUS P5N32-SLI Deluxe motherboard. The boot drive and primary filesystems are on an SATA disk and I'm having no problem with that. However, I recently plugged in a couple of IDE drives salvaged from my old PCs and I'm running into trouble with one of those.

The drive in question is a 20GB Maxtor 92049U6. It had an old RH5.2 ext2 filesystem on it when I first plugged it in, from which I tried to recover some data to back up to CD. Mostly this worked, but I started encountering read errors accessing some files so I unmounted the partition and started a fsck on it. At some point during the fsck (I was off doing something else on another workspace at the time), the system locked up hard, leaving the disk activity LED lit. I had to reset the PC.

So at that point I booted single-user and ran the fsck from there. It completed successfully after fixing a number of problems. I continued into multi-user mode, finished doing my backups, repartitioned the drive, and started "mkfs -t ext3 -c" on the larger partition, to check for bad blocks. Again at some point part way through the mkfs, the system locked up.

Back to single user mode, run the "mkfs", everything finishes fine. Back to multi-user mode, start to copy some large files onto the drive. MD5 sums fail to match for some of the copied files. Unmounted and started up "fsck -y". This succeeded, after fixing a number of errors, so (at this point just as a test case) I re-copied the files with bad MD5s. Some of these came through OK this time, others still did not. I decided perhaps this meant there were still bad blocks on the drive that a read-only test was not finding.

You'd think I'd have learned, but encouraged by the success of the previous fsck I optimistically started up another "fsck -c -c -y" on the suspect partition, and this time I waited around to watch it. About 1.6GB into the 16GB partition, the system locked up again.

This time I booted into a hard disk diagnostic program instead of into CentOS. After running overnight last night, a non-destructive read-write surface-scan reported no problems with the drive. This leads me to suspect that the problem is with linux, but I don't know how to proceed with diagnosing it. Suggestions would be appreciated.

Re the overnight diag, are environmental conditions similar to when you encounter problems? Temp, power "brown out", etc? If *not*, try the diag when conditions are similar to when you have the problem. Long shot, but you've obviously gotten o the point of needing a long rifle.

Secondly, are your current HD configurations consistent with what is actually on the drive?

"sfdisk -l /dev/hdXXX"

and then look at your BIOS settings for it. If the BIOS where the disk was originally set up assigned different params than the current BIOS (assuming you did auto-detect rather than set up manually) could this be involved? I don't *think* BIOS settings are actually used in current Linuxes, but I could be wrong. If sfdisk and/or the BIOS show different params than the old BIOS, maybe a manual setup of the HD params will fix your problem?

Since you have an (apparent) inconsistent behavior, how about cable integrity, connectors, power etc.? Poorly seated or worn connector could be very sensitive to temp changes and vibration. Have you visually inspected all this, especially the power? Have you put volt meters on the +5/+12 and their respective grounds when the system is under load to see if maybe your PS is too weak?

HTH Bill

P.S. Tried different IDE cable/power connectors?

Bart Schaefer

2 Apr 2 Apr

7:56 a.m.

On 4/1/06, William L. Maltby BillsCentOS@triad.rr.com wrote:

...

Re the overnight diag, are environmental conditions similar to when you encounter problems? Temp, power "brown out", etc?

Yep. The PC was custom-built less than 12 weeks ago, it's on a UPS only a couple of weeks older than that, the case power supply is supposed to handle up to twice as many drives as I have in there, and the IDE cables are brand new and firmly seated.

...

Secondly, are your current HD configurations consistent with what is actually on the drive?

AFAICT, yes. And the other drive on the same IDE cable, also a Maxtor, is working fine.

William L. Maltby

4:30 p.m.

On Sat, 2006-04-01 at 21:56 -0800, Bart Schaefer wrote:

...

On 4/1/06, William L. Maltby BillsCentOS@triad.rr.com wrote:

...
Re the overnight diag, are environmental conditions similar to when you encounter problems? Temp, power "brown out", etc?

Yep. The PC was custom-built less than 12 weeks ago, it's on a UPS only a couple of weeks older than that, the case power supply is supposed to handle up to twice as many drives as I have in there, and the IDE cables are brand new and firmly seated.

...
Secondly, are your current HD configurations consistent with what is actually on the drive?

AFAICT, yes. And the other drive on the same IDE cable, also a Maxtor, is working fine.

Well, then it sounds like it's isolated on that drive. Did you do the sfdisk -l to see if the geometry stored on-disk matches BIOS/ I don't know that it would affect it, but leave no tern un-stoned (old joke).

My next suggestion is to see if the problem can be reproduced... HOLD THE PHONE! One other thing to consider/try! Have you put this drive on another IDE port and tried all the same things? Since we are passed the "easy" answers, we must include the "obvious" too. Is the jumpering correct for how it is installed? 1) Is the MB one that uses "cable select"? Is the jumper appropriately CS/MAST/SLAVE? Is it on a cable with another drive or CD? Are they jumpered appropriately CS/MST/SLAVE? Are they active when the problems are detected (since the diags shows no prob and I assume nothing else runs then)?

Resist the urge to arbitrarily reject any potential problem cause due to "I know it can't be because...". Those are the ones that get you.

Also, don assume that new=good. New eqpt has "shakedown cruise" because it is inherently less reliable that tried and true stuff. Is this the first time this cable/port has been used, for example. Is this the first time (or not) that the power connectors from the PS have been used? Any "in-line connectors" added in (like to power other fans, etc.)?

Last ditch effort to prove drive one way or the other: put it into another unit and try similar operations, See if diags acts the same, can you try copy operation and get errors again? If so, good be something is now flaky in the drive, like a connector loosened or gone bad during handling as it was moved.

HTH and GL Bill

Bart Schaefer

3 Apr 3 Apr

7:01 p.m.

On 4/2/06, William L. Maltby BillsCentOS@triad.rr.com wrote:

...

Well, then it sounds like it's isolated on that drive.

I suppose it could actually be a problem elsewhere in the system that just happens to be triggered by IDE traffic load, but I have no idea how to diagnose that -- the system locks up so hard that I can't examine console output, and there's nothing in dmesg or any of the logs after the reboot.

...

Did you do the sfdisk -l to see if the geometry stored on-disk matches BIOS

Yeah, that all looks OK.

...

My next suggestion is to see if the problem can be reproduced...

I haven't had a chance to try anything since the other night, but it was consistently happening whenever I ran the equivalent of "badblocks" on it with the system booted in multi-user mode. Hmm ... IRQ conflict?

...

HOLD THE PHONE! One other thing to consider/try! Have you put this drive on another IDE port and tried all the same things?

...

Since we are passed the "easy" answers, we must include the "obvious" too. Is the jumpering correct for how it is installed?

Yes, I checked that.

...

Is the MB one that uses "cable select"?

Almost certainly.

...

Is the jumper appropriately CS/MAST/SLAVE? Is it on a cable with another drive or CD? Are they jumpered appropriately CS/MST/SLAVE?

It's the slave on the same cable with the other Maxtor which is working correctly.

...

Are they active when the problems are detected (since the diags shows no prob and I assume nothing else runs then)?

The problem has occurred both with and without the other drive active. In fact my first thought was that the problem was related to having both drives active, but that proved not to be the case.

...

Also, don assume that new=good. New eqpt has "shakedown cruise" because it is inherently less reliable that tried and true stuff.

The system had been operating properly for several weeks before I hooked up these drives, and was [supposedly] throroughly tested by the manufacturer before it was shipped to me.

...

Last ditch effort to prove drive one way or the other: put it into another unit and try similar operations

Yeah, next on the list when I have a chance ...

William L. Maltby

7:40 p.m.

On Mon, 2006-04-03 at 10:01 -0700, Bart Schaefer wrote:

...

On 4/2/06, William L. Maltby BillsCentOS@triad.rr.com wrote:

...
Well, then it sounds like it's isolated on that drive.

<snip>

Did you catch Leo's post on Sunday that Maxtor had a reported problem and fix that might apply to you? In the archives with this message #.

007101c65679$f477c7c0$6501a8c0@medionp4

Synopsis: Problem ID 2685, nForce4 chipset, Maxtor SATA II Drives, Anomalous behavior such as data corruption, A firmware upgrade to resolve these issues.

Not sure if it applies to you, but it might be worth checking Leo's post and the Maxtor support site.

HTH

Bart Schaefer

4 Apr 4 Apr

3:36 a.m.

On 4/3/06, William L. Maltby BillsCentOS@triad.rr.com wrote:

...

Did you catch Leo's post on Sunday that Maxtor had a reported problem and fix that might apply to you?

Yes. I also saw Dave's post about smartctl ... I was interrupted this morning before I had a chance to reply to them.

The short of it is: (Leo) I'm not having problems with the SATA drive on its own, but it is a Maxtor so I'll check that out; (Dave) I've had smartd chkconfig'd off since the day after I installed CentOS 4.2 because always gave a FAILED message at boot (said it was unable to work with the SATA drive).

Leo Arnts

2 Apr 2 Apr

11:54 a.m.

Hi Bart,

Here the same problems with an Asus P5ND2 SLI motherboard and 2 ATA Maxtor 6L200P0 200Gb hard drives in raid 0+1.

The disc also checks ok even with the Maxtor tools. Guessing a linux problem maybe one of the ata drivers ? I have seen the problem with several kernel's yust now running 2.6.9-34.ELsmp-CUSTOM

-----Oorspronkelijk bericht----- Van: centos-bounces@centos.org [mailto:centos-bounces@centos.org] Namens Bart Schaefer Verzonden: zaterdag 1 april 2006 21:17 Aan: CentOS mailing list Onderwerp: [CentOS] CentOS 4.3 occasionally locking up accessing IDE drive

Dave Hatton

12:36 p.m.

I've just had a similar problem but with SATA drives.

I discovered some reports of problems when smartctl scans were issued against running drives.

For me the fix was to turn off SMART monitoring (service smartd stop && chkconfig smartd off).

Daveh

William L. Maltby

4:37 p.m.

On Sun, 2006-04-02 at 11:36 +0100, Dave Hatton wrote:

...

I've just had a similar problem but with SATA drives.

I discovered some reports of problems when smartctl scans were issued against running drives.

For me the fix was to turn off SMART monitoring (service smartd stop && chkconfig smartd off).

I put this here just to have your answer and this suggestion in one place. Has anyone checked the Mfg's (HD, controller, MoBo) web sites to see if probs reported/solved? Has anyone tried hdparm to "adjust" tunables to see if it makes a difference? I suspect not, based on what Dave is saying, but never know?

Also, no reason to believe y'all are the absolute first. Now sounds like Google might be productive?

...

Daveh

Bill

Leo Arnts

7:22 p.m.

Hmm,

Seems there is a problem:

I had discovered in the maxtor web site the answer ID 2685 that speak about my problem:

Note: This solution is only to be applied to the listed products and only when used with the nForce4 chipset. If you are unsure as to which chipset you have, contact your motherboard manufacturer or visit your motherboard manufacturer's website for clarification.

Maxtor SATA II Drives and nVidia nForce4 Compatibility

Question

The following issues have been identified with Maxtor SATA II drives and the nVidia nForce4 controller.

No detection

Anomalous behavior such as data corruption

Answer

The drives affected are those with model numbers starting with:

6V- (DiamondMax 10)

6H- (DiamondMax 11)

7V- (Maxline III)

7H- (MaxLine Pro 500)

A firmware upgrade to resolve these issues for Maxtor SATA II drives is available by contacting Maxtor Support. Please have the drive serial number and the Code number available, which will be listed on the top label of the drive itself.

Contact Maxtor Online: North, Central and South America Europe, Middle East, and Africa Asia, Australia, and New Zealand

Please see our Contact Pages for telephone support options: North/Central/South America Europe, Middle East and Africa Asia, Australia and New Zealand

-----Oorspronkelijk bericht----- Van: centos-bounces@centos.org [mailto:centos-bounces@centos.org] Namens William L. Maltby Verzonden: zondag 2 april 2006 16:38 Aan: CentOS General List Onderwerp: RE: [CentOS] CentOS 4.3 occasionally locking up accessing IDE drive

On Sun, 2006-04-02 at 11:36 +0100, Dave Hatton wrote:

...

I've just had a similar problem but with SATA drives.

I discovered some reports of problems when smartctl scans were issued against running drives.

For me the fix was to turn off SMART monitoring (service smartd stop && chkconfig smartd off).

Also, no reason to believe y'all are the absolute first. Now sounds like Google might be productive?

...