I've heard now from more than one source about problems with CentOS (and RH) at least up through 5.2 w.r.t. SATA drive handling, and I've even reported on this myself in this list before.
My question is, do we have any idea if 5.3 has any improvements in this area?
One of my cohorts here, who happens to be a Fedora fan, says that these problems are fixed in F9, but I have grave concerns about putting an enterprise lifeline main application on any Fedora release. If 5.3 solves these issues, I'd much rather go with that.
Any ideas? Any places I might look to see for myself?
Thanks.
mhr
MHR wrote:
I've heard now from more than one source about problems with CentOS (and RH) at least up through 5.2 w.r.t. SATA drive handling, and I've even reported on this myself in this list before.
My question is, do we have any idea if 5.3 has any improvements in this area?
do you have any bug report numbers for these issues ?
On Wed, Oct 29, 2008 at 5:12 PM, Karanbir Singh mail-lists@karan.org wrote:
do you have any bug report numbers for these issues ?
No, and from what I saw on the RH bugzilla list of SATA disk related bugs, none of them seem to be that serious except w.r.t. specific controllers.
I will go back and dig deeper.
Thanks.
mhr
On Wed, Oct 29, 2008 at 8:01 PM, MHR mhullrich@gmail.com wrote:
I've heard now from more than one source about problems with CentOS (and RH) at least up through 5.2 w.r.t. SATA drive handling, and I've even reported on this myself in this list before.
My question is, do we have any idea if 5.3 has any improvements in this area?
One of my cohorts here, who happens to be a Fedora fan, says that these problems are fixed in F9, but I have grave concerns about putting an enterprise lifeline main application on any Fedora release. If 5.3 solves these issues, I'd much rather go with that.
Any ideas? Any places I might look to see for myself?
The only issue I've ever seen has been with the onboard fakeraid stuff more and more vendors seem to be adding. I've been using SATA disks with centos since the early 4.x days without issue, so you have me at a bit of a loss here. I'd say if anything it's due to controller support, and much of that can be chalked up to what hardware vendors are pawning off as 'controllers' these days.
On Wed, Oct 29, 2008 at 5:25 PM, Jim Perrin jperrin@gmail.com wrote:
The only issue I've ever seen has been with the onboard fakeraid stuff more and more vendors seem to be adding. I've been using SATA disks with centos since the early 4.x days without issue, so you have me at a bit of a loss here. I'd say if anything it's due to controller support, and much of that can be chalked up to what hardware vendors are pawning off as 'controllers' these days.
The one problem I've seen and posted here was w.r.t. smartd error reports showing 2^32 - 1 errors on one of the disks (probably my system disk) every few minutes. I thought this was more than just a bit suspicious, since there are only 4,687,500,000 sectors on a 300GB disk, and the likelihood of having errors on 4,294,967,295 (~92%) of them is rather slim unless the whole system is crashing a lot (it's not). It's a Seagate 300GB, so I ran Seagate's SeaTools on it in lightweight mode, and no problems were reported, which is good because the disk is only about a year and a half old and has my CentOS root, swap, boot and home partitions on it.
I'll dig deeper on this one - sounds fishy to me, too, now....
mhr
On Wed, 2008-10-29 at 17:59 -0700, MHR wrote:
On Wed, Oct 29, 2008 at 5:25 PM, Jim Perrin jperrin@gmail.com wrote:
The only issue I've ever seen has been with the on-board fakeraid stuff more and more vendors seem to be adding. I've been using SATA disks with centos since the early 4.x days without issue, so you have me at a bit of a loss here. I'd say if anything it's due to controller support, and much of that can be chalked up to what hardware vendors are pawning off as 'controllers' these days.
The one problem I've seen and posted here was w.r.t. smartd error reports showing 2^32 - 1 errors on one of the disks (probably my system disk) every few minutes. I thought this was more than just a bit suspicious, since there are only 4,687,500,000 sectors on a 300GB disk, and the likelihood of having errors on 4,294,967,295 (~92%) of them is rather slim unless the whole system is crashing a lot (it's not). It's a Seagate 300GB, so I ran Seagate's SeaTools on it in lightweight mode, and no problems were reported, which is good because the disk is only about a year and a half old and has my CentOS root, swap, boot and home partitions on it.
I'll dig deeper on this one - sounds fishy to me, too, now....
With my usual jaundiced eye, my first thought is that the fault is not the obvious one. So I suggest temporarily abandoning "The Usual Suspects" (TM) - what a *great* movie.
Is it a consistent or sporadic issue? Is the controller an on-board or after-market? If on-board, is the BIOS the latest? Have you checked connections power/data cable connections? The number you mention makes me think of a bad cable (or connections). Any pattern if it's recurring? Temperature steady in the area? If you had a temporary rise/fall in temperature it could have exposed weak connections, micro-fractures in various cables, poor seating of memory, add-in cards, etc.
Any other messages, that might be related, in the log file when it happens? I'm wondering if some spurious interrupt might be involved.
Have you memtested recently? ISTM that a memory error could "fool" the system. Re-seated the memory?
How about the kernel version? On the latest kernel, 2.6.18-92.1.13.el5, I recently got this.
---------------------------------------------------------- Oct 29 07:09:41 centos501 kernel: Uhhuh. NMI received for unknown reason 2c on CPU 0. Oct 29 07:09:41 centos501 kernel: Do you have a strange power saving mode enabled? Oct 29 07:09:41 centos501 kernel: Dazed and confused, but trying to continue -----------------------------------------------------------
Never seen before. Only once, so far. I've not yet investigated this. No recent changes to the system since 5.0 but normal yum updates to current 5.2 status. The case cover is off right now though, so it could be some EMI (heh, or an EMP from the recent trash on this list) :-)
That's all I can think of ATM but for power from the utility company or marginal power supply in the unit.
mhr
<snip sig stuff>
HTH
MHR wrote:
The one problem I've seen and posted here was w.r.t. smartd error reports showing 2^32 - 1 errors on one of the disks (probably my system disk) every few minutes. I thought this was more than just a bit suspicious, since there are only 4,687,500,000 sectors on a 300GB disk, and the likelihood of having errors on 4,294,967,295 (~92%) of them is rather slim unless the whole system is crashing a lot (it's not). It's a Seagate 300GB, so I ran Seagate's SeaTools on it in lightweight mode, and no problems were reported, which is good because the disk is only about a year and a half old and has my CentOS root, swap, boot and home partitions on it.
Precisely what error counters are alarming you? If these are the raw numbers for Raw_Read_Error_Rate, Hardware_ECC_Recovered, and Seek_Error_Rate, it is normal for Seagate drives. Look at the normalized values for these attributes. As long as they are not approaching their failure thresholds, the drive is OK. For further reassurance you can run the SMART long offline tests ("smartctl -t long /dev/whatever" -- see smartctl manpage for details) on the drive.
You need to understand something about modern drives. In the past, drives achieved the first level of redundancy by recording each bit in a large enough area to include many magnetic domains. If some percentage of the domains failed to hold the data (a highly likely situation), that was OK because the read head would get enough signal from the rest of the domains so that the bit would be detected correctly. Fast forward to today. That multi-domain redundancy is all but gone, having been replaced by more advanced error correcting codes implemented in hardware. Seagate has elected to have the raw number for Raw_Read_Error_Rate report each instance of sectors needing this level of correction and let the normalized values reflect whether these corrections are occurring at a rate higher than expected.
A similar situation exists for Seek_Error_Rate. When a drive performs a seek, there is a trade-off between speed and accuracy. You can make it more likely that the heads go directly to the right track by moving them more slowly and allowing more settling time. Performance can be improved significantly by moving the heads more abruptly and accepting that some percentage of the time a subsequent small adjustment will be needed to get to the right track. Again, it is the normalized value for Seek_Error_Rate that reports whether these adjustments are becoming necessary more often than expected.
Mhr wrote on Wed, 29 Oct 2008 17:59:40 -0700:
The one problem I've seen and posted here was w.r.t. smartd error reports showing 2^32 - 1 errors on one of the disks (probably my system disk) every few minutes.
How has this anything to do with "SATA problems/drive handling"? And could you please use a decent subject next time?
Regarding your problem: Have you done a smartctl selftest since then, did you go to smartmontools.sf.net since then and read up on smartmon? This may just be a problem with smartd not being able to handle the error codes/number of errors from that disk. If you look at smartmontools.sf.net and read the man you'll see that vendors are quite inconsistent in what and how they report and a reversal of byte ordering every now and then seems to be common. Not to mention that ther smartmon shipping with CentOS naturally doesn't include the latest code.
Kai
On Sat, Nov 1, 2008 at 6:31 AM, Kai Schaetzl maillists@conactive.com wrote:
Mhr wrote on Wed, 29 Oct 2008 17:59:40 -0700:
The one problem I've seen and posted here was w.r.t. smartd error reports showing 2^32 - 1 errors on one of the disks (probably my system disk) every few minutes.
How has this anything to do with "SATA problems/drive handling"?
Possibly because my system drive is a SATA disk? (FTR, the drive does not appear to be the slightest bit unstable and it runs just fine. In fact, I recently modified the system so that it now runs on three SATA-2 drives exclusively. For whatever reason, the WD drives do not report any errors - see also below.)
And could you please use a decent subject next time?
When I select the subject, I usually do. This was a reply to a thread, so I didn't pick the subject. There's no need to be testy....
Regarding your problem: Have you done a smartctl selftest since then, did you go to smartmontools.sf.net since then and read up on smartmon?
Yes and not until now, in that order. The smartctl selftest has the same problem, IIRC, but the seatools test showed nothing wrong.
This may just be a problem with smartd not being able to handle the error codes/number of errors from that disk. If you look at smartmontools.sf.net and read the man you'll see that vendors are quite inconsistent in what and how they report and a reversal of byte ordering every now and then seems to be common. Not to mention that ther smartmon shipping with CentOS naturally doesn't include the latest code.
All good information, thank you. I did not see anything specific to the issue I am seeing, which is that every half hour, smartd reports the following:
Nov 2 01:56:11 mhrichter smartd[3121]: Device: /dev/sda, 4294967295 Currently unreadable (pending) sectors Nov 2 01:56:11 mhrichter smartd[3121]: Device: /dev/sda, 4294967295 Offline uncorrectable sectors
In each case, it also sends a warning email to root, which is kind of annoying since these do not appear to be legitimate error conditions.
Someone mentioned that this is a recurring problem with Seagate drives - more info, please?
Thanks.
mhr
On Sun, Nov 2, 2008 at 1:03 AM, MHR mhullrich@gmail.com wrote:
Nov 2 01:56:11 mhrichter smartd[3121]: Device: /dev/sda, 4294967295 Currently unreadable (pending) sectors Nov 2 01:56:11 mhrichter smartd[3121]: Device: /dev/sda, 4294967295 Offline uncorrectable sectors
In each case, it also sends a warning email to root, which is kind of annoying since these do not appear to be legitimate error conditions.
Someone mentioned that this is a recurring problem with Seagate drives
- more info, please?
You might want to check out the following CentOS forum thread. I, too, had the same problem (see comment #3).
http://www.centos.org/modules/newbb/viewtopic.php?viewmode=flat&topic_id...
Akemi / toracat
Jim Perrin wrote:
On Wed, Oct 29, 2008 at 8:01 PM, MHR mhullrich@gmail.com wrote:
I've heard now from more than one source about problems with CentOS (and RH) at least up through 5.2 w.r.t. SATA drive handling, and I've even reported on this myself in this list before.
My question is, do we have any idea if 5.3 has any improvements in this area?
One of my cohorts here, who happens to be a Fedora fan, says that these problems are fixed in F9, but I have grave concerns about putting an enterprise lifeline main application on any Fedora release. If 5.3 solves these issues, I'd much rather go with that.
Any ideas? Any places I might look to see for myself?
The only issue I've ever seen has been with the onboard fakeraid stuff more and more vendors seem to be adding. I've been using SATA disks with centos since the early 4.x days without issue, so you have me at a bit of a loss here. I'd say if anything it's due to controller support, and much of that can be chalked up to what hardware vendors are pawning off as 'controllers' these days.
I recently set up a CentOS 5.2 server with RAID 1 (software) and 2 sata drives. During burn-in I see no problems. I'm using a Supermicro PDSBM series system board.
Ben
MHR wrote:
I've heard now from more than one source about problems with CentOS (and RH) at least up through 5.2 w.r.t. SATA drive handling, and I've even reported on this myself in this list before.
My question is, do we have any idea if 5.3 has any improvements in this area?
One of my cohorts here, who happens to be a Fedora fan, says that these problems are fixed in F9, but I have grave concerns about putting an enterprise lifeline main application on any Fedora release. If 5.3 solves these issues, I'd much rather go with that.
You seem to be relying on hearsay and rumor-mongering. Any bug reports you have filed on the CentOS bug tracker?
Any ideas? Any places I might look to see for myself?
You have to install 5.3 beta and test it yourself. Then again if your "alleged problems" only surface in a conversation with your "fedora buddy" you have to have more than your word for it.
Comprende?
Spike.