[CentOS] Question re RHEL 5.3

On Wed, 2008-10-29 at 17:59 -0700, MHR wrote:
> On Wed, Oct 29, 2008 at 5:25 PM, Jim Perrin <jperrin at gmail.com> wrote:
> >
> > The only issue I've ever seen has been with the on-board fakeraid stuff
> > more and more vendors seem to be adding. I've been using SATA disks
> > with centos since the early 4.x days without issue, so you have me at
> > a bit of a loss here. I'd say if anything it's due to controller
> > support, and much of that can be chalked up to what hardware vendors
> > are pawning off as 'controllers' these days.
> >
> 
> The one problem I've seen and posted here was w.r.t. smartd error
> reports showing 2^32 - 1 errors on one of the disks (probably my
> system disk) every few minutes.  I thought this was more than just a
> bit suspicious, since there are only 4,687,500,000 sectors on a 300GB
> disk, and the likelihood of having errors on 4,294,967,295 (~92%) of
> them is rather slim unless the whole system is crashing a lot (it's
> not).  It's a Seagate 300GB, so I ran Seagate's SeaTools on it in
> lightweight mode, and no problems were reported, which is good because
> the disk is only about a year and a half old and has my CentOS root,
> swap, boot and home partitions on it.
> 
> I'll dig deeper on this one - sounds fishy to me, too, now....

With my usual jaundiced eye, my first thought is that the fault is not
the obvious one. So I suggest temporarily abandoning "The Usual
Suspects" (TM) - what a *great* movie.

Is it a consistent or sporadic issue? Is the controller an on-board or
after-market? If on-board, is the BIOS the latest? Have you checked
connections power/data cable connections? The number you mention makes
me think of a bad cable (or connections). Any pattern if it's recurring?
Temperature steady in the area? If you had a temporary rise/fall in
temperature it could have exposed weak connections, micro-fractures in
various cables, poor seating of memory, add-in cards, etc.

Any other messages, that might be related, in the log file when it
happens? I'm wondering if some spurious interrupt might be involved.

Have you memtested recently? ISTM that a memory error could "fool" the
system. Re-seated the memory?

How about the kernel version? On the latest kernel, 2.6.18-92.1.13.el5,
I recently got this.

----------------------------------------------------------
Oct 29 07:09:41 centos501 kernel: Uhhuh. NMI received for unknown reason
2c on CPU 0.
Oct 29 07:09:41 centos501 kernel: Do you have a strange power saving
mode enabled?
Oct 29 07:09:41 centos501 kernel: Dazed and confused, but trying to
continue
-----------------------------------------------------------

Never seen before. Only once, so far. I've not yet investigated this. No
recent changes to the system since 5.0 but normal yum updates to current
5.2 status. The case cover is off right now though, so it could be some
EMI (heh, or an EMP from the recent trash on this list)  :-)

That's all I can think of ATM but for power from the utility company or
marginal power supply in the unit.

> 
> mhr
> <snip sig stuff>

HTH
-- 
Bill