[CentOS] Re: BUG in fs/bio.c:99

Tue Oct 24 16:23:31 UTC 2006
J.J. Garcia <stigmatedbrain at gmail.com>

El mar, 24-10-2006 a las 08:19 -0700, Scott Silva escribió:
> J.J. Garcia spake the following on 10/24/2006 6:00 AM:
> > El lun, 23-10-2006 a las 16:56 +0200, J.J. Garcia escribió:
> >> El lun, 23-10-2006 a las 17:50 +0400, Kirill Korotaev escribió:
> >>> J.J. Garcia,
> >>>
> >>> the bug you face looks exactly like the ours one.
> >>> I thought it is memory corruption since %eax is 8, while should be 0.
> >>> (BTW, can you run memtest to make sure your memory is really ok?
> >>> http://wiki.openvz.org/Hardware_testing ),
> >>> but the fact that it is always 8 in yours and our case makes me believe
> >>> it is something else...
> >>>
> >>> If I provide some debugging patch for you, will you be able to apply it to your
> >>> kernel, rebuild it and test the issue?
> >>>
> >>> Your help is very much appreciated.
> >>>
> >>> Thanks,
> >>> Kirill
> >>>
> >> Sure i'll do my best, if you provide me the patch i can check it on the
> >> current host, it's not a very critycall host at the network and i think
> >> the bug is relevant to stop it for a while,
> >>
> >> I've started by installing memtest86+ in the related host following the
> >> next steps, for your info:
> >>
> >> <...>
> >>
> >> =============================================================================
> >>  Package                 Arch       Version          Repository
> >> Size
> >> =============================================================================
> >> Installing:
> >>  memtest86+              i386       1.26-2           base
> >> 53 k
> >>
> >> Transaction Summary
> >> =============================================================================
> >> Install      1 Package(s)
> >> Update       0 Package(s)
> >> Remove       0 Package(s)
> >> Total download size: 53 k
> >> Is this ok [y/N]: y
> >> Downloading Packages:
> >> (1/1): memtest86+-1.26-2. 100% |=========================|  53 kB
> >> 00:00
> >> Running Transaction Test
> >> Finished Transaction Test
> >> Transaction Test Succeeded
> >> Running Transaction
> >>   Installing: memtest86+                   #########################
> >> [1/1]
> >>
> >> Installed: memtest86+.i386 0:1.26-2
> >> Complete!
> >> [root at fattybox ~]# rpm -ql memtest86+
> >> /boot/memtest86+-1.26
> >> /sbin/new-memtest-pkg
> >> /usr/sbin/memtest-setup
> >> /usr/share/doc/memtest86+-1.26
> >> /usr/share/doc/memtest86+-1.26/README
> >>
> >> [root at fattybox ~]# rpm -qi memtest86+
> >> Name        : memtest86+                   Relocations: (not
> >> relocatable)
> >> Version     : 1.26                              Vendor: CentOS
> >> Release     : 2                             Build Date: lun 21 feb 2005
> >> 20:35:44 CET
> >> Install Date: lun 23 oct 2006 16:25:57 CEST      Build Host:
> >> bhrama.build.karan.org
> >> Group       : System Environment/Base       Source RPM: memtest86
> >> +-1.26-2.src.rpm
> >> Size        : 123633                           License: GPL
> >> Signature   : DSA/SHA1, sáb 26 feb 2005 21:59:06 CET, Key ID
> >> a53d0bab443e1821
> >> Packager    : Karanbir Singh <kbsingh-IFYaIzF+flcdnm+yROfE0A at public.gmane.org>
> >> URL         : http://www.memtest.org
> >> Summary     : Stand-alone memory tester for x86 and x86-64 computers
> >> Description :
> >> Memtest86+ is a thorough stand-alone memory test for x86 and x86-64
> >> architecture computers. BIOS based memory tests are only a quick
> >> check and often miss many of the failures that are detected by
> >> Memtest86+.
> >>
> >> Run 'memtest-setup' to add to your GRUB or lilo boot menu.
> >> root at fattybox ~]#
> >>
> >> Proceding with the install on boot,
> >>
> >> [root at fattybox ~]# memtest-setup
> >> Setup complete.
> >>
> >> Lead to /etc/grub.conf in the following way, i'll use it to launch the
> >> tests by the way:
> >>
> >> title Memtest86+ (1.26)
> >>         root (hd0,0)
> >>         kernel /memtest86+-1.26 ro root=/dev/VolGroup00/LogVol00
> >> ACPI=off vga=0x307 selinux=0
> >>
> >>
> >> Since here, memtest is running using default config, feel free 2 tell me
> >> 2 change the default params when running if you are looking for
> >> something you need, i'll leave it running for 48 hours looking for
> >> something strange in memory.
> >>
> >> I've to note that this host has shared memm for the graphics, iow,
> >> there's no graphic card but embedded one on mobo, it's a DFI CM33T3-100
> >> mobo (CM33-TL) with up2date bios according dfi with a intel celeron
> >> running. I can't assure kingstom memories... but 22.0.2 worked fine with
> >> this hardware previously for long time (months, and year of uptime with
> >> heavy loads)...
> >>
> >> We'll keep on touch,
> >>
> >> Jose.
> >>
> >>
> >>
> > 
> > Hi again,
> > 
> > After almost 24 hours running memtest86+ in affected host i think it
> > discovered a memory corruption issue as you mentioned and it can be
> > checked at http://img206.imageshack.us/my.php?image=dscn2284xj4.jpg
> > 
> > I'm trying to solve it with a new PC133 memory module. And at the same
> > time maybe i can use an old video card to avoid memory sharing from mobo
> > embedded one to simplify things,
> > 
> > I'll check it then ASAP to see if the EAX register still keeps the noted
> > value after panic, if i can reproduce it again,
> > 
> > Sorry about the inconvenience, but what is strange is not having any
> > kind of memory corruption when 22.0.2 was used for months, really this
> > morning i was surprised!
> > 
> > Jose. 
> Memory can fail over time, and also look for any swollen or leaky tops on
> motherboard capacitors. If this is an older board, which I assumed by the
> PLE133 chipset, there were a lot of issues with bad capacitors in the 2000 to
> 2003 timeframe. This can be a symptom of drying electrolyte in the filter caps.

Scott

Sometimes i forget that "nothing is/lasts forever"... :) my fault to be
in my maniac mood sometimes... yes, i have to admit it, and i must to
write it down on the blackboard for 1000 times! :)

By the way thx for the hint, the capacitors seems to be ok on that
board, no drying by the momment and the host is well refrigerated, no
more than 2 years on service with a "new/0 hours" board since then. 

I hope to find out the bad ram module asap to get further

J.