On 7/7/05, Farkas Levente lfarkas@bppiac.hu wrote:
hi, after we switch our servers from centos-3 to centos-4 (aka. rhel-4) one of our server always crash once a week without any oops. this happneds with both the normal kernel-2.6.9-11.EL and kernel-2.6.9-11.106.unsupported. after we change the motherboard, the raid contorller and the cables too we still got it. finally we start netdump and last but not least yesterday we got a crash log and a core file. it seems there is a bug in the raid5 code of the kernel. this is our backup server with 8 x 200GB hdd in a raid5 (for the data) plus 2 x 40GB hdd in raid1 (for the system) with 3ware 8xxx raid contorller, running. i attached the netdump log of the last crash. how can i fix it? yours.
Hi,
I have seen similar (but not quite the same) in the raid code on RHEL 3 kernels. They typically have occured due to a race condition between something updating the linked lists of raid devices and something trying to read them. For RHEL 3, my co-workes and I found where one particular race condition was fixed in 2.6 kernel and back ported to RHEL 3 kernel. Ultimately this patch was placed in one of the updates for the RHEL 3 kernel.
Anyway, it is likely your problem is yet another race condition. What I would suggest doing is get a box configured with true RHEL 4 and reproduce. Once reproduced file a bugzilla report with redhat. We have had very good success with this approach with a number of kernel bugs we found in the Centos 3/RHEL 3 kernels. Fixes have not always come quickly, but they generally do come.
Good Luck...james
--