I've got a pair of HA servers I'm trying to get into production. Here are some specs :
Xeon X3210 Quad Core (aka Core 2 Quad) 2.13Ghz (four logical processors, no Hyper Threading) 4GB memory Hardware (3ware) Raid 1 mirror, 2 x Seagate 750GB SATA2 650GB DRBD partition run on top of an LVM2 partition.
CentOS 5.2 2.6.18-92.1.6.el5.centos.plus DRBD 8.2 (drbd82-8.2.6-1.el5.centos) Kernel Module kmod-drbd82-8.2.6-1.2.6.18_92.1.6.el5.centos.plus
I've been trying to rsync data from a remote server and it's crashed a couple of times now. It does not happen immediately, but over time. I connected a serial console and got the below panic message. The last file copied was ~1GB in size, but previous files up to 4GB had been copied. I do not have kernel core dumping enabled, but that's a possibility if needed. Not sure if this is a bug or is caused by something I've done. This isn't my first DRBD install (although first on top of LVM) and I believe I've gotten everything setup correctly. I did have a full sync rate (110M) enabled over Gbe, if that's relevant. Thoughts?
Regards, Chris
[root@haws2 ~]# pvscan PV /dev/sda2 VG VolGroup00 lvm2 [698.28 GB / 0 free] Total: 1 [698.28 GB] / in use: 1 [698.28 GB] / in no VG: 0 [0 ]
[root@haws2 ~]# lvscan ACTIVE '/dev/VolGroup00/LogVol00' [39.06 GB] inherit ACTIVE '/dev/VolGroup00/LogVol02' [658.72 GB] inherit ACTIVE '/dev/VolGroup00/LogVol01' [512.00 MB] inherit
[root@haws2 ~]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/VolGroup00-LogVol00 39676508 1938880 35689628 6% / /dev/sda1 194442 23650 160753 13% /boot tmpfs 1684156 0 1684156 0% /dev/shm /dev/drbd0 679824572 113321224 531970212 18% /home
[root@haws1 ~]# BUG: unable to handle kernel paging request at virtual address c printing eip: c04e9291 *pde = 00000000 Oops: 0000 [#1] SMP last sysfs file: /devices/pci0000:00/0000:00:00.0/irq Modules linked in: softdog drbd(U) autofs4 hidp rfcomm l2cap bluetooth sunrpc id CPU: 0 EIP: 0060:[<c04e9291>] Tainted: G VLI EFLAGS: 00010046 (2.6.18-92.1.6.el5.centos.plus #1) EIP is at list_del+0x25/0x5c eax: fe187128 ebx: f04a6ab8 ecx: f04a6a8c edx: f04a6a8c esi: fe187128 edi: f4e355a0 ebp: f426c800 esp: f385df3c ds: 007b es: 007b ss: 0068 Process drbd0_asender (pid: 2900, ti=f385d000 task=f4932000 task.ti=f385d000) Stack: 000000e6 f8d1953b 00000000 f04a6a8c 000000e6 00000001 ee187b14 00000046 f49e7bc0 f04a6ab8 f04a6a8c f426c800 fe187128 f4e355a0 0000349f f8d24805 00000800 f426c800 f426c800 00000008 f426c9f4 f8d14d47 f385dfbc f8d15fbc Call Trace: [<f8d1953b>] _req_may_be_done+0x4ea/0x710 [drbd] [<f8d24805>] tl_release+0x35/0x172 [drbd] [<f8d14d47>] got_BarrierAck+0x10/0x6b [drbd] [<f8d15fbc>] drbd_asender+0x3b1/0x4e7 [drbd] [<f8d24a53>] drbd_thread_setup+0x0/0x14e [drbd] [<f8d24adb>] drbd_thread_setup+0x88/0x14e [drbd] [<f8d24a53>] drbd_thread_setup+0x0/0x14e [drbd] [<c0405c3b>] kernel_thread_helper+0x7/0x10 ======================= Code: 89 c3 eb eb 90 90 53 89 c3 8b 40 04 8b 00 39 d8 74 17 50 53 68 9b 9a 63 c EIP: [<c04e9291>] list_del+0x25/0x5c SS:ESP 0068:f385df3c <0>Kernel panic - not syncing: Fatal exception BUG: warning at arch/i386/kernel/smp.c:550/smp_call_function() (Tainted: G ) [<c0417ae0>] stop_this_cpu+0x0/0x33 [<c04178cf>] smp_call_function+0x57/0xc3 [<c0426682>] printk+0x18/0x8e [<c041794e>] smp_send_stop+0x13/0x1c [<c0425c53>] panic+0x4c/0x16d [<c04064dd>] die+0x25d/0x291 [<c060c48b>] do_page_fault+0x3ea/0x4b8 [<c060c0a1>] do_page_fault+0x0/0x4b8 [<c0405a71>] error_code+0x39/0x40 [<c04e9291>] list_del+0x25/0x5c [<f8d1953b>] _req_may_be_done+0x4ea/0x710 [drbd] [<f8d24805>] tl_release+0x35/0x172 [drbd] [<f8d14d47>] got_BarrierAck+0x10/0x6b [drbd] [<f8d15fbc>] drbd_asender+0x3b1/0x4e7 [drbd] [<f8d24a53>] drbd_thread_setup+0x0/0x14e [drbd] [<f8d24adb>] drbd_thread_setup+0x88/0x14e [drbd] [<f8d24a53>] drbd_thread_setup+0x0/0x14e [drbd] [<c0405c3b>] kernel_thread_helper+0x7/0x10 =======================
nate wrote:
Chris Miller wrote:
I've got a pair of HA servers I'm trying to get into production. Here are some specs :
[..]
[root@haws1 ~]# BUG: unable to handle kernel paging request at virtual address c
This typically means bad RAM
While I won't rule this out, my local hardware vendor does a 48 hour burn-in including a full gamut of tests (including memory) before handing over the servers. These servers are less than two weeks old...
Seems like this is a common type of error in some situations. I tried to boot in kexec/kdump mode (CentOS 5 replacement for diskdumputils), but the e1000 driver isn't seeing the NICs after a reboot via the "capture kernel", so I can't replicate the (rsync induced) problem and perform kernel debugging. I'll explore this more tomorrow.
Chris
on 8-14-2008 12:55 AM Chris Miller spake the following:
nate wrote:
Chris Miller wrote:
I've got a pair of HA servers I'm trying to get into production. Here are some specs :
[..]
[root@haws1 ~]# BUG: unable to handle kernel paging request at virtual address c
This typically means bad RAM
While I won't rule this out, my local hardware vendor does a 48 hour burn-in including a full gamut of tests (including memory) before handing over the servers. These servers are less than two weeks old...
Seems like this is a common type of error in some situations. I tried to boot in kexec/kdump mode (CentOS 5 replacement for diskdumputils), but the e1000 driver isn't seeing the NICs after a reboot via the "capture kernel", so I can't replicate the (rsync induced) problem and perform kernel debugging. I'll explore this more tomorrow.
Chris
When the servers are shipped to you, do you open them and make sure all modules are seated completely, and haven't been dislodged by the shipping?
Scott Silva wrote:
on 8-14-2008 12:55 AM Chris Miller spake the following:
nate wrote:
Chris Miller wrote:
I've got a pair of HA servers I'm trying to get into production. Here are some specs :
[..]
[root@haws1 ~]# BUG: unable to handle kernel paging request at virtual address c
This typically means bad RAM
While I won't rule this out, my local hardware vendor does a 48 hour burn-in
When the servers are shipped to you, do you open them and make sure all modules are seated completely, and haven't been dislodged by the shipping?
+1 on hardware issues...I won't name names, but once recently I ordered two identical systems and I had to send one of them back FOUR times: two bad raid controllers, bad ram, and bad motherboard. This was all started about 4 weeks into production. I don't know if the vendor was actually doing burn-in, but I've seen pleny of damage from shipping. Do your own memory testing and line up another (nearly identical) server to verify the problem.
Good luck,
Jed