[CentOS] how to debug hardware lockups?

Sat Nov 15 08:16:39 UTC 2008
Rudi Ahlers <rudiahlers at gmail.com>

Hi,

We have a server which locks up about once a week (for the past 3
weeks now), without any warning, and the only way to recover it, is to
reset the server. This causes unwanted downtime, and often software
loss as well.

How do I debug the server, which runs CentOS 5.2 to see why it locks
up? The CPU is an Intel Q9300 Core 2 Quad, with 8 GB RAM, on an Intel
Motherboard

The last few entries before the server froze, is:


Nov 15 07:15:20 saturn snmpd[2527]: Connection from UDP: [127.0.0.1]:59008
Nov 15 07:15:20 saturn snmpd[2527]: Received SNMP packet(s) from UDP:
[127.0.0.1]:59008
Nov 15 07:15:20 saturn snmpd[2527]: Connection from UDP: [127.0.0.1]:47729
Nov 15 07:15:20 saturn snmpd[2527]: Received SNMP packet(s) from UDP:
[127.0.0.1]:47729
Nov 15 07:15:20 saturn snmpd[2527]: Connection from UDP: [127.0.0.1]:47890
Nov 15 07:15:20 saturn snmpd[2527]: Received SNMP packet(s) from UDP:
[127.0.0.1]:47890
Nov 15 07:15:20 saturn snmpd[2527]: Connection from UDP: [127.0.0.1]:50023
Nov 15 07:15:20 saturn snmpd[2527]: Received SNMP packet(s) from UDP:
[127.0.0.1]:50023
Nov 15 07:15:20 saturn snmpd[2527]: Connection from UDP: [127.0.0.1]:58459
Nov 15 07:15:20 saturn snmpd[2527]: Received SNMP packet(s) from UDP:
[127.0.0.1]:58459
Nov 15 10:10:10 saturn syslogd 1.4.1: restart.
Nov 15 10:10:11 saturn kernel: klogd 1.4.1, log source = /proc/kmsg started.
Nov 15 10:10:11 saturn kernel: Bootdata ok (command line is ro
root=/dev/System/root)
Nov 15 10:10:11 saturn kernel: Linux version 2.6.18-92.1.17.el5xen
(mockbuild at builder10.centos.org) (gcc version 4.1.2 20071124 (Red Hat
4.1
.2-42)) #1 SMP Tue Nov 4 14:13:09 EST 2008
Nov 15 10:10:11 saturn kernel: BIOS-provided physical RAM map:
Nov 15 10:10:11 saturn kernel:  Xen: 0000000000000000 -
00000001ef958000 (usable)
Nov 15 10:10:11 saturn kernel: DMI 2.4 present.
Nov 15 10:10:11 saturn kernel: ACPI: LAPIC (acpi_id[0x01]
lapic_id[0x00] enabled)
Nov 15 10:10:11 saturn kernel: ACPI: LAPIC (acpi_id[0x03]
lapic_id[0x02] enabled)
Nov 15 10:10:11 saturn kernel: ACPI: LAPIC (acpi_id[0x02]
lapic_id[0x01] enabled)
Nov 15 10:10:11 saturn kernel: ACPI: LAPIC (acpi_id[0x04]
lapic_id[0x03] enabled)
Nov 15 10:10:11 saturn kernel: ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1])
Nov 15 10:10:11 saturn kernel: ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0x1])
Nov 15 10:10:11 saturn kernel: ACPI: IOAPIC (id[0x02]
address[0xfec00000] gsi_base[0])
Nov 15 10:10:11 saturn kernel: IOAPIC[0]: apic_id 2, version 32,
address 0xfec00000, GSI 0-23
Nov 15 10:10:11 saturn kernel: ACPI: INT_SRC_OVR (bus 0 bus_irq 0
global_irq 2 dfl dfl)
Nov 15 10:10:11 saturn kernel: ACPI: INT_SRC_OVR (bus 0 bus_irq 9
global_irq 9 high level)




-- 

Kind Regards
Rudi Ahlers