Hello,
every few hours I get the following message in /var/log/message:
Jul 5 20:23:28 hXXX kernel: Machine check events logged Jul 5 20:53:28 hXXX kernel: Machine check events logged Jul 5 22:13:28 hXXX kernel: Machine check events logged Jul 5 23:53:28 hXXX kernel: Machine check events logged Jul 5 23:58:27 hXXX kernel: Machine check events logged Jul 6 01:38:27 hXXX kernel: Machine check events logged Jul 6 04:48:27 hXXX kernel: Machine check events logged
And in the /var/log/mcelog I see:
MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 4 northbridge TSC 111a60c5584d4 [at 2500 Mhz 1 days 9:25:51 uptime (unreliable)] MISC c008000001000000 ADDR 1148f5940 Northbridge NB Array Error bit35 = err cpu3 bit42 = L3 subcache in error bit 0 bit43 = L3 subcache in error bit 1 bit46 = corrected ecc error bit59 = misc error valid memory/cache error 'generic read mem transaction, generic transaction, level generic' STATUS 9c1f4cf8001c011b MCGSTATUS 0 No DIMM found for 1148f5940 in SMBIOS
My machine (a CentOS 5.5/64bit server rented at German hoster strato.de) seems to run ok as a LAMP server though...
What do these messages actually mean, is RAM defect and how critical is it (because I have an important event this Friday and would prefer not to take the machine offline)
Thank you and I'm attaching my dmesg output below
Regards Alex
Linux version 2.6.18-194.8.1.el5 (mockbuild@builder10.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)) #1 SMP Thu Jul 1 19:04:48 EDT 2010 Command line: ro root=LABEL=/ console=tty0 console=ttyS0,57600 BIOS-provided physical RAM map: BIOS-e820: 0000000000010000 - 000000000009f000 (usable) BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved) BIOS-e820: 00000000000e4000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 00000000ddfb0000 (usable) BIOS-e820: 00000000ddfb0000 - 00000000ddfbe000 (ACPI data) BIOS-e820: 00000000ddfbe000 - 00000000ddfe0000 (ACPI NVS) BIOS-e820: 00000000ddfe0000 - 00000000ddfee000 (reserved) BIOS-e820: 00000000ddff0000 - 00000000de000000 (reserved) BIOS-e820: 00000000ff700000 - 0000000100000000 (reserved) BIOS-e820: 0000000100000000 - 0000000120000000 (usable) DMI present. ACPI: RSDP (v000 ACPIAM ) @ 0x00000000000faf80 ACPI: RSDT (v001 032510 RSDT1503 0x20100325 MSFT 0x00000097) @ 0x00000000ddfb0000 ACPI: FADT (v002 032510 FACP1503 0x20100325 MSFT 0x00000097) @ 0x00000000ddfb0200 ACPI: MADT (v001 032510 APIC1503 0x20100325 MSFT 0x00000097) @ 0x00000000ddfb0390 ACPI: MCFG (v001 032510 OEMMCFG 0x20100325 MSFT 0x00000097) @ 0x00000000ddfb0400 ACPI: OEMB (v001 032510 OEMB1503 0x20100325 MSFT 0x00000097) @ 0x00000000ddfbe040 ACPI: HPET (v001 032510 OEMHPET 0x20100325 MSFT 0x00000097) @ 0x00000000ddfb48c0 ACPI: SSDT (v001 A M I POWERNOW 0x00000001 AMD 0x00000001) @ 0x00000000ddfb4900 ACPI: DSDT (v001 A96B3 A96B3210 0x00000210 INTL 0x20051117) @ 0x0000000000000000 No NUMA configuration found Faking a node at 0000000000000000-0000000120000000 Bootmem setup node 0 0000000000000000-0000000120000000 Memory for crash kernel (0x0 to 0x0) notwithin permissible range disabling kdump On node 0 totalpages: 1022763 DMA zone: 2627 pages, LIFO batch:0 DMA32 zone: 890856 pages, LIFO batch:31 Normal zone: 129280 pages, LIFO batch:31 ACPI: PM-Timer IO Port: 0x808 ACPI: Local APIC address 0xfee00000 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled) Processor #0 0:4 APIC version 16 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled) Processor #1 0:4 APIC version 16 ACPI: LAPIC (acpi_id[0x03] lapic_id[0x02] enabled) Processor #2 0:4 APIC version 16 ACPI: LAPIC (acpi_id[0x04] lapic_id[0x03] enabled) Processor #3 0:4 APIC version 16 ACPI: IOAPIC (id[0x04] address[0xfec00000] gsi_base[0]) IOAPIC[0]: apic_id 4, version 33, address 0xfec00000, GSI 0-23 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level) ACPI: IRQ0 used by override. ACPI: IRQ2 used by override. ACPI: IRQ9 used by override. Setting APIC routing to physical flat ACPI: HPET id: 0x8300 base: 0xfed00000 Using ACPI (MADT) for SMP configuration information Nosave address range: 000000000009f000 - 00000000000a0000 Nosave address range: 00000000000a0000 - 00000000000e4000 Nosave address range: 00000000000e4000 - 0000000000100000 Nosave address range: 00000000ddfb0000 - 00000000ddfbe000 Nosave address range: 00000000ddfbe000 - 00000000ddfe0000 Nosave address range: 00000000ddfe0000 - 00000000ddfee000 Nosave address range: 00000000ddfee000 - 00000000ddff0000 Nosave address range: 00000000ddff0000 - 00000000de000000 Nosave address range: 00000000de000000 - 00000000ff700000 Nosave address range: 00000000ff700000 - 0000000100000000 Allocating PCI resources starting at e0000000 (gap: de000000:21700000) SMP: Allowing 4 CPUs, 0 hotplug CPUs Built 1 zonelists. Total pages: 1022763 Kernel command line: ro root=LABEL=/ console=tty0 console=ttyS0,57600 Initializing CPU#0 PID hash table entries: 4096 (order: 12, 32768 bytes) Console: colour VGA+ 80x25 Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes) Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes) Checking aperture... CPU 0: aperture @ 4000000 size 32 MB Aperture too small (32 MB) No AGP bridge found Your BIOS doesn't leave a aperture memory hole Please enable the IOMMU option in the BIOS setup This costs you 64 MB of RAM Mapping aperture over 65536 KB of RAM @ 4000000 Nosave address range: 0000000004000000 - 0000000008000000 ACPI: DMAR not present Memory: 4016200k/4718592k available (2575k kernel code, 144564k reserved, 1303k data, 212k init) Calibrating delay loop (skipped), value calculated using timer frequency.. 5000.21 BogoMIPS (lpj=2500105) Security Framework v1.0.0 initialized SELinux: Initializing. SELinux: Starting in permissive mode selinux_register_security: Registering secondary module capability Capability LSM initialized as secondary Mount-cache hash table entries: 256 CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 512K (64 bytes/line) CPU 0/0 -> Node 0 CPU: Physical Processor ID: 0 CPU: Processor Core ID: 0 SMP alternatives: switching to UP code ACPI: Core revision 20060707 Using local APIC timer interrupts. Detected 12.500 MHz APIC timer. SMP alternatives: switching to SMP code Booting processor 1/4 APIC 0x1 Initializing CPU#1 Calibrating delay using timer specific routine.. 5000.36 BogoMIPS (lpj=2500184) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 512K (64 bytes/line) CPU 1/1 -> Node 0 CPU: Physical Processor ID: 0 CPU: Processor Core ID: 1 Quad-Core AMD Opteron(tm) Processor 1381 stepping 02 SMP alternatives: switching to SMP code Booting processor 2/4 APIC 0x2 Initializing CPU#2 Calibrating delay using timer specific routine.. 4999.71 BogoMIPS (lpj=2499855) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 512K (64 bytes/line) CPU 2/2 -> Node 0 CPU: Physical Processor ID: 0 CPU: Processor Core ID: 2 Quad-Core AMD Opteron(tm) Processor 1381 stepping 02 SMP alternatives: switching to SMP code Booting processor 3/4 APIC 0x3 Initializing CPU#3 Calibrating delay using timer specific routine.. 5000.16 BogoMIPS (lpj=2500084) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 512K (64 bytes/line) CPU 3/3 -> Node 0 CPU: Physical Processor ID: 0 CPU: Processor Core ID: 3 Quad-Core AMD Opteron(tm) Processor 1381 stepping 02 Brought up 4 CPUs testing NMI watchdog ... OK. time.c: Using 14.318180 MHz WALL HPET GTOD HPET/TSC timer. time.c: Detected 2500.108 MHz processor. sizeof(vma)=176 bytes sizeof(page)=56 bytes sizeof(inode)=560 bytes sizeof(dentry)=216 bytes sizeof(ext3inode)=760 bytes sizeof(buffer_head)=96 bytes sizeof(skbuff)=248 bytes migration_cost=230 checking if image is initramfs... it is Freeing initrd memory: 2614k freed NET: Registered protocol family 16 ACPI: bus type pci registered PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved PCI: Not using MMCONFIG. PCI: Using configuration type 1 PCI: Using configuration type 1 for extended access ACPI: Interpreter enabled ACPI: Using IOAPIC for interrupt routing ACPI: No dock devices found. ACPI: PCI Root Bridge [PCI0] (0000:00) PCI: set SATA to AHCI mode PCI: Ignoring BAR0-3 of IDE controller 0000:00:14.1 PCI: Transparent bridge - 0000:00:14.4 ACPI: PCI Interrupt Routing Table [_SB_.PCI0._PRT] ACPI: PCI Interrupt Routing Table [_SB_.PCI0.P0P1._PRT] ACPI: PCI Interrupt Routing Table [_SB_.PCI0.PCE4._PRT] ACPI: PCI Interrupt Routing Table [_SB_.PCI0.PCE5._PRT] ACPI: PCI Interrupt Routing Table [_SB_.PCI0.P0PC._PRT] ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 *5 7 10 11 12 14 15) ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 7 10 11 12 14 *15) ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 7 *10 11 12 14 15) ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 7 *10 11 12 14 15) ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 7 10 11 12 14 15) *0, disabled. ACPI: PCI Interrupt Link [LNKF] (IRQs 9) *0, disabled. ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 5 7 10 *11 12 14 15) ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 7 10 11 12 14 15) *0, disabled. Linux Plug and Play Support v0.97 (c) Adam Belay pnp: PnP ACPI init pnp: PnP ACPI: found 14 devices usbcore: registered new driver usbfs usbcore: registered new driver hub PCI: Using ACPI for IRQ routing PCI: If a device doesn't work, try "pci=routeirq". If it helps, post a report PCI: Cannot allocate resource region 1 of device 0000:00:14.0 NetLabel: Initializing NetLabel: domain hash size = 128 NetLabel: protocols = UNLABELED CIPSOv4 NetLabel: unlabeled traffic allowed by default hpet0: at MMIO 0xfed00000 (virtual 0xffffffffff5fe000), IRQs 2, 8, 0, 0 hpet0: 4 32-bit timers, 14318180 Hz ACPI: DMAR not present PCI-DMA: Disabling AGP. PCI-DMA: aperture base @ 4000000 size 65536 KB PCI-DMA: using GART IOMMU. PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture pnp: 00:0b: ioport range 0xa00-0xa0f has been reserved pnp: 00:0b: ioport range 0xa10-0xa1f has been reserved PCI: Error while updating region 0000:00:14.0/1 (e0000004 != 8000a014) PCI: Bridge: 0000:00:01.0 IO window: c000-cfff MEM window: fe800000-fe9fffff PREFETCH window 0x00000000fc000000-0x00000000fdffffff PCI: Bridge: 0000:00:04.0 IO window: d000-dfff MEM window: fea00000-feafffff PREFETCH window: disabled. PCI: Bridge: 0000:00:05.0 IO window: e000-efff MEM window: feb00000-febfffff PREFETCH window: disabled. PCI: Bridge: 0000:00:14.4 IO window: disabled. MEM window: disabled. PREFETCH window: disabled. PCI: Setting latency timer of device 0000:00:04.0 to 64 PCI: Setting latency timer of device 0000:00:05.0 to 64 NET: Registered protocol family 2 IP route cache hash table entries: 131072 (order: 8, 1048576 bytes) TCP established hash table entries: 262144 (order: 10, 4194304 bytes) TCP bind hash table entries: 65536 (order: 8, 1048576 bytes) TCP: Hash tables configured (established 262144 bind 65536) TCP reno registered audit: initializing netlink socket (disabled) type=2000 audit(1278326009.398:1): initialized Total HugeTLB memory allocated, 0 VFS: Disk quotas dquot_6.5.1 Dquot-cache hash table entries: 512 (order 0, 4096 bytes) SELinux: Registering netfilter hooks Initializing Cryptographic API alg: No test for crc32c (crc32c-generic) ksign: Installing public key data Loading keyring - Added public key 71959A475B93578 - User ID: CentOS (Kernel Module GPG key) io scheduler noop registered io scheduler anticipatory registered io scheduler deadline registered io scheduler cfq registered (default) Boot video device is 0000:01:05.0 PCI: Setting latency timer of device 0000:00:04.0 to 64 PCI: Setting latency timer of device 0000:00:05.0 to 64 pci_hotplug: PCI Hot Plug PCI Core version: 0.5 ACPI: duty_cycle spans bit 4 ACPI: CPU0 (power states: C1[C1] C2[C2] C3[C3]) Real Time Clock Driver v1.12ac hpet_resources: 0xfed00000 is busy Non-volatile memory driver v1.2 Linux agpgart interface v0.101 (c) Dave Jones Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A 00:05: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A 00:06: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A brd: module loaded Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx SB600_PATA: IDE controller at PCI slot 0000:00:14.1 GSI 16 sharing vector 0xC1 and IRQ 16 ACPI: PCI Interrupt 0000:00:14.1[A] -> GSI 16 (level, low) -> IRQ 193 SB600_PATA: chipset revision 0 SB600_PATA: not 100% native mode: will probe irqs later ide0: BM-DMA at 0xff00-0xff07, BIOS settings: hda:pio, hdb:pio Probing IDE interface ide0... Probing IDE interface ide0... Probing IDE interface ide1... ide-floppy driver 0.99.newide usbcore: registered new driver hiddev usbcore: registered new driver usbhid drivers/usb/input/hid-core.c: v2.6:USB HID core driver PNP: PS/2 Controller [PNP0303:PS2K] at 0x60,0x64 irq 1 PNP: PS/2 controller doesn't have AUX irq; using default 12 serio: i8042 KBD port at 0x60,0x64 irq 1 serio: i8042 AUX port at 0x60,0x64 irq 12 mice: PS/2 mouse device common for all mice md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27 md: bitmap version 4.39 TCP bic registered Initializing IPsec netlink socket NET: Registered protocol family 1 NET: Registered protocol family 17 ACPI: (supports S0 S1 S3 S4 S5) Initalizing network drop monitor service Freeing unused kernel memory: 212k freed Write protecting the kernel read-only data: 504k GSI 17 sharing vector 0xC9 and IRQ 17 ACPI: PCI Interrupt 0000:00:13.5[D] -> GSI 19 (level, low) -> IRQ 201 ehci_hcd 0000:00:13.5: EHCI Host Controller ehci_hcd 0000:00:13.5: new USB bus registered, assigned bus number 1 ehci_hcd 0000:00:13.5: applying AMD SB600/SB700 USB freeze workaround ehci_hcd 0000:00:13.5: debug port 1 ehci_hcd 0000:00:13.5: irq 201, io mem 0xfe7ff000 ehci_hcd 0000:00:13.5: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004 usb usb1: configuration #1 chosen from 1 choice hub 1-0:1.0: USB hub found hub 1-0:1.0: 10 ports detected ohci_hcd: 2005 April 22 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI) ACPI: PCI Interrupt 0000:00:13.0[A] -> GSI 16 (level, low) -> IRQ 193 ohci_hcd 0000:00:13.0: OHCI Host Controller ohci_hcd 0000:00:13.0: new USB bus registered, assigned bus number 2 ohci_hcd 0000:00:13.0: irq 193, io mem 0xfe7fe000 usb usb2: configuration #1 chosen from 1 choice hub 2-0:1.0: USB hub found hub 2-0:1.0: 2 ports detected GSI 18 sharing vector 0xD1 and IRQ 18 ACPI: PCI Interrupt 0000:00:13.1[B] -> GSI 17 (level, low) -> IRQ 209 ohci_hcd 0000:00:13.1: OHCI Host Controller ohci_hcd 0000:00:13.1: new USB bus registered, assigned bus number 3 ohci_hcd 0000:00:13.1: irq 209, io mem 0xfe7fd000 usb usb3: configuration #1 chosen from 1 choice hub 3-0:1.0: USB hub found hub 3-0:1.0: 2 ports detected GSI 19 sharing vector 0xD9 and IRQ 19 ACPI: PCI Interrupt 0000:00:13.2[C] -> GSI 18 (level, low) -> IRQ 217 ohci_hcd 0000:00:13.2: OHCI Host Controller ohci_hcd 0000:00:13.2: new USB bus registered, assigned bus number 4 ohci_hcd 0000:00:13.2: irq 217, io mem 0xfe7fc000 usb usb4: configuration #1 chosen from 1 choice hub 4-0:1.0: USB hub found hub 4-0:1.0: 2 ports detected ACPI: PCI Interrupt 0000:00:13.3[B] -> GSI 17 (level, low) -> IRQ 209 ohci_hcd 0000:00:13.3: OHCI Host Controller ohci_hcd 0000:00:13.3: new USB bus registered, assigned bus number 5 ohci_hcd 0000:00:13.3: irq 209, io mem 0xfe7fb000 usb usb5: configuration #1 chosen from 1 choice hub 5-0:1.0: USB hub found hub 5-0:1.0: 2 ports detected ACPI: PCI Interrupt 0000:00:13.4[C] -> GSI 18 (level, low) -> IRQ 217 ohci_hcd 0000:00:13.4: OHCI Host Controller ohci_hcd 0000:00:13.4: new USB bus registered, assigned bus number 6 ohci_hcd 0000:00:13.4: irq 217, io mem 0xfe7fa000 usb usb6: configuration #1 chosen from 1 choice hub 6-0:1.0: USB hub found hub 6-0:1.0: 2 ports detected USB Universal Host Controller Interface driver v3.0 md: raid1 personality registered for level 1 SCSI subsystem initialized libata version 3.00 loaded. ahci 0000:00:12.0: version 3.0 GSI 20 sharing vector 0xE1 and IRQ 20 ACPI: PCI Interrupt 0000:00:12.0[A] -> GSI 22 (level, low) -> IRQ 225 ahci 0000:00:12.0: controller can't do 64bit DMA, forcing 32bit ahci 0000:00:12.0: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl SATA mode ahci 0000:00:12.0: flags: ncq sntf ilck pm led clo pmp pio slum part scsi0 : ahci scsi1 : ahci scsi2 : ahci scsi3 : ahci ata1: SATA max UDMA/133 abar m1024@0xfe7ff800 port 0xfe7ff900 irq 225 ata2: SATA max UDMA/133 abar m1024@0xfe7ff800 port 0xfe7ff980 irq 225 ata3: SATA max UDMA/133 abar m1024@0xfe7ff800 port 0xfe7ffa00 irq 225 ata4: SATA max UDMA/133 abar m1024@0xfe7ff800 port 0xfe7ffa80 irq 225 ata1: softreset failed (device not ready) ata1: failed due to HW bug, retry pmp=0 ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata1.00: ATA-8: Hitachi HDS721050CLA362, JP2OA39C, max UDMA/133 ata1.00: 976773168 sectors, multi 16: LBA48 NCQ (depth 31/32) ata1.00: SB600 AHCI: limiting to 255 sectors per cmd ata1.00: SB600 AHCI: limiting to 255 sectors per cmd ata1.00: configured for UDMA/133 ata2: softreset failed (device not ready) ata2: failed due to HW bug, retry pmp=0 ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata2.00: ATA-8: Hitachi HDS721050CLA362, JP2OA39C, max UDMA/133 ata2.00: 976773168 sectors, multi 16: LBA48 NCQ (depth 31/32) ata2.00: SB600 AHCI: limiting to 255 sectors per cmd ata2.00: SB600 AHCI: limiting to 255 sectors per cmd ata2.00: configured for UDMA/133 ata3: SATA link down (SStatus 0 SControl 300) ata4: SATA link down (SStatus 0 SControl 300) Vendor: ATA Model: Hitachi HDS72105 Rev: JP2O Type: Direct-Access ANSI SCSI revision: 05 SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: drive cache: write back SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: drive cache: write back sda: sda1 sda2 sda3 sd 0:0:0:0: Attached scsi disk sda Vendor: ATA Model: Hitachi HDS72105 Rev: JP2O Type: Direct-Access ANSI SCSI revision: 05 SCSI device sdb: 976773168 512-byte hdwr sectors (500108 MB) sdb: Write Protect is off sdb: Mode Sense: 00 3a 00 00 SCSI device sdb: drive cache: write back SCSI device sdb: 976773168 512-byte hdwr sectors (500108 MB) sdb: Write Protect is off sdb: Mode Sense: 00 3a 00 00 SCSI device sdb: drive cache: write back sdb: sdb1 sdb2 sdb3 sd 1:0:0:0: Attached scsi disk sdb device-mapper: uevent: version 1.0.3 device-mapper: ioctl: 4.11.5-ioctl (2007-12-12) initialised: dm-devel@redhat.com device-mapper: dm-raid45: initialized v0.2594l md: Autodetecting RAID arrays. md: autorun ... md: considering sdb3 ... md: adding sdb3 ... md: sdb1 has different UUID to sdb3 md: adding sda3 ... md: sda1 has different UUID to sdb3 md: created md1 md: bind<sda3> md: bind<sdb3> md: running: <sdb3><sda3> raid1: raid set md1 active with 2 out of 2 mirrors md: considering sdb1 ... md: adding sdb1 ... md: adding sda1 ... md: created md0 md: bind<sda1> md: bind<sdb1> md: running: <sdb1><sda1> raid1: raid set md0 active with 2 out of 2 mirrors md: ... autorun DONE. kjournald starting. Commit interval 5 seconds EXT3-fs: mounted filesystem with ordered data mode. SELinux: Disabled at runtime. SELinux: Unregistering netfilter hooks type=1404 audit(1278326037.818:2): selinux=0 auid=4294967295 ses=4294967295 input: PC Speaker as /class/input/input0 e1000e: Intel(R) PRO/1000 Network Driver - 1.0.2-k3.1 e1000e: Copyright (c) 1999-2008 Intel Corporation. ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 16 (level, low) -> IRQ 193 PCI: Setting latency timer of device 0000:02:00.0 to 64 EDAC MC: Ver: 2.0.1 Jul 1 2010 EDAC amd64_edac: Ver: 3.2.0 Jul 1 2010 Floppy drive(s): fd0 is 1.44M e1000e 0000:02:00.0: Warning: detected ASPM enabled in EEPROM sd 0:0:0:0: Attached scsi generic sg0 type 0 sd 1:0:0:0: Attached scsi generic sg1 type 0 eth0: (PCI Express:2.5GB/s:Width x1) 40:61:86:ec:c0:45 eth0: Intel(R) PRO/1000 Network Connection eth0: MAC: 2, PHY: 2, PBA No: ffffff-0ff ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 17 (level, low) -> IRQ 209 PCI: Setting latency timer of device 0000:03:00.0 to 64 e1000e 0000:03:00.0: Warning: detected ASPM enabled in EEPROM eth1: (PCI Express:2.5GB/s:Width x1) 40:61:86:ec:c0:46 eth1: Intel(R) PRO/1000 Network Connection eth1: MAC: 2, PHY: 2, PBA No: ffffff-0ff piix4_smbus 0000:00:14.0: Found 0000:00:14.0 device EDAC amd64: This node reports that Memory ECC is currently disabled, set F3x44[22] (0000:00:18.3). EDAC amd64: WARNING: ECC is disabled by BIOS. Module will NOT be loaded. Either Enable ECC in the BIOS, or set 'ecc_enable_override'. Also, use of the override can cause unknown side effects. amd64_edac: probe of 0000:00:18.2 failed with error -22 shpchp: Standard Hot Plug PCI Controller Driver version: 0.4 floppy0: no floppy controllers found Floppy drive(s): fd0 is 1.44M floppy0: no floppy controllers found lp: driver loaded but no devices found ACPI: Power Button (FF) [PWRF] ACPI: Power Button (CM) [PWRB] ACPI: Mapper loaded dell-wmi: No known WMI GUID found md: Autodetecting RAID arrays. md: autorun ... md: ... autorun DONE. device-mapper: multipath: version 1.0.5 loaded EXT3 FS on md1, internal journal Adding 1052248k swap on /dev/sda2. Priority:-1 extents:1 across:1052248k Adding 1052248k swap on /dev/sdb2. Priority:-2 extents:1 across:1052248k powernow-k8: Found 1 Quad-Core AMD Opteron(tm) Processor 1381 processors (4 cpu cores) (version 2.20.00) powernow-k8: 0 : fid 0x0 gid 0x0 (2500 MHz) powernow-k8: 1 : fid 0x0 gid 0x0 (1800 MHz) powernow-k8: 2 : fid 0x0 gid 0x0 (1300 MHz) powernow-k8: 3 : fid 0x0 gid 0x0 (800 MHz) ip_tables: (C) 2000-2006 Netfilter Core Team Netfilter messages via NETLINK v0.30. ip_conntrack version 2.4 (8192 buckets, 65536 max) - 304 bytes per conntrack NET: Registered protocol family 10 lo: Disabled Privacy Extensions IPv6 over IPv4 tunneling driver ADDRCONF(NETDEV_UP): eth0: link is not ready e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: None eth0: 10/100 speed: disabling TSO ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready ADDRCONF(NETDEV_UP): eth1: link is not ready Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged Machine check events logged
Alexander Farber wrote:
Hello,
every few hours I get the following message in /var/log/message:
Jul 5 20:23:28 hXXX kernel: Machine check events logged
<snip>
And in the /var/log/mcelog I see:
MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 4 northbridge TSC 111a60c5584d4 [at 2500 Mhz 1 days 9:25:51 uptime (unreliable)] MISC c008000001000000 ADDR 1148f5940 Northbridge NB Array Error bit35 = err cpu3 bit42 = L3 subcache in error bit 0 bit43 = L3 subcache in error bit 1 bit46 = corrected ecc error bit59 = misc error valid memory/cache error 'generic read mem transaction, generic transaction, level generic' STATUS 9c1f4cf8001c011b MCGSTATUS 0 No DIMM found for 1148f5940 in SMBIOS
My machine (a CentOS 5.5/64bit server rented at German hoster strato.de) seems to run ok as a LAMP server though...
What do these messages actually mean, is RAM defect and how critical is it (because I have an important event this Friday and would prefer not to take the machine offline)
<snip> First, this is *very* bad - I'm not good enough on this to tell you if it's the CPU, or the motherboard, but it's one of the two, *not* just memory. Second, if you're paying for hosting, and it's *their* server, you need to get on the phone with them *now*, and tell them that they need to fix it, yesterday would be preferable. They *should* have seen the logs.
Dunno if you have a physical machine hosted there, or a VM' if the latter, they can move it without you seeing any downtime at all. If the former, they can just hot swap the drives into another server.
But call them *NOW*. You're paying for the service.
mark
Hello Mark,
On Wed, Jul 7, 2010 at 2:51 PM, m.roth@5-cent.us wrote:
First, this is *very* bad - I'm not good enough on this to tell you if it's the CPU, or the motherboard, but it's one of the two, *not* just memory. Second, if you're paying for hosting, and it's *their* server, you need to get on the phone with them *now*, and tell them that they need to fix it, yesterday would be preferable. They *should* have seen the logs.
yes, thanks for confirming this.
I've called them few hours ago and they are currently performing "hardware tests" with my dedicated server now.
Stupidly they (Strato.de) have refused to move my HDDs to another machine and then just test the old machine "offline" :-( (Not the best service, but I'm locked by an 18-month contract...)
Regards Alex
Alexander Farber wrote:
Hello Mark,
On Wed, Jul 7, 2010 at 2:51 PM, m.roth@5-cent.us wrote:
First, this is *very* bad - I'm not good enough on this to tell you if it's the CPU, or the motherboard, but it's one of the two, *not* just memory. Second, if you're paying for hosting, and it's *their* server, you need to get on the phone with them *now*, and tell them that they need to fix it, yesterday would be preferable. They *should* have seen the
logs.
yes, thanks for confirming this.
I've called them few hours ago and they are currently performing
"hardware tests"
with my dedicated server now.
Stupidly they (Strato.de) have refused to move my HDDs to another
machine and
then just test the old machine "offline" :-( (Not the best service, but I'm locked by an 18-month contract...)
Really? And what's the SLA they have in the contract (and there *better* be one)?
mark
On Wednesday 07 July 2010, m.roth@5-cent.us wrote:
Alexander Farber wrote:
every few hours I get the following message in /var/log/message: Jul 5 20:23:28 hXXX kernel: Machine check events logged
...
MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 4 northbridge TSC 111a60c5584d4 [at 2500 Mhz 1 days 9:25:51 uptime (unreliable)] MISC c008000001000000 ADDR 1148f5940 Northbridge NB Array Error bit35 = err cpu3 bit42 = L3 subcache in error bit 0 bit43 = L3 subcache in error bit 1 bit46 = corrected ecc error bit59 = misc error valid memory/cache error 'generic read mem transaction, generic transaction, level generic' STATUS 9c1f4cf8001c011b MCGSTATUS 0 No DIMM found for 1148f5940 in SMBIOS
...
First, this is *very* bad
That's a bit hard. Depending on what the actual error is that triggers this mce it may actually be just an annoyance (even though, yes, it is a hardware problem). Also the OP did mention that the servers runs without any obvious problems.
- I'm not good enough on this to tell you if
it's the CPU, or the motherboard, but it's one of the two, *not* just memory.
What do you base that on? I've seen a lot of different MCE-errors being resolved by finding and replacing flaky dimms.
Second, if you're paying for hosting, and it's *their* server, you need to get on the phone with them *now*, and tell them that they need to fix it, yesterday would be preferable. They *should* have seen the logs.
Dunno if you have a physical machine hosted there, or a VM'
I'm quite sure you can't get that kind of MCE-dump inside a VM.
/Peter
if the latter, they can move it without you seeing any downtime at all. If the former, they can just hot swap the drives into another server.
But call them *NOW*. You're paying for the service.
mark
Peter Kjellstrom wrote:
On Wednesday 07 July 2010, m.roth@5-cent.us wrote:
Alexander Farber wrote:
every few hours I get the following message in /var/log/message: Jul 5 20:23:28 hXXX kernel: Machine check events logged
...
MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 4 northbridge TSC 111a60c5584d4 [at 2500 Mhz 1 days 9:25:51 uptime (unreliable)] MISC c008000001000000 ADDR 1148f5940 Northbridge NB Array Error bit35 = err cpu3 bit42 = L3 subcache in error bit 0 bit43 = L3 subcache in error bit 1 bit46 = corrected ecc error bit59 = misc error valid memory/cache error 'generic read mem transaction, generic transaction, level generic' STATUS 9c1f4cf8001c011b MCGSTATUS 0 No DIMM found for 1148f5940 in SMBIOS
...
<snip>
- I'm not good enough on this to tell you if
it's the CPU, or the motherboard, but it's one of the two, *not* just memory.
What do you base that on? I've seen a lot of different MCE-errors being resolved by finding and replacing flaky dimms.
Because it says NB Array error, and errors in the L3 subcache. I've seen enough memory errors, and not seen an NB array & subcache error.
I do just note that there's "No DIMM found for ... in SMBIOS", but I assume that's just a bank that's not filled.
mark
On Wednesday 07 July 2010, m.roth@5-cent.us wrote:
Peter Kjellstrom wrote:
On Wednesday 07 July 2010, m.roth@5-cent.us wrote:
Alexander Farber wrote:
...
MISC c008000001000000 ADDR 1148f5940 Northbridge NB Array Error bit35 = err cpu3 bit42 = L3 subcache in error bit 0 bit43 = L3 subcache in error bit 1 bit46 = corrected ecc error bit59 = misc error valid memory/cache error 'generic read mem transaction, generic transaction, level generic' STATUS 9c1f4cf8001c011b MCGSTATUS 0 No DIMM found for 1148f5940 in SMBIOS
...
- I'm not good enough on this to tell you if
it's the CPU, or the motherboard, but it's one of the two, *not* just memory.
What do you base that on? I've seen a lot of different MCE-errors being resolved by finding and replacing flaky dimms.
Because it says NB Array error, and errors in the L3 subcache. I've seen enough memory errors, and not seen an NB array & subcache error.
That does sound like a reasonable guess. However, you presented it as absolute truth. The MCE could just as easily be read as: NB means not IC/DC/BU => actual RAM.
Given that real world figures show bad RAM to be a lot more likely that a bad CPU I'd start by looking at the dimms (or at the very least not exclude it...).
I do just note that there's "No DIMM found for ... in SMBIOS", but I assume that's just a bank that's not filled.
or the SMBIOS data is borked, wouldn't be the first time...
/Peter
I've only found this Solaris blog, but don't understand it well enough: http://blogs.sun.com/gavinm/entry/amd_opteron_athlon64_turion64_fault
Can't provide you more details, because my dedicated server is under hoster's "hardware tests" since 5 hours :-( (and I guess everyone will run home for the Germany-Spain game soon)
Regards Alex
MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 4 northbridge TSC 111a60c5584d4 [at 2500 Mhz 1 days 9:25:51 uptime (unreliable)] MISC c008000001000000 ADDR 1148f5940 Northbridge NB Array Error bit35 = err cpu3 bit42 = L3 subcache in error bit 0 bit43 = L3 subcache in error bit 1 bit46 = corrected ecc error bit59 = misc error valid memory/cache error 'generic read mem transaction, generic transaction, level generic' STATUS 9c1f4cf8001c011b MCGSTATUS 0 No DIMM found for 1148f5940 in SMBIOS
Alexander Farber wrote:
I've only found this Solaris blog, but don't understand it well enough: http://blogs.sun.com/gavinm/entry/amd_opteron_athlon64_turion64_fault
Can't provide you more details, because my dedicated server is under hoster's "hardware tests" since 5 hours :-( (and I guess everyone will run home for the Germany-Spain game soon)
First, that's solaris (or opensolaris), so it's not the same. Second, you'll notice that the diagram and the table do *not* mention L3 caches, so the architecture's a bit different.
Finally, note where the article says, "If an error is recoverable then it does not raise a Machine Check Exception (MCE or mc#) when detected. The recoverable errors, broadly speaking, are single-bit ECC errors from ECC-protected arrays and parity errors on clean parity- <snip> If an error is irrecoverable then detection of that error will raise a machine check exception (if the bit that controls mc# for that error type is set; if not you'll either never know or you pick it up by polling). The mc# handler can extract information about the error from the machine check architecture registers as before, but has the additional responsibility of deciding what further actions (which may include panic and reboot) are required. A machine check exception is a form of interrupt which allows immediate notification of an error condition - you can't afford to wait to poll for the error since that could result in the use of bad data and associated data corruption. --- end excerpt ---
So, it is, in fact, serious, and non-recoverable, so they have a problem with their hardware, and you've paid for a service that they provide, including hardware that's supposed to be up 99.<whatever you paid for>% of the time. If they don't get it up, there should be penalties against them, or at least money rebates to *you*.
There may also be limits that would mean they've broken the contract, and are liable.
mark
Regards Alex
MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 4 northbridge TSC 111a60c5584d4 [at 2500 Mhz 1 days 9:25:51 uptime (unreliable)] MISC c008000001000000 ADDR 1148f5940 Northbridge NB Array Error bit35 = err cpu3 bit42 = L3 subcache in error bit 0 bit43 = L3 subcache in error bit 1 bit46 = corrected ecc error bit59 = misc error valid memory/cache error 'generic read mem transaction, generic transaction, level generic' STATUS 9c1f4cf8001c011b MCGSTATUS 0 No DIMM found for 1148f5940 in SMBIOS
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Anyway my hoster has finished the "hardware tests" (probably just kept running memtest86 or some vendor CD?) on my CentOS 5.5/64bit machine with quad Opteron 1381 and said that they haven't found any issues.
I'll post here a short note if I will experience any issues on my LAPP server (preferans.de - I run phpBB3+ PostgreSQL+my backend for a facebook card game there)
Thank you Alex