how to debug hardware lockups?

List overview All Threads
Download

newer

older

qlogic driver not scanning scsi...

Terminal emulation scripting

Rudi Ahlers

15 Nov 2008 15 Nov '08

2:16 a.m.

Hi,

We have a server which locks up about once a week (for the past 3 weeks now), without any warning, and the only way to recover it, is to reset the server. This causes unwanted downtime, and often software loss as well.

How do I debug the server, which runs CentOS 5.2 to see why it locks up? The CPU is an Intel Q9300 Core 2 Quad, with 8 GB RAM, on an Intel Motherboard

The last few entries before the server froze, is:

Nov 15 07:15:20 saturn snmpd[2527]: Connection from UDP: [127.0.0.1]:59008 Nov 15 07:15:20 saturn snmpd[2527]: Received SNMP packet(s) from UDP: [127.0.0.1]:59008 Nov 15 07:15:20 saturn snmpd[2527]: Connection from UDP: [127.0.0.1]:47729 Nov 15 07:15:20 saturn snmpd[2527]: Received SNMP packet(s) from UDP: [127.0.0.1]:47729 Nov 15 07:15:20 saturn snmpd[2527]: Connection from UDP: [127.0.0.1]:47890 Nov 15 07:15:20 saturn snmpd[2527]: Received SNMP packet(s) from UDP: [127.0.0.1]:47890 Nov 15 07:15:20 saturn snmpd[2527]: Connection from UDP: [127.0.0.1]:50023 Nov 15 07:15:20 saturn snmpd[2527]: Received SNMP packet(s) from UDP: [127.0.0.1]:50023 Nov 15 07:15:20 saturn snmpd[2527]: Connection from UDP: [127.0.0.1]:58459 Nov 15 07:15:20 saturn snmpd[2527]: Received SNMP packet(s) from UDP: [127.0.0.1]:58459 Nov 15 10:10:10 saturn syslogd 1.4.1: restart. Nov 15 10:10:11 saturn kernel: klogd 1.4.1, log source = /proc/kmsg started. Nov 15 10:10:11 saturn kernel: Bootdata ok (command line is ro root=/dev/System/root) Nov 15 10:10:11 saturn kernel: Linux version 2.6.18-92.1.17.el5xen (mockbuild@builder10.centos.org) (gcc version 4.1.2 20071124 (Red Hat 4.1 .2-42)) #1 SMP Tue Nov 4 14:13:09 EST 2008 Nov 15 10:10:11 saturn kernel: BIOS-provided physical RAM map: Nov 15 10:10:11 saturn kernel: Xen: 0000000000000000 - 00000001ef958000 (usable) Nov 15 10:10:11 saturn kernel: DMI 2.4 present. Nov 15 10:10:11 saturn kernel: ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled) Nov 15 10:10:11 saturn kernel: ACPI: LAPIC (acpi_id[0x03] lapic_id[0x02] enabled) Nov 15 10:10:11 saturn kernel: ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled) Nov 15 10:10:11 saturn kernel: ACPI: LAPIC (acpi_id[0x04] lapic_id[0x03] enabled) Nov 15 10:10:11 saturn kernel: ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1]) Nov 15 10:10:11 saturn kernel: ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0x1]) Nov 15 10:10:11 saturn kernel: ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0]) Nov 15 10:10:11 saturn kernel: IOAPIC[0]: apic_id 2, version 32, address 0xfec00000, GSI 0-23 Nov 15 10:10:11 saturn kernel: ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) Nov 15 10:10:11 saturn kernel: ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)

-- Kind Regards Rudi Ahlers

Show replies by date

Richard Karhuse

15 Nov 15 Nov

8:47 a.m.

On Sat, Nov 15, 2008 at 3:16 AM, Rudi Ahlers rudiahlers@gmail.com wrote:

...

Hi,

We have a server which locks up about once a week (for the past 3 weeks now), without any warning, and the only way to recover it, is to reset the server. This causes unwanted downtime, and often software loss as well.

How do I debug the server, which runs CentOS 5.2 to see why it locks up? The CPU is an Intel Q9300 Core 2 Quad, with 8 GB RAM, on an Intel Motherboard

Attach a local console to the video port and let us know what it says --> that will (probably) be very insightful. E.G., Kernel panic, MCE, ....

Next, run memtest86+ -- at least overnight. [Note: I've had less than stellar results with memtest86 recently, but if it shows errors, you've got a problem big time; if it doesn't show errors, you still not 100% sure that memory is good:-):-).] Is it ECC memory?? If not, why not -- particularly given it is a critical server ....

Are all the fans spinning -- particularly the CPU?? Do you have lm-sensors enabled?? Either create a script or using something like munin to track things and see if fans, temperature, voltages are all stable & within range up to death.

Can you easilhy swap power supplies?? (Is the unit dual powered or just one unit?)

Clearly, just a start, but you get the idea of elementary, 101 problem solving ....

-rak-

Rudi Ahlers

10:11 a.m.

On Sat, Nov 15, 2008 at 4:47 PM, Richard Karhuse rkarhuse@gmail.com wrote:

...

On Sat, Nov 15, 2008 at 3:16 AM, Rudi Ahlers rudiahlers@gmail.com wrote:

...
Hi,

We have a server which locks up about once a week (for the past 3 weeks now), without any warning, and the only way to recover it, is to reset the server. This causes unwanted downtime, and often software loss as well.

How do I debug the server, which runs CentOS 5.2 to see why it locks up? The CPU is an Intel Q9300 Core 2 Quad, with 8 GB RAM, on an Intel Motherboard

Attach a local console to the video port and let us know what it says --> that will (probably) be very insightful. E.G., Kernel panic, MCE, ....

Next, run memtest86+ -- at least overnight. [Note: I've had less than stellar results with memtest86 recently, but if it shows errors, you've got a problem big time; if it doesn't show errors, you still not 100% sure that memory is good:-):-).] Is it ECC memory?? If not, why not -- particularly given it is a critical server ....

Are all the fans spinning -- particularly the CPU?? Do you have lm-sensors enabled?? Either create a script or using something like munin to track things and see if fans, temperature, voltages are all stable & within range up to death.

Can you easilhy swap power supplies?? (Is the unit dual powered or just one unit?)

Clearly, just a start, but you get the idea of elementary, 101 problem solving ....

-rak-

Unfortunately, I can't leave a monitor attached to the server all the time. The server is in a shared cabinet @ a 3rd party ISP, and they lock the cabinets once we're done working with it. The last lockup was about 6 days ago, and previous one about 8 days ago. There's no consitancy.

How can I redirect all console output to a file instead?

I have got lm-sensors installed, but it doesn't pick-up the motherboard's sensors. All fans are working when I checked last time, but it's a 1U chassis, so it's got limited air-flow. I don't know if it get's too hot, or not. When I rebooted it, the temp was about 45 degrees celcius, but the lockup only happened about 6 days later. So, I can't even sit there 24/7 to see what happens.

-- Kind Regards Rudi Ahlers

nate

12:17 p.m.

Rudi Ahlers wrote:

...

Unfortunately, I can't leave a monitor attached to the server all the time. The server is in a shared cabinet @ a 3rd party ISP, and they lock the cabinets once we're done working with it. The last lockup was about 6 days ago, and previous one about 8 days ago. There's no consitancy.

How can I redirect all console output to a file instead?

Configure a serial console, connect the console to another system and use something like minicom to log the console to a file. You can't really log to the local system in this situation as you likely won't capture the event(if you did you would of seen the error in the system logs)

In my experience most of these kinds of problems are related to bad ram.

If your running CentOS 4.x configure netdump to send the kernel dumps to another server, if your using CentOS 5.x configure diskdump(?) to store the dump to local disk.

Run memtest86 on the system for a few days, replace the system with a known working one so you can take the broken system off site from the ISP for diagnostics.

I like running cerberus http://sourceforge.net/projects/va-ctcs/ as a burn-in tool, if the system can survive that running for a couple days it should be good. In running against a hundred or so systems I don't recall it taking longer than a few hours to crash the system if there was a problem.

nate

Rudi Ahlers

1:59 p.m.

On Sat, Nov 15, 2008 at 8:17 PM, nate centos@linuxpowered.net wrote:

...

Rudi Ahlers wrote:

...
Unfortunately, I can't leave a monitor attached to the server all the time. The server is in a shared cabinet @ a 3rd party ISP, and they lock the cabinets once we're done working with it. The last lockup was about 6 days ago, and previous one about 8 days ago. There's no consitancy.

How can I redirect all console output to a file instead?

Configure a serial console, connect the console to another system and use something like minicom to log the console to a file. You can't really log to the local system in this situation as you likely won't capture the event(if you did you would of seen the error in the system logs)

In my experience most of these kinds of problems are related to bad ram.

If your running CentOS 4.x configure netdump to send the kernel dumps to another server, if your using CentOS 5.x configure diskdump(?) to store the dump to local disk.

Run memtest86 on the system for a few days, replace the system with a known working one so you can take the broken system off site from the ISP for diagnostics.

I like running cerberus http://sourceforge.net/projects/va-ctcs/ as a burn-in tool, if the system can survive that running for a couple days it should be good. In running against a hundred or so systems I don't recall it taking longer than a few hours to crash the system if there was a problem.

nate

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

That machine doesn't have a serial port (why do vendors think serial ports are obsolete????), so is there any other way to send to logs to a different machine then?

-- Kind Regards Rudi Ahlers

Scott Silva

17 Nov 17 Nov

6:50 p.m.

on 11-15-2008 11:59 AM Rudi Ahlers spake the following:

...

On Sat, Nov 15, 2008 at 8:17 PM, nate centos-T6AQWPvKiI1cRAk/VAjCeQ@public.gmane.org wrote:

...
Rudi Ahlers wrote:

...
Unfortunately, I can't leave a monitor attached to the server all the time. The server is in a shared cabinet @ a 3rd party ISP, and they lock the cabinets once we're done working with it. The last lockup was about 6 days ago, and previous one about 8 days ago. There's no consitancy.

How can I redirect all console output to a file instead?

Configure a serial console, connect the console to another system and use something like minicom to log the console to a file. You can't really log to the local system in this situation as you likely won't capture the event(if you did you would of seen the error in the system logs)

In my experience most of these kinds of problems are related to bad ram.

If your running CentOS 4.x configure netdump to send the kernel dumps to another server, if your using CentOS 5.x configure diskdump(?) to store the dump to local disk.

Run memtest86 on the system for a few days, replace the system with a known working one so you can take the broken system off site from the ISP for diagnostics.

I like running cerberus http://sourceforge.net/projects/va-ctcs/ as a burn-in tool, if the system can survive that running for a couple days it should be good. In running against a hundred or so systems I don't recall it taking longer than a few hours to crash the system if there was a problem.

nate

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

That machine doesn't have a serial port (why do vendors think serial ports are obsolete????), so is there any other way to send to logs to a different machine then?

Does it have any out of bandwidth management like Dell's drac or HP's ILO?

-- MailScanner is like deodorant... You hope everybody uses it, and you notice quickly if they don't!!!!

John R Pierce

6:54 p.m.

Scott Silva wrote:

...

Does it have any out of bandwidth management like Dell's drac or HP's ILO?

in the original post he said...

...

The CPU is an Intel Q9300 Core 2 Quad, with 8 GB RAM, on an Intel Motherboard

and upon further questioning...

...

The motherboard is a Intel DG35EC - http://www.intel.com/products/desktop/motherboards/DG35EC/DG35EC-overview.ht...

which is purely a desktop board (onboard Intel graphics, etc).

Scott Silva

18 Nov 18 Nov

4:44 p.m.

on 11-17-2008 4:54 PM John R Pierce spake the following:

...

Scott Silva wrote:

...
Does it have any out of bandwidth management like Dell's drac or HP's ILO?

in the original post he said...

...
The CPU is an Intel Q9300 Core 2 Quad, with 8 GB RAM, on an Intel Motherboard

and upon further questioning...

...
The motherboard is a Intel DG35EC - http://www.intel.com/products/desktop/motherboards/DG35EC/DG35EC-overview.ht...

which is purely a desktop board (onboard Intel graphics, etc).

Sometimes you only have an older part of the thread to reply to... That all came in a different branch of the thread so I didn't see it until I hit send.

SSSOOORRRRYY!.

-- MailScanner is like deodorant... You hope everybody uses it, and you notice quickly if they don't!!!!

Matthew Kent

5:09 p.m.

On Sat, 2008-11-15 at 21:59 +0200, Rudi Ahlers wrote:

...

That machine doesn't have a serial port (why do vendors think serial ports are obsolete????), so is there any other way to send to logs to a different machine then?

You can send it to another machines syslogd with netconsole. Checkout

initscripts /etc/rc.d/init.d/netconsole initscripts /etc/sysconfig/netconsole

kernel-doc /usr/share/doc/kernel-doc-2.6.18/Documentation/networking/netconsole.txt

Good luck!

-- Matthew Kent \ SA \ bravenet.com

Vandaman

15 Nov 15 Nov

11:26 a.m.

Rudi Ahlers wrote:

...

We have a server which locks up about once a week (for the past 3 weeks now), without any warning, and the only way to recover it, is to reset the server. This causes unwanted downtime, and often software loss as well.

How do I debug the server, which runs CentOS 5.2 to see why it locks up?

Are those the only logs you've got. Normally linux is very chatty, and you get WARNING, PANIC etc messages. What kernel are you using? Does a previous kernel or CentOS plus kernel stop the problem?

Regards, Vandaman.

Rudi Ahlers

12:13 p.m.

On Sat, Nov 15, 2008 at 7:26 PM, Vandaman vandaman2002-sk@yahoo.co.uk wrote:

...

Rudi Ahlers wrote:

...
We have a server which locks up about once a week (for the past 3 weeks now), without any warning, and the only way to recover it, is to reset the server. This causes unwanted downtime, and often software loss as well.

How do I debug the server, which runs CentOS 5.2 to see why it locks up?

Are those the only logs you've got. Normally linux is very chatty, and you get WARNING, PANIC etc messages. What kernel are you using? Does a previous kernel or CentOS plus kernel stop the problem?

Regards, Vandaman.

Well, on a standard CentOS 5.2, /var/log/messages will be the the place to log problems like this, or where else can I get more info?

I've upgraded the kernel to xen.gz-2.6.18-92.1.18.el5 but can only reboot the server tomorrow, during a planned maintenaince window and then see what it does

-- Kind Regards Rudi Ahlers

John R Pierce

5:14 p.m.

Rudi Ahlers wrote:

...

Well, on a standard CentOS 5.2, /var/log/messages will be the the place to log problems like this, or where else can I get more info?

tough to write to the disk when the kernel is crashing. ditto the network. that leaves VGAs and serial ports, which can be written to by self contained emergency-crash routines...

IIRC, you said this was a Q9something quad core... thats a desktop processor... does this server have ECC memory? (I ask, because few desktop platforms do, while ECC is fairly standard on servers). Without ECC, the system has no way of knowing it read in bad data from the ram, and if the bad data happens to be code and that code happens to be in the kernel, ka-RASH, without any detection or warning, it leaps off into never-land, and you get a kernel fault, almost always resulting in...

kernel panic system halted

with no additional useful information available. with ECC memory, single bit errors get corrected on the fly, and log an ECC error event, while double bit errors result in a system halt with a message indicating such.

Rudi Ahlers

5:21 p.m.

On Sun, Nov 16, 2008 at 1:14 AM, John R Pierce pierce@hogranch.com wrote:

...

Rudi Ahlers wrote:

...
Well, on a standard CentOS 5.2, /var/log/messages will be the the place to log problems like this, or where else can I get more info?

tough to write to the disk when the kernel is crashing. ditto the network. that leaves VGAs and serial ports, which can be written to by self contained emergency-crash routines...

IIRC, you said this was a Q9something quad core... thats a desktop processor... does this server have ECC memory? (I ask, because few desktop platforms do, while ECC is fairly standard on servers). Without ECC, the system has no way of knowing it read in bad data from the ram, and if the bad data happens to be code and that code happens to be in the kernel, ka-RASH, without any detection or warning, it leaps off into never-land, and you get a kernel fault, almost always resulting in...

kernel panic system halted

with no additional useful information available. with ECC memory, single bit errors get corrected on the fly, and log an ECC error event, while double bit errors result in a system halt with a message indicating such.

No, the motherboard doesn't support ECC RAM. The motherboard is a Intel DG35EC - http://www.intel.com/products/desktop/motherboards/DG35EC/DG35EC-overview.ht...

-- Kind Regards Rudi Ahlers

John R Pierce

5:32 p.m.

Rudi Ahlers wrote:

...

No, the motherboard doesn't support ECC RAM. The motherboard is a Intel DG35EC - http://www.intel.com/products/desktop/motherboards/DG35EC/DG35EC-overview.ht...

midrange business desktop board. I use a DG33TL as my desktop, same thing.

Scott Silva

17 Nov 17 Nov

6:52 p.m.

on 11-15-2008 3:32 PM John R Pierce spake the following:

...

Rudi Ahlers wrote:

...
No, the motherboard doesn't support ECC RAM. The motherboard is a Intel DG35EC - http://www.intel.com/products/desktop/motherboards/DG35EC/DG35EC-overview.ht...

midrange business desktop board. I use a DG33TL as my desktop, same thing.

It just doesn't pay to run critical systems on desktop hardware. Companies think they are saving money, until the downtime eats away any initial savings.

-- MailScanner is like deodorant... You hope everybody uses it, and you notice quickly if they don't!!!!

Rudi Ahlers

18 Nov 18 Nov

1:20 a.m.

On Tue, Nov 18, 2008 at 2:52 AM, Scott Silva ssilva@sgvwater.com wrote:

...

on 11-15-2008 3:32 PM John R Pierce spake the following:

...
Rudi Ahlers wrote:

...
No, the motherboard doesn't support ECC RAM. The motherboard is a Intel DG35EC - http://www.intel.com/products/desktop/motherboards/DG35EC/DG35EC-overview.ht...

midrange business desktop board. I use a DG33TL as my desktop, same thing.

It just doesn't pay to run critical systems on desktop hardware. Companies think they are saving money, until the downtime eats away any initial savings.

Sure, but it also doesn't pay to purchase a 10Ton truck to move a 1Ton load :) Bottom line is, you purchase the hardware for the needs that you have. Not every situation warrents a quad XEON on a blade system. The problem is, I have another server, with a slower CPU, half the RAM & a gigabyte motherboard, yet it can handle the same load.

This server runs 4 XEN VPS's, which I moved to the slower machine, and the slower machines handles the load very well. So, where does the problem lie? With the "cheap desktop hardware" ? I don't think so. Rather, I believe there's a hardware problem - i.e. CPU / RAM / motherboard / PSU?

I have reinstalled the OS (CentOS 5.2), and swapped out the HDD's as well - so that's not causing the problem.

-- Kind Regards Rudi Ahlers

Rudi Ahlers

1:32 a.m.

On Tue, Nov 18, 2008 at 9:20 AM, Rudi Ahlers rudiahlers@gmail.com wrote:

...

On Tue, Nov 18, 2008 at 2:52 AM, Scott Silva ssilva@sgvwater.com wrote:

...
on 11-15-2008 3:32 PM John R Pierce spake the following:

...
Rudi Ahlers wrote:

...
No, the motherboard doesn't support ECC RAM. The motherboard is a Intel DG35EC - http://www.intel.com/products/desktop/motherboards/DG35EC/DG35EC-overview.ht...

midrange business desktop board. I use a DG33TL as my desktop, same thing.

It just doesn't pay to run critical systems on desktop hardware. Companies think they are saving money, until the downtime eats away any initial savings.

Sure, but it also doesn't pay to purchase a 10Ton truck to move a 1Ton load :) Bottom line is, you purchase the hardware for the needs that you have. Not every situation warrents a quad XEON on a blade system. The problem is, I have another server, with a slower CPU, half the RAM & a gigabyte motherboard, yet it can handle the same load.

This server runs 4 XEN VPS's, which I moved to the slower machine, and the slower machines handles the load very well. So, where does the problem lie? With the "cheap desktop hardware" ? I don't think so. Rather, I believe there's a hardware problem - i.e. CPU / RAM / motherboard / PSU?

I have reinstalled the OS (CentOS 5.2), and swapped out the HDD's as well - so that's not causing the problem.

--

Kind Regards Rudi Ahlers

This comes down't to the old question of "what is a server"?

Is a server, a) the most powerful, reliable, expensive computer equipment on the planet? b) a machine that serves something to other machines, i.e. a web / database / email / backup / print / etc server?

And does it mean that if the motherboard is not a Titan / SuperMicro board, it's not a server? Come on, that is BS! My 15 year old Pentium I PRO (Socket 8 CPU), still serves very well as a firewall, and will boudle back as a file server at any given moment. In fact, I think it's far more stable than many Dell servers I have worked on. Just cause a company like Dell or SuperMicro builds expensive components and offer a 4 hour support structure does not make them superior to other computer components.

-- Kind Regards Rudi Ahlers

Tru Huynh

2:14 a.m.

On Tue, Nov 18, 2008 at 09:32:05AM +0200, Rudi Ahlers wrote:

...

This comes down't to the old question of "what is a server"?

<rant deleted, mail trimmed> a server just "works", and provide a usable way to debug the OS whenever it's needed (mostly never). Cheap server have at least a serial port, because that the minimal device to interact with the bios/OS. More expensive server have some out of band management capabilities.

Most of the time, they are not used, but when we **need** them these "plus" save your time which is what we value most (isn't it).

But your server, your problems, and your choices.

Just my .2 cents

Tru

-- Tru Huynh (mirrors, CentOS-3 i386/x86_64 Package Maintenance) http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xBEFA581B

Rudi Ahlers

3:29 a.m.

On Tue, Nov 18, 2008 at 10:14 AM, Tru Huynh tru@centos.org wrote:

...

On Tue, Nov 18, 2008 at 09:32:05AM +0200, Rudi Ahlers wrote:

...
This comes down't to the old question of "what is a server"?

<rant deleted, mail trimmed> a server just "works", and provide a usable way to debug the OS whenever it's needed (mostly never). Cheap server have at least a serial port, because that the minimal device to interact with the bios/OS. More expensive server have some out of band management capabilities.

Most of the time, they are not used, but when we **need** them these "plus" save your time which is what we value most (isn't it).

But your server, your problems, and your choices.

Just my .2 cents

Tru

Sure, I understand that. But then again, on my Dell servers, when I have problems, I sit with the same issues. And those expensive motherboards doesn't give me anything more than the cheaper ones. In fact, when the RAM failed on the Dell's, they were unusable untill I could get new RAM from a different supplier. With the cheaper board, I drive down to the first PC shop and get new RAM.

-- Kind Regards Rudi Ahlers

Scott Silva

5:04 p.m.

on 11-18-2008 1:29 AM Rudi Ahlers spake the following:

...

On Tue, Nov 18, 2008 at 10:14 AM, Tru Huynh tru-IFYaIzF+flcdnm+yROfE0A@public.gmane.org wrote:

...
On Tue, Nov 18, 2008 at 09:32:05AM +0200, Rudi Ahlers wrote:

...
This comes down't to the old question of "what is a server"?

<rant deleted, mail trimmed> a server just "works", and provide a usable way to debug the OS whenever it's needed (mostly never). Cheap server have at least a serial port, because that the minimal device to interact with the bios/OS. More expensive server have some out of band management capabilities.

Most of the time, they are not used, but when we **need** them these "plus" save your time which is what we value most (isn't it).

But your server, your problems, and your choices.

Just my .2 cents

Tru

Sure, I understand that. But then again, on my Dell servers, when I have problems, I sit with the same issues. And those expensive motherboards doesn't give me anything more than the cheaper ones. In fact, when the RAM failed on the Dell's, they were unusable untill I could get new RAM from a different supplier. With the cheaper board, I drive down to the first PC shop and get new RAM.

That is one reason I stopped using Dell. The other reason had something to do with our Accounting department and Dell's insistence on using a "Dell Card" instead of a plain "net 30" account.

With my HP servers, if something goes south, HP will send a tech to fix it in 4 hours. The server gets a 48 hour burn in before I even take delivery. Then I burn it in again to make sure something didn't come loose in shipment. Sure servers cost more. If it runs a critical service that can't be down for even a 5 minute reboot, you just need to spend some money. Sure things have failed, One server has had every hard drive replaced over a few years, but all under warranty, and since I have spares, there was no interruption to service. My T1 lines go down more often then the servers do.

My home firewall runs on an old re-used piece of equipment. If it goes down, big deal. The kids just can't play World of Warcraft until I fix it.

If the e-mail server at work goes down, I have the guy that signs my paycheck calling my cellphone at 2 AM to fix it.

Reliability is not cheap. And cheap isn't usually as reliable.

-- MailScanner is like deodorant... You hope everybody uses it, and you notice quickly if they don't!!!!

nate

9:03 p.m.

Rudi Ahlers wrote:

...

Sure, I understand that. But then again, on my Dell servers, when I have problems, I sit with the same issues. And those expensive motherboards doesn't give me anything more than the cheaper ones. In fact, when the RAM failed on the Dell's, they were unusable untill I could get new RAM from a different supplier. With the cheaper board, I drive down to the first PC shop and get new RAM.

I suppose it depends on what dells you have. On the latest 1950 III systems we have they have moderately good diagnostics similar to HP systems. The system log tells me what DIMM module is spitting out errors so I don't need to go through the trouble of narrowing down which one(s) is bad.

I only started using Dell recently since I started my new job in March, before that was mostly HP and Supermicro. HP certainly has great quality stuff though you do generally pay quite a bit more for it. Depending on what the server is doing would depend if I'd really push for that level of quality. Certainly anything that is a single point of failure I would want on a higher quality system. I'm not sure if Dell's motherboards go so far as to having diagnostic LEDs on them to point out what part is faulty. HP has been doing that for a long time now.

The latest HP G5s port the LEDs to the front of the chassis so you don't even have to open it up or load any software you can just look at the front and see if a DIMM is going bad or a voltage regulator, or a PSU, or a CPU etc. Earlier systems just had a generic health LED, which would say good/degraded/bad. But it couldn't give any information as to what was causing the problem.

Granted not as useful for a remote location if nobody is on site to look at the LEDs, though for many smaller places that actually do have people on site on a regular basis it's real handy.

nate

Les Mikesell

7:02 a.m.

Rudi Ahlers wrote:

...

On Sun, Nov 16, 2008 at 1:14 AM, John R Pierce pierce@hogranch.com wrote:

...
Rudi Ahlers wrote:

...
Well, on a standard CentOS 5.2, /var/log/messages will be the the place to log problems like this, or where else can I get more info?

tough to write to the disk when the kernel is crashing. ditto the network. that leaves VGAs and serial ports, which can be written to by self contained emergency-crash routines...

IIRC, you said this was a Q9something quad core... thats a desktop processor... does this server have ECC memory? (I ask, because few desktop platforms do, while ECC is fairly standard on servers). Without ECC, the system has no way of knowing it read in bad data from the ram, and if the bad data happens to be code and that code happens to be in the kernel, ka-RASH, without any detection or warning, it leaps off into never-land, and you get a kernel fault, almost always resulting in...

kernel panic system halted

with no additional useful information available. with ECC memory, single bit errors get corrected on the fly, and log an ECC error event, while double bit errors result in a system halt with a message indicating such.

No, the motherboard doesn't support ECC RAM. The motherboard is a Intel DG35EC - http://www.intel.com/products/desktop/motherboards/DG35EC/DG35EC-overview.ht...

I had machine that would crash about once every week or two in normal operation. Memtest86+ found an error in the 2nd day of running. The worst part was that it left the raid mirrors in a strange state that caused occasional problems for months even after replacing the RAM.

-- Les Mikesell lesmikesell@gmail.com

Rudi Ahlers

7:44 a.m.

...

I had machine that would crash about once every week or two in normal operation. Memtest86+ found an error in the 2nd day of running. The worst part was that it left the raid mirrors in a strange state that caused occasional problems for months even after replacing the RAM.

--

Did you leave memtest86+ running for 2 days? I thought 1 or 2 cycles would be good enough?

I'm hoping to pick-up the server in the next 2 hours then I can see what happens when I run memtest86+ or other tests

-- Kind Regards Rudi Ahlers

Les Mikesell

8:47 a.m.

Rudi Ahlers wrote:

...

...
I had machine that would crash about once every week or two in normal operation. Memtest86+ found an error in the 2nd day of running. The worst part was that it left the raid mirrors in a strange state that caused occasional problems for months even after replacing the RAM.

--

Did you leave memtest86+ running for 2 days? I thought 1 or 2 cycles would be good enough?

I'm hoping to pick-up the server in the next 2 hours then I can see what happens when I run memtest86+ or other tests

Yes, apparently RAM errors can be subtle and only appear when certain adjacent bit patterns are stored - or when the moon is in a certain phase or something.

-- Les Mikesell lesmikesell@gmail.com

Rob Lines

9:53 a.m.

On Tue, Nov 18, 2008 at 9:47 AM, Les Mikesell lesmikesell@gmail.com wrote:

...

...
Did you leave memtest86+ running for 2 days? I thought 1 or 2 cycles would be good enough?

I'm hoping to pick-up the server in the next 2 hours then I can see what happens when I run memtest86+ or other tests

Yes, apparently RAM errors can be subtle and only appear when certain adjacent bit patterns are stored - or when the moon is in a certain phase or something.

-- Les Mikesell lesmikesell@gmail.com

When we burn in machines to try to find errors we go with the day or two run also. The one fun thing that we found was that many times it was temperature related. It would crash in the rack but then when the machine was removed to a test bench it would not exhibit the issue. This is especially true when the machine under load would have both the CPU and the memory taxed but then during the testing we could only really tax one or the other using the existing tools. So blocking a bit of the air flow in the lab to heat up the case or being able to test in the same data center environment helped a lot.

We have most errors show up either in the first 2 minutes of running a memory test or using one the prime number calculations or it will take a day or few to show up.

Rob

nate

4:22 p.m.

Les Mikesell wrote:

...

Yes, apparently RAM errors can be subtle and only appear when certain adjacent bit patterns are stored - or when the moon is in a certain phase or something.

Don't forget cosmic rays

http://adsabs.harvard.edu/abs/1978ITNS...25.1166P

nate

Les Mikesell

5:05 p.m.

nate wrote:

...

Les Mikesell wrote:

...
Yes, apparently RAM errors can be subtle and only appear when certain adjacent bit patterns are stored - or when the moon is in a certain phase or something.

Don't forget cosmic rays

http://adsabs.harvard.edu/abs/1978ITNS...25.1166P

Yeah, but those don't stop when you replace the faulty RAM... Mine did, but the errors committed to disk kept randomly re-appearing mysteriously as the reads from the RAID1 alternated afterwards.

-- Les Mikesell lesmikesell@gmail.com

Ross Walker

5:56 p.m.

On Nov 18, 2008, at 6:05 PM, Les Mikesell lesmikesell@gmail.com wrote:

...

nate wrote:

...
Les Mikesell wrote:

...
Yes, apparently RAM errors can be subtle and only appear when certain adjacent bit patterns are stored - or when the moon is in a certain phase or something.

Don't forget cosmic rays http://adsabs.harvard.edu/abs/1978ITNS...25.1166P

Yeah, but those don't stop when you replace the faulty RAM... Mine did, but the errors committed to disk kept randomly re-appearing mysteriously as the reads from the RAID1 alternated afterwards.

Ah, memory mapped files, another very good reason to use ECC with large memory machines.

Also if you identify bad memory and use software RAID1, it's better to break the mirror, fsck and fix, then rebuild the mirror as there is no data integrity test on RAID1.

-Ross

nate

6:39 p.m.

Ross Walker wrote:

...

Ah, memory mapped files, another very good reason to use ECC with large memory machines.

Normal ECC doesn't seem to be all that great IMO, though I have been very impressed with HP's Advanced ECC it seems much more resilient to memory errors. Bad ram has been my #1 source for system failures over the past few years.

http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00256943/c00256943....

Though I think it's HP specific, haven't seen that technology anywhere else yet.

Of course there is memory mirroring and memory sparing technology as well though I've yet to run into any machines that actually used it.

nate

Nifty Cluster Mitch

20 Nov 20 Nov

2:09 a.m.

On Sat, Nov 15, 2008 at 08:13:24PM +0200, Rudi Ahlers wrote:

...

On Sat, Nov 15, 2008 at 7:26 PM, Vandaman vandaman2002-sk@yahoo.co.uk wrote:

...
Rudi Ahlers wrote:

...
We have a server which locks up about once a week (for the past 3

......

...

...
...
How do I debug the server, which runs CentOS 5.2 to see why it locks up?

Jumping in the middle of a long list of good ideas. Other things to try -- change the run level if 5 switch to 3 if 3 switch to 5

Reinstall the processor-- remove the processor clean the heat sink and processor of thermal compound correctly apply the best thermal grease you can get (I like Arctic Silver) reinstall the heat sink consider upgrading the processor heat sink if the chassis permits (more Cu is good).

Add thermal spreaders to your RAM. You want all the chips on a RAM stick at the same temp.

Chkconfig cpuspeed off if it is on (powersaved on some distros) if off toggle to on.

Turn off any special system monitoring software tools. Things like I2C serial buses do not isolate simple read only activity from things that might modify (shut down) the system. I have see sites install bluesmoke tools yet the kernel had EDAC installed. The two tools had overlapping uncoordinated interactions with the hardware and would randomly shut down the system. Very new boards are almost never supported well so consider going blind. Read EDAC info on CentOS and RH sites.

Inspect then tidy all cables they can mess up air flow and cause thermal issues.

Reset the BIOS and check all the BIOS options. Check for a BIOS update from the vendor. When updating the BIOS do a NVRAM reset. The data structures of the old BIOS and new may differ. The keyboard sequence to reset a BIOS to all defaults may require a call to tech support. Call the vendor.. you have a warranty on a new board.

Since a hardware tty is not possible login (ssh) and run a "while /bin/true" script that lets you see memory, processes and the exact time things fail or just "top". It is possible to have syslog also log to the pty of a ssh session. When you return to the cage plugin a terminal. If there is no screen saver or screen blanking the GFX card may still display the last key bits of info so long as X is not running.

-- T o m M i t c h e l l Found me a new hat, now what?

Rudi Ahlers

2:27 a.m.

On Thu, Nov 20, 2008 at 10:09 AM, Nifty Cluster Mitch niftycluster@niftyegg.com wrote:

...

On Sat, Nov 15, 2008 at 08:13:24PM +0200, Rudi Ahlers wrote:

...
On Sat, Nov 15, 2008 at 7:26 PM, Vandaman vandaman2002-sk@yahoo.co.uk wrote:

...
Rudi Ahlers wrote:

...
We have a server which locks up about once a week (for the past 3

......

...
...
...
How do I debug the server, which runs CentOS 5.2 to see why it locks up?

Jumping in the middle of a long list of good ideas. Other things to try -- change the run level if 5 switch to 3 if 3 switch to 5

Reinstall the processor-- remove the processor clean the heat sink and processor of thermal compound correctly apply the best thermal grease you can get (I like Arctic Silver) reinstall the heat sink consider upgrading the processor heat sink if the chassis permits (more Cu is good).

Add thermal spreaders to your RAM. You want all the chips on a RAM stick at the same temp.

Chkconfig cpuspeed off if it is on (powersaved on some distros) if off toggle to on.

Turn off any special system monitoring software tools. Things like I2C serial buses do not isolate simple read only activity from things that might modify (shut down) the system. I have see sites install bluesmoke tools yet the kernel had EDAC installed. The two tools had overlapping uncoordinated interactions with the hardware and would randomly shut down the system. Very new boards are almost never supported well so consider going blind. Read EDAC info on CentOS and RH sites.

Inspect then tidy all cables they can mess up air flow and cause thermal issues.

Reset the BIOS and check all the BIOS options. Check for a BIOS update from the vendor. When updating the BIOS do a NVRAM reset. The data structures of the old BIOS and new may differ. The keyboard sequence to reset a BIOS to all defaults may require a call to tech support. Call the vendor.. you have a warranty on a new board.

Since a hardware tty is not possible login (ssh) and run a "while /bin/true" script that lets you see memory, processes and the exact time things fail or just "top". It is possible to have syslog also log to the pty of a ssh session. When you return to the cage plugin a terminal. If there is no screen saver or screen blanking the GFX card may still display the last key bits of info so long as X is not running.

-- T o m M i t c h e l l Found me a new hat, now what?

Thanx Tom,

You gave some good ideas, and I've been through all of them. As a general rule of thumb, I only purchase RAM with factory fitted heatsinks attached to them. The chassis is a 1U chassis, so space is limited, and only the necessary cables are installed & tidied up already.

After spending another 2 days in the datacentre trying to figure this one out, I thought I'd take the machine to the office instead. It's just so much nicer working in the office :)

Top didn't help much, since I couldn't see what's wrong. But, sitting at my desk and running some tests & noticed that the fan was running so load at times, that I couldn't even talk to someone on the phone. This is when I realized that the Q9300 CPU could be too big a processor for the fan that I have installed.

The fan that I have, is: http://www.dynatron-corp.com/products/cpucooler/cpucooler_model.asp?id=165

So, it looks like it's not really made for a Q9300 CPU, although their specs say it is.

-- Kind Regards Rudi Ahlers

Rudi Ahlers

2:30 a.m.

On Thu, Nov 20, 2008 at 10:27 AM, Rudi Ahlers rudiahlers@gmail.com wrote:

...

On Thu, Nov 20, 2008 at 10:09 AM, Nifty Cluster Mitch niftycluster@niftyegg.com wrote:

...
On Sat, Nov 15, 2008 at 08:13:24PM +0200, Rudi Ahlers wrote:

...
On Sat, Nov 15, 2008 at 7:26 PM, Vandaman vandaman2002-sk@yahoo.co.uk wrote:

...
Rudi Ahlers wrote:

...
We have a server which locks up about once a week (for the past 3

......

...
...
...
How do I debug the server, which runs CentOS 5.2 to see why it locks up?

Jumping in the middle of a long list of good ideas. Other things to try -- change the run level if 5 switch to 3 if 3 switch to 5

Reinstall the processor-- remove the processor clean the heat sink and processor of thermal compound correctly apply the best thermal grease you can get (I like Arctic Silver) reinstall the heat sink consider upgrading the processor heat sink if the chassis permits (more Cu is good).

Add thermal spreaders to your RAM. You want all the chips on a RAM stick at the same temp.

Chkconfig cpuspeed off if it is on (powersaved on some distros) if off toggle to on.

Turn off any special system monitoring software tools. Things like I2C serial buses do not isolate simple read only activity from things that might modify (shut down) the system. I have see sites install bluesmoke tools yet the kernel had EDAC installed. The two tools had overlapping uncoordinated interactions with the hardware and would randomly shut down the system. Very new boards are almost never supported well so consider going blind. Read EDAC info on CentOS and RH sites.

Inspect then tidy all cables they can mess up air flow and cause thermal issues.

Reset the BIOS and check all the BIOS options. Check for a BIOS update from the vendor. When updating the BIOS do a NVRAM reset. The data structures of the old BIOS and new may differ. The keyboard sequence to reset a BIOS to all defaults may require a call to tech support. Call the vendor.. you have a warranty on a new board.

Since a hardware tty is not possible login (ssh) and run a "while /bin/true" script that lets you see memory, processes and the exact time things fail or just "top". It is possible to have syslog also log to the pty of a ssh session. When you return to the cage plugin a terminal. If there is no screen saver or screen blanking the GFX card may still display the last key bits of info so long as X is not running.

-- T o m M i t c h e l l Found me a new hat, now what?

Thanx Tom,

You gave some good ideas, and I've been through all of them. As a general rule of thumb, I only purchase RAM with factory fitted heatsinks attached to them. The chassis is a 1U chassis, so space is limited, and only the necessary cables are installed & tidied up already.

After spending another 2 days in the datacentre trying to figure this one out, I thought I'd take the machine to the office instead. It's just so much nicer working in the office :)

Top didn't help much, since I couldn't see what's wrong. But, sitting at my desk and running some tests & noticed that the fan was running so load at times, that I couldn't even talk to someone on the phone. This is when I realized that the Q9300 CPU could be too big a processor for the fan that I have installed.

The fan that I have, is: http://www.dynatron-corp.com/products/cpucooler/cpucooler_model.asp?id=165

So, it looks like it's not really made for a Q9300 CPU, although their specs say it is.

--

As an interesting side note, with all the other servers & cabinets in the datacentre, the DB level is so high that it's difficult to pickup a fan that's blowing at full force the whole time. Only when I was at the office, I could hear it. My own PC is totally fan & noise free, so could easily hear when the fan was running fine, and when it was running at full speed. And that also only when I started the VPS's on the server, and couldn't ping / SSH it over the network. Top reported load to be 12 - 15, which is normally still workable, but with the overheating CPU, I couldn't do a thing.

-- Kind Regards Rudi Ahlers

Kai Schaetzl

5:31 a.m.

Rudi Ahlers wrote on Thu, 20 Nov 2008 10:30:53 +0200:

...

Top reported load to be 12 - 15, which is normally still workable, but with the overheating CPU, I couldn't do a thing.

If it's overheating there should be two things telling you this:

- sensors - throttled CPU speed

Something you can check without ears or being in the datacenter. Deducing from fan noise that it's overheating or even that the fan is running at full power isn't really convincing.

Kai

-- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com

Rudi Ahlers

6:05 a.m.

On Thu, Nov 20, 2008 at 1:31 PM, Kai Schaetzl maillists@conactive.com wrote:

...

Rudi Ahlers wrote on Thu, 20 Nov 2008 10:30:53 +0200:

...
Top reported load to be 12 - 15, which is normally still workable, but with the overheating CPU, I couldn't do a thing.

If it's overheating there should be two things telling you this:

sensors

throttled CPU speed

Something you can check without ears or being in the datacenter. Deducing from fan noise that it's overheating or even that the fan is running at full power isn't really convincing.

Kai

-- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com

Hi Kai,

How do I check the sensors & throttling?

-- Kind Regards Rudi Ahlers

John R Pierce

2:38 a.m.

Rudi Ahlers wrote:

...

This is when I realized that the Q9300 CPU could be too big a processor for the fan that I have installed.

The fan that I have, is: http://www.dynatron-corp.com/products/cpucooler/cpucooler_model.asp?id=165

So, it looks like it's not really made for a Q9300 CPU, although their specs say it is

that fan says its for up to 135 watt CPUs, I don't think a q9300 is anywhere near that, so unless that heatsink fan is grossly under its speced capability, I dont think thats a problem. Yeah, the Q9300 is 95W max, and thats with all 4 cores running heavy math...

HOWEVER. Intel desktop boards generally have a passive heatsink on the northbridge and expect the downdraft from a conventional CPU fan to cool said northbridge. your 1U configuration might not be moving enough air past that northbridge. I know on my DG33TL in a desktop minitower, the G33 northbridge runs pretty hot, and I had to arrange for some extra airflow past it since I used a 'tower cooler' which blew the air sideways rather than down.

I still think running four instances of mprime (from www.mersenne.org) each bound to a different cpu affinity (-a0, -a1, -a2, -a3) and running the 'torture test' overnight will tell you a lot. do this with xen disabled, just the base system running at init 3. any sort of computational or memory timing related glitch will show up as a numeric error and be logged by the program.

Rudi Ahlers

2:51 a.m.

On Thu, Nov 20, 2008 at 10:38 AM, John R Pierce pierce@hogranch.com wrote:

...

Rudi Ahlers wrote:

...
This is when I realized that the Q9300 CPU could be too big a processor for the fan that I have installed.

The fan that I have, is: http://www.dynatron-corp.com/products/cpucooler/cpucooler_model.asp?id=165

So, it looks like it's not really made for a Q9300 CPU, although their specs say it is

that fan says its for up to 135 watt CPUs, I don't think a q9300 is anywhere near that, so unless that heatsink fan is grossly under its speced capability, I dont think thats a problem. Yeah, the Q9300 is 95W max, and thats with all 4 cores running heavy math...

HOWEVER. Intel desktop boards generally have a passive heatsink on the northbridge and expect the downdraft from a conventional CPU fan to cool said northbridge. your 1U configuration might not be moving enough air past that northbridge. I know on my DG33TL in a desktop minitower, the G33 northbridge runs pretty hot, and I had to arrange for some extra airflow past it since I used a 'tower cooler' which blew the air sideways rather than down.

I still think running four instances of mprime (from www.mersenne.org) each bound to a different cpu affinity (-a0, -a1, -a2, -a3) and running the 'torture test' overnight will tell you a lot. do this with xen disabled, just the base system running at init 3. any sort of computational or memory timing related glitch will show up as a numeric error and be logged by the program.

Well, as you can imagine, I didn't think the CPU would / could be a problem when I installed it, due to it's lower wattage rating.

Looking at the motherboard layout, I wonder what will happen when I turn the CPU fan 90 degrees "down" so that it blows through the very large northbridge heatsink? I was even thinking of trying to look for a NB heatsink with a fan, but don't know if they exist. I'll give that a try, and see what happens.

The 2 butterfly fans are pretty useless though, since they blow directly into the memory modules, and don't even come close to the CPU / NB.

On a side note, I'm able to touch the NB heatsink and it doesn't really feel that hot. Unfortunately I can't get lm_sensors to work on the MB, so I can't even tell the temp / fan speeds

-- Kind Regards Rudi Ahlers

6083

Age (days ago)

6088

Last active (days ago)

discuss@lists.centos.org

35 comments

13 participants

tags (0)

participants (13)

John R Pierce
Kai Schaetzl
Les Mikesell
Matthew Kent
nate
Nifty Cluster Mitch
Richard Karhuse
Rob Lines
Ross Walker
Rudi Ahlers
Scott Silva
Tru Huynh
Vandaman