Server hangs on CentOS 5.5

List overview All Threads
Download

newer

older

Re: [CentOS] Installation failure

Installation failure

Michael Eager

8 Mar 2011 8 Mar '11

5:24 p.m.

Hi --

I'm running a server which is usually stable, but every once in a while it hangs. The server is used as a file store using NFS and to run VMware machines.

I don't see anything in /var/log/messages or elsewhere to indicate any problem or offer any clue why the system was hung.

Any suggestions where I might look for a clue?

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077

Show replies by date

Les Mikesell

8 Mar 8 Mar

5:41 p.m.

On 3/8/2011 11:24 AM, Michael Eager wrote:

...

Hi --

I'm running a server which is usually stable, but every once in a while it hangs. The server is used as a file store using NFS and to run VMware machines.

I don't see anything in /var/log/messages or elsewhere to indicate any problem or offer any clue why the system was hung.

Any suggestions where I might look for a clue?

Probably something hardware related. Bad memory, overheating, power supply, etc. I've even seen some rare cases where a bios update would fix it although it didn't make much sense for a machine to run for years, then need a firmware change.

-- Les Mikesell lesmikesell@gmail.com

Michael Eager

6:31 p.m.

Les Mikesell wrote:

...

On 3/8/2011 11:24 AM, Michael Eager wrote:

...
Hi --

I'm running a server which is usually stable, but every once in a while it hangs. The server is used as a file store using NFS and to run VMware machines.

I don't see anything in /var/log/messages or elsewhere to indicate any problem or offer any clue why the system was hung.

Any suggestions where I might look for a clue?

Probably something hardware related. Bad memory, overheating, power supply, etc. I've even seen some rare cases where a bios update would fix it although it didn't make much sense for a machine to run for years, then need a firmware change.

The system is on a UPS and temps seem reasonable. Locating a transient memory problem is time consuming. Identifying a power supply which sometimes spikes is even more difficult. I'd like to have a clue about the likely problem before shutting down the server for an extended period.

I'll set up sar and sensord to periodically log system status and see if this gives me a clue for the next time this happens.

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077

Les Mikesell

6:52 p.m.

On 3/8/2011 12:31 PM, Michael Eager wrote:

...

...
...
Any suggestions where I might look for a clue?

Probably something hardware related. Bad memory, overheating, power supply, etc. I've even seen some rare cases where a bios update would fix it although it didn't make much sense for a machine to run for years, then need a firmware change.

The system is on a UPS and temps seem reasonable. Locating a transient memory problem is time consuming. Identifying a power supply which sometimes spikes is even more difficult. I'd like to have a clue about the likely problem before shutting down the server for an extended period.

I'll set up sar and sensord to periodically log system status and see if this gives me a clue for the next time this happens.

The times I've seen things like that it would happen too quickly to log anything. One other possibility is an individual bad CPU fan, but then you might have to shut down completely for a while to wake it up.

-- Les Mikesell lesmikesell@gmail.com

Dr. Ed Morbius

9:48 p.m.

on 10:31 Tue 08 Mar, Michael Eager (eager@eagerm.com) wrote:

...

Les Mikesell wrote:

...
On 3/8/2011 11:24 AM, Michael Eager wrote:

...
Hi --

I'm running a server which is usually stable, but every once in a while it hangs. The server is used as a file store using NFS and to run VMware machines.

I don't see anything in /var/log/messages or elsewhere to indicate any problem or offer any clue why the system was hung.

Any suggestions where I might look for a clue?

Probably something hardware related. Bad memory, overheating, power supply, etc. I've even seen some rare cases where a bios update would fix it although it didn't make much sense for a machine to run for years, then need a firmware change.

The system is on a UPS and temps seem reasonable. Locating a transient memory problem is time consuming.

Disable or remove half your RAM. If the problem persists, replace that RAM and remove the other half. If the problem resolves, the issue is likely in the half of the RAM you've removed. You can binary search through it, or RMA the lot if warranteed.

...

Identifying a power supply which sometimes spikes is even more difficult.

Same drill. Replace the power supply, or on a dual-PS system, disable one, then the other. Follow procedure as for RAM.

...

I'd like to have a clue about the likely problem before shutting down the server for an extended period.

If the server is critical, get a vendor loaner and bench-test the equipment until the fault can be identified.

...

I'll set up sar and sensord to periodically log system status and see if this gives me a clue for the next time this happens.

At best, sar will tell you whether or not you're experiencing resource exhuastion. It's a valuable tool, but fairly coarse-grained. Cacti will give you better resolution and visualization (particularly on CentOS) than sar (some distros now include sar graphing utilities, CentOS to the best of my recollection does not).

-- Dr. Ed Morbius, Chief Scientist / | Robot Wrangler / Staff Psychologist | When you seek unlimited power Krell Power Systems Unlimited | Go to Krell!

compdoc

5:44 p.m.

...

I'm running a server which is usually stable, but every once in a while it hangs.

There can be many reasons for that. One thing I'm curious about - try looking at the reallocated sector count, and current pending sector count for your drives with smartctl.

Michael Eager

6:14 p.m.

compdoc wrote:

...

...
I'm running a server which is usually stable, but every once in a while it hangs.

There can be many reasons for that. One thing I'm curious about - try looking at the reallocated sector count, and current pending sector count for your drives with smartctl.

Thanks for the suggestions. All disks show zero realloc sectors and pending sectors. Smartctl says no failures. Also, max temp was 48 C or less.

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077

Brian Mathis

5:50 p.m.

On Tue, Mar 8, 2011 at 12:24 PM, Michael Eager eager@eagerm.com wrote:

...

Hi --

I'm running a server which is usually stable, but every once in a while it hangs. The server is used as a file store using NFS and to run VMware machines.

I don't see anything in /var/log/messages or elsewhere to indicate any problem or offer any clue why the system was hung.

Any suggestions where I might look for a clue?

Please be more specific when you say it "hangs". Does it just pause for a minute and then continue working, or does it freeze completely until you reboot it? Does it respond to s "soft" reboot like Ctrl-Alt-Del, or do you need to hard power it off?

Since this is an NFS server I'm going to guess there might be a lot of IO. Maybe there is some large IO load going on, like maybe all your VMs are running anti-virus scan at the same time, or something like that.

To troubleshoot, I recommend installing the 'sar' utilities (yum install sysstat) and then reviewing the collected data using the 'ksar' utility (http://sourceforge.net/projects/ksar/). sar/ksar are good for tracking down acute problems.

Michael Eager

6:20 p.m.

Brian Mathis wrote:

...

On Tue, Mar 8, 2011 at 12:24 PM, Michael Eager eager@eagerm.com wrote:

...
Hi --

I'm running a server which is usually stable, but every once in a while it hangs. The server is used as a file store using NFS and to run VMware machines.

I don't see anything in /var/log/messages or elsewhere to indicate any problem or offer any clue why the system was hung.

Any suggestions where I might look for a clue?

Please be more specific when you say it "hangs". Does it just pause for a minute and then continue working, or does it freeze completely until you reboot it? Does it respond to s "soft" reboot like Ctrl-Alt-Del, or do you need to hard power it off?

System is unresponsive. Monitor blank, no response to keyboard, no response to remote ssh. Hit reset to reboot.

The only indication that I had that there was a problem (other that attached systems were not accessing files) was that the fan(s) on the server were louder than normal.

...

Since this is an NFS server I'm going to guess there might be a lot of IO. Maybe there is some large IO load going on, like maybe all your VMs are running anti-virus scan at the same time, or something like that.

At the time, should be very low NFS load.

...

To troubleshoot, I recommend installing the 'sar' utilities (yum install sysstat) and then reviewing the collected data using the 'ksar' utility (http://sourceforge.net/projects/ksar/). sar/ksar are good for tracking down acute problems.

Thanks for the suggestion. I'll look into sar.

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077

compdoc

6:34 p.m.

...

The only indication that I had that there was a problem (other that attached systems were not accessing files) was that the fan(s) on the server were louder than normal.

Are you saying the fans were running faster than normal while it was hung? Or are they louder than usual even while its running?

Fans making noise can mean the fan isn't spinning as fast as it should because the bearing is failing. Be a good time to open the case to check to see that all fans are working...

Michael Eager

7:14 p.m.

compdoc wrote:

...

...
The only indication that I had that there was a problem (other that attached systems were not accessing files) was that the fan(s) on the server were louder than normal.

Are you saying the fans were running faster than normal while it was hung? Or are they louder than usual even while its running?

They were louder than normal when hung, but returned to being quiet after the reboot.

...

Fans making noise can mean the fan isn't spinning as fast as it should because the bearing is failing. Be a good time to open the case to check to see that all fans are working...

Good idea.

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077

m.roth＠5-cent.us

6:34 p.m.

Michael Eager wrote:

...

Brian Mathis wrote:

...
On Tue, Mar 8, 2011 at 12:24 PM, Michael Eager eager@eagerm.com wrote:

...
Hi --

I'm running a server which is usually stable, but every once in a while it hangs. The server is used as a file store using NFS and to run VMware machines.

I don't see anything in /var/log/messages or elsewhere to indicate any problem or offer any clue why the system was hung.

Any suggestions where I might look for a clue?

<snip>

...

System is unresponsive. Monitor blank, no response to keyboard, no response to remote ssh. Hit reset to reboot.

Suggestion 1: ->from the console<-, run setterm --powersave off That way, even if you connect a monitor (in our, uh, "computer labs", we have a monitor-on-a-stick), you'll still see what's on the screen at the end, not the power save blanking.

...

The only indication that I had that there was a problem (other that attached systems were not accessing files) was that the fan(s) on the server were louder than normal.

Um. Um. What make is the server? We had that on some new Suns, where after working on them, the fans would spin up and *not* spin down to normal. The answer to that was, after powering them down, pull all the plugs, and leave them out for 20 sec or so....

mark

Michael Eager

7:25 p.m.

m.roth@5-cent.us wrote:

...

Michael Eager wrote:

...
Brian Mathis wrote:

...
On Tue, Mar 8, 2011 at 12:24 PM, Michael Eager eager@eagerm.com wrote:

...
Hi --

I'm running a server which is usually stable, but every once in a while it hangs. The server is used as a file store using NFS and to run VMware machines.

I don't see anything in /var/log/messages or elsewhere to indicate any problem or offer any clue why the system was hung.

Any suggestions where I might look for a clue?

<snip> > System is unresponsive. Monitor blank, no response to keyboard, > no response to remote ssh. Hit reset to reboot.

Suggestion 1: ->from the console<-, run setterm --powersave off That way, even if you connect a monitor (in our, uh, "computer labs", we have a monitor-on-a-stick), you'll still see what's on the screen at the end, not the power save blanking.

I get a message "cannot (un)set powersave mode".

I'll add this to .xinitrc.

...

...
The only indication that I had that there was a problem (other that attached systems were not accessing files) was that the fan(s) on the server were louder than normal.

Um. Um. What make is the server? We had that on some new Suns, where after working on them, the fans would spin up and *not* spin down to normal. The answer to that was, after powering them down, pull all the plugs, and leave them out for 20 sec or so....

House-built, Gigabyte MB, AMD Phenom II X6, 6Gb RAM.

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077

Michael Eager

7:43 p.m.

Michael Eager wrote:

...

m.roth@5-cent.us wrote:

...

...
Suggestion 1: ->from the console<-, run setterm --powersave off That way, even if you connect a monitor (in our, uh, "computer labs", we have a monitor-on-a-stick), you'll still see what's on the screen at the end, not the power save blanking.

I get a message "cannot (un)set powersave mode".

I'll add this to .xinitrc.

Or better, CTRL-ALT-F1 to switch to serial console and run "setterm -powersave off".

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077

m.roth＠5-cent.us

7:45 p.m.

Michael Eager wrote:

...

m.roth@5-cent.us wrote:

...
Michael Eager wrote:

...
Brian Mathis wrote:

...
On Tue, Mar 8, 2011 at 12:24 PM, Michael Eager eager@eagerm.com wrote:

...
I'm running a server which is usually stable, but every once in a while it hangs. The server is used as a file store using NFS and to run VMware machines.

I don't see anything in /var/log/messages or elsewhere to indicate any problem or offer any clue why the system was hung.

Any suggestions where I might look for a clue?

<snip> > System is unresponsive. Monitor blank, no response to keyboard, > no response to remote ssh. Hit reset to reboot.

Suggestion 1: ->from the console<-, run setterm --powersave off That way, even if you connect a monitor (in our, uh, "computer labs", we have a monitor-on-a-stick), you'll still see what's on the screen at the end, not the power save blanking.

I get a message "cannot (un)set powersave mode".

Did you do it from the console? It won't work (or at least neither my manager nor I have figured out how to do it) remotely.

...

I'll add this to .xinitrc.

Um. This isn't X, it's below that.

...

...
...
The only indication that I had that there was a problem (other that attached systems were not accessing files) was that the fan(s) on the server were louder than normal.

Um. Um. What make is the server? We had that on some new Suns, where after working on them, the fans would spin up and *not* spin down to

normal.

...

...
The answer to that was, after powering them down, pull all the plugs, and leave them out for 20 sec or so....

House-built, Gigabyte MB, AMD Phenom II X6, 6Gb RAM.

Any chance the problem's with the video card?

mark

Michael Eager

8:35 p.m.

m.roth@5-cent.us wrote:

...

Michael Eager wrote:

...

...
House-built, Gigabyte MB, AMD Phenom II X6, 6Gb RAM.

Any chance the problem's with the video card?

Video is on the MB. It doesn't seem likely that it's the video, since the system doesn't respond to network when it crashes.

It could be anything. That's why I'm looking for something that would give me a bit of a hint what to look at. With an infrequent failure, it's not practical to replace components piecemeal.

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077

John R Pierce

8:47 p.m.

...

Video is on the MB. It doesn't seem likely that it's the video, since the system doesn't respond to network when it crashes.

bad video hardware or drivers can easily crash the system

If its running an X windows display of any sort, I'd suggest trying it in text-only mode. in /etc/inittab, set the default runlevel to 3 instead of 5. this leaves the video in plain VGA text mode which is far less likely to crash the system.

id:3:initdefault:

bonus, if this is a server, and thats a shared memory video system, disabling the graphic modes reduces the memory bus contention, speeding up the whole system by some percentage.

m.roth＠5-cent.us

9:29 p.m.

John R Pierce wrote:

...

...
Video is on the MB. It doesn't seem likely that it's the video, since the system doesn't respond to network when it crashes.

bad video hardware or drivers can easily crash the system

If its running an X windows display of any sort, I'd suggest trying it in text-only mode. in /etc/inittab, set the default runlevel to 3 instead of 5. this leaves the video in plain VGA text mode which is far less likely to crash the system.
 id:3:initdefault:

Seconded. If it's a server, it doesn't really need X running anyway.

mark

Leen de Braal

9 Mar 9 Mar

8:24 a.m.

...

m.roth@5-cent.us wrote:

...
Michael Eager wrote:

...
...
House-built, Gigabyte MB, AMD Phenom II X6, 6Gb RAM.

Any chance the problem's with the video card?

Video is on the MB. It doesn't seem likely that it's the video, since the system doesn't respond to network when it crashes.

It could be anything. That's why I'm looking for something that would give me a bit of a hint what to look at. With an infrequent failure, it's not practical to replace components piecemeal.

While you open the case, check for the bulging capacitor problem. Will have the effect you describe, freezing up the system so that even bios routines don't work (your fans). If that's the case, replace mainboard.

...

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077 _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

-- L. de Braal BraHa Systems NL - Terneuzen T +31 115 649333 F +31 115 649444

Rudi Ahlers

8:36 a.m.

On Wed, Mar 9, 2011 at 10:24 AM, Leen de Braal ldb@braha.nl wrote:

...

...
m.roth@5-cent.us wrote:

...
Michael Eager wrote:

...
...
House-built, Gigabyte MB, AMD Phenom II X6, 6Gb RAM.

Any chance the problem's with the video card?

Video is on the MB. It doesn't seem likely that it's the video, since the system doesn't respond to network when it crashes.

It could be anything. That's why I'm looking for something that would give me a bit of a hint what to look at. With an infrequent failure, it's not practical to replace components piecemeal.

While you open the case, check for the bulging capacitor problem. Will have the effect you describe, freezing up the system so that even bios routines don't work (your fans). If that's the case, replace mainboard.

Or replace the CAPS if you're not afraid of a soldering iron :)

-- Kind Regards Rudi Ahlers SoftDux Website: http://www.SoftDux.com Technical Blog: http://Blog.SoftDux.com Office: 087 805 9573 Cell: 082 554 7532

Lamar Owen

3:22 p.m.

On Wednesday, March 09, 2011 03:24:48 am Leen de Braal wrote:

...

While you open the case, check for the bulging capacitor problem. Will have the effect you describe, freezing up the system so that even bios routines don't work (your fans). If that's the case, replace mainboard.

I've seen capacitor problems in the past, and they can be rather interesting.

What the caps do is open up (electrically speaking) meaning they no longer can smooth out the ripple in the output of the switching regulator; this ripple is very high frequency due to the switching regulator's design. As the CPU draws more current (which happens when it's loaded, of course, since MOS gates by design consume the most power during the switching period (capacitor charging time constants on the gates of the transistors themselves)), the switching regulator has to supply more current, and if the caps are open they can't smooth out the deeper ripple.

I actually had one motherboard blow two caps; one of the cases of one of the blown capacitors was violently ejected off of the 'guts' of the cap, hard enough that it dented the PC's case from the inside.

The PC kept running, until it was put under load, then it would lock up. When the second cap blew, about an hour later, the PC hung; it would power up and run POST, and even run the BIOS setup's memory check and health check, but as soon as the CPU was shifted into protect mode as the OS booted it would hard hang due to the CPU's increased current draw overwhelming the ripple absorbing capacity of the remaining good capacitors on the CPU's switching regulator.

There's really only one way to determine this, and that's by putting an oscilloscope on the CPU's power supply output rails and looking for ripple while running a CPU burnin program. The hard part of that is actually finding a good place to measure the output, thanks to the typical motherboard's multilayer design.

And while with the proper desoldering equipment and training/experience one can re-cap a motherboard, I would not recommend doing so for a critical server, unless you want and can assume personal liability for that server's operation. Better to get a new motherboard with a warranty. For a personal server that if it breaks isn't going to open you up to personal liability, sure, you can re-cap if you'd like and have the patience, time, equipment, and experience necessary to work on 6 to 8 layer PC boards, with may be soldered with RoHS lead-free solder, which requires special techniques. Otherwise, as you said, you can damage the 'vias' (that is, the plated through holes the capacitor leads solder to, which may be used to connect to internal layers that you can't resolder) very easily.

Scott Silva

8 Mar 8 Mar

6:41 p.m.

on 3/8/2011 10:20 AM Michael Eager spake the following:

...

Brian Mathis wrote:

...
On Tue, Mar 8, 2011 at 12:24 PM, Michael Eager eager@eagerm.com wrote:

...
Hi --

I'm running a server which is usually stable, but every once in a while it hangs. The server is used as a file store using NFS and to run VMware machines.

I don't see anything in /var/log/messages or elsewhere to indicate any problem or offer any clue why the system was hung.

Any suggestions where I might look for a clue?

Please be more specific when you say it "hangs". Does it just pause for a minute and then continue working, or does it freeze completely until you reboot it? Does it respond to s "soft" reboot like Ctrl-Alt-Del, or do you need to hard power it off?

System is unresponsive. Monitor blank, no response to keyboard, no response to remote ssh. Hit reset to reboot.

The only indication that I had that there was a problem (other that attached systems were not accessing files) was that the fan(s) on the server were louder than normal.

...
Since this is an NFS server I'm going to guess there might be a lot of IO. Maybe there is some large IO load going on, like maybe all your VMs are running anti-virus scan at the same time, or something like that.

At the time, should be very low NFS load.

...
To troubleshoot, I recommend installing the 'sar' utilities (yum install sysstat) and then reviewing the collected data using the 'ksar' utility (http://sourceforge.net/projects/ksar/). sar/ksar are good for tracking down acute problems.

Thanks for the suggestion. I'll look into sar.

Did you try the obvious stuff for older equipment? Remove and reseat ALL cards and memory, several times, to clean off any oxidation from contacts. Blow out any dust and collected lint. reseat drive cables.

Michael Eager

7:40 p.m.

Scott Silva wrote:

...

Did you try the obvious stuff for older equipment? Remove and reseat ALL cards and memory, several times, to clean off any oxidation from contacts. Blow out any dust and collected lint. reseat drive cables.

Not yet, but that's always a good idea.

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077

Dr. Ed Morbius

9:44 p.m.

on 09:24 Tue 08 Mar, Michael Eager (eager@eagerm.com) wrote:

...

Hi --

I'm running a server which is usually stable, but every once in a while it hangs. The server is used as a file store using NFS and to run VMware machines.

I don't see anything in /var/log/messages or elsewhere to indicate any problem or offer any clue why the system was hung.

Any suggestions where I might look for a clue?

I'd very strongly recommend you configure netconsole. Though not entire clear from the name, it's actually an in-kernel network logging module, which is very useful for kicking out kernel panics which otherwise aren't logged to disk and can't be seen on a (nonresponsive) monitor.

Alternately, a serial console which actually retains all output sent to it (some remote access systems support this, some don't) may help.

Barring that, I'd start looking at individual HW components, starting with RAM.

The trick is in passing the appropriate parameters to the module at load time. I found it helpful to have an @boot cronjob to do this.

You'll need to pass the local port, local system IP, local network device, remote syslog UDP port, remote syslog IP, and the /gateway/ MAC address, where gateway is the syslogd (if on a contiguous ethernet segment), or your network gateway host, if not. Some parsing magic can determine these values for you.

Good article describing configuration:

http://www.cyberciti.biz/tips/linux-netconsole-log-management-tutorial.html

If you're not already remote-logging all other activity, I'd do that as well. You might catch the start of the hang, if not all of it.

-- Dr. Ed Morbius, Chief Scientist / | Robot Wrangler / Staff Psychologist | When you seek unlimited power Krell Power Systems Unlimited | Go to Krell!

Lamar Owen

9 Mar 9 Mar

3:05 p.m.

On Tuesday, March 08, 2011 04:44:54 pm Dr. Ed Morbius wrote:

...

I'd very strongly recommend you configure netconsole.

Ok, now this is useful indeed. Thanks for the information, even though I'm not the OP.... While I suspected the facility might be there, I hadn't really dug for it, but if this will catch things after filesystems go r/o (ext3 journal things, ya know) it could be worth its weight in gold for catching kernel errors from VMware guests (serial console not really an option with the hosts I have, although I'm sure some enterprising soul has figured out how to redirect the VM guest serial port to something else....).

Dr. Ed Morbius

6:37 p.m.

on 10:05 Wed 09 Mar, Lamar Owen (lowen@pari.edu) wrote:

...

On Tuesday, March 08, 2011 04:44:54 pm Dr. Ed Morbius wrote:

...
I'd very strongly recommend you configure netconsole.

Ok, now this is useful indeed. Thanks for the information, even though I'm not the OP.... While I suspected the facility might be there, I hadn't really dug for it, but if this will catch things after filesystems go r/o (ext3 journal things, ya know) it could be worth its weight in gold for catching kernel errors from VMware guests (serial console not really an option with the hosts I have,

Yep, it is.

Netconsole made me fall in love with Linux all over again.

...

although I'm sure some enterprising soul has figured out how to redirect the VM guest serial port to something else....).

-- Dr. Ed Morbius, Chief Scientist / | Robot Wrangler / Staff Psychologist | When you seek unlimited power Krell Power Systems Unlimited | Go to Krell!

Michael Eager

3:06 p.m.

Dr. Ed Morbius wrote:

...

on 09:24 Tue 08 Mar, Michael Eager (eager@eagerm.com) wrote:

...
Hi --

I'm running a server which is usually stable, but every once in a while it hangs. The server is used as a file store using NFS and to run VMware machines.

I don't see anything in /var/log/messages or elsewhere to indicate any problem or offer any clue why the system was hung.

Any suggestions where I might look for a clue?

I'd very strongly recommend you configure netconsole. Though not entire clear from the name, it's actually an in-kernel network logging module, which is very useful for kicking out kernel panics which otherwise aren't logged to disk and can't be seen on a (nonresponsive) monitor.

I'll take a look at netconsole.

...

Alternately, a serial console which actually retains all output sent to it (some remote access systems support this, some don't) may help.

Barring that, I'd start looking at individual HW components, starting with RAM.

The problem with randomly replacing various components, other than the downtime and nuisance, is that there's no way to know that the change actually fixed any problem. When the base rate is one unknown system hang every few weeks, how many wees should I wait without a failure to conclude that the replaced component was the cause? A failure which happens infrequently isn't really amenable to a random diagnostic approach.

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077

John Hodrien

3:11 p.m.

On Wed, 9 Mar 2011, Michael Eager wrote:

...

The problem with randomly replacing various components, other than the downtime and nuisance, is that there's no way to know that the change actually fixed any problem. When the base rate is one unknown system hang every few weeks, how many wees should I wait without a failure to conclude that the replaced component was the cause? A failure which happens infrequently isn't really amenable to a random diagnostic approach.

So you pitch the whole thing over to being a test rig, and buy all new hardware?

Brunner, Brian T.

3:16 p.m.

centos-bounces@centos.org wrote:

...

On Wed, 9 Mar 2011, Michael Eager wrote:

...
The problem with randomly replacing various components, other than the downtime and nuisance, is that there's no way to know that the change actually fixed any problem. When the base rate is one unknown system hang every few weeks, how many weeks should I wait without a failure to conclude that the replaced component was the cause? A failure which happens infrequently isn't really amenable to a random diagnostic approach.

So you pitch the whole thing over to being a test rig, and buy all new hardware?

This would be far cheaper than the time spent troubleshooting the running (sometimes hanging) system. Starting with RAM and Power Supply is not random ... They're "The Usual Suspects".

Insert spiffy .sig here: Life is complex: it has both real and imaginary parts.

//me ******************************************************************* This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This footnote also confirms that this email message has been swept for the presence of computer viruses. www.Hubbell.com - Hubbell Incorporated**

Lamar Owen

3:37 p.m.

On Wednesday, March 09, 2011 10:16:34 am Brunner, Brian T. wrote:

...

This would be far cheaper than the time spent troubleshooting the running (sometimes hanging) system.

Let me interject here, that from a budgeting standpoint 'cheaper' has to be interpreted in the context of which budget the costs are coming out of. New hardware is capex, and thus would come out of the capital budget, and admin time is opex, and thus would come out of the operating budget. There may be sufficient funds in the operating budget to pay an admin $x,000 but the funds in the capital budget may be insufficient to buy a server costing $y,000, where y=x. And if this is an educational institution, and there are grants involved, it may be the reverse situation. So 'cheaper' only has meaning when the costs are coming out of the same budget. So, yes, while it's easy for a single-budget entity to make this decision, it's not so easy when you have multiple budgets involved with different spending parameters and different funding entities.

...

Starting with RAM and Power Supply is not random ... They're "The Usual Suspects".

This is a very true statement.

Heat and airflow are two others.

m.roth＠5-cent.us

3:48 p.m.

Lamar Owen wrote:

...

On Wednesday, March 09, 2011 10:16:34 am Brunner, Brian T. wrote:

<snip>

...

...
Starting with RAM and Power Supply is not random ... They're "The Usual Suspects".

This is a very true statement.

Heat and airflow are two others.

Hmmm... has the a/c been changed lately? Or maybe stuff outside the rack been moved, and so obstructed the airflow?

mark

Lamar Owen

4:53 p.m.

On Wednesday, March 09, 2011 10:48:29 am m.roth@5-cent.us wrote:

...

Lamar Owen wrote:

...
Heat and airflow are two others.

...

Hmmm... has the a/c been changed lately? Or maybe stuff outside the rack been moved, and so obstructed the airflow?

To followup a little, I had a motherboard one time, with a factory-installed CPU, heatsink, and fan, that would not run for more than four or five hours before hanging. This motherboard was in a system that was donated to us as being 'flaky' so I don't know the warranty status or what the original owner had or had not done, but it did have a factory seal sticker strip between the heatsink and the CPU and the motherboard socket, and that sticker was tamper-evident type, and there had been no tampering.

I decided I would refresh the heatsink compound, and, since even if it were still covered by the warranty that would have only been valid for the original purchaser. So I pulled the sticker strip, which left little 'voids' on things, and pulled the heatsink. At that point I laughed so hard I cried, as the heatsink still had the clear plastic protector film between the CPU and the heatsink compound. From the factory. I pulled the film, reinstalled the heatsink, and that system is and has been for several years rock-solid stable.

The issue of dust buildup follows from the heat and airflow.

There is another potential culprit, though, especially if this system has been in a raised floor environment, that some might find odd. That culprit, or, rather, those culprits, are zinc whiskers. Also, the metal components in the electronics themselves can exude whiskers; see the wikipedia article on the subject for more information ( https://secure.wikimedia.org/wikipedia/en/wiki/Whisker_%28metallurgy%29 )

Brunner, Brian T.

3:55 p.m.

centos-bounces@centos.org wrote:

...

On Wednesday, March 09, 2011 10:16:34 am Brunner, Brian T. wrote:

...
This would be far cheaper than the time spent troubleshooting the running (sometimes hanging) system.

Let me interject here, that from a budgeting standpoint 'cheaper' has to be interpreted in the context of which budget the costs are coming out of.

This degenerates into "Your dollars are cheaper than my dollars".

...

New hardware is capex, and thus would come out of the capital budget, and admin time is opex, and thus would come out of the operating budget.

This is where mental ossification amongst bean-counters can kill a company. "Economic Opportunity Cost" should raise its head here: What would we do with the $capex if we paid $opex vs what would we do with the $opex if we paid $capex. "The Time Value of Money vs The Money Value of Time" is another phrasing of this point-of-view. Unfortunately this is no longer a CentOS topic.

...

...
Starting with RAM and Power Supply is not random ... They're "The Usual Suspects".

This is a very true statement.

Heat and airflow are two others.

RAM and PowerSupply are easy starting points: swap RAM between two systems and see (in the next 3 months) if the problem moved, swapping power supplies is a bit trickier but doable if the systems are similar enough. Again, several months watching to see where the problem manifests is a test of patience and diligence. It's possible that doing this will make the problem stop arising (RAM and PS are both good enough, they just don't play well together).

Heat & airflow are harder to swap (says the guy who opened an office desktop, and vacuumed out enough hair, lint, dust, dander, and ashes to knit a grey angora hamster (with lung cancer)).

Insert spiffy .sig here: Life is complex: it has both real and imaginary parts.

Les Mikesell

4:45 p.m.

On 3/9/2011 9:55 AM, Brunner, Brian T. wrote:

...

This is where mental ossification amongst bean-counters can kill a company. "Economic Opportunity Cost" should raise its head here: What would we do with the $capex if we paid $opex vs what would we do with the $opex if we paid $capex. "The Time Value of Money vs The Money Value of Time" is another phrasing of this point-of-view. Unfortunately this is no longer a CentOS topic.

The admin/operator's time is usually seen as a fixed cost and keeping a machine working is not supposed to take unplanned time. So, if you want to keep something running you really need to buy 3 of them in the first place. One as primary in production, one as a backup, and one to be developing/testing the next version on. In some cases you can replace the third one with a virtual setup, and you might be able to have one backup as a spare for more than one live server but you can't skimp much more than that. Everything breaks, so if one thing breaking causes a big problem, it wasn't planned realistically. This should be a 'swap in the backup' while you run extensive diagnostics or get a warranty repair on the broken thing. And if you are running Centos the one thing you don't need is to pay for extra licenses to cover the backup/development instances.

-- Les Mikesell lesmikesell@gmail.com

Lamar Owen

5:06 p.m.

On Wednesday, March 09, 2011 11:45:06 am Les Mikesell wrote:

...

And if you are running Centos the one thing you don't need is to pay for extra licenses to cover the backup/development instances.

And this is significant, and really highlights the reasoning of the CentOS team in 'bug-for'bug' binary compatibility with the upstream EL.

That is, in your hypothetical 'three of everything' approach you'd run a fully entitled copy of the upstream on the production unit, and save costs by running CentOS on the backup and the backup backup.

This is another fine financial point, and I'll not use the semi-derogatory 'bean counters' thing, because some money really is cheaper than other money, and I'm not making that up, it is reality. In particular, capital can be donated, but rarely will opex be donation-driven. I have quite a bit of donated capital here, capital that I don't have replacement capex budget for. Also, many grants are awarded with 'capex-only' stipulations in the awards; it is a violation of the grant agreement to use that grant money on opex. Likewise, there are some grants that have exactly the opposite stipulation, and there are a few that have both, and have further direct versus indirect opex stipulations.

The point is that CentOS saves on opex; not personnel opex, but subscription opex. Support subscriptions are opex, not capex. And while that fine of a point might be lost to some, it is a point I deal with on virtually a daily basis. I literally have to think about that distinction, and the various grant stipulations for monies that fund my salary, when filling out my biweekly timesheet; though salaried I am, that salary is funded between several grants, and most of those have different direct versus indirect cost budgets.

And helping keep things simpler is something that CentOS has helped me in significant ways.

Dr. Ed Morbius

6:47 p.m.

on 10:37 Wed 09 Mar, Lamar Owen (lowen@pari.edu) wrote:

...

On Wednesday, March 09, 2011 10:16:34 am Brunner, Brian T. wrote:

...
This would be far cheaper than the time spent troubleshooting the running (sometimes hanging) system.

Let me interject here, that from a budgeting standpoint 'cheaper' has to be interpreted in the context of which budget the costs are coming out of. New hardware is capex, and thus would come out of the capital budget, and admin time is opex, and thus would come out of the operating budget. There may be sufficient funds in the operating budget to pay an admin $x,000 but the funds in the capital budget may be insufficient to buy a server costing $y,000, where y=x.

That represents an accounting failure, as opex is now subsidizing capex. Troubleshooting of known bad equipment should be an opex chargeback against capex or some capital reserve.

This requires clueful beancounters. Recent economic/business/finance history suggests a significant shortage of same. Cue supply/demand and incentives off-topic digression.

The answer is still to communicate the issue upstream. Estimating replacement costs and likelihood will help in the relevant business / organizational decision.

-- Dr. Ed Morbius, Chief Scientist / | Robot Wrangler / Staff Psychologist | When you seek unlimited power Krell Power Systems Unlimited | Go to Krell!

Les Mikesell

7 p.m.

On 3/9/2011 12:47 PM, Dr. Ed Morbius wrote:

...

That represents an accounting failure, as opex is now subsidizing capex. Troubleshooting of known bad equipment should be an opex chargeback against capex or some capital reserve.

This requires clueful beancounters. Recent economic/business/finance history suggests a significant shortage of same. Cue supply/demand and incentives off-topic digression.

Statistical stuff doesn't play out well in one-off situations. If you have a large number of boxes you'll know about the right amount of spare parts and on-hand spares you need. But individual units are about like light bulbs in breaking at random and if the only one you have breaks today it won't matter that their average life is in years.

-- Les Mikesell lesmikesell@gmail.com

Michael Eager

5:32 p.m.

John Hodrien wrote:

...

On Wed, 9 Mar 2011, Michael Eager wrote:

...
The problem with randomly replacing various components, other than the downtime and nuisance, is that there's no way to know that the change actually fixed any problem. When the base rate is one unknown system hang every few weeks, how many wees should I wait without a failure to conclude that the replaced component was the cause? A failure which happens infrequently isn't really amenable to a random diagnostic approach.

So you pitch the whole thing over to being a test rig, and buy all new hardware?

I'll repeat from my original post:

I don't see anything in /var/log/messages or elsewhere to indicate any problem or offer any clue why the system was hung.

Any suggestions where I might look for a clue?

I'm looking for diagnostics to focus on the cause of the crash. My thanks for the several suggestions in this area.

I'm not particularly interested in a listing of the myriad of hypothetical causes absent observable evidence and some of which are contradicted by evidence (such as overheating).

I've encountered my share of bad power supplies, bad RAM, poorly seated cards, etc. I've replaced failing capacitors in monitors (never on a motherboard). I've replaced video cards, hard drives, bad cables. And so forth. Each of these had characteristics which pointed to the problem: kernel oops, POST failures, flickering screens, etc. The problem I have is that there is a lack of diagnostic information to focus on the cause of the server failure.

I don't mean to appear unappreciative, but suggestions which amount to spending many hours making a series of unfocused modifications to the server, hoping that one of these random alterations fixes an infrequent problem, doesn't strike me as useful. At the other extreme, the suggestions that I not look for the cause of the system failure and instead replace the server with one or three servers also doesn't seem to be a useful diagnostic approach either.

During the next server downtime, I'll re-seat RAM and cables, check for excess dust, and do normal maintenance as folks have suggested. I might also run a memory diag. I'll also look at the several excellent and appreciated suggestions (some of which I've already installed) on how to get a better picture on the state of the server when/if there is a future failure.

Thanks all!

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077

m.roth＠5-cent.us

5:51 p.m.

Michael Eager wrote:

...

John Hodrien wrote:

...
On Wed, 9 Mar 2011, Michael Eager wrote:

...
The problem with randomly replacing various components, other than the downtime and nuisance, is that there's no way to know that the change actually fixed any problem. When the base rate is one unknown system hang every few weeks, how many wees should I wait without a failure to conclude that the replaced component was the cause? A failure which happens infrequently isn't really amenable to a random diagnostic approach.

So you pitch the whole thing over to being a test rig, and buy all new hardware?

I'll repeat from my original post:
I don't see anything in /var/log/messages or elsewhere
to indicate any problem or offer any clue why the system
was hung.

Any suggestions where I might look for a clue?
I'm looking for diagnostics to focus on the cause of the crash. My thanks for the several suggestions in this area.

I'm not particularly interested in a listing of the myriad of hypothetical causes absent observable evidence and some of which are contradicted by evidence (such as overheating).

<snip> Here's one more, off-the-wall thought: do the setterm --powersave off, and find some way to make it work, so that you can see what's on the screen when it dies. What may be very important here is I recently had a problem with a honkin' big server crashing... and it turned out that a user was running a parallel processing job that kicked off three? four? dozen threads, and towards the end of the job, every single thread wanted 10G... on a system with 256G RAM (which size still boggles my mind). The OOM-Killer didn't even have a chance to do its thing.... Yes, he's limited what his job requests, and the system hasn't crashed since.

mark

Michael Eager

6:07 p.m.

m.roth@5-cent.us wrote:

...

Michael Eager wrote:

...
John Hodrien wrote:

...
On Wed, 9 Mar 2011, Michael Eager wrote:

...
The problem with randomly replacing various components, other than the downtime and nuisance, is that there's no way to know that the change actually fixed any problem. When the base rate is one unknown system hang every few weeks, how many wees should I wait without a failure to conclude that the replaced component was the cause? A failure which happens infrequently isn't really amenable to a random diagnostic approach.

So you pitch the whole thing over to being a test rig, and buy all new hardware?

I'll repeat from my original post:
I don't see anything in /var/log/messages or elsewhere
to indicate any problem or offer any clue why the system
was hung.

Any suggestions where I might look for a clue?
I'm looking for diagnostics to focus on the cause of the crash. My thanks for the several suggestions in this area.

I'm not particularly interested in a listing of the myriad of hypothetical causes absent observable evidence and some of which are contradicted by evidence (such as overheating).
<snip> Here's one more, off-the-wall thought: do the setterm --powersave off, and find some way to make it work, so that you can see what's on the screen when it dies.

Yes, I did this. Switched to console screen. The correct command is "setterm -powersave off -blank off", otherwise the screen gets blanked. Turned the monitor off. I hope it shows something useful on the next fault.

...

What may be very important here is I recently had a problem with a honkin' big server crashing... and it turned out that a user was running a parallel processing job that kicked off three? four? dozen threads, and towards the end of the job, every single thread wanted 10G... on a system with 256G RAM (which size still boggles my mind). The OOM-Killer didn't even have a chance to do its thing.... Yes, he's limited what his job requests, and the system hasn't crashed since.

Strange. OOM-Killer should get priority. That's what it's for. Although it usually seems to kill the innocent bystanders before it gets around to killing the offenders.

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077

m.roth＠5-cent.us

6:14 p.m.

Michael Eager wrote:

...

m.roth@5-cent.us wrote:

...
Michael Eager wrote:

...
John Hodrien wrote:

...
On Wed, 9 Mar 2011, Michael Eager wrote:

<snip> Here's one more, off-the-wall thought: do the setterm --powersave off, and find some way to make it work, so that you can see what's on the

screen

...

...
when it dies.

Yes, I did this. Switched to console screen. The correct command is "setterm -powersave off -blank off", otherwise the screen gets blanked. Turned the monitor off. I hope it shows something useful on the next fault.

Best of luck. And thanks, I may try that.

...

...
What may be very important here is I recently had a problem with a honkin' big server crashing... and it turned out that a user was running a parallel processing job that kicked off three? four? dozen threads, and towards the end of the job, every single thread wanted 10G... on a system with 256G RAM (which size still boggles my mind). The OOM-Killer didn't even have a chance to do its thing.... Yes, he's limited what his job requests, and the system hasn't crashed since.

Strange. OOM-Killer should get priority. That's what it's for. Although it usually seems to kill the innocent bystanders before it gets around to killing the offenders.

Yeah, but apparently too many of them hit too quickly - that's all I can think.

mark

Les Mikesell

5:52 p.m.

On 3/9/2011 11:32 AM, Michael Eager wrote:

...

I'm not particularly interested in a listing of the myriad of hypothetical causes absent observable evidence and some of which are contradicted by evidence (such as overheating).

Note that overheating can be localized or a bad heat sink mounting or fan on a CPU.

...

I've encountered my share of bad power supplies, bad RAM, poorly seated cards, etc. I've replaced failing capacitors in monitors (never on a motherboard). I've replaced video cards, hard drives, bad cables. And so forth. Each of these had characteristics which pointed to the problem: kernel oops, POST failures, flickering screens, etc. The problem I have is that there is a lack of diagnostic information to focus on the cause of the server failure.

Anything that happens quickly isn't going to show up in a log.

...

I don't mean to appear unappreciative, but suggestions which amount to spending many hours making a series of unfocused modifications to the server, hoping that one of these random alterations fixes an infrequent problem, doesn't strike me as useful. At the other extreme, the suggestions that I not look for the cause of the system failure and instead replace the server with one or three servers also doesn't seem to be a useful diagnostic approach either.

There's not really a good way to approach intermittent failures. It may only break when you aren't looking. Major component swaps or taking it offline for extended diagnostics hoping to catch a glimpse of the cause when it fails is about all you can do.

...

During the next server downtime, I'll re-seat RAM and cables, check for excess dust, and do normal maintenance as folks have suggested. I might also run a memory diag. I'll also look at the several excellent and appreciated suggestions (some of which I've already installed) on how to get a better picture on the state of the server when/if there is a future failure.

Memory diagnostics may take days to catch a problem. Did you check for a newer bios for your MB? I mentioned before that it seemed strange, but I've seen that fix mysterious problems even after the machines had previously been reliable for a long time (and even more oddly, all the machines in the lot weren't affected).

-- Les Mikesell lesmikesell@gmail.com

Michael Eager

6:29 p.m.

Les Mikesell wrote:

...

Note that overheating can be localized or a bad heat sink mounting or fan on a CPU.

I'll re-seat the CPU, heatsink, and fan on the next downtime.

Heat related problems usually present as a system which fails and will not reboot immediately, but will after they sit for a while to cool down. This system doesn't do that.

I'll install sensord to log CPU temps in case this is a problem.

...

There's not really a good way to approach intermittent failures. It may only break when you aren't looking. Major component swaps or taking it offline for extended diagnostics hoping to catch a glimpse of the cause when it fails is about all you can do.

...
During the next server downtime, I'll re-seat RAM and cables, check for excess dust, and do normal maintenance as folks have suggested. I might also run a memory diag. I'll also look at the several excellent and appreciated suggestions (some of which I've already installed) on how to get a better picture on the state of the server when/if there is a future failure.

Memory diagnostics may take days to catch a problem. Did you check for a newer bios for your MB? I mentioned before that it seemed strange, but I've seen that fix mysterious problems even after the machines had previously been reliable for a long time (and even more oddly, all the machines in the lot weren't affected).

Yes, most memory diagnostics are not very effective.

I'll have to stop the server to find out what the installed bios version is and see whether there is an update. Most bios updates appear to only change supported CPUs. Something else for the next downtime.

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077

m.roth＠5-cent.us

6:47 p.m.

Michael Eager wrote: <snip>

...

I'll have to stop the server to find out what the installed bios version is and see whether there is an update. Most bios updates appear to only change supported CPUs. Something else for the next downtime.

Nope: dmidecode, or lshw, is your friend.

mark

Michael Eager

9:10 p.m.

m.roth@5-cent.us wrote:

...

Michael Eager wrote:

<snip> > I'll have to stop the server to find out what the installed bios version > is and see whether there is an update. Most bios updates appear to only > change supported CPUs. Something else for the next downtime.

Nope: dmidecode, or lshw, is your friend.

Thanks. Looks like there might be a newer bios available, although the vendor identifies it as 'beta'.

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077

Dr. Ed Morbius

7:06 p.m.

on 10:29 Wed 09 Mar, Michael Eager (eager@eagerm.com) wrote:

...

Les Mikesell wrote:

...
Note that overheating can be localized or a bad heat sink mounting or fan on a CPU.

I'll re-seat the CPU, heatsink, and fan on the next downtime.

Very strongly advised. It's a simple and very cheap approach. I'd check /all/ cables (power, disk) as well.

Visually scan for bad caps while you're doing this. The pandemic of the mid 2000s seems to have abated, but they can still ruin your whole day.

...

Heat related problems usually present as a system which fails and will not reboot immediately, but will after they sit for a while to cool down. This system doesn't do that.

Maybe, maybe not.

...

I'll install sensord to log CPU temps in case this is a problem.

Good call.

...

...
There's not really a good way to approach intermittent failures. It may only break when you aren't looking. Major component swaps or taking it offline for extended diagnostics hoping to catch a glimpse of the cause when it fails is about all you can do.

I disagree with this statement: you start with the bleeding obvious and easy to do (the cheap diagnostics), same as any garage mechanic or doctor. You instrument and increase log scrutiny. You make damned sure you're logging remotely as one of the first things a hosed system does is stop writing to disk.

...

Yes, most memory diagnostics are not very effective.

I'll have to stop the server to find out what the installed bios version is and see whether there is an update. Most bios updates appear to only change supported CPUs. Something else for the next downtime.

You haven't stated who's built this system, but many LOM / OMC systems will provide basic information such as this. dmidecode and lshw are also very helpful here.

-- Dr. Ed Morbius, Chief Scientist / | Robot Wrangler / Staff Psychologist | When you seek unlimited power Krell Power Systems Unlimited | Go to Krell!

compdoc

7:10 p.m.

...

I'll re-seat the CPU, heatsink, and fan on the next downtime.

Is the CPU overheating? Pointless to reseat the cpu or even remove the heatsink, if not.

Michael Eager

9 p.m.

compdoc wrote:

...

...
I'll re-seat the CPU, heatsink, and fan on the next downtime.

Is the CPU overheating? Pointless to reseat the cpu or even remove the heatsink, if not.

No evidence to suggest that it is.

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077

m.roth＠5-cent.us

9:28 p.m.

Michael Eager wrote:

...

compdoc wrote:

...
...
I'll re-seat the CPU, heatsink, and fan on the next downtime.

Is the CPU overheating? Pointless to reseat the cpu or even remove the heatsink, if not.

No evidence to suggest that it is.

Have you used ipmitool to see what the temperatures are?

mark

Michael Eager

10:45 p.m.

m.roth@5-cent.us wrote:

...

Michael Eager wrote:

...
compdoc wrote:

...
...
I'll re-seat the CPU, heatsink, and fan on the next downtime.

Is the CPU overheating? Pointless to reseat the cpu or even remove the heatsink, if not.

No evidence to suggest that it is.

Have you used ipmitool to see what the temperatures are?

No, I'm not familiar with ipmitool. I just installed it and the man page will take some time to read. It looks like it does everything and then more.

According to the man page, it apparently needs a kernel driver named OpenIMPI, which it claims is installed in standard distributions. I don't find it on my system. Running "impitool sdr type Temperature" results in an error message saying that it could not open /dev/imp0, etc.

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077

compdoc

11:11 p.m.

...

According to the man page, it apparently needs a kernel driver named OpenIMPI, which it claims is installed in standard distributions. I don't find it on my system.

lm_sensors is another, and I think installs ready to use from the repos.

Failing that, you should reboot and look in the motherboard's bios/cmos. It should display all that good stuff: fan speeds, voltage levels, temps.

Michael Eager

11:30 p.m.

compdoc wrote:

...

...
According to the man page, it apparently needs a kernel driver named OpenIMPI, which it claims is installed in standard distributions. I don't find it on my system.

lm_sensors is another, and I think installs ready to use from the repos.

sensors says that the three temp sensors read +36C, +39C, and +87C. These appear to be AMD K10 temp sensors, although I might be misreading sensors-detect. Low/highs are (+127/+127, +127/+90, +127/+127) respectively. (I'm not sure if these are alarm set points or something else.)

One fan is listed as 0 rpm. Something to look into.

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077

compdoc

11:50 p.m.

+36C and +39C are likely your cpu and motherboard temps. You have to look at the temps in the cmos and match them.

The +87C is likely just a miss-reading by lm_sensors. Anything running that hot won't be stable.

I use AMD as well, and lm_sensors tells me something is 128°C.

heh

compdoc

11:52 p.m.

Err, that should read 128C

-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of compdoc Sent: Wednesday, March 09, 2011 4:50 PM To: 'CentOS mailing list' Subject: Re: [CentOS] Server hangs on CentOS 5.5

+36C and +39C are likely your cpu and motherboard temps. You have to look at the temps in the cmos and match them.

The +87C is likely just a miss-reading by lm_sensors. Anything running that hot won't be stable.

I use AMD as well, and lm_sensors tells me something is 1280C.

heh

Michael Eager

10 Mar 10 Mar

12:06 a.m.

compdoc wrote:

...

Err, that should read 128C

-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of compdoc Sent: Wednesday, March 09, 2011 4:50 PM To: 'CentOS mailing list' Subject: Re: [CentOS] Server hangs on CentOS 5.5

+36C and +39C are likely your cpu and motherboard temps. You have to look at the temps in the cmos and match them.

The +87C is likely just a miss-reading by lm_sensors. Anything running that hot won't be stable.

I use AMD as well, and lm_sensors tells me something is 1280C.

I'll compare the values from lm_sensors with the bios temps to see if they are in line.

1280C is about the melting point of iron. Wow!

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077

compdoc

12:09 a.m.

...

1280C is about the melting point of iron. Wow!

The degree symbol was converted to text after pasting into the email and became an '0'

It actually shows 128C in lm_sensors.

Great little program, tho.

John R Pierce

12:11 a.m.

On 03/09/11 4:06 PM, Michael Eager wrote:

...

I'll compare the values from lm_sensors with the bios temps to see if they are in line.

I find lm_sensors tends to be pretty useless on server grade hardware, as opposed to desktop. server hardware tends to have an IPMI management processor, which is accessed over the network (after you configure it) and can be centrally managed, this includes temp+fan+power monitoring as well as remote power and console.

John Hodrien

9:04 a.m.

On Wed, 9 Mar 2011, compdoc wrote:

...

+36C and +39C are likely your cpu and motherboard temps. You have to look at the temps in the cmos and match them.

The +87C is likely just a miss-reading by lm_sensors. Anything running that hot won't be stable.

In testing nVidia graphics cards to destruction (not entirely deliberately) we found that anything up to about 110C was likely to work fine, anything past that was likely to cause visual corruption. Anything past 125C was pretty much guaranteed to cause permanent damage.

But you're right, I doubt that's correct, and lm_sensors is prone to reporting duff information. AMD list 70C as the max recommended for that chip. In the past it'd also depend a lot on where the temperature probe was (so varied a lot motherboard by motherboard), but they're on package now aren't they?

Simon Matter

10:04 a.m.

...

compdoc wrote:

...
...
According to the man page, it apparently needs a kernel driver named OpenIMPI, which it claims is installed in standard distributions. I don't find it on my system.

lm_sensors is another, and I think installs ready to use from the repos.

sensors says that the three temp sensors read +36C, +39C, and +87C. These appear to be AMD K10 temp sensors, although I might be misreading sensors-detect. Low/highs are (+127/+127, +127/+90, +127/+127) respectively. (I'm not sure if these are alarm set points or something else.)

One fan is listed as 0 rpm. Something to look into.

Hmm, much has been said now in this thread and I know how difficult it can be to find such an issue. However, I suggest not to throw in too many new tools in parallel. And, be careful of how to interpret any information gathered by tools like lm_sensors. They can only report as good as the mainboard and it's sensors were designed and built, both can be suboptimal. I've seen all kind of things like temp sensors not mounted where they should. Of course, builtin sensors like thiose of a CPU should be taken very serious.

So, may I give some more tips how I'd try to find what is wrong: - Take a vacuum cleaner and *carefully* clean the whole box. Dust can really do bad things because it is not a perfect insulator. - If you feel you have to remove any device like CPU, make sure you up everything, have a good quality heat sink paste at hand and make sure everything is seated well after mounting it again. - For the memory part, do you have ECC? If not, then it is really a problem and if the box is used as a server, ECC is a must, if yes, then most errors will be corrected by ECC but what is more important, memory errors are usually logged. You should be able to find a list of those errors in the BIOS, you may see how many times errors occur and where, does something like that exist? - For the temparatures, 87C is not so uncommon, but yes, it looks a little bit high. Someone else posted 80C to be the max for your CPU, that seems correct, at least our 12core Opterons have "Caution: 75C; Critical: 80C" but they usually run at 45C-55C under normal load. So if 87C is really correct, under normal load, that may be already too much, and then consider what happens at peak times? - When you look at the lm_sensors values, do they correspund with what is shown in the BIOS (if is has this kind of diagnostics)?

Simon

John Hodrien

10:10 a.m.

On Thu, 10 Mar 2011, Simon Matter wrote:

...

Take a vacuum cleaner and *carefully* clean the whole box. Dust can

really do bad things because it is not a perfect insulator.

Take the wrong vacuum cleaner and static your machine to death.

Rudi Ahlers

10:35 a.m.

On Thu, Mar 10, 2011 at 12:10 PM, John Hodrien J.H.Hodrien@leeds.ac.uk wrote:

...

On Thu, 10 Mar 2011, Simon Matter wrote:

...

Take a vacuum cleaner and *carefully* clean the whole box. Dust can

really do bad things because it is not a perfect insulator.

Take the wrong vacuum cleaner and static your machine to death.

jh _______________________________________________

I prefer to use a dust blower instead. It doesn't risk pulling loose components with "dry" or loose "soldering"

-- Kind Regards Rudi Ahlers SoftDux Website: http://www.SoftDux.com Technical Blog: http://Blog.SoftDux.com Office: 087 805 9573 Cell: 082 554 7532

Lamar Owen

2:07 p.m.

On Thursday, March 10, 2011 05:35:29 am Rudi Ahlers wrote:

...

I prefer to use a dust blower instead. It doesn't risk pulling loose components with "dry" or loose "soldering"

I use both: antistatic canned air to blow the dust and a metal-tubed vacuum rested on a part of the case away from any boards to grab the dust that's being blown. Works great, and you don't 'recycle' the dust.....

Alexander Arlt

1:22 p.m.

Am 03/10/2011 11:04 AM, schrieb Simon Matter:

...

Take a vacuum cleaner and *carefully* clean the whole box. Dust can

really do bad things because it is not a perfect insulator.

Never ever do that. Especially not inside the machine. There is a real risk of simply vacuuming smaller components like smd-resistors of the board. And, as already mentioned, you also have the chance of killing components by electrostatic discharge. Always use compressed air, even if just using canned one. Vacuuming is a pretty bad advice.

Michael Eager

4:13 p.m.

Alexander Arlt wrote:

...

Am 03/10/2011 11:04 AM, schrieb Simon Matter:

...

Take a vacuum cleaner and *carefully* clean the whole box. Dust can

really do bad things because it is not a perfect insulator.

Never ever do that. Especially not inside the machine. There is a real risk of simply vacuuming smaller components like smd-resistors of the board. And, as already mentioned, you also have the chance of killing components by electrostatic discharge. Always use compressed air, even if just using canned one. Vacuuming is a pretty bad advice.

Previous cleaning have been with canned compressed air. Thanks for the caution about vacuums and static. I may use the vacuum on the case fans from the outside. The case should provide an adequate static shield.

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077

Nico Kadel-Garcia

11:28 p.m.

On Thu, Mar 10, 2011 at 11:13 AM, Michael Eager eager@eagerm.com wrote:

...

Previous cleaning have been with canned compressed air. Thanks for the caution about vacuums and static. I may use the vacuum on the case fans from the outside. The case should provide an adequate static shield.

I've had good results with a damp, soft cloth or Q-tip with distilled water for awkward bits. and filters, and that cloth for the case itself. It also looks noticeably newer, which helps with walking investors through a small machine room.

B.J. McClure

11:49 p.m.

B.J. McClure keepertoad@bellsouth.net

Sent from MacBook-Air

On Mar 10, 2011, at 5:28 PM, Nico Kadel-Garcia wrote:

...

On Thu, Mar 10, 2011 at 11:13 AM, Michael Eager eager@eagerm.com wrote:

...
Previous cleaning have been with canned compressed air. Thanks for the caution about vacuums and static. I may use the vacuum on the case fans from the outside. The case should provide an adequate static shield.

I've had good results with a damp, soft cloth or Q-tip with distilled water for awkward bits. and filters, and that cloth for the case itself. It also looks noticeably newer, which helps with walking investors through a small machine room.

I must respectfully disagree with any application of water, distilled or otherwise to things electronic. I was taught in the Navy, and my engineering career has confirmed, that cleaning of electronic components should be done with low pressure, dried, compressed air. 50 psi max. If some solvent must be used, try alcohol. Evaporates quickly, leaves no residue and has an affinity for water.

Just my $0.02.

Cheers, B.J.

Nico Kadel-Garcia

11 Mar 11 Mar

2:03 a.m.

On Thu, Mar 10, 2011 at 6:49 PM, B.J. McClure keepertoad@bellsouth.net wrote:

...

B.J. McClure keepertoad@bellsouth.net

Sent from MacBook-Air

On Mar 10, 2011, at 5:28 PM, Nico Kadel-Garcia wrote:

...
On Thu, Mar 10, 2011 at 11:13 AM, Michael Eager eager@eagerm.com wrote:

...
Previous cleaning have been with canned compressed air. Thanks for the caution about vacuums and static. I may use the vacuum on the case fans from the outside. The case should provide an adequate static shield.

I've had good results with a damp, soft cloth or Q-tip with distilled water for awkward bits. and filters, and that cloth for the case itself. It also looks noticeably newer, which helps with walking investors through a small machine room.

I must respectfully disagree with any application of water, distilled or otherwise to things electronic. I was taught in the Navy, and my engineering career has confirmed, that cleaning of electronic components should be done with low pressure, dried, compressed air. 50 psi max. If some solvent must be used, try alcohol. Evaporates quickly, leaves no residue and has an affinity for water.

Typical drug-store alcohol is "rubbing alcohol", and is 30% water.

I designed medical electronics for a dozen years. Acohol has its uses, but water is much cheaper, safer, and you don't have fumes to deal with. Shall we discuss the effectives of surface etch resist and cladding in protecting circuit boards from damage, and the effects of alcohol on low cost electronic sockets?

Alexander Arlt

10:25 a.m.

Am 03/11/2011 03:03 AM, schrieb Nico Kadel-Garcia:

...

On Thu, Mar 10, 2011 at 6:49 PM, B.J. McClurekeepertoad@bellsouth.net wrote:

...
B.J. McClure keepertoad@bellsouth.net

Sent from MacBook-Air

On Mar 10, 2011, at 5:28 PM, Nico Kadel-Garcia wrote:

...
On Thu, Mar 10, 2011 at 11:13 AM, Michael Eagereager@eagerm.com wrote:

...
Previous cleaning have been with canned compressed air. Thanks for the caution about vacuums and static. I may use the vacuum on the case fans from the outside. The case should provide an adequate static shield.

I've had good results with a damp, soft cloth or Q-tip with distilled water for awkward bits. and filters, and that cloth for the case itself. It also looks noticeably newer, which helps with walking investors through a small machine room.

I must respectfully disagree with any application of water, distilled or otherwise to things electronic. I was taught in the Navy, and my engineering career has confirmed, that cleaning of electronic components should be done with low pressure, dried, compressed air. 50 psi max. If some solvent must be used, try alcohol. Evaporates quickly, leaves no residue and has an affinity for water.

Typical drug-store alcohol is "rubbing alcohol", and is 30% water.

I designed medical electronics for a dozen years. Acohol has its uses, but water is much cheaper, safer, and you don't have fumes to deal with. Shall we discuss the effectives of surface etch resist and cladding in protecting circuit boards from damage, and the effects of alcohol on low cost electronic sockets?

I agree with Nico, I have been working for a large PC-Manufacturer in Europe for many years and alcohol was never a good idea for cleaning pcbs, not in production nor in the field.

Either we used trichloroethane or trichlorotrifluoroethane for washing and cleaning of mainboards (which became a bit unpopular due to its effects on the ozone layer...) or we used water-based cleaning fluids (aka 'water'). But that was only in the production process of the pcbs. Almost never in the field, except when real repairs on the mainboard had to be done on site (soldering).

Yes, it can be true with 'navy-strength' electronics that you actually can use alcohol for the purpose of cleaning electronic boards, but in low-cost electronics, it's a total no-go, because it disolves the coating of the pcbs and most often harms - as Nico wrote - the sockets and chip packages. We're talking about low-cost electronics here...

Though, when cleaning machines in the field, I very rarely ever used something else then compressed air. Actually, I would suggest to everyone not to clean the inside of a box with any kind of fluid, since it actually won't do anything positive besides changing the looks.

Simon Matter

1:51 p.m.

...

Am 03/11/2011 03:03 AM, schrieb Nico Kadel-Garcia:

...
On Thu, Mar 10, 2011 at 6:49 PM, B.J. McClurekeepertoad@bellsouth.net wrote:

...
B.J. McClure keepertoad@bellsouth.net

Sent from MacBook-Air

On Mar 10, 2011, at 5:28 PM, Nico Kadel-Garcia wrote:

...
On Thu, Mar 10, 2011 at 11:13 AM, Michael Eagereager@eagerm.com wrote:

...
Previous cleaning have been with canned compressed air. Thanks for the caution about vacuums and static. I may use the vacuum on the case fans from the outside. The case should provide an adequate static shield.

I've had good results with a damp, soft cloth or Q-tip with distilled water for awkward bits. and filters, and that cloth for the case itself. It also looks noticeably newer, which helps with walking investors through a small machine room.

I must respectfully disagree with any application of water, distilled or otherwise to things electronic. I was taught in the Navy, and my engineering career has confirmed, that cleaning of electronic components should be done with low pressure, dried, compressed air. 50 psi max. If some solvent must be used, try alcohol. Evaporates quickly, leaves no residue and has an affinity for water.

Typical drug-store alcohol is "rubbing alcohol", and is 30% water.

I designed medical electronics for a dozen years. Acohol has its uses, but water is much cheaper, safer, and you don't have fumes to deal with. Shall we discuss the effectives of surface etch resist and cladding in protecting circuit boards from damage, and the effects of alcohol on low cost electronic sockets?

I agree with Nico, I have been working for a large PC-Manufacturer in Europe for many years and alcohol was never a good idea for cleaning pcbs, not in production nor in the field.

Either we used trichloroethane or trichlorotrifluoroethane for washing and cleaning of mainboards (which became a bit unpopular due to its effects on the ozone layer...) or we used water-based cleaning fluids (aka 'water'). But that was only in the production process of the pcbs. Almost never in the field, except when real repairs on the mainboard had to be done on site (soldering).

Yes, it can be true with 'navy-strength' electronics that you actually can use alcohol for the purpose of cleaning electronic boards, but in low-cost electronics, it's a total no-go, because it disolves the coating of the pcbs and most often harms - as Nico wrote - the sockets and chip packages. We're talking about low-cost electronics here...

Though, when cleaning machines in the field, I very rarely ever used something else then compressed air. Actually, I would suggest to everyone not to clean the inside of a box with any kind of fluid, since it actually won't do anything positive besides changing the looks.

After decades in the high precision and electronics industry, I can tell you for sure that compressed air is not seen as a good choice. It blows the dust where it doesn't belong. That may not be a big problem with a cheap PC, but it's not professional at all.

If you want to do it the professional way, go to an ESD protected room, take an ESD vac and an ESD brush, wear your ESD shoes and wrist strap, and clean *carefully*. Compressed air may additionally be used in certain places, but not more.

Simon

Nico Kadel-Garcia

4:57 p.m.

On Fri, Mar 11, 2011 at 8:51 AM, Simon Matter simon.matter@invoca.ch wrote:

...

After decades in the high precision and electronics industry, I can tell you for sure that compressed air is not seen as a good choice. It blows the dust where it doesn't belong. That may not be a big problem with a cheap PC, but it's not professional at all.

If you want to do it the professional way, go to an ESD protected room, take an ESD vac and an ESD brush, wear your ESD shoes and wrist strap, and clean *carefully*. Compressed air may additionally be used in certain places, but not more.

Simon

Is it worth a discussion here of overall PC manufacturing safety tips? It's not really CentOS specific, but it is interesting.

Michael Eager

10 Mar 10 Mar

4:11 p.m.

Simon Matter wrote:

...

...
One fan is listed as 0 rpm. Something to look into.

Hmm, much has been said now in this thread and I know how difficult it can be to find such an issue. However, I suggest not to throw in too many new tools in parallel. And, be careful of how to interpret any information gathered by tools like lm_sensors. They can only report as good as the mainboard and it's sensors were designed and built, both can be suboptimal. I've seen all kind of things like temp sensors not mounted where they should. Of course, builtin sensors like thiose of a CPU should be taken very serious.

Thanks for the suggestions.

...

So, may I give some more tips how I'd try to find what is wrong:

Take a vacuum cleaner and *carefully* clean the whole box. Dust can

really do bad things because it is not a perfect insulator.

If you feel you have to remove any device like CPU, make sure you up

everything, have a good quality heat sink paste at hand and make sure everything is seated well after mounting it again.

For the memory part, do you have ECC? If not, then it is really a

problem and if the box is used as a server, ECC is a must, if yes, then most errors will be corrected by ECC but what is more important, memory errors are usually logged. You should be able to find a list of those errors in the BIOS, you may see how many times errors occur and where, does something like that exist?

The MB docs/website don't mention ECC support, but I presume it is as part of the DDR2 spec. I'll check whether the memory has ECC. If not, this is a reasonable upgrade.

...

For the temparatures, 87C is not so uncommon, but yes, it looks a little

bit high. Someone else posted 80C to be the max for your CPU, that seems correct, at least our 12core Opterons have "Caution: 75C; Critical: 80C" but they usually run at 45C-55C under normal load. So if 87C is really correct, under normal load, that may be already too much, and then consider what happens at peak times?

The most recent crash was overnight and not discovered until morning. Probably not related to load. But if it really is running over temp, then almost anything can happen.

...

When you look at the lm_sensors values, do they correspund with what is

shown in the BIOS (if is has this kind of diagnostics)?

Something I'll check when the system is taken down.

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077

Brunner, Brian T.

4:37 p.m.

centos-bounces@centos.org wrote:

...

Simon Matter wrote:

The MB docs/website don't mention ECC support, but I presume it is as part of the DDR2 spec. I'll check whether the memory has ECC. If not, this is a reasonable

upgrade.

Your board does not support DDR2. See http://service.msicomputer.com/index.php?func=proddesc&maincat_no=1&... _no=&cat3_no=&prod_no=273 "Support 2.5v DDR200/266/333 DDR SDRAM DIMM "

That's straight old DDR. 3 slots of up to 3GB. No ECC.

BIOS listed is A6380VMS.570

So many "instrumentation" suggestions have been made, that I think to note: The CPU bandwidth is rather modest, and might not support all that instrumentation *and* its previous job load. Also, some instrumentation packages suggested might not support socket A (pre-Barton) motherboards, verify VIA(r) KT333 (552 BGA) Chipset and VIA(r) VT8233A (376 BGA) Chipset are "comprehended"

Insert spiffy .sig here: Life is complex: it has both real and imaginary parts.

compdoc

5:07 p.m.

...

Your board does not support DDR2. (url for MSI KT3 Ultra) "Support 2.5v DDR200/266/333 DDR SDRAM DIMM

The OP says this:

...

House-built, Gigabyte MB, AMD Phenom II X6, 6Gb RAM.

Somehow, info has gotten crossed...

Brunner, Brian T.

5:10 p.m.

centos-bounces@centos.org wrote:

...

...
Your board does not support DDR2. (url for MSI KT3 Ultra) "Support 2.5v DDR200/266/333 DDR SDRAM DIMM

The OP says this:

...
House-built, Gigabyte MB, AMD Phenom II X6, 6Gb RAM.

Somehow, info has gotten crossed...

Possibility... Please excuse...

Insert spiffy .sig here: Life is complex: it has both real and imaginary parts.

compdoc

9 Mar 9 Mar

11:06 p.m.

...

compdoc wrote:

...
I'll re-seat the CPU, heatsink, and fan on the next downtime.

Is the CPU overheating? Pointless to reseat the cpu or even remove the heatsink, if not.

...

No evidence to suggest that it is.

As much as I love telling anecdotes, I have none to tell you concerning cpu reseating. I've never seen it fix a problem.

Maybe that was something they needed to do back in 1998, but cpu and ram sockets are a reliable technology these days.

Removing and then reinserting is likely to do more damage than it will fix.

I think you're on the right track - use diagnostic tools and see what you can find. The more poking around you do the better.

I do agree about bad caps - even one with a bulging top can cause crashing/rebooting. They need to be checked both on the motherboard and inside the PSU.

However, if the motherboard is 2 years old or less, capacitor problems on the motherboard will become less likely the newer it is. They've been making some excellent low cost boards with solid caps for a while.

The older boards with that problem are still around but most have died by now. Cheaper PSUs have a cap problem even these days, though.

Oh, and both the motherboard and PSU circuit board should be examined for burned components. We have some hellacious lighting strikes here in Denver, and stuff blows up.

Hey, I did manage an anecdote after all!

Don Krause

11:21 p.m.

On Mar 9, 2011, at 3:06 PM, compdoc wrote:

...

...
compdoc wrote:

...
I'll re-seat the CPU, heatsink, and fan on the next downtime.

Is the CPU overheating? Pointless to reseat the cpu or even remove the heatsink, if not.

...
No evidence to suggest that it is.

As much as I love telling anecdotes, I have none to tell you concerning cpu reseating. I've never seen it fix a problem.

Funny, we actually had a whole stack of HP 4600s that needed the cpus reinstalled in order to function.

When we removed the heatsinks, the cpus came up with them, even though the socket lever was down in the lock position.

We had to "twist" the CPU off the bottom of the heatsink, reinstall it in the socket, reinstall the heatsink, and the machines were fine.

-- Don Krause

compdoc

11:26 p.m.

...

When we removed the heatsinks, the cpus came up with them, even though the socket lever was down in the lock position.

I've seen that in HP desktops too - the thermal paste became a hardened glue and the cpu gets pulled right out .

Another reason to leave the heat sink on.

Don Krause

11:42 p.m.

On Mar 9, 2011, at 3:26 PM, compdoc wrote:

...

...
When we removed the heatsinks, the cpus came up with them, even though the socket lever was down in the lock position.

I've seen that in HP desktops too - the thermal paste became a hardened glue and the cpu gets pulled right out .

Another reason to leave the heat sink on.

Umm, actually, that was a great reason to take the heatsink off. The machines wouldn't boot in that condition, reseating the cpus fixed them all. Yes, we could have shipped them back, (they were brand new, broken out of the box) but didn't have the time to deal with that.

-- Don Krause

John R Pierce

7:27 p.m.

On 03/09/11 10:29 AM, Michael Eager wrote:

...

I'll re-seat the CPU, heatsink, and fan on the next downtime.

do have on hand the suppplies to clean off the old heatsink goo (I use alcohol pads for this), and some fresh heatsink goop

check all fans when its powered off that they spin easily. I've seen fans that were still spinning but felt a little stiff, and failed not long thereafter. and of course, clean out most of the dust that tends to collect everywhere.

Dr. Ed Morbius

6:59 p.m.

on 11:52 Wed 09 Mar, Les Mikesell (lesmikesell@gmail.com) wrote:

...

On 3/9/2011 11:32 AM, Michael Eager wrote:

...

Memory diagnostics may take days to catch a problem. Did you check for a newer bios for your MB? I mentioned before that it seemed strange, but I've seen that fix mysterious problems even after the machines had previously been reliable for a long time (and even more oddly, all the machines in the lot weren't affected).

BIOS issues would tend to present similar issues on numerous systems, especially if they're similarly configured.

Mind: we've encountered a DSTATE bug with recent Dell PowerEdge systems (r610, r410, r310), which has resulted in several BIOS revisions, the latest of which simply disables the option entirely. It's one of the first things Dell techs mention when you call them these days (much to our amusement).

If it's a single system (and assuming there are others similarly configured), I'm leaning toward hardware or build-quality issues: bad RAM, other componentry, poor cable seating, etc.

-- Dr. Ed Morbius, Chief Scientist / | Robot Wrangler / Staff Psychologist | When you seek unlimited power Krell Power Systems Unlimited | Go to Krell!

compdoc

6:34 p.m.

...

During the next server downtime, I'll re-seat RAM

If the ram is passing memtest86+, I think reseating only serves to introduce dust and dirt into an area where a tight connection was previously keeping it out.

Gently press them down to make sure they're seated, sure. But pulling them out only allows dirt to fall into the cavity, and increases chances of damage from insertion or static electricity, etc.

No to mention causing wear on the memory socket itself...

Dr. Ed Morbius

6:43 p.m.

on 07:06 Wed 09 Mar, Michael Eager (eager@eagerm.com) wrote:

...

Dr. Ed Morbius wrote:

...
on 09:24 Tue 08 Mar, Michael Eager (eager@eagerm.com) wrote:

...
Hi --

I'm running a server which is usually stable, but every once in a while it hangs. The server is used as a file store using NFS and to run VMware machines.

I don't see anything in /var/log/messages or elsewhere to indicate any problem or offer any clue why the system was hung.

Any suggestions where I might look for a clue?

I'd very strongly recommend you configure netconsole. Though not entire clear from the name, it's actually an in-kernel network logging module, which is very useful for kicking out kernel panics which otherwise aren't logged to disk and can't be seen on a (nonresponsive) monitor.

I'll take a look at netconsole.

...
Alternately, a serial console which actually retains all output sent to it (some remote access systems support this, some don't) may help.

Barring that, I'd start looking at individual HW components, starting with RAM.

The problem with randomly replacing various components, other than the downtime and nuisance, is that there's no way to know that the change actually fixed any problem. When the base rate is one unknown system hang every few weeks, how many wees should I wait without a failure to conclude that the replaced component was the cause? A failure which happens infrequently isn't really amenable to a random diagnostic approach.

This is where vendor management/relations starts coming into the picture.

Your architecture should also support single-point failures.

If the issue is repeated but rare system failures on one of a set of similarly configured hosts, I'd RMA the box and get a replacement. End of story.

If that's not the case, well, then, I suppose YOUR problem is to figure out when you've resolved the issue. I've outlined the steps I'd take. If this means weeks of uncertainty, then I'd communicate this fact, in no uncertain terms, to my manager, along with the financial implications of downtime.

If downtime is more expensive than system replacement costs, the decision is pretty obvious, even if painful.

Note that most system problems /are/ single-source. If you'd post details of the host, more logging information, netconsole panic logs, etc., it might be possible to narrow down possible causes.

With what you've posted to date, it's not.

-- Dr. Ed Morbius, Chief Scientist / | Robot Wrangler / Staff Psychologist | When you seek unlimited power Krell Power Systems Unlimited | Go to Krell!

Michael Eager

10:31 p.m.

Dr. Ed Morbius wrote:

...

If the issue is repeated but rare system failures on one of a set of similarly configured hosts, I'd RMA the box and get a replacement. End of story.

I'll repeat: this is a house-made system. There's no vendor to RMA to. It seems obvious to me: RMA is not a diagnostic tool.

...

If you'd post details of the host, more logging information, netconsole panic logs, etc., it might be possible to narrow down possible causes.

The problem is that there are NO DIAGNOSTICS generated when the system hangs. There's no panic and nothing in the logs which indicates any problem. This is what I indicated from the get go.

...

With what you've posted to date, it's not.

I could waste my time posting logs for you to tell me that they don't point to any problem. I'd rather skip that step.

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077

Rudi Ahlers

10:54 p.m.

On Thu, Mar 10, 2011 at 12:31 AM, Michael Eager eager@eagerm.com wrote:

...

Dr. Ed Morbius wrote:

...
If the issue is repeated but rare system failures on one of a set of similarly configured hosts, I'd RMA the box and get a replacement. End of story.

I'll repeat: this is a house-made system. There's no vendor to RMA to.

I don't know where you are, but in our country we can RMA anything and everything. Apart from CPU's. So, even a cheap desktop mobo could be RMA'd, as long as I can prove to the suppliers it's faulty, and it's within the warrenty period

-- Kind Regards Rudi Ahlers SoftDux Website: http://www.SoftDux.com Technical Blog: http://Blog.SoftDux.com Office: 087 805 9573 Cell: 082 554 7532

Brunner, Brian T.

10 Mar 10 Mar

4:55 p.m.

centos-bounces@centos.org wrote:

...

On Thu, Mar 10, 2011 at 12:31 AM, Michael Eager eager@eagerm.com wrote:

...
Dr. Ed Morbius wrote:

...
If the issue is repeated but rare system failures on one of a set of similarly configured hosts, I'd RMA the box and get a replacement. End of story.

I'll repeat: this is a house-made system. There's no vendor to RMA to.

I don't know where you are,

His signature list CA/USA.

...

but in our country we can RMA anything and everything. Apart from CPU's. So, even a cheap desktop mobo could be RMA'd, as long as I can prove to the suppliers it's faulty, and it's within the warrenty period

Here in the USA we can RMA stuff if we can show it is dysfunctional. Michael's position is that he has no evidence of a dysfunctional part, which could be RMA'd. He has evidence of a dysfunctional gestalt, comprising hardware, software, environment, and data stream.

Insert spiffy .sig here: Life is complex: it has both real and imaginary parts.

Dr. Ed Morbius

9 Mar 9 Mar

11:25 p.m.

on 14:31 Wed 09 Mar, Michael Eager (eager@eagerm.com) wrote:

...

Dr. Ed Morbius wrote:

...
If the issue is repeated but rare system failures on one of a set of similarly configured hosts, I'd RMA the box and get a replacement. End of story.

I'll repeat: this is a house-made system. There's no vendor to RMA to. It seems obvious to me: RMA is not a diagnostic tool.

You fab your own silicon?

I saw your reference to a homebrew machine after I'd posted. You'd neglected to provide this information initially.

Knowing some basic stuff like: CPU architecture, memory allocation, disk subsystem, kernel modules, etc.,

...

...
If you'd post details of the host, more logging information, netconsole panic logs, etc., it might be possible to narrow down possible causes.

The problem is that there are NO DIAGNOSTICS generated when the system hangs. There's no panic and nothing in the logs which indicates any problem. This is what I indicated from the get go.

uname -a /proc/cpuinfo /proc/meminfo lspci lsmod /proc/mounts /proc/scsi/scsi /proc/partitions dmidecode

... would be useful for starters.

If you've built your own kernel, your config options (if you're running stock, we can get that from the package itself).

As would wiring up netconsole as I initially suggested.

If I can clarify: YOU are the person with the problem. WE are the people you're turning to for assistance. YOU are getting pissy. YOU should be focusing on providing relevant information, or noting that it's not available.

You're NOT obliged to repeat information you've already posted (e.g.: home-brew system), but it's helpful to front-load data rather than have us tease it out of you.

...

...
With what you've posted to date, it's not.

I could waste my time posting logs for you to tell me that they don't point to any problem. I'd rather skip that step.

Krell forfend you should post relevant and useful information which might be useful in actually diagnosing your problem (or pointing to likely candidates and/or further tests).

-- Dr. Ed Morbius, Chief Scientist / | Robot Wrangler / Staff Psychologist | When you seek unlimited power Krell Power Systems Unlimited | Go to Krell!

Michael Eager

10 Mar 10 Mar

12:15 a.m.

Dr. Ed Morbius wrote:

...

You're NOT obliged to repeat information you've already posted (e.g.: home-brew system), but it's helpful to front-load data rather than have us tease it out of you.

No intention to have anyone tease information out of me.

The subject line says that the system is CentOS 5.5. The other info has been forthcoming, as much as I have been able to provide. Sorry it wasn't all at the same time -- I didn't think that saying the server was not a Dell or HP box was important.

...

...
...
With what you've posted to date, it's not.

I could waste my time posting logs for you to tell me that they don't point to any problem. I'd rather skip that step.

Krell forfend you should post relevant and useful information which might be useful in actually diagnosing your problem (or pointing to likely candidates and/or further tests).

The logs are uninformative. No messages for hours before the crash.

Thanks for the help.

-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077

John R Pierce

9 Mar 9 Mar

11:54 p.m.

On 03/09/11 2:31 PM, Michael Eager wrote:

...

I'll repeat: this is a house-made system. There's no vendor to RMA to. It seems obvious to me: RMA is not a diagnostic tool.

you built it, you get to fix it. sometimes the initial savings in capital can come back and bite you in time wasted.

5351

Age (days ago)

5354

Last active (days ago)

discuss@lists.centos.org

87 comments

18 participants

tags (0)

participants (18)

Alexander Arlt
B.J. McClure
Brian Mathis
Brunner, Brian T.
compdoc
Don Krause
Dr. Ed Morbius
John Hodrien
John R Pierce
Lamar Owen
Leen de Braal
Les Mikesell
m.roth＠5-cent.us
Michael Eager
Nico Kadel-Garcia
Rudi Ahlers
Scott Silva
Simon Matter