I have a simple Tyan server with RAID1 on a 3ware 7002 that has been running updated Centos 5.1 for almost 3 weeks with no problem, it runs low-traffic Apache/PHP/mySQL and vsftpd sites exclusively - last night it crashed. The console showed something about a process named 'php' and gave a process id with the panic message, but that's all I can remember. Hard-rebooted and all was well...
Then about 55 minutes later I was logged in with ssh and it froze. Another fun holiday trip back to the datacenter. This time I brought a camera, but found a message totally different - this seems more like a ext3/disk problem?
http://www.flickr.com/photos/sbeam/2525251590/sizes/l/
the logs contain nothing unusual. I have set up a script to do a vmstat/ps dump every 10 minutes. Anything else advised by this kernel dump or places to look?
Thanks much- Sam
On Mon, 26 May 2008 14:13:27 -0400 sbeam sbeam@onsetcorps.net wrote:
This time I brought a camera, but found a message totally different
If it keeps crashing and every time you get a new and excitingly different error message, I would look at the hardware and the immediate environment. Overheat? Power supply dying or stuck fans? Bad ram or dirty connectors? Etc.
On Monday 26 May 2008 14:19, Frank Cox wrote:
I would look at the hardware and the immediate environment. Overheat? Power supply dying or stuck fans? Bad ram or dirty connectors? Etc.
Unit is in a climate-controlled datacenter, and it is practically brand-new. (4 weeks). I'll check the fans and ps voltage. Meantime it crashed again last night.
Any chance the RAID controller is to blame for this (3ware 7006)? or should I swap out the RAM as a first step?
on 5-27-2008 4:02 AM sbeam spake the following:
On Monday 26 May 2008 14:19, Frank Cox wrote:
I would look at the hardware and the immediate environment. Overheat? Power supply dying or stuck fans? Bad ram or dirty connectors? Etc.
Unit is in a climate-controlled datacenter, and it is practically brand-new. (4 weeks). I'll check the fans and ps voltage. Meantime it crashed again last night.
Any chance the RAID controller is to blame for this (3ware 7006)? or should I swap out the RAM as a first step?
Running memtest for 24 hours should be enough to test the ram. A 3ware 7006 is a fairly old card. Does it have the latest bios available from 3ware? You could always eliminate the 3ware controller by installing a drive on whatever built in controller it has.
On Tuesday 27 May 2008 11:39, Scott Silva wrote:
Running memtest for 24 hours should be enough to test the ram. A 3ware 7006 is a fairly old card. Does it have the latest bios available from 3ware? You could always eliminate the 3ware controller by installing a drive on whatever built in controller it has.
this is a production server, so running an extended memtest not going to happen. But I can swap it out and put it in a backup system to do the test. It's beginning to look a lot like a RAM issue as I have now seen a couple segfaults from programs that have always run fine. Every kernel panic message is different (crashed again 1 hour ago). Fans and case temp are nominal.
the 3ware card was just purchased last month, it has the latest firmware and bios installed.
the memory is from PQI - supposed to be an OK brand right? it has a lifetime warranty... heh
next steps... HA and fault-tolerant clustering, per the adjacent thread... this is the cautionary tale come to life.
fun fun fun Sam
sbeam wrote:
On Tuesday 27 May 2008 11:39, Scott Silva wrote:
Running memtest for 24 hours should be enough to test the ram. A 3ware 7006 is a fairly old card. Does it have the latest bios available from 3ware? You could always eliminate the 3ware controller by installing a drive on whatever built in controller it has.
this is a production server, so running an extended memtest not going to happen. But I can swap it out and put it in a backup system to do the test. It's beginning to look a lot like a RAM issue as I have now seen a couple segfaults from programs that have always run fine. Every kernel panic message is different (crashed again 1 hour ago). Fans and case temp are nominal.
the 3ware card was just purchased last month, it has the latest firmware and bios installed.
the memory is from PQI - supposed to be an OK brand right? it has a lifetime warranty... heh
next steps... HA and fault-tolerant clustering, per the adjacent thread... this is the cautionary tale come to life.
It would be great if there were a simple machine that you could plug a bunch of dimms of varying types into and it will perform high-speed tests on them continuously and flag ones that show an error.
Then you could test all memory modules thoroughly before putting them into production servers (or any server for that matter).
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
Ross S. W. Walker wrote:
It would be great if there were a simple machine that you could plug a bunch of dimms of varying types into and it will perform high-speed tests on them continuously and flag ones that show an error.
Then you could test all memory modules thoroughly before putting them into production servers (or any server for that matter).
and DIMMs could pass that hardware tester and still fail in a production server due to differences in timing, capacitive/inductive signal load, etc. especially sensitive are systems that use dual bank interleaving on a single memory bus.
actually, said machines DO exist, known as ATE (Automatic Test Equipment) but they cost $1M's
this is a production server, so running an extended memtest not going to happen. But I can swap it out and put it in a backup system to do the test. It's beginning to look a lot like a RAM issue as I have now seen a couple segfaults from programs that have always run fine. Every kernel panic message is different (crashed again 1 hour ago). Fans and case temp are nominal
Some Tyan boards are known as being very picky with RAM. If you want to avoid problems, you should really stick to the types listed in Tyan's memory compatibility list. Perhaps the same can be said about any board today. It seems that the times of "generic" RAM are gone. See Tyan's web site.
I have a Tyan board that several times destroyed a DIMM or two upon reboot, and only upon reboot. It did not happen during operation or when doing a hard reboot.
No virus found in this outgoing message. Checked by AVG. Version: 8.0.100 / Virus Database: 269.24.1/1469 - Release Date: 27-05-2008 13:25
On Tuesday 27 May 2008 13:16, Miguel Medalha wrote:
Some Tyan boards are known as being very picky with RAM. If you want to avoid problems, you should really stick to the types listed in Tyan's memory compatibility list.
hmm. well the spec sheet just says "unbuffered DDR 266/200" and that is what we got. I never noticed there was a list of "recommended" memory, but your comment made me look and I found one. PQI is not on the list :(
But... we had a similar system with the same mobo and ram stick - only one 512M, not 2x1G like these - that was running for years with nary a hiccup until the disks died. The one we are having problems with was its upgrade/replacement. Maybe this is a problem though.
thanks Sam
on 5-27-2008 11:24 AM sbeam spake the following:
On Tuesday 27 May 2008 13:16, Miguel Medalha wrote:
Some Tyan boards are known as being very picky with RAM. If you want to avoid problems, you should really stick to the types listed in Tyan's memory compatibility list.
hmm. well the spec sheet just says "unbuffered DDR 266/200" and that is what we got. I never noticed there was a list of "recommended" memory, but your comment made me look and I found one. PQI is not on the list :(
But... we had a similar system with the same mobo and ram stick - only one 512M, not 2x1G like these - that was running for years with nary a hiccup until the disks died. The one we are having problems with was its upgrade/replacement. Maybe this is a problem though.
thanks Sam
Just because memory has a lifetime warranty doesn't mean that it is guaranteed to not be bad on arrival, or to not fail quickly. It just means they will send you new ones if they break. It does happen, and more often then you might think.
On Tue, 2008-05-27 at 14:24 -0400, sbeam wrote:
On Tuesday 27 May 2008 13:16, Miguel Medalha wrote:
Some Tyan boards are known as being very picky with RAM. If you want to avoid problems, you should really stick to the types listed in Tyan's memory compatibility list.
hmm. well the spec sheet just says "unbuffered DDR 266/200" and that is what we got. I never noticed there was a list of "recommended" memory, but your comment made me look and I found one. PQI is not on the list :(
But... we had a similar system with the same mobo and ram stick - only one 512M, not 2x1G like these - that was running for years with nary a hiccup until the disks died. The one we are having problems with was its upgrade/replacement. Maybe this is a problem though.
I recently had a problem upgrading to 2x1GB PC3200 (> DDR 400) in one of my home units. Ultimate solution: tweaked the DRAM voltage to 2.7 volts. Took a long time because I didn't want to fry the DIMMS and there were no specs anywhere that specified the voltage.
But, my *deduced* solution is that the greater capacity at max frequency needed higher voltage to drive the increased speed/capacity (was PC2700 @ 1GB).
Might work for you.
And of course, YMMV, you keep the pieces, etc.
thanks Sam
<snip sig stuff>
HTH
William L. Maltby wrote:
On Tue, 2008-05-27 at 14:24 -0400, sbeam wrote:
On Tuesday 27 May 2008 13:16, Miguel Medalha wrote:
Some Tyan boards are known as being very picky with RAM. If you want to avoid problems, you should really stick to the types listed in Tyan's memory compatibility list.
hmm. well the spec sheet just says "unbuffered DDR 266/200" and that is what we got. I never noticed there was a list of "recommended" memory, but your comment made me look and I found one. PQI is not on the list :(
But... we had a similar system with the same mobo and ram stick - only one 512M, not 2x1G like these - that was running for years with nary a hiccup until the disks died. The one we are having problems with was its upgrade/replacement. Maybe this is a problem though.
I recently had a problem upgrading to 2x1GB PC3200 (> DDR 400) in one of my home units. Ultimate solution: tweaked the DRAM voltage to 2.7 volts. Took a long time because I didn't want to fry the DIMMS and there were no specs anywhere that specified the voltage.
We faced a similar situation lately (however, on a fedora system). After swapping each and every part of the hardware, we found some USB temperature sensors and their kernel drivers to be the source of all problems. After unplugging the USB devices and unloading the kernel modules, the machine runs without any problems.
Regards,
Peter