My Dell PowerEdge T105 running Centos-5.2 has started crashing fairly often (3 times in the last 2 hours). The message on the screen tells me to look at the System Event Log. Is this just /var/log/messages ?
I didn't see anything helpful there. As far as I could see, the last messages before the crash were about samba, which is running on the machine. Has anyone had such problems with Samba? I've stopped the smb service to see if this improves matters.
I'm running a standard updated system. I looked at smartctl but this did not suggest that there was anything wrong with the 2 SATA disks.
Incidentally, wouldn't it be a good idea for /var/log/messages* to start a new file when booting?
The message on the screen tells me to look at the System Event Log.
What application is doing that? System Event Log sounds like Windows terminology.
Incidentally, wouldn't it be a good idea for /var/log/messages* to start a new file when booting?
that's what /var/log/dmesg is for.
messages files are typically rolled over daily or when they hit a certain size.
Spiro Harvey wrote:
The message on the screen tells me to look at the System Event Log.
What application is doing that? System Event Log sounds like Windows terminology.
Centos-5.2 halted, and the message on the screen told me to look at the System Event Log.
In fact I found that on pressing F2 (Enter Setup) on boot I was given a list of options, which includes System Event Log. Pressing this allows me to read entries going back several days.
In my case I found the error "Uncorrectable ECC Error DIMM 1,1".
On swapping the two 2GB memory modules, the message changed to "Uncorrectable ECC Error DIMM 2,2".
This seemed to me to be pretty strong evidence that the fault was indeed with one of the memory modules. Unfortunately it was difficult to convince Dell of this. The technical support man I spoke to (in Scotland) claimed that it might be a software error caused by installing Windows XP Pro as a second OS on the machine. He asked me to run a diagnostic test which in fact did not run because I did not have the "Utility Partition" that it required.
In the end he agreed to send a new module in place of the old one.
Incidentally, wouldn't it be a good idea for /var/log/messages* to start a new file when booting?
that's what /var/log/dmesg is for.
As far as I can see, this only contains messages about the boot sequence? In any case, it did not contain any messages relevant to the crash.
on 2-10-2009 5:44 AM Timothy Murphy spake the following:
Spiro Harvey wrote:
The message on the screen tells me to look at the System Event Log.
What application is doing that? System Event Log sounds like Windows terminology.
Centos-5.2 halted, and the message on the screen told me to look at the System Event Log.
In fact I found that on pressing F2 (Enter Setup) on boot I was given a list of options, which includes System Event Log. Pressing this allows me to read entries going back several days.
In my case I found the error "Uncorrectable ECC Error DIMM 1,1".
On swapping the two 2GB memory modules, the message changed to "Uncorrectable ECC Error DIMM 2,2".
This seemed to me to be pretty strong evidence that the fault was indeed with one of the memory modules. Unfortunately it was difficult to convince Dell of this. The technical support man I spoke to (in Scotland) claimed that it might be a software error caused by installing Windows XP Pro as a second OS on the machine.
Many 1st level tech support people have a book of scripts they have to follow, and anything not in those scripts throws them off. You almost always have to escalate to the next level to actually talk to someone who is allowed to "think for themselves".
On Tue, Feb 10, 2009 at 12:37 PM, Scott Silva ssilva@sgvwater.com wrote:
on 2-10-2009 5:44 AM Timothy Murphy spake the following:
Spiro Harvey wrote:
<snip>
On swapping the two 2GB memory modules, the message changed to "Uncorrectable ECC Error DIMM 2,2".
This seemed to me to be pretty strong evidence that the fault was indeed with one of the memory modules. Unfortunately it was difficult to convince Dell of this. The technical support man I spoke to (in Scotland) claimed that it might be a software error caused by installing Windows XP Pro as a second OS on the machine.
We have four Dell Dimension boxes in our house. They have the right to require one to restore the system (HW & SW) to the original configuration, before you get Tech Support. Usually they are not that hard nosed about it, but it requires some patience.
Obviously, in this case, you could have booted from the CentOS LiveCD and run memtest86 on the memory, or, you could have run Dell's Diagnostics on it, which probably would have been more satisfying to the Dell guy in Scotland.
IMHO, you are lucky you were not talking with someone in India or some other country.
Many 1st level tech support people have a book of scripts they have to follow, and anything not in those scripts throws them off. You almost always have to escalate to the next level to actually talk to someone who is allowed to "think for themselves".
That can also apply to getting parts, for those of us who do not live in the USA. Some years ago, I needed a new Bezel for one of our boxes. The Spare Parts person said special permission was needed, to ship it to me, from the USA. Since they had shipped the box down here, I could not believe it. I contacted Dell Management and they sent me the Bezel.
Our impression is that Dell Latin America has *much* better Tech Support than they provide in the USA. We always recommend Dell to people here who are contemplating buy a new PC, because of their Support. They have really stood behind our purchases. :-)
Lanny Marcus wrote:
On swapping the two 2GB memory modules, the message changed to "Uncorrectable ECC Error DIMM 2,2". ...
We have four Dell Dimension boxes in our house. They have the right to require one to restore the system (HW & SW) to the original configuration, before you get Tech Support. Usually they are not that hard nosed about it, but it requires some patience.
He said 'ECC memory', which to me indicates this is a PowerEdge server, not a Dimension consumer desktop. you get a whole different level of support from the Server folks than you do from the consumer desktop folks.
On Tue, Feb 10, 2009 at 3:25 PM, John R Pierce pierce@hogranch.com wrote:
Lanny Marcus wrote:
On swapping the two 2GB memory modules, the message changed to "Uncorrectable ECC Error DIMM 2,2".
<snip>
He said 'ECC memory', which to me indicates this is a PowerEdge server, not a Dimension consumer desktop. you get a whole different level of support from the Server folks than you do from the consumer desktop folks.
I hope so. Servers should be supported better than Consumer Desktop boxes.
Timothy Murphy wrote:
My Dell PowerEdge T105 running Centos-5.2 has started crashing fairly often (3 times in the last 2 hours). The message on the screen tells me to look at the System Event Log. Is this just /var/log/messages ?
The only System Event Log I know of is part of the BIOS. If you enter your BIOS setup you might find something recorded, or you can try running "dmidecode -t system" (as root) and see if anything meaningful shows up. There's a manpage for dmidecode, but I doubt you'll find much of interest there.
Robert Nichols wrote:
My Dell PowerEdge T105 running Centos-5.2 has started crashing fairly often (3 times in the last 2 hours). The message on the screen tells me to look at the System Event Log. Is this just /var/log/messages ?
The only System Event Log I know of is part of the BIOS. If you enter your BIOS setup you might find something recorded, or you can try running "dmidecode -t system" (as root) and see if anything meaningful shows up. There's a manpage for dmidecode, but I doubt you'll find much of interest there.
Thanks. I found as I said in another post that I could read the System Event Log by pressing F2 on boot.
I see that "dmidecode -t system" does indeed give me information about the error: -------------------------------------------------- Descriptor 2: Single-bit ECC memory error Data Format 2: Multiple-event Descriptor 3: Multi-bit ECC memory error Data Format 3: Multiple-event --------------------------------------------------
Not as explicit as the System Erro Log , though. Eg can I tell from this which module is at fault? (I don't have the faulty module in now.)
Incidentally, does anyone know what the many "Intrusion" entries in the System Event Log mean?
On Feb 10, 2009, at 9:13 AM, Timothy Murphy wrote:
Incidentally, does anyone know what the many "Intrusion" entries in the System Event Log mean?
someone opened the chassis?
http://support.dell.com/support/edocs/software/svradmin/1.8.1/en/messages/ms...
-steve
-- If this were played upon a stage now, I could condemn it as an improbable fiction. - Fabian, Twelfth Night, III,v
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Timothy Murphy Sent: Sunday, February 08, 2009 7:47 PM To: centos@centos.org Subject: [CentOS] What is the System Event Log?
My Dell PowerEdge T105 running Centos-5.2 has started crashing fairly often (3 times in the last 2 hours). The message on the screen tells me to look at the System Event Log. Is this just /var/log/messages ?
If you have OMSA installed you can check out those logs. Ipmi and Dmidecode as mentioned and check out linux.dell.com Wiki for support and diagnostic tools. Depending on what you have installed on that particular server there can be many logs to check out under /var/log/*
I didn't see anything helpful there. As far as I could see, the last messages before the crash were about samba, which is running on the machine. Has anyone had such problems with Samba? I've stopped the smb service to see if this improves matters.
Post the Samba Messages and the log. No problems with samba that I know of unless your trying to run version 4.
I'm running a standard updated system. I looked at smartctl but this did not suggest that there was anything wrong with the 2 SATA disks.
Incidentally, wouldn't it be a good idea for /var/log/messages* to start a new file when booting?
When it gets to a certain size limit it will rollover to a new log. Lastly you can send a mail to "linux-poweredge@dell.com"
JohnStanley
John wrote:
My Dell PowerEdge T105 running Centos-5.2 has started crashing fairly often (3 times in the last 2 hours). The message on the screen tells me to look at the System Event Log. Is this just /var/log/messages ?
If you have OMSA installed you can check out those logs. Ipmi and Dmidecode as mentioned and check out linux.dell.com Wiki for support and diagnostic tools.
How do I install OMSA? I looked for OMSA* and omsa* packages (with yum) but did not find anything
Incidentally, wouldn't it be a good idea for /var/log/messages* to start a new file when booting?
When it gets to a certain size limit it will rollover to a new log.
I realize that. I'm just suggesting that it would be easier to read /var/log/messages if it started a new page on re-booting.
Lastly you can send a mail to "linux-poweredge@dell.com"
I'm not quite sure what you mean. Is this a Dell technical information site?
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Timothy Murphy Sent: Tuesday, February 10, 2009 9:07 AM To: centos@centos.org Subject: Re: [CentOS] What is the System Event Log?
John wrote:
My Dell PowerEdge T105 running Centos-5.2 has started crashing fairly often (3 times in the last 2 hours). The message on the screen tells me to look at the System Event Log. Is this just /var/log/messages ?
If you have OMSA installed you can check out those logs. Ipmi and Dmidecode as mentioned and check out linux.dell.com Wiki for support
and diagnostic
tools.
How do I install OMSA? I looked for OMSA* and omsa* packages (with yum) but did not find anything
You have to install the Dell Yum Repo http://linux.dell.com/repo/software/ http://linux.dell.com/monitoring.shtml http://linux.dell.com/projects.shtml
Incidentally, wouldn't it be a good idea for /var/log/messages* to start a new file when booting?
When it gets to a certain size limit it will rollover to a new log.
I realize that. I'm just suggesting that it would be easier to read /var/log/messages if it started a new page on re-booting.
Lastly you can send a mail to "linux-poweredge@dell.com"
I'm not quite sure what you mean. Is this a Dell technical information site?
It is a Dell mailing List Like this one for technical help for Linux on Dell Only.
Timothy Murphy wrote on Tue, 10 Feb 2009 14:06:43 +0000:
How do I install OMSA?
There are repo's for it. http://linux.dell.com/wiki/index.php/Repository
Kai
On Monday 09 February 2009, Timothy Murphy wrote:
My Dell PowerEdge T105 running Centos-5.2 has started crashing fairly often (3 times in the last 2 hours). The message on the screen tells me to look at the System Event Log. Is this just /var/log/messages ?
No, the SEL is maintained on the BMC/IPMI-controller. In Linux you can (assuming you have /etc/init.d/ipmi running) view it with: ipmitool sel list
You will need OpenIPMI and OpenIPMI-tools (from base) for the above to work.
/Peter
I didn't see anything helpful there. As far as I could see, the last messages before the crash were about samba, which is running on the machine. Has anyone had such problems with Samba? I've stopped the smb service to see if this improves matters.
I'm running a standard updated system. I looked at smartctl but this did not suggest that there was anything wrong with the 2 SATA disks.
Incidentally, wouldn't it be a good idea for /var/log/messages* to start a new file when booting?
Peter Kjellstrom wrote:
My Dell PowerEdge T105 running Centos-5.2 has started crashing fairly often (3 times in the last 2 hours). The message on the screen tells me to look at the System Event Log. Is this just /var/log/messages ?
No, the SEL is maintained on the BMC/IPMI-controller. In Linux you can (assuming you have /etc/init.d/ipmi running) view it with: ipmitool sel list
You will need OpenIPMI and OpenIPMI-tools (from base) for the above to work.
Thanks for the info. As I mentioned in another posting, I found that I could read the System Event Log after pressing F2 (Enter Setup) on boot.
I tried the command above, but it did not work. I found I already had OpenIPMI (which I have never heard of) installed, and I yum-installed OpenIPMI-tools . But when I ran the command I got: ------------------------------------------------- [tim@helen ~]$ sudo service ipmi restart Stopping all ipmi drivers: [ OK ] Starting ipmi drivers: [FAILED] [tim@helen ~]$ sudo ipmitool sel list Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory Get SEL Info command failed -------------------------------------------------
On Tuesday 10 February 2009, Timothy Murphy wrote:
Peter Kjellstrom wrote:
...
No, the SEL is maintained on the BMC/IPMI-controller. In Linux you can (assuming you have /etc/init.d/ipmi running) view it with: ipmitool sel list
You will need OpenIPMI and OpenIPMI-tools (from base) for the above to work.
...
I tried the command above, but it did not work. I found I already had OpenIPMI (which I have never heard of) installed, and I yum-installed OpenIPMI-tools . But when I ran the command I got:
[tim@helen ~]$ sudo service ipmi restart Stopping all ipmi drivers: [ OK ] Starting ipmi drivers: [FAILED]
This should not happen but essentially means that the kernel you are running does not have a driver compatible with the server you are running it on.
[tim@helen ~]$ sudo ipmitool sel list Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
This device is created by the step that failed above. Without a proper ipmi driver load you cannot access the BMC via a local device.
However, with or without /dev/ipmi0 you can access the BMC remotely with (assuming you have an IP configured etc.).
/Peter
Get SEL Info command failed
Peter Kjellstrom wrote:
However, with or without /dev/ipmi0 you can access the BMC remotely with (assuming you have an IP configured etc.).
I'd never heard of BMC (I am not an expert in this area, to put it mildly) but on googling for "dell bmc" I found http://www.dell.com/downloads/global/power/ps1q05-20040219-Brumley.pdf where I learnt that "The on-board baseboard management controller (BMC) is a powerful and flexible device that can be used to effectively manage eighth-generation Dell servers such as the PowerEdge 1850, PowerEdge 2800, and PowerEdge 2850."
This suggests to me that my modest PowerEdge T105 probably does not support this service.
How would I access it if it were available? I see a large number of BMC-related ports in /etc/services , but none of them seem to be active on my server.
Timothy Murphy wrote:
This suggests to me that my modest PowerEdge T105 probably does not support this service.
From my brief look no it does not, it is very rare for desktop
or even workstation systems to have management chips on them. And at least the T105 has no DRAC options either(not surprised)
How would I access it if it were available?
You have to configure it first, how you do that depends, sometimes you can configure it via openipmi.
I see a large number of BMC-related ports in /etc/services , but none of them seem to be active on my server.
Not related I don't believe.
nate
On Tue, Feb 10, 2009, nate wrote:
Timothy Murphy wrote:
...
How would I access it if it were available?
You have to configure it first, how you do that depends, sometimes you can configure it via openipmi.
I just installed the OpenIPMI-tools package with yum yesterday on a Supermicro box that seems to be having overheating or fan problems. This is my first time looking into IPMI monitoring, but looking at the contents of the OpenIPMI-tools rpm package, it seems to be missing some of the configuration files necessary to run things like the ipmievd daemon (e.g. no /etc/init.d/ipmievd script, only /usr/share/ipmitool/ipmievd.init.redhat, and the so sample of the /etc/sysconfig/ipmievd file).
Any suggestions on documentation covering configuration on CentOS systems? I have no problems with RTFM, if only I know where to find TFM.
Bill
On Wednesday 11 February 2009, Bill Campbell wrote:
On Tue, Feb 10, 2009, nate wrote:
Timothy Murphy wrote:
...
How would I access it if it were available?
You have to configure it first, how you do that depends, sometimes you can configure it via openipmi.
I just installed the OpenIPMI-tools package with yum yesterday on a Supermicro box that seems to be having overheating or fan problems. This is my first time looking into IPMI monitoring, but looking at the contents of the OpenIPMI-tools rpm package, it seems to be missing some of the configuration files necessary to run things like the ipmievd daemon (e.g. no /etc/init.d/ipmievd script, only /usr/share/ipmitool/ipmievd.init.redhat
Yes, ipmievd is lacking a proper init.d script and if you want to run it the file you found above is probably the way to go (copy it to /etc/init.d/..., chkconfig, etc.).
But evd is not needed to talk to the IPMI-controller/BMC/service-processor/whatever. To talk locally the only thing needed is to start the ipmi service (init.d-file from the OpenIPMI package) and then, for example, "ipmitool sel list".
/Peter
, and the so sample of the /etc/sysconfig/ipmievd file).
Any suggestions on documentation covering configuration on CentOS systems? I have no problems with RTFM, if only I know where to find TFM.
Bill
On Tuesday 10 February 2009, Timothy Murphy wrote:
Peter Kjellstrom wrote:
However, with or without /dev/ipmi0 you can access the BMC remotely with (assuming you have an IP configured etc.).
I'd never heard of BMC (I am not an expert in this area, to put it mildly)
BMC, on-board baseboard management controller, IPMI-controller, service processor, ...
The piece of hardware that runs independently from the main part of the server and that typically does things like:
* keep the SEL * perform power controll * monitor temperatures and fan speeds * provide serial port over LAN functionality
but on googling for "dell bmc" I found http://www.dell.com/downloads/global/power/ps1q05-20040219-Brumley.pdf where I learnt that "The on-board baseboard management controller (BMC) is a powerful and flexible device that can be used to effectively manage eighth-generation Dell servers such as the PowerEdge 1850, PowerEdge 2800, and PowerEdge 2850."
This suggests to me that my modest PowerEdge T105 probably does not support this service.
Googling a bit it does seem that the T105 lacks a BMC :-(
Earlier cheap servers from dell (sc1435 for example) did have it...
How would I access it if it were available? I see a large number of BMC-related ports in /etc/services , but none of them seem to be active on my server.
The BMC is independent from the OS, think a small separate server inside your real server.
If a server has one you access it in one of the following ways:
* service ipmi start, local access (requires OpenIPMI kernel driver support) * via LAN to a dedicated management ethernet port on the server * via LAN to a shared ethernet port on the server
The LAN access way depends on the BMC having a connection to an ethernet port and a working TCP/IP configuration.
/Peter