Kernel Panic on HP/Compaq ProLiant G7

List overview All Threads
Download

newer

older

Controlling the order of /dev/sdX...

Centos 6 Update?

Windsor Dave L. (AdP/TEF7.1)

24 Mar 2011 24 Mar '11

3:03 p.m.

Hello Everyone,

I recently installed CentOS 5.5 x86_64 on a brand new ProLiant DL380 G7. I have identical OS software running reock-solid on two other DL380 ProLiant servers, but they are G6 models, not G7. On the G7, the installation went perfectly and the machine ran great for about 2 weeks, when it just seemed to "stop". The system stopped responding on the network, and there was no video on the console (or remote console via iLO). It would not reboot or cold boot through iLO, I actually had to hold the power to turn it off and then hit it again to power up.

This happened several times within a few days of each other. Each time, there was no evidence in any logs of a problem - the system just seemed to stop or lock up. We did have a CPU problem light appear on the front, so HP came in and replaced the one 4-core CPU. Since then, it has run as long as two weeks, but still crashes randomly. After the last reboot, I left the console in text mode on vt1, and when it crashed again this morning this was displayed on the screen:

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8100dc435cf0 CR3: 000000008a6ca000 CR4: 00000000000006e0 Process smbd (pid: 18970, threadinfo ffff81001529e000, task ffff81011f5347a0) Stack: ffff81011e4e71c0 0000000000000000 ffff8100cf12a015 ffffffff80009c41 ffff81011e4e71c0 0000000100000000 000000030027ea9d ffff8100cf12a011 ffff81011e4e71c0 ffff81010d9cf300 ffff81011e4e71c0 ffff8101044099c0 Call Trace: [<ffffffff80009c41>] __link_path_walk+0x3a6/0xf5b [<ffffffff8000ea4b>] link_path_walk+0x42/0xb2 [<ffffffff8000cd72>] do_path_lookup+0x275/0x2f1 [<ffffffff80012851>] getname+0x15b/0x1c2 [<ffffffff800239d1>] __user_walk_fd+0x37/0x4c [<ffffffff80028905>] vfs_stat_fd+0x1b/0x4a [<ffffffff80039fa2>] fcntl_setlk+0x243/0x273 [<ffffffff80023703>] sys_newstat+0x19/0x31 [<ffffffff8005d229>] tracesys+0x71/0xe0 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Code: 00 00 00 00 00 00 00 00 70 4d 4f 9d 00 81 ff ff 98 e4 4b dc RIP [<ffff8100dc435cf0>] RSP <ffff81001529fd18> CR2: ffff8100dc435cf0 <0>Kernel panic - not syncing: Fatal exception

This suggests that something happened in a Samba process. I have the Samba3x packages installed since we are beginning to introduce Win7 clients into our environment.

Googling "Kernel panic - not syncing: Fatal exception" and "CentOS" produced many hits, but nothing that seemed to exactly match my problem. Since this is the only G7 server I have here right now, I can't reproduce the problem on another machine. The G6s I have running the identical version of CentOS have no problems.

I am trying to determine if this is pointing to a hardware or software issue. Some of the Google results suggested using a Centosplus kernel - is this a good idea?

The server is a HP DL380 G7 Server with 4 GB RAM (1 DIMM 1333 MHz), one 4-core CPU (2133 MHz), 4 built-in Broadcom "NetExtreme II BCM5709 II Gigabit Ethernet" NICs, and a P410 Smart Array Controller. The P410 and the system BIOS have both been updated to the latest levels to see if that fixes the crashes, with no change.

Any idea where I should look next?

Thanks for any help anyone can provide!

Best Regards,

Dave Windsor

Robert Bosch LLC Team Leader, MES Database Infrastructure Group (AdP/TEF7.1) 4421 Highway 81 North Anderson, SC 29621 USA www.bosch.us

Tel: 1 (864) 260-8459 Fax: 1 (864) 260-8422 Dave.Windsor@us.bosch.com

Show replies by date

Rob Kampen

24 Mar 24 Mar

3:07 p.m.

Windsor Dave L. (AdP/TEF7.1) wrote:

...

Hello Everyone,

I recently installed CentOS 5.5 x86_64 on a brand new ProLiant DL380 G7. I have identical OS software running reock-solid on two other DL380 ProLiant servers, but they are G6 models, not G7. On the G7, the installation went perfectly and the machine ran great for about 2 weeks, when it just seemed to "stop". The system stopped responding on the network, and there was no video on the console (or remote console via iLO). It would not reboot or cold boot through iLO, I actually had to hold the power to turn it off and then hit it again to power up.

This happened several times within a few days of each other. Each time, there was no evidence in any logs of a problem - the system just seemed to stop or lock up. We did have a CPU problem light appear on the front, so HP came in and replaced the one 4-core CPU. Since then, it has run as long as two weeks, but still crashes randomly. After the last reboot, I left the console in text mode on vt1, and when it crashed again this morning this was displayed on the screen:

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8100dc435cf0 CR3: 000000008a6ca000 CR4: 00000000000006e0 Process smbd (pid: 18970, threadinfo ffff81001529e000, task ffff81011f5347a0) Stack: ffff81011e4e71c0 0000000000000000 ffff8100cf12a015 ffffffff80009c41 ffff81011e4e71c0 0000000100000000 000000030027ea9d ffff8100cf12a011 ffff81011e4e71c0 ffff81010d9cf300 ffff81011e4e71c0 ffff8101044099c0 Call Trace: [<ffffffff80009c41>] __link_path_walk+0x3a6/0xf5b [<ffffffff8000ea4b>] link_path_walk+0x42/0xb2 [<ffffffff8000cd72>] do_path_lookup+0x275/0x2f1 [<ffffffff80012851>] getname+0x15b/0x1c2 [<ffffffff800239d1>] __user_walk_fd+0x37/0x4c [<ffffffff80028905>] vfs_stat_fd+0x1b/0x4a [<ffffffff80039fa2>] fcntl_setlk+0x243/0x273 [<ffffffff80023703>] sys_newstat+0x19/0x31 [<ffffffff8005d229>] tracesys+0x71/0xe0 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Code: 00 00 00 00 00 00 00 00 70 4d 4f 9d 00 81 ff ff 98 e4 4b dc RIP [<ffff8100dc435cf0>] RSP <ffff81001529fd18> CR2: ffff8100dc435cf0 <0>Kernel panic - not syncing: Fatal exception

This suggests that something happened in a Samba process. I have the Samba3x packages installed since we are beginning to introduce Win7 clients into our environment.

Googling "Kernel panic - not syncing: Fatal exception" and "CentOS" produced many hits, but nothing that seemed to exactly match my problem. Since this is the only G7 server I have here right now, I can't reproduce the problem on another machine. The G6s I have running the identical version of CentOS have no problems.

I am trying to determine if this is pointing to a hardware or software issue. Some of the Google results suggested using a Centosplus kernel - is this a good idea?

The server is a HP DL380 G7 Server with 4 GB RAM (1 DIMM 1333 MHz), one 4-core CPU (2133 MHz), 4 built-in Broadcom "NetExtreme II BCM5709 II Gigabit Ethernet" NICs, and a P410 Smart Array Controller. The P410 and the system BIOS have both been updated to the latest levels to see if that fixes the crashes, with no change.

Any idea where I should look next?

Run memtest for 48 hours - also check temperature of system - I have seen errors like these from overheating. HTH

...

Thanks for any help anyone can provide!

Best Regards,

Dave Windsor

Robert Bosch LLC Team Leader, MES Database Infrastructure Group (AdP/TEF7.1) 4421 Highway 81 North Anderson, SC 29621 USA www.bosch.us

Tel: 1 (864) 260-8459 Fax: 1 (864) 260-8422 Dave.Windsor@us.bosch.com

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Alain Péan

4:37 p.m.

Le 24/03/2011 16:03, Windsor Dave L. (AdP/TEF7.1) a écrit :

...

Hello Everyone,

I recently installed CentOS 5.5 x86_64 on a brand new ProLiant DL380 G7. I have identical OS software running reock-solid on two other DL380 ProLiant servers, but they are G6 models, not G7. On the G7, the installation went perfectly and the machine ran great for about 2 weeks, when it just seemed to "stop". The system stopped responding on the network, and there was no video on the console (or remote console via iLO). It would not reboot or cold boot through iLO, I actually had to hold the power to turn it off and then hit it again to power up.

This happened several times within a few days of each other. Each time, there was no evidence in any logs of a problem - the system just seemed to stop or lock up. We did have a CPU problem light appear on the front, so HP came in and replaced the one 4-core CPU. Since then, it has run as long as two weeks, but still crashes randomly. After the last reboot, I left the console in text mode on vt1, and when it crashed again this morning this was displayed on the screen:

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8100dc435cf0 CR3: 000000008a6ca000 CR4: 00000000000006e0 Process smbd (pid: 18970, threadinfo ffff81001529e000, task ffff81011f5347a0) Stack: ffff81011e4e71c0 0000000000000000 ffff8100cf12a015 ffffffff80009c41 ffff81011e4e71c0 0000000100000000 000000030027ea9d ffff8100cf12a011 ffff81011e4e71c0 ffff81010d9cf300 ffff81011e4e71c0 ffff8101044099c0 Call Trace: [<ffffffff80009c41>] __link_path_walk+0x3a6/0xf5b [<ffffffff8000ea4b>] link_path_walk+0x42/0xb2 [<ffffffff8000cd72>] do_path_lookup+0x275/0x2f1 [<ffffffff80012851>] getname+0x15b/0x1c2 [<ffffffff800239d1>] __user_walk_fd+0x37/0x4c [<ffffffff80028905>] vfs_stat_fd+0x1b/0x4a [<ffffffff80039fa2>] fcntl_setlk+0x243/0x273 [<ffffffff80023703>] sys_newstat+0x19/0x31 [<ffffffff8005d229>] tracesys+0x71/0xe0 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Code: 00 00 00 00 00 00 00 00 70 4d 4f 9d 00 81 ff ff 98 e4 4b dc RIP [<ffff8100dc435cf0>] RSP<ffff81001529fd18> CR2: ffff8100dc435cf0 <0>Kernel panic - not syncing: Fatal exception

This suggests that something happened in a Samba process. I have the Samba3x packages installed since we are beginning to introduce Win7 clients into our environment.

Googling "Kernel panic - not syncing: Fatal exception" and "CentOS" produced many hits, but nothing that seemed to exactly match my problem. Since this is the only G7 server I have here right now, I can't reproduce the problem on another machine. The G6s I have running the identical version of CentOS have no problems.

I am trying to determine if this is pointing to a hardware or software issue. Some of the Google results suggested using a Centosplus kernel - is this a good idea?

The server is a HP DL380 G7 Server with 4 GB RAM (1 DIMM 1333 MHz), one 4-core CPU (2133 MHz), 4 built-in Broadcom "NetExtreme II BCM5709 II Gigabit Ethernet" NICs, and a P410 Smart Array Controller. The P410 and the system BIOS have both been updated to the latest levels to see if that fixes the crashes, with no change.

Any idea where I should look next?

Thanks for any help anyone can provide!

The fact that it appears after two weeks or so reminds me of a bug I saw on linux PowerEdge mailing list, //the "blocked for more than 120 seconds" timeout bug. I don't know if your problem is related, but if it is the case you should see the message in your logs.

Do you have any high IO load, at least at some moments, on your server ?

See : http://lists.us.dell.com/pipermail/linux-poweredge/2011-March/044515.html

In this case, using a newer kernel would be indeed it seems a good idea.

See if it can help...

Alain //

-- ========================================================== Alain Péan - LPP/CNRS Administrateur Système/Réseau Laboratoire de Physique des Plasmas - UMR 7648 Observatoire de Saint-Maur 4, av de Neptune, Bat. A 94100 Saint-Maur des Fossés Tel : 01-45-11-42-39 - Fax : 01-48-89-44-33 ==========================================================

Dave Windsor

5:30 p.m.

On 3/24/2011 12:37 PM, Alain Péan wrote:

...

Le 24/03/2011 16:03, Windsor Dave L. (AdP/TEF7.1) a écrit :

...
<snipped> Code: 00 00 00 00 00 00 00 00 70 4d 4f 9d 00 81 ff ff 98 e4 4b dc RIP [<ffff8100dc435cf0>] RSP<ffff81001529fd18> CR2: ffff8100dc435cf0 <0>Kernel panic - not syncing: Fatal exception

<snipped> I am trying to determine if this is pointing to a hardware or software issue. Some of the Google results suggested using a Centosplus kernel - is this a good idea?

The server is a HP DL380 G7 Server with 4 GB RAM (1 DIMM 1333 MHz), one 4-core CPU (2133 MHz), 4 built-in Broadcom "NetExtreme II BCM5709 II Gigabit Ethernet" NICs, and a P410 Smart Array Controller. The P410 and the system BIOS have both been updated to the latest levels to see if that fixes the crashes, with no change.

Any idea where I should look next?

Thanks for any help anyone can provide!

The fact that it appears after two weeks or so reminds me of a bug I saw on linux PowerEdge mailing list, //the "blocked for more than 120 seconds" timeout bug. I don't know if your problem is related, but if it is the case you should see the message in your logs.

Do you have any high IO load, at least at some moments, on your server ?

See : http://lists.us.dell.com/pipermail/linux-poweredge/2011-March/044515.html

In this case, using a newer kernel would be indeed it seems a good idea.

See if it can help...

Alain //

--

Alain Péan - LPP/CNRS Administrateur Système/Réseau Laboratoire de Physique des Plasmas - UMR 7648 Observatoire de Saint-Maur 4, av de Neptune, Bat. A 94100 Saint-Maur des Fossés Tel : 01-45-11-42-39 - Fax : 01-48-89-44-33 ==========================================================

Alain,

Today, there are not high I/O loads. This server was intended to replace two older HP-UX servers. I had just begun to migrate the workload to the new server when the crashes began to occur. There are some minor, sporadic I/O loads but nothing that I would think could trigger the bug discussed in your link. However, I haven't measured the workload closely yet, so there could be spikes.

Best Regards,

*Dave Windsor*

Robert Bosch LLC Team Leader, MES Database Infrastructure Group (AdP/TEF7.1) 4421 Highway 81 North Anderson, SC 29621 USA _www.bosch.us _ http://www.bosch.us

Tel: 1 (864) 260-8459 Fax: 1 (864) 260-8422 _Dave.Windsor@us.bosch.com_ mailto:Dave.Windsor@us.bosch.com

Alain Péan

5:44 p.m.

Le 24/03/2011 18:30, Dave Windsor a écrit :

...

On 3/24/2011 12:37 PM, Alain Péan wrote:

...
Le 24/03/2011 16:03, Windsor Dave L. (AdP/TEF7.1) a écrit :

...
<snipped> Code: 00 00 00 00 00 00 00 00 70 4d 4f 9d 00 81 ff ff 98 e4 4b dc RIP [<ffff8100dc435cf0>] RSP<ffff81001529fd18> CR2: ffff8100dc435cf0 <0>Kernel panic - not syncing: Fatal exception

<snipped> I am trying to determine if this is pointing to a hardware or software issue. Some of the Google results suggested using a Centosplus kernel - is this a good idea?

The server is a HP DL380 G7 Server with 4 GB RAM (1 DIMM 1333 MHz), one 4-core CPU (2133 MHz), 4 built-in Broadcom "NetExtreme II BCM5709 II Gigabit Ethernet" NICs, and a P410 Smart Array Controller. The P410 and the system BIOS have both been updated to the latest levels to see if that fixes the crashes, with no change.

Any idea where I should look next?

Thanks for any help anyone can provide!

The fact that it appears after two weeks or so reminds me of a bug I saw on linux PowerEdge mailing list, //the "blocked for more than 120 seconds" timeout bug. I don't know if your problem is related, but if it is the case you should see the message in your logs.

Do you have any high IO load, at least at some moments, on your server ?

See : http://lists.us.dell.com/pipermail/linux-poweredge/2011-March/044515.html

In this case, using a newer kernel would be indeed it seems a good idea.

See if it can help...

Alain //

Alain,

Today, there are not high I/O loads. This server was intended to replace two older HP-UX servers. I had just begun to migrate the workload to the new server when the crashes began to occur. There are some minor, sporadic I/O loads but nothing that I would think could trigger the bug discussed in your link. However, I haven't measured the workload closely yet, so there could be spikes.

Best Regards,

*Dave Windsor*

Your error message, "Kernel panic - not syncing: Fatal exception" is too generic to give any clue. Do you see other error messages in your log ?

Did you run any hardware test (with Dell you have such utilities on DVD, I think they exist also on HP), to see if some hardware is failing, for example RAM ?

Alain

Windsor Dave L. (AdP/TEF7)

6:01 p.m.

On 3/24/2011 1:44 PM, Alain Péan wrote:

...

Le 24/03/2011 18:30, Dave Windsor a écrit :

...
On 3/24/2011 12:37 PM, Alain Péan wrote:

...
Le 24/03/2011 16:03, Windsor Dave L. (AdP/TEF7.1) a écrit :

...
<snipped> Code: 00 00 00 00 00 00 00 00 70 4d 4f 9d 00 81 ff ff 98 e4 4b dc RIP [<ffff8100dc435cf0>] RSP<ffff81001529fd18> CR2: ffff8100dc435cf0 <0>Kernel panic - not syncing: Fatal exception

<snipped> I am trying to determine if this is pointing to a hardware or software issue. Some of the Google results suggested using a Centosplus kernel - is this a good idea?

The server is a HP DL380 G7 Server with 4 GB RAM (1 DIMM 1333 MHz), one 4-core CPU (2133 MHz), 4 built-in Broadcom "NetExtreme II BCM5709 II Gigabit Ethernet" NICs, and a P410 Smart Array Controller. The P410 and the system BIOS have both been updated to the latest levels to see if that fixes the crashes, with no change.

Any idea where I should look next?

Thanks for any help anyone can provide!

The fact that it appears after two weeks or so reminds me of a bug I saw on linux PowerEdge mailing list, //the "blocked for more than 120 seconds" timeout bug. I don't know if your problem is related, but if it is the case you should see the message in your logs.

Do you have any high IO load, at least at some moments, on your server ?

See : http://lists.us.dell.com/pipermail/linux-poweredge/2011-March/044515.html

In this case, using a newer kernel would be indeed it seems a good idea.

See if it can help...

Alain //

Alain,

Today, there are not high I/O loads. This server was intended to replace two older HP-UX servers. I had just begun to migrate the workload to the new server when the crashes began to occur. There are some minor, sporadic I/O loads but nothing that I would think could trigger the bug discussed in your link. However, I haven't measured the workload closely yet, so there could be spikes.

Best Regards,

*Dave Windsor*

Your error message, "Kernel panic - not syncing: Fatal exception" is too generic to give any clue. Do you see other error messages in your log ?

Did you run any hardware test (with Dell you have such utilities on DVD, I think they exist also on HP), to see if some hardware is failing, for example RAM ?

Alain

There are no error messages in any logs. For example, in /var/log/messages, everything looks normal until you see the kernel restart messages after the reboot, although there seems to be a long gap in time between the last entry and the time when the systems actually stopped and was restarted. Whatever is happening, the system doesn't seem to be in a state where the problem can be recorded.

By the way, I forgot to list my kernel version. uname -rmi gives: 2.6.18-194.32.1.el5 x86_64 x86_64

-- Best Regards,

Dave Windsor

Robert Bosch LLC Team Leader, MES Database Infrastructure Group (AdP/TEF7.1) 4421 Highway 81 North Anderson, SC 29621 USA www.bosch.us

Tel: 1 (864) 260-8459 Fax: 1 (864) 260-8422 Dave.Windsor@us.bosch.com

m.roth＠5-cent.us

6:27 p.m.

Windsor Dave L. (AdP/TEF7) wrote:

...

On 3/24/2011 1:44 PM, Alain Péan wrote:

...
Le 24/03/2011 18:30, Dave Windsor a écrit :

...
On 3/24/2011 12:37 PM, Alain Péan wrote:

...
Le 24/03/2011 16:03, Windsor Dave L. (AdP/TEF7.1) a écrit :

...
<snipped> Code: 00 00 00 00 00 00 00 00 70 4d 4f 9d 00 81 ff ff 98 e4 4b dc RIP [<ffff8100dc435cf0>] RSP<ffff81001529fd18> CR2: ffff8100dc435cf0 <0>Kernel panic - not syncing: Fatal exception

<snipped> I am trying to determine if this is pointing to a hardware or software issue. Some of the Google results suggested using a Centosplus kernel - is this a good idea?

<snip> Weird question that I've brought up once recently here with someone else: what kinds of programs are running? Anything that might want a *ridiculous* amount of memory, all at once?

Another question: what's the a/c/airflow around the server like?

mark

Windsor Dave L. (AdP/TEF7)

8:45 p.m.

On 3/24/2011 2:27 PM, m.roth@5-cent.us wrote:

...

Windsor Dave L. (AdP/TEF7) wrote:

...
On 3/24/2011 1:44 PM, Alain Péan wrote:

...
Le 24/03/2011 18:30, Dave Windsor a écrit :

...
On 3/24/2011 12:37 PM, Alain Péan wrote:

...
Le 24/03/2011 16:03, Windsor Dave L. (AdP/TEF7.1) a écrit :

...
<snipped> Code: 00 00 00 00 00 00 00 00 70 4d 4f 9d 00 81 ff ff 98 e4 4b dc RIP [<ffff8100dc435cf0>] RSP<ffff81001529fd18> CR2: ffff8100dc435cf0 <0>Kernel panic - not syncing: Fatal exception

<snipped> I am trying to determine if this is pointing to a hardware or software issue. Some of the Google results suggested using a Centosplus kernel - is this a good idea?

<snip> Weird question that I've brought up once recently here with someone else: what kinds of programs are running? Anything that might want a *ridiculous* amount of memory, all at once?

Another question: what's the a/c/airflow around the server like?
     mark
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Nothing of which I'm aware would need much memory, unless a system service like Samba has a serious memory leak. I will watch that more closely over the next couple of days and look for trends.

According to the iLO system info, all the temps are in normal range, and the ambient temp is 66 degrees F (it's racked in a server room with dedicated air handlers).

Best Regards,

Dave Windsor

Robert Bosch LLC Team Leader, MES Database Infrastructure Group (AdP/TEF7.1) 4421 Highway 81 North Anderson, SC 29621 USA www.bosch.us

Tel: 1 (864) 260-8459 Fax: 1 (864) 260-8422 Dave.Windsor@us.bosch.com

rainer＠ultra-secure.de

6:29 p.m.

...

There are no error messages in any logs. For example, in /var/log/messages, everything looks normal until you see the kernel restart messages after the reboot, although there seems to be a long gap in time between the last entry and the time when the systems actually stopped and was restarted. Whatever is happening, the system doesn't seem to be in a state where the problem can be recorded.

By the way, I forgot to list my kernel version. uname -rmi gives: 2.6.18-194.32.1.el5 x86_64 x86_64

Can you install RHEL and see if it goes away? Otherwise, open a ticket with RedHat...

Rainer

John Doe

25 Mar 25 Mar

11:51 a.m.

From: Dave Windsor Dave.Windsor@us.bosch.com

...

Today, there are not high I/O loads. This server was intended to replace two older HP-UX servers. I had just begun to migrate the workload to the new server when the crashes began to occur. There are some minor, sporadic I/O loads but nothing that I would think could trigger the bug discussed in your link. However, I haven't measured the workload closely yet, so there could be spikes.

Tried hpdiags? Checked in the bios logs? (hplog -v) If you can, boot on the SmartStart CD and run the diag tests.

Windsor Dave L. (AdP/TEF7)

3:59 p.m.

On 3/25/2011 7:51 AM, John Doe wrote:

...

From: Dave WindsorDave.Windsor@us.bosch.com

...
Today, there are not high I/O loads. This server was intended to replace two older HP-UX servers. I had just begun to migrate the workload to the new server when the crashes began to occur. There are some minor, sporadic I/O loads but nothing that I would think could trigger the bug discussed in your link. However, I haven't measured the workload closely yet, so there could be spikes.

Tried hpdiags? Checked in the bios logs? (hplog -v) If you can, boot on the SmartStart CD and run the diag tests.

JD _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

I hadn't thought about the diagnostics on the SmartStart CD - that's a good idea. I'll have to move the workload off to another server first, though.

Best Regards,

Dave Windsor

Robert Bosch LLC Team Leader, MES Database Infrastructure Group (AdP/TEF7.1) 4421 Highway 81 North Anderson, SC 29621 USA www.bosch.us

Tel: 1 (864) 260-8459 Fax: 1 (864) 260-8422 Dave.Windsor@us.bosch.com

m.roth＠5-cent.us

4:08 p.m.

Windsor Dave L. (AdP/TEF7) wrote:

...

On 3/25/2011 7:51 AM, John Doe wrote:

...
From: Dave WindsorDave.Windsor@us.bosch.com

...
Today, there are not high I/O loads. This server was intended to replace two older HP-UX servers. I had just begun to migrate the workload to the new server when the crashes began to occur. There are some minor, sporadic I/O loads but nothing that I would think could trigger the bug discussed in your link. However, I haven't measured the workload closely yet, so there could be spikes.

Tried hpdiags? Checked in the bios logs? (hplog -v) If you can, boot on the SmartStart CD and run the diag tests.

JD _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

I hadn't thought about the diagnostics on the SmartStart CD - that's a good idea. I'll have to move the workload off to another server first, though.

Before you do that, as I suggested before, have you tried impitool? Look for possibly the oem special commands, or just the alarms or errors. And ipmitool you can run while the system's still running and available.

mark

Rajagopal Swaminathan

24 Mar 24 Mar

7:02 p.m.

Greetings,

On 3/24/11, Windsor Dave L. (AdP/TEF7.1) Dave.Windsor@us.bosch.com wrote:

...

[<ffffffff80028905>] vfs_stat_fd+0x1b/0x4a

My gut feeling would is that its hardware (i/o channel) .

imho

Regards,

Rajagopal

Dr. Ed Morbius

8:38 p.m.

Dave:

on 16:03 Thu 24 Mar, Windsor Dave L. (AdP/TEF7.1) (Dave.Windsor@us.bosch.com) wrote:

...

Hello Everyone,

I recently installed CentOS 5.5 x86_64 on a brand new ProLiant DL380 G7. I have identical OS software running reock-solid on two other DL380 ProLiant servers, but they are G6 models, not G7. On the G7, the installation went perfectly and the machine ran great for about 2 weeks, when it just seemed to "stop". The system stopped responding on the network, and there was no video on the console (or remote console via iLO). It would not reboot or cold boot through iLO, I actually had to hold the power to turn it off and then hit it again to power up.

This happened several times within a few days of each other. Each time, there was no evidence in any logs of a problem - the system just seemed to stop or lock up. We did have a CPU problem light appear on the front, so HP came in and replaced the one 4-core CPU. Since then, it has run as long as two weeks, but still crashes randomly. After the last reboot, I left the console in text mode on vt1, and when it crashed again this morning this was displayed on the screen:

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8100dc435cf0 CR3: 000000008a6ca000 CR4: 00000000000006e0 Process smbd (pid: 18970, threadinfo ffff81001529e000, task ffff81011f5347a0) Stack: ffff81011e4e71c0 0000000000000000 ffff8100cf12a015 ffffffff80009c41 ffff81011e4e71c0 0000000100000000 000000030027ea9d ffff8100cf12a011 ffff81011e4e71c0 ffff81010d9cf300 ffff81011e4e71c0 ffff8101044099c0 Call Trace: [<ffffffff80009c41>] __link_path_walk+0x3a6/0xf5b [<ffffffff8000ea4b>] link_path_walk+0x42/0xb2 [<ffffffff8000cd72>] do_path_lookup+0x275/0x2f1 [<ffffffff80012851>] getname+0x15b/0x1c2 [<ffffffff800239d1>] __user_walk_fd+0x37/0x4c [<ffffffff80028905>] vfs_stat_fd+0x1b/0x4a [<ffffffff80039fa2>] fcntl_setlk+0x243/0x273 [<ffffffff80023703>] sys_newstat+0x19/0x31 [<ffffffff8005d229>] tracesys+0x71/0xe0 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Code: 00 00 00 00 00 00 00 00 70 4d 4f 9d 00 81 ff ff 98 e4 4b dc RIP [<ffff8100dc435cf0>] RSP <ffff81001529fd18> CR2: ffff8100dc435cf0 <0>Kernel panic - not syncing: Fatal exception

This suggests that something happened in a Samba process.

Correct.

If this is regularly happening in Samba, that would point to a problem with your samba config (either on that host, something remotely stuffing bad packets at you, or likley in that case, both, as bad data shouldn't crash the host).

If this is happening in different programs over time, then the problem is likely /not/ software, but hardware/firmware.

The LKML may be able to help you on your panic; please read their bug posting guidelines /BEFORE/ posting.

...

I have the Samba3x packages installed since we are beginning to introduce Win7 clients into our environment.

What happens if you take the Win7 clients away?

...

Googling "Kernel panic - not syncing: Fatal exception" and "CentOS"

That is the generic kernel panic message. It's going to be spectacularly unspecific.

...

produced many hits, but nothing that seemed to exactly match my problem. Since this is the only G7 server I have here right now, I can't reproduce the problem on another machine. The G6s I have running the identical version of CentOS have no problems.

I am trying to determine if this is pointing to a hardware or software issue. Some of the Google results suggested using a Centosplus kernel

is this a good idea?

Dell have had numerous issues with recent server editions, it's possible HP are as well:

- If you haven't, configure the netconsole kernel module for kernel-enabled network logging of panics.

- Call HP and find out what the latest recommended BIOS and firmware upgrades for your system are. C-STATE has been a particular issue with Dell, and its' been disabled entirely in recent BIOS versions. I see below you've updated BIOS.

- Scan logs for other messages, particularly panics and/or ECC issues.

- If you can stand the downtime, run memtest86+ at least overnight on your RAM. A reboot indicates a failed test.

- Otherwise: try running with half your RAM swapped.

- Check/reseat all DIMMs, sockets, and cables. Some folks caution against this on the basis of connector wear, but if you've got a problem, this may help resolve it, and I've seen boxes shipped with components poorly or even un-cabled.

- Does a similarly equipped system exhibit the same problems?

...

The server is a HP DL380 G7 Server with 4 GB RAM (1 DIMM 1333 MHz), one 4-core CPU (2133 MHz), 4 built-in Broadcom "NetExtreme II BCM5709 II Gigabit Ethernet" NICs, and a P410 Smart Array Controller. The P410 and the system BIOS have both been updated to the latest levels to see if that fixes the crashes, with no change.

Ugh. Broadcom's gotten better but I prefer Intel NICs. Can't speak to the others. And OK, you've updated BIOS.

-- Dr. Ed Morbius, Chief Scientist / | Robot Wrangler / Staff Psychologist | When you seek unlimited power Krell Power Systems Unlimited | Go to Krell!

Windsor Dave L. (AdP/TEF7)

8:56 p.m.

On 3/24/2011 4:38 PM, Dr. Ed Morbius wrote:

...

Dave:

on 16:03 Thu 24 Mar, Windsor Dave L. (AdP/TEF7.1) (Dave.Windsor@us.bosch.com) wrote:

...
Hello Everyone,

Code: 00 00 00 00 00 00 00 00 70 4d 4f 9d 00 81 ff ff 98 e4 4b dc RIP [<ffff8100dc435cf0>] RSP<ffff81001529fd18> CR2: ffff8100dc435cf0 <0>Kernel panic - not syncing: Fatal exception

This suggests that something happened in a Samba process.

Correct.

If this is regularly happening in Samba, that would point to a problem with your samba config (either on that host, something remotely stuffing bad packets at you, or likley in that case, both, as bad data shouldn't crash the host).

I can have have network analyst monitor the ports for unusual bursts of traffic, although that might not catch small amounts of strange data.

...

If this is happening in different programs over time, then the problem is likely /not/ software, but hardware/firmware.

The LKML may be able to help you on your panic; please read their bug posting guidelines /BEFORE/ posting.

...
I have the Samba3x packages installed since we are beginning to introduce Win7 clients into our environment.

What happens if you take the Win7 clients away?

...
Googling "Kernel panic - not syncing: Fatal exception" and "CentOS"

That is the generic kernel panic message. It's going to be spectacularly unspecific.

...
produced many hits, but nothing that seemed to exactly match my problem. Since this is the only G7 server I have here right now, I can't reproduce the problem on another machine. The G6s I have running the identical version of CentOS have no problems.

I am trying to determine if this is pointing to a hardware or software issue. Some of the Google results suggested using a Centosplus kernel

is this a good idea?

Dell have had numerous issues with recent server editions, it's possible HP are as well:

If you haven't, configure the netconsole kernel module for kernel-enabled network logging of panics.

This is a great idea. I will work on that soonest.

...

Call HP and find out what the latest recommended BIOS and firmware upgrades for your system are. C-STATE has been a particular issue with Dell, and its' been disabled entirely in recent BIOS versions. I see below you've updated BIOS.

Scan logs for other messages, particularly panics and/or ECC issues.

I haven't seen anything ominous, although I have noticed a long time gap between the last entry in /var/log/messages and the actual crash. Such a gap in entries is very unusual.

...

If you can stand the downtime, run memtest86+ at least overnight on your RAM. A reboot indicates a failed test.

Otherwise: try running with half your RAM swapped.

Check/reseat all DIMMs, sockets, and cables. Some folks caution against this on the basis of connector wear, but if you've got a problem, this may help resolve it, and I've seen boxes shipped with components poorly or even un-cabled.

We have one DIMM of 4 GB RAM, so I can't swap it out or run with half. I have reseated it and inspected the contacts, and it looks OK. I will look at anything else with connectors.

...

Does a similarly equipped system exhibit the same problems?

...
The server is a HP DL380 G7 Server with 4 GB RAM (1 DIMM 1333 MHz), one 4-core CPU (2133 MHz), 4 built-in Broadcom "NetExtreme II BCM5709 II Gigabit Ethernet" NICs, and a P410 Smart Array Controller. The P410 and the system BIOS have both been updated to the latest levels to see if that fixes the crashes, with no change.

Ugh. Broadcom's gotten better but I prefer Intel NICs. Can't speak to the others. And OK, you've updated BIOS.

Thanks for your help!

Best Regards,

Dave Windsor

Robert Bosch LLC Team Leader, MES Database Infrastructure Group (AdP/TEF7.1) 4421 Highway 81 North Anderson, SC 29621 USA www.bosch.us

Tel: 1 (864) 260-8459 Fax: 1 (864) 260-8422 Dave.Windsor@us.bosch.com

Dr. Ed Morbius

9:05 p.m.

on 16:56 Thu 24 Mar, Windsor Dave L. (AdP/TEF7) (Dave.Windsor@us.bosch.com) wrote:

...

On 3/24/2011 4:38 PM, Dr. Ed Morbius wrote:

...
Dave:

on 16:03 Thu 24 Mar, Windsor Dave L. (AdP/TEF7.1) (Dave.Windsor@us.bosch.com) wrote:

...
Hello Everyone,

Code: 00 00 00 00 00 00 00 00 70 4d 4f 9d 00 81 ff ff 98 e4 4b dc RIP [<ffff8100dc435cf0>] RSP<ffff81001529fd18> CR2: ffff8100dc435cf0 <0>Kernel panic - not syncing: Fatal exception

This suggests that something happened in a Samba process.

<...>

...

...

If you haven't, configure the netconsole kernel module for kernel-enabled network logging of panics.
   This is a great idea.  I will work on that soonest.

It really is about four times as cool as it sounds. Getting the actual panic is hugely useful.

...

...

Call HP and find out what the latest recommended BIOS and firmware upgrades for your system are. C-STATE has been a particular issue with Dell, and its' been disabled entirely in recent BIOS versions. I see below you've updated BIOS.

Scan logs for other messages, particularly panics and/or ECC issues.
   I haven't seen anything ominous, although I have noticed a long 
time gap between the last entry in /var/log/messages and the actual crash. Such a gap in entries is very unusual.

You can create a "timestamp" cron job. Just a

*/10 * * * * root Logger "--- TIMESTAMP ---"

... entry. At least you'll see any long dry periods.

sar is also a useful utility to look at. It should be recording and reporting systems state and resource utilization levels prior to the crash.

...

...

If you can stand the downtime, run memtest86+ at least overnight on your RAM. A reboot indicates a failed test.

Otherwise: try running with half your RAM swapped.

Check/reseat all DIMMs, sockets, and cables. Some folks caution against this on the basis of connector wear, but if you've got a problem, this may help resolve it, and I've seen boxes shipped with components poorly or even un-cabled.
   We have one DIMM of 4 GB RAM, so I can't swap it out or run with 
half. I have reseated it and inspected the contacts, and it looks OK. I will look at anything else with connectors.

Actually, you can. Setting 'mem=2G' at your boot prompt will cue the kernel to use only half the RAM. Now, you can't specify an offset to use the high half, unfortunately. You could also swap the DIMM with another system if you've got it and see if you still have the problems in this one (or start seeing them in the other).

-- Dr. Ed Morbius, Chief Scientist / | Robot Wrangler / Staff Psychologist | When you seek unlimited power Krell Power Systems Unlimited | Go to Krell!

Mogens Kjaer

25 Mar 25 Mar

12:18 p.m.

On 03/24/2011 10:05 PM, Dr. Ed Morbius wrote:

...

You can create a "timestamp" cron job. Just a
 */10 * * * * root Logger "--- TIMESTAMP ---"

syslogd already has this buildin. It's normally disabled by the "-m 0" in /etc/sysconfig/syslog. Change the zero to 10, restart syslogd and you get the same result.

Mogens

-- Mogens Kjaer, mk@lemo.dk http://www.lemo.dk

Dr. Ed Morbius

8:42 p.m.

on 13:18 Fri 25 Mar, Mogens Kjaer (mk@lemo.dk) wrote:

...

On 03/24/2011 10:05 PM, Dr. Ed Morbius wrote:

...
You can create a "timestamp" cron job. Just a
 */10 * * * * root Logger "--- TIMESTAMP ---"
syslogd already has this buildin. It's normally disabled by the "-m 0" in /etc/sysconfig/syslog. Change the zero to 10, restart syslogd and you get the same result.

Thanks, yeah, I wasn't sure if it was or wasn't, too many system variants over the years.

-- Dr. Ed Morbius, Chief Scientist / | Robot Wrangler / Staff Psychologist | When you seek unlimited power Krell Power Systems Unlimited | Go to Krell!

m.roth＠5-cent.us

24 Mar 24 Mar

9:06 p.m.

Dave,

Here's a thought: have you tried ipmitool?

mark

Windsor Dave L. (AdP/TEF7)

1 Apr 1 Apr

5:44 p.m.

On 3/24/2011 11:03 AM, Windsor Dave L. (AdP/TEF7.1) wrote:

...

Hello Everyone,

I recently installed CentOS 5.5 x86_64 on a brand new ProLiant DL380 G7. I have identical OS software running reock-solid on two other DL380 ProLiant servers, but they are G6 models, not G7. On the G7, the installation went perfectly and the machine ran great for about 2 weeks, when it just seemed to "stop". The system stopped responding on the network, and there was no video on the console (or remote console via iLO). It would not reboot or cold boot through iLO, I actually had to hold the power to turn it off and then hit it again to power up.

This happened several times within a few days of each other. Each time, there was no evidence in any logs of a problem - the system just seemed to stop or lock up. We did have a CPU problem light appear on the front, so HP came in and replaced the one 4-core CPU. Since then, it has run as long as two weeks, but still crashes randomly. After the last reboot, I left the console in text mode on vt1, and when it crashed again this morning this was displayed on the screen:

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8100dc435cf0 CR3: 000000008a6ca000 CR4: 00000000000006e0 Process smbd (pid: 18970, threadinfo ffff81001529e000, task ffff81011f5347a0)

...

<0>Kernel panic - not syncing: Fatal exception

OK everyone, here is an update:

The server crashed again overnight. This time, the following error messages were on the console:

HARDWARE ERROR CPU 3: Machine Check Exception: 4 Bank 5: ba00000000400405 TSC 5172b45d44f0a MISC 80 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor

HARDWARE ERROR CPU 7: Machine Check Exception: 4 Bank 5: ba00000000400405 TSC 5172b45d45bba MISC 80 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor

HARDWARE ERROR CPU 5: Machine Check Exception: 4 Bank 8: 0000000000000000 TSC 0 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Kernel panic - not syncing: Uncorrected machine check

After reboot, running the first error through mcelog --ascii gives

CPU 3: Machine Check Exception: 4 Bank 5: ba00000000400405 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor mcelog: Unknown Intel CPU type family 6 model 2c

CPU 3 BANK 5 MCG status:MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid Processor context corrupt MCA: Internal unclassified error: 405 STATUS ba00000000400405 MCGSTATUS 4

The second error gives

CPU 7: Machine Check Exception: 4 Bank 5: ba00000000400405 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor mcelog: Unknown Intel CPU type family 6 model 2c

CPU 7 BANK 5 MCG status:MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid Processor context corrupt MCA: Internal unclassified error: 405 STATUS ba00000000400405 MCGSTATUS 4

And the third gives

CPU 3: Machine Check Exception: 4 Bank 5: ba00000000400405 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor mcelog: Unknown Intel CPU type family 6 model 2c

CPU 3 BANK 5 MCG status:MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid Processor context corrupt MCA: Internal unclassified error: 405 STATUS ba00000000400405 MCGSTATUS 4

I have been able to move all workloads onto other servers. As at least two people suggested, I booted from the HP SmartStart CD and ran 100 loops of systems diagnostics and tests, especially for the memory and CPU. No problems were found. I think I will run memtest86 over the weekend.

We have placed a hardware support call in to HP.

Best Regards,

Dave Windsor

Robert Bosch LLC Team Leader, MES Database Infrastructure Group (AdP/TEF7.1) 4421 Highway 81 North Anderson, SC 29621 USA www.bosch.us

Windsor Dave L. (AdP/TEF7)

4 Apr 4 Apr

3:37 p.m.

On 4/1/2011 1:44 PM, Windsor Dave L. (AdP/TEF7) wrote:

...

On 3/24/2011 11:03 AM, Windsor Dave L. (AdP/TEF7.1) wrote:

...
Hello Everyone,

I recently installed CentOS 5.5 x86_64 on a brand new ProLiant DL380 G7. I have identical OS software running reock-solid on two other DL380 ProLiant servers, but they are G6 models, not G7. On the G7, the installation went perfectly and the machine ran great for about 2 weeks, when it just seemed to "stop". The system stopped responding on the network, and there was no video on the console (or remote console via iLO). It would not reboot or cold boot through iLO, I actually had to hold the power to turn it off and then hit it again to power up.

...

OK everyone, here is an update:

The server crashed again overnight. This time, the following error messages were on the console:
  HARDWARE ERROR
  CPU 3: Machine Check Exception:                4 Bank 5:
ba00000000400405 TSC 5172b45d44f0a MISC 80 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor

...

I have been able to move all workloads onto other servers. As at least two people suggested, I booted from the HP SmartStart CD and ran 100 loops of systems diagnostics and tests, especially for the memory and CPU. No problems were found. I think I will run memtest86 over the weekend.

Best Regards,

Dave Windsor

This is interesting... I tried to load memtest86 from the CentOS 5.5 Install DVD, and the system immediately rebooted. I eventually loaded memtest86 from an OpenSUSE 11.4 install DVD I had laying around, and that ran OK.

I ran memtest86+ starting Friday about 6 pm and stopping Monday morning at 10:45 am. Almost 70 full passes were completed, and no errors were found.

Best Regards,

Dave Windsor

Robert Bosch LLC Team Leader, MES Database Infrastructure Group (AdP/TEF7.1) 4421 Highway 81 North Anderson, SC 29621 USA

Dave Windsor

7 Apr 7 Apr

9:13 p.m.

Two days since systemboard replaced, and all is well.

Of course, I probably just jinxed it.... :-)

Best Regards,

*Dave Windsor*

Robert Bosch LLC Team Leader, MES Database Infrastructure Group (AdP/TEF7.1) 4421 Highway 81 North Anderson, SC 29621 USA

5507

Age (days ago)

5521

Last active (days ago)

discuss@lists.centos.org

21 comments

11 participants

tags (0)

participants (11)

Alain Péan
Dave Windsor
Dr. Ed Morbius
John Doe
m.roth＠5-cent.us
Mogens Kjaer
rainer＠ultra-secure.de
Rajagopal Swaminathan
Rob Kampen
Windsor Dave L. (AdP/TEF7)
Windsor Dave L. (AdP/TEF7.1)