Server spontaneously rebooting under RHEL-4

List overview All Threads
Download

newer

older

dovecot vs Cyrus-Imapd

Re: CentOS Digest, Vol 14, Issue...

Benjamin J. Weiss

28 Mar 2006 28 Mar '06

11:15 a.m.

Hey, y'all! :)

I've got an RHEL-4 server (yep, I know it's not CentOS, but hey we gotta send some money RH's way to keep CentOS up and going! ) that's running Oracle 10g. This same hardware worked just fine for over a year running RHEL-AS-2.1 and Oracle 9i. Now we're getting spontaneous reboots when running oracle processes that eat up a bunch of resources. I don't know where to go from here.

It's got dual hyper-threading processors set to hyperthreading mode, and I understand that the 2.6 kernel used to have HT issues, but I thought that'd been solved. The kernel we're running is: 2.6.9-22.0.2.ELsmp (yeah, not the latest, I haven't had a chance lately to test and update the patches).

I think the kernel settings are correct, what with 4gigs of ram:

[root@sibrsdbs etc]# cat sysctl.conf # Kernel sysctl configuration file for Red Hat Linux # # For binary values, 0 is disabled, 1 is enabled. See sysctl(8) and # sysctl.conf(5) for more details.

# Controls IP packet forwarding net.ipv4.ip_forward = 0

# Controls source route verification net.ipv4.conf.default.rp_filter = 1

# Do not accept source routing net.ipv4.conf.default.accept_source_route = 0

# Controls the System Request debugging functionality of the kernel kernel.sysrq = 0

# Controls whether core dumps will append the PID to the core filename. # Useful for debugging multi-threaded applications. kernel.core_uses_pid = 1

# oracle settings kernel.shmall = 2097152 kernel.shmmax = 2147483648 kernel.shmmni = 4096 kernel.sem = 250 32000 100 128 #fs.file-max = 65536 net.ipv4.ip_local_port_range = 1024 65000 net.core.rmem_default=262144 net.core.wmem_default=262144 net.core.rmem_max=262144 net.core.wmem_max=262144

I don't know how to look for the core dump, if there was one. I don't see anything named 'core' in the /root directory.

I'm sucking wind, any suggestions?

Thanks!

Ben

Show replies by date

Leonard Isham

28 Mar 28 Mar

4:08 p.m.

On 3/28/06, Benjamin J. Weiss benjamin@birdvet.org wrote:

...

Hey, y'all! :)

I've got an RHEL-4 server (yep, I know it's not CentOS, but hey we gotta send some money RH's way to keep CentOS up and going! ) that's running Oracle 10g. This same hardware worked just fine for over a year running RHEL-AS-2.1 and Oracle 9i. Now we're getting spontaneous reboots when running oracle processes that eat up a bunch of resources. I don't know where to go from here.

I didn't see a mention of the hardware type, but some systems have a BIOS setting to reboot if the hardware doesn't detecet any "activity" for a period of time. Check for that setting and disable that feature. This may solve the issue. If not at least let you see the crash if there is one.

...

It's got dual hyper-threading processors set to hyperthreading mode, and I understand that the 2.6 kernel used to have HT issues, but I thought that'd been solved. The kernel we're running is: 2.6.9-22.0.2.ELsmp (yeah, not the latest, I haven't had a chance lately to test and update the patches).

I think the kernel settings are correct, what with 4gigs of ram:

[root@sibrsdbs etc]# cat sysctl.conf # Kernel sysctl configuration file for Red Hat Linux # # For binary values, 0 is disabled, 1 is enabled. See sysctl(8) and # sysctl.conf(5) for more details.

# Controls IP packet forwarding net.ipv4.ip_forward = 0

# Controls source route verification net.ipv4.conf.default.rp_filter = 1

# Do not accept source routing net.ipv4.conf.default.accept_source_route = 0

# Controls the System Request debugging functionality of the kernel kernel.sysrq = 0

# Controls whether core dumps will append the PID to the core filename. # Useful for debugging multi-threaded applications. kernel.core_uses_pid = 1

# oracle settings kernel.shmall = 2097152 kernel.shmmax = 2147483648 kernel.shmmni = 4096 kernel.sem = 250 32000 100 128 #fs.file-max = 65536 net.ipv4.ip_local_port_range = 1024 65000 net.core.rmem_default=262144 net.core.wmem_default=262144 net.core.rmem_max=262144 net.core.wmem_max=262144

I don't know how to look for the core dump, if there was one. I don't see anything named 'core' in the /root directory.

I'm sucking wind, any suggestions?

Thanks!

Ben _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

-- Leonard Isham, CISSP Ostendo non ostento.

Benjamin J. Weiss

7:46 p.m.

Sorry, it's an HP/Compaq ML-530. It didn't do this until I changed the OS, so I doubt that it's a BIOS issue.

Thanks!

Leonard Isham wrote:

...

On 3/28/06, Benjamin J. Weiss benjamin@birdvet.org wrote:

...
Hey, y'all! :)

I've got an RHEL-4 server (yep, I know it's not CentOS, but hey we gotta send some money RH's way to keep CentOS up and going! ) that's running Oracle 10g. This same hardware worked just fine for over a year running RHEL-AS-2.1 and Oracle 9i. Now we're getting spontaneous reboots when running oracle processes that eat up a bunch of resources. I don't know where to go from here.

I didn't see a mention of the hardware type, but some systems have a BIOS setting to reboot if the hardware doesn't detecet any "activity" for a period of time. Check for that setting and disable that feature. This may solve the issue. If not at least let you see the crash if there is one.

...
It's got dual hyper-threading processors set to hyperthreading mode, and I understand that the 2.6 kernel used to have HT issues, but I thought that'd been solved. The kernel we're running is: 2.6.9-22.0.2.ELsmp (yeah, not the latest, I haven't had a chance lately to test and update the patches).

I think the kernel settings are correct, what with 4gigs of ram:

[root@sibrsdbs etc]# cat sysctl.conf # Kernel sysctl configuration file for Red Hat Linux # # For binary values, 0 is disabled, 1 is enabled. See sysctl(8) and # sysctl.conf(5) for more details.

# Controls IP packet forwarding net.ipv4.ip_forward = 0

# Controls source route verification net.ipv4.conf.default.rp_filter = 1

# Do not accept source routing net.ipv4.conf.default.accept_source_route = 0

# Controls the System Request debugging functionality of the kernel kernel.sysrq = 0

# Controls whether core dumps will append the PID to the core filename. # Useful for debugging multi-threaded applications. kernel.core_uses_pid = 1

# oracle settings kernel.shmall = 2097152 kernel.shmmax = 2147483648 kernel.shmmni = 4096 kernel.sem = 250 32000 100 128 #fs.file-max = 65536 net.ipv4.ip_local_port_range = 1024 65000 net.core.rmem_default=262144 net.core.wmem_default=262144 net.core.rmem_max=262144 net.core.wmem_max=262144

I don't know how to look for the core dump, if there was one. I don't see anything named 'core' in the /root directory.

I'm sucking wind, any suggestions?

Thanks!

Ben _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

-- Leonard Isham, CISSP Ostendo non ostento. _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Jay Lee

7:50 p.m.

Benjamin J. Weiss wrote:

...

Sorry, it's an HP/Compaq ML-530. It didn't do this until I changed the OS, so I doubt that it's a BIOS issue.

I wouldn't be so sure. In addition to other BIOS interaction changes, RHEL4 is going to use ACPI where RHEL2.1 did not. I would recommend you flash up to the latest BIOS. Are you caught up to Update 3 also?

Jay

Craig White

7:51 p.m.

you are aware of the support resources available for RHEL and Oracle right? There also is a nahant-list which is Red Hat's mail list for RHEL-4

Just wanting to point out the perhaps more logical places to seek assistance though I am sure that the CentOS team is flattered by your asking for help on this list.

Craig

On Tue, 2006-03-28 at 08:16 -0600, Benjamin J. Weiss wrote:

...

Sorry, it's an HP/Compaq ML-530. It didn't do this until I changed the OS, so I doubt that it's a BIOS issue.

Thanks!

Leonard Isham wrote:

...
On 3/28/06, Benjamin J. Weiss benjamin@birdvet.org wrote:

...
Hey, y'all! :)

I've got an RHEL-4 server (yep, I know it's not CentOS, but hey we gotta send some money RH's way to keep CentOS up and going! ) that's running Oracle 10g. This same hardware worked just fine for over a year running RHEL-AS-2.1 and Oracle 9i. Now we're getting spontaneous reboots when running oracle processes that eat up a bunch of resources. I don't know where to go from here.

I didn't see a mention of the hardware type, but some systems have a BIOS setting to reboot if the hardware doesn't detecet any "activity" for a period of time. Check for that setting and disable that feature. This may solve the issue. If not at least let you see the crash if there is one.

...
It's got dual hyper-threading processors set to hyperthreading mode, and I understand that the 2.6 kernel used to have HT issues, but I thought that'd been solved. The kernel we're running is: 2.6.9-22.0.2.ELsmp (yeah, not the latest, I haven't had a chance lately to test and update the patches).

I think the kernel settings are correct, what with 4gigs of ram:

[root@sibrsdbs etc]# cat sysctl.conf # Kernel sysctl configuration file for Red Hat Linux # # For binary values, 0 is disabled, 1 is enabled. See sysctl(8) and # sysctl.conf(5) for more details.

# Controls IP packet forwarding net.ipv4.ip_forward = 0

# Controls source route verification net.ipv4.conf.default.rp_filter = 1

# Do not accept source routing net.ipv4.conf.default.accept_source_route = 0

# Controls the System Request debugging functionality of the kernel kernel.sysrq = 0

# Controls whether core dumps will append the PID to the core filename. # Useful for debugging multi-threaded applications. kernel.core_uses_pid = 1

# oracle settings kernel.shmall = 2097152 kernel.shmmax = 2147483648 kernel.shmmni = 4096 kernel.sem = 250 32000 100 128 #fs.file-max = 65536 net.ipv4.ip_local_port_range = 1024 65000 net.core.rmem_default=262144 net.core.wmem_default=262144 net.core.rmem_max=262144 net.core.wmem_max=262144

I don't know how to look for the core dump, if there was one. I don't see anything named 'core' in the /root directory.

I'm sucking wind, any suggestions?

Thanks!

Ben _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

-- Leonard Isham, CISSP Ostendo non ostento. _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Benjamin J. Weiss

7:57 p.m.

Sorry, yeah, I guess I could call RedHat support, since we're paying them. :) It's just that I've always been so impressed with the knowlege and helpfulness of the folks here that I thought I'd give it a shot before turning somewhere else.

I'll see if I can get my boss to allow me to update to the latest OS patches and BIOS to see if this will fix stuff.

Ben

Craig White wrote:

...

you are aware of the support resources available for RHEL and Oracle right? There also is a nahant-list which is Red Hat's mail list for RHEL-4

Just wanting to point out the perhaps more logical places to seek assistance though I am sure that the CentOS team is flattered by your asking for help on this list.

Craig

On Tue, 2006-03-28 at 08:16 -0600, Benjamin J. Weiss wrote:

...
Sorry, it's an HP/Compaq ML-530. It didn't do this until I changed the OS, so I doubt that it's a BIOS issue.

Thanks!

Leonard Isham wrote:

...
On 3/28/06, Benjamin J. Weiss benjamin@birdvet.org wrote:

...
Hey, y'all! :)

I've got an RHEL-4 server (yep, I know it's not CentOS, but hey we gotta send some money RH's way to keep CentOS up and going! ) that's running Oracle 10g. This same hardware worked just fine for over a year running RHEL-AS-2.1 and Oracle 9i. Now we're getting spontaneous reboots when running oracle processes that eat up a bunch of resources. I don't know where to go from here.

I didn't see a mention of the hardware type, but some systems have a BIOS setting to reboot if the hardware doesn't detecet any "activity" for a period of time. Check for that setting and disable that feature. This may solve the issue. If not at least let you see the crash if there is one.

...
It's got dual hyper-threading processors set to hyperthreading mode, and I understand that the 2.6 kernel used to have HT issues, but I thought that'd been solved. The kernel we're running is: 2.6.9-22.0.2.ELsmp (yeah, not the latest, I haven't had a chance lately to test and update the patches).

I think the kernel settings are correct, what with 4gigs of ram:

[root@sibrsdbs etc]# cat sysctl.conf # Kernel sysctl configuration file for Red Hat Linux # # For binary values, 0 is disabled, 1 is enabled. See sysctl(8) and # sysctl.conf(5) for more details.

# Controls IP packet forwarding net.ipv4.ip_forward = 0

# Controls source route verification net.ipv4.conf.default.rp_filter = 1

# Do not accept source routing net.ipv4.conf.default.accept_source_route = 0

# Controls the System Request debugging functionality of the kernel kernel.sysrq = 0

# Controls whether core dumps will append the PID to the core filename. # Useful for debugging multi-threaded applications. kernel.core_uses_pid = 1

# oracle settings kernel.shmall = 2097152 kernel.shmmax = 2147483648 kernel.shmmni = 4096 kernel.sem = 250 32000 100 128 #fs.file-max = 65536 net.ipv4.ip_local_port_range = 1024 65000 net.core.rmem_default=262144 net.core.wmem_default=262144 net.core.rmem_max=262144 net.core.wmem_max=262144

I don't know how to look for the core dump, if there was one. I don't see anything named 'core' in the /root directory.

I'm sucking wind, any suggestions?

Thanks!

Ben _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

-- Leonard Isham, CISSP Ostendo non ostento. _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Leonard Isham

8:43 p.m.

On 3/28/06, Benjamin J. Weiss benjamin@birdvet.org wrote:

...

Sorry, it's an HP/Compaq ML-530. It didn't do this until I changed the OS, so I doubt that it's a BIOS issue.

[snip]

...

...
...
I've got an RHEL-4 server (yep, I know it's not CentOS, but hey we gotta send some money RH's way to keep CentOS up and going! ) that's running Oracle 10g. This same hardware worked just fine for over a year running RHEL-AS-2.1 and Oracle 9i. Now we're getting spontaneous reboots when running oracle processes that eat up a bunch of resources. I don't know where to go from here.

I didn't see a mention of the hardware type, but some systems have a BIOS setting to reboot if the hardware doesn't detecet any "activity" for a period of time. Check for that setting and disable that feature. This may solve the issue. If not at least let you see the crash if there is one.

Unless they changes since the acquisition the system has the BIOS setting I mentioned.

You specified that this happens under a heavy load. With the newer OS ans Oracle the hardware may think that the server is locked up. Or you may actually be experiencing a lock-up and the BIOS setting causes a reboot

P.S. Please don't top post. -- Leonard Isham, CISSP Ostendo non ostento.

BRUCE STANLEY

8:53 p.m.

Leonard Isham leonard.isham@gmail.com wrote: On 3/28/06, Benjamin J. Weiss wrote:

...

Sorry, it's an HP/Compaq ML-530. It didn't do this until I changed the OS, so I doubt that it's a BIOS issue.

[snip]

...

...
...
I've got an RHEL-4 server (yep, I know it's not CentOS, but hey we gotta send some money RH's way to keep CentOS up and going! ) that's running Oracle 10g. This same hardware worked just fine for over a year running RHEL-AS-2.1 and Oracle 9i. Now we're getting spontaneous reboots when running oracle processes that eat up a bunch of resources. I don't know where to go from here.

I didn't see a mention of the hardware type, but some systems have a BIOS setting to reboot if the hardware doesn't detecet any "activity" for a period of time. Check for that setting and disable that feature. This may solve the issue. If not at least let you see the crash if there is one.

Unless they changes since the acquisition the system has the BIOS setting I mentioned.

There could even be a simpler reason for this problem. We had a server do this very thing under REHL-3 and it turned out to be hardware related.

The servive technicians came in and reset the the memory, CPU, replaced the CPU fan, and reset the bios.

Our problem may have been to over heating after the system was up and running for a few hours.

No upgrade to the bios was needed in our situation.

This system is only about 2 years old and is rack mounted. Be sure you have enough air flow to and around the system.

James Olin Oden

10:08 p.m.

On 3/28/06, BRUCE STANLEY bruce.stanley@prodigy.net wrote: <snip>

...

There could even be a simpler reason for this problem. We had a server do this very thing under REHL-3 and it turned out to be hardware related.

The servive technicians came in and reset the the memory, CPU, replaced the CPU fan, and reset the bios.

One thing that I have seen occur more often with 2.6 kernels is catching of MCE's (Machine Check Exceptions). The MCE's are the processors way of saying something is extremely wrong that it can detect. This typically will cause a panic though not causing a reboot. OTH, If your hardware also has support for a watchdog then shortly after the panic a reboot would occur.

I'm not saying that this is what is actually happening, but just that along the lines of what has been said thus far, this would make sense. If indeed this is the case, maybe the panic output is in /var/log/messages.

Cheers...james

...

Benjamin J. Weiss

29 Mar 29 Mar

5:38 p.m.

James Olin Oden wrote:

...

On 3/28/06, BRUCE STANLEY bruce.stanley@prodigy.net wrote:

<snip>

...
There could even be a simpler reason for this problem. We had a server do this very thing under REHL-3 and it turned out to be hardware related.

The servive technicians came in and reset the the memory, CPU, replaced the CPU fan, and reset the bios.

One thing that I have seen occur more often with 2.6 kernels is catching of MCE's (Machine Check Exceptions). The MCE's are the processors way of saying something is extremely wrong that it can detect. This typically will cause a panic though not causing a reboot. OTH, If your hardware also has support for a watchdog then shortly after the panic a reboot would occur.

I'm not saying that this is what is actually happening, but just that along the lines of what has been said thus far, this would make sense. If indeed this is the case, maybe the panic output is in /var/log/messages.

Cheers...james

Well, so far it looks like something is wrong with our memory subsystem. I updated all the BIOS' and ran Smart Disk diagnostics. I'm getting an ECC error on module 4, whether I have RAM in the slot or not!

We're calling HP support, I'm sure we'll be able to get it fixed.

Thanks, all!

Ben

7039

Age (days ago)

7040

Last active (days ago)

discuss@lists.centos.org

9 comments

6 participants

tags (0)

participants (6)

Benjamin J. Weiss
BRUCE STANLEY
Craig White
James Olin Oden
Jay Lee
Leonard Isham