Hi,
I have a CentOS 5.0 running as a web server.
# uname -a Linux hostnamehidden.net 2.6.18-53.1.14.el5 #1 SMP Wed Mar 5 11:37:38 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
Every 59 minutes (maybe every hour) it reboots without any logs, without any traces and unfortunately with breaking software raid. After reboot dmesg does not have any strange entries.
I double-checked crons, any strange services, nothing suspicious.
I did "yum update" recently.
I went to Datacenter and waited before the monitor but during reboot I did not see anything strange. I guess reboot is cold reboot.
I changed all system and cpu fans. Upgraded system powersupply with a more powerful one. Placed server infront of air-conditioner.
Do you have any idea?
Thanks.
A little birdy told me that Linux said:
] I have a CentOS 5.0 running as a web server. ] ] # uname -a ] Linux hostnamehidden.net 2.6.18-53.1.14.el5 #1 SMP Wed Mar 5 11:37:38 ] EST 2008 x86_64 x86_64 x86_64 GNU/Linux ] ] Every 59 minutes (maybe every hour) it reboots without any logs, ] without any traces and unfortunately with breaking software raid. ] After reboot dmesg does not have any strange entries. ] ] I double-checked crons, any strange services, nothing suspicious. ] ] I did "yum update" recently. ] ] I went to Datacenter and waited before the monitor but during reboot I ] did not see anything strange. I guess reboot is cold reboot. ] ] I changed all system and cpu fans. Upgraded system powersupply with a ] more powerful one. Placed server infront of air-conditioner. ] ] Do you have any idea?
i have only recently started to believe CentOS 5 (not CentOS' fault at all, but really RHEL 5) is stable on a large enough scope of hardware to begin moving from CentOS 4 (which has been rock solid for my job's organization and my home use for years)...
the issue you describe was one of the many symptoms that would manifest on some systems running 5... especially early on... as a suggestion, try disabling (really not installing) the X-server and see if the problem doesn't vanish... although i wouldn't consider that an acceptable "solution" for my own long term use (and thus the hesitance to move from 4 to 5) that WAS often a culprit in "periodic spontaneous crash/reboots"...
i've found the simplest way to test this without massive software removal or reinstallation is to change the initdefault in /etc/inittab to "3"... and then to remove (or rename) /etc/X11/xorg.conf (to prevent the X-server from running during the boot notification sequence and possibly hanging at exit, thus preventing even console logins)
this really only helps you if you have X11 installed/enabled to begin with...
B. Karhan simon@pop.psu.edu PRI/SSRI Unix Administrator
On Fri, Apr 11, 2008 at 1:44 AM, Benjamin Karhan simon@pop.psu.edu wrote:
] Every 59 minutes (maybe every hour) it reboots without any logs, ] without any traces and unfortunately with breaking software raid. ] After reboot dmesg does not have any strange entries. this really only helps you if you have X11 installed/enabled to begin with...
Well, X11 is not installed (just as would be expected from a production server) Also I tried removing unneeded things ipv6 etc... no luck yet...
And it is really annoying that I have only 59 minutes to work on it....
On Fri, 2008-04-11 at 02:02 +0300, Linux wrote:
On Fri, Apr 11, 2008 at 1:44 AM, Benjamin Karhan simon@pop.psu.edu wrote:
] Every 59 minutes (maybe every hour) it reboots without any logs, ] without any traces and unfortunately with breaking software raid. ] After reboot dmesg does not have any strange entries. this really only helps you if you have X11 installed/enabled to begin with...
Well, X11 is not installed (just as would be expected from a production server) Also I tried removing unneeded things ipv6 etc... no luck yet...
And it is really annoying that I have only 59 minutes to work on it....
I can't help, but if you post your hardware configuration, grub kernel boot lines, OS status, etc., I bet there is someone that has your config running that may have something useful to say. Maybe need something on the kernel line like "lapic" or whatnot.
<snip sig stuff>
On Fri, Apr 11, 2008 at 2:43 AM, William L. Maltby CentOS4Bill@triad.rr.com wrote:
On Fri, 2008-04-11 at 02:02 +0300, Linux wrote:
On Fri, Apr 11, 2008 at 1:44 AM, Benjamin Karhan simon@pop.psu.edu wrote:
] Every 59 minutes (maybe every hour) it reboots without any logs, ] without any traces and unfortunately with breaking software raid. ] After reboot dmesg does not have any strange entries. this really only helps you if you have X11 installed/enabled to begin with...
Well, X11 is not installed (just as would be expected from a production server) Also I tried removing unneeded things ipv6 etc... no luck yet...
And it is really annoying that I have only 59 minutes to work on it....
I can't help, but if you post your hardware configuration, grub kernel boot lines, OS status, etc., I bet there is someone that has your config running that may have something useful to say. Maybe need something on the kernel line like "lapic" or whatnot.
Note that, the reboot has no relation with earth time, just timer. I feel like someone is making joke of me. Planted o timebomb on my boot process and at 60th minute, it explodes.
hardware: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz 8 GB RAM 4 x 300 GB SATA Disk (Soft RAID-1) Intel Board (no idea about model/rev)
grub.conf line: kernel /boot/vmlinuz-2.6.18-53.1.14.el5 ro root=/dev/md0 pci=nommconf mem=8318M
# uname -a Linux hostnamehidden.net 2.6.18-53.1.14.el5 #1 SMP Wed Mar 5 11:37:38 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
# ps ax PID TTY STAT TIME COMMAND 1 ? Ss 0:01 init [3] 2 ? S 0:00 [migration/0] 3 ? SN 0:00 [ksoftirqd/0] 4 ? S 0:00 [watchdog/0] 5 ? S 0:00 [migration/1] 6 ? SN 0:00 [ksoftirqd/1] 7 ? S 0:00 [watchdog/1] 8 ? S 0:00 [migration/2] 9 ? SN 0:00 [ksoftirqd/2] 10 ? S 0:00 [watchdog/2] 11 ? S 0:00 [migration/3] 12 ? SN 0:00 [ksoftirqd/3] 13 ? S 0:00 [watchdog/3] 14 ? S< 0:00 [events/0] 15 ? S< 0:00 [events/1] 16 ? S< 0:00 [events/2] 17 ? S< 0:00 [events/3] 18 ? S< 0:00 [khelper] 84 ? S< 0:00 [kthread] 91 ? S< 0:00 [kblockd/0] 92 ? S< 0:00 [kblockd/1] 93 ? S< 0:00 [kblockd/2] 94 ? S< 0:00 [kblockd/3] 95 ? S< 0:00 [kacpid] 190 ? S< 0:00 [cqueue/0] 191 ? S< 0:00 [cqueue/1] 192 ? S< 0:00 [cqueue/2] 193 ? S< 0:00 [cqueue/3] 196 ? S< 0:00 [khubd] 198 ? S< 0:00 [kseriod] 284 ? S 0:00 [pdflush] 285 ? S 0:00 [pdflush] 286 ? S< 0:00 [kswapd0] 287 ? S< 0:00 [aio/0] 288 ? S< 0:00 [aio/1] 289 ? S< 0:00 [aio/2] 290 ? S< 0:00 [aio/3] 436 ? S< 0:00 [kpsmoused] 493 ? S< 0:00 [ata/0] 494 ? S< 0:00 [ata/1] 495 ? S< 0:00 [ata/2] 496 ? S< 0:00 [ata/3] 497 ? S< 0:00 [ata_aux] 503 ? S< 0:00 [scsi_eh_0] 504 ? S< 0:00 [scsi_eh_1] 505 ? S< 0:00 [scsi_eh_2] 506 ? S< 0:00 [scsi_eh_3] 507 ? S< 0:00 [scsi_eh_4] 508 ? S< 0:00 [scsi_eh_5] 511 ? S< 0:00 [md2_raid1] 514 ? S< 0:00 [md1_raid1] 517 ? S< 0:00 [md0_raid1] 520 ? S< 0:00 [md3_raid1] 521 ? S< 0:00 [kjournald] 553 ? S< 0:00 [kauditd] 587 ? S<s 0:00 /sbin/udevd -d 1285 ? S< 0:00 [scsi_eh_6] 1286 ? S< 0:00 [scsi_eh_7] 1753 ? S< 0:00 [kmpathd/0] 1754 ? S< 0:00 [kmpathd/1] 1755 ? S< 0:00 [kmpathd/2] 1756 ? S< 0:00 [kmpathd/3] 1794 ? S< 0:00 [xfslogd/0] 1795 ? S< 0:00 [xfslogd/1] 1796 ? S< 0:00 [xfslogd/2] 1797 ? S< 0:00 [xfslogd/3] 1798 ? S< 0:00 [xfsdatad/0] 1799 ? S< 0:00 [xfsdatad/1] 1800 ? S< 0:00 [xfsdatad/2] 1801 ? S< 0:00 [xfsdatad/3] 1803 ? S< 0:00 [xfsbufd] 1804 ? S< 0:00 [xfssyncd] 1806 ? S< 0:00 [xfsbufd] 1807 ? S< 0:00 [xfssyncd] 1809 ? S< 0:00 [kjournald] 2076 ? S< 0:00 [kondemand/0] 2077 ? S< 0:00 [kondemand/1] 2078 ? S< 0:00 [kondemand/2] 2079 ? S< 0:00 [kondemand/3] 2528 ? Ss 0:00 /usr/sbin/restorecond 2543 ? S<sl 0:00 auditd 2545 ? S<s 0:00 python /sbin/audispd 2590 ? Ss 0:00 syslogd -m 0 2593 ? Ss 0:00 klogd -x 2609 ? Ss 0:00 irqbalance 2629 ? Ss 0:00 mcstransd 2714 ? S 0:00 /usr/sbin/courierlogger -pid=/var/spool/authdaemon/pid -facility=mail -start /usr/libexec/courier-authlib/authdaemond 2715 ? S 0:00 /usr/libexec/courier-authlib/authdaemond 2730 ? Ss 0:00 mdadm --monitor --scan -f --pid-file=/var/run/mdadm/mdadm.pid 2758 ? Ssl 0:00 dbus-daemon --system 2781 ? Ssl 0:00 automount 2784 ? S 0:00 /usr/libexec/courier-authlib/authdaemond 2785 ? S 0:00 /usr/libexec/courier-authlib/authdaemond 2786 ? S 0:00 /usr/libexec/courier-authlib/authdaemond 2787 ? S 0:00 /usr/libexec/courier-authlib/authdaemond 2788 ? S 0:00 /usr/libexec/courier-authlib/authdaemond 2811 ? Ss 0:00 proftpd: (accepting connections) 2826 ? Ss 0:00 /usr/sbin/acpid 2849 ? Ss 0:00 /usr/sbin/sshd 2871 ? S 0:00 /bin/sh /usr/bin/mysqld_safe --datadir=/var/lib/mysql --pid-file=/var/lib/mysql/cpanelz.sbys.net.pid 2994 ? Ss 0:00 sshd: root@pts/2 3011 ? S 0:00 chkservd 3160 ? Ss 0:00 /usr/sbin/exim -bd -q60m 3164 ? Ss 0:00 /usr/sbin/exim -tls-on-connect -bd -oX 465 3171 ? S 0:00 antirelayd 3192 ? Ss 0:00 /usr/bin/spamd -d --allowed-ips=127.0.0.1 --pidfile=/var/run/spamd.pid --max-children=5 3228 ? S 0:00 lfd - sleeping 3244 ? S 0:14 spamd child 3245 ? S 0:10 spamd child 3431 ? Ss 0:00 crond 3456 ? Ss 0:00 xfs -droppriv -daemon 3471 ? SNs 0:00 anacron -s 3564 ? S 0:00 eximstats 3584 ? Ss 0:00 /usr/local/apache/bin/httpd -k start -DSSL 3606 ? Ss 0:00 cPhulkd - processor 3624 ? S 0:00 /usr/local/apache/bin/httpd -k start -DSSL 3632 ? S 0:00 /usr/local/apache/bin/httpd -k start -DSSL 3634 ? S 0:00 /usr/local/apache/bin/httpd -k start -DSSL 3643 ? S 0:00 cpdavd - accepting connections on 2077 and 2078 3651 ? S 0:00 cpbandwd 3652 ? SN 0:00 cpanellogd - sleeping for logs 3664 ? S 0:00 entropychat 3690 ? Ss 0:00 /usr/bin/python2.4 /usr/local/cpanel/3rdparty/mailman/bin/mailmanctl -s start 3704 ? S 0:00 /usr/bin/python2.4 /usr/local/cpanel/3rdparty/mailman/bin/qrunner --runner=ArchRunner:0:1 -s 3705 ? S 0:00 /usr/bin/python2.4 /usr/local/cpanel/3rdparty/mailman/bin/qrunner --runner=BounceRunner:0:1 -s 3706 ? S 0:00 /usr/bin/python2.4 /usr/local/cpanel/3rdparty/mailman/bin/qrunner --runner=CommandRunner:0:1 -s 3707 ? S 0:00 /usr/bin/python2.4 /usr/local/cpanel/3rdparty/mailman/bin/qrunner --runner=IncomingRunner:0:1 -s 3708 ? S 0:00 /usr/bin/python2.4 /usr/local/cpanel/3rdparty/mailman/bin/qrunner --runner=NewsRunner:0:1 -s 3709 ? S 0:00 /usr/bin/python2.4 /usr/local/cpanel/3rdparty/mailman/bin/qrunner --runner=OutgoingRunner:0:1 -s 3710 ? S 0:00 /usr/bin/python2.4 /usr/local/cpanel/3rdparty/mailman/bin/qrunner --runner=VirginRunner:0:1 -s 3711 ? S 0:00 /usr/bin/python2.4 /usr/local/cpanel/3rdparty/mailman/bin/qrunner --runner=RetryRunner:0:1 -s 3712 ? S 0:00 /usr/bin/python /usr/sbin/yum-updatesd 3727 ? Ss 0:00 hald 3728 ? S 0:00 hald-runner 3735 ? S 0:00 hald-addon-acpi: listening on acpid socket /var/run/acpid.socket 3736 ? S 0:00 /usr/libexec/hald-addon-cpufreq 3737 ? S 0:00 hald-addon-keyboard: listening on /dev/input/event0 3749 ? S 0:00 hald-addon-storage: polling /dev/scd0 3800 ? Ss 0:00 /usr/sbin/portsentry -tcp 3838 ? S 0:00 /usr/local/apache/bin/httpd -k start -DSSL 3857 ? S 0:00 cpsrvd - waiting for connections 3888 ? S 0:00 /usr/sbin/smartd -q never 3892 tty1 Ss+ 0:00 /sbin/mingetty tty1 3893 tty2 Ss+ 0:00 /sbin/mingetty tty2 3894 tty3 Ss+ 0:00 /sbin/mingetty tty3 3895 tty4 Ss+ 0:00 /sbin/mingetty tty4 3896 tty5 Ss+ 0:00 /sbin/mingetty tty5 3897 tty6 Ss+ 0:00 /sbin/mingetty tty6 3952 ? Ss 0:00 interchange 3981 ? S 0:00 /usr/local/apache/bin/httpd -k start -DSSL 4028 pts/2 Ss+ 0:00 -bash 4218 ? S 0:09 /usr/local/apache/bin/httpd -k start -DSSL 4870 ? Ssl 0:00 /usr/sbin/named -u named -t /var/named/chroot 4884 ? S 0:00 /etc/authlib/authProg 4896 ? S 0:00 /etc/authlib/authProg 5033 ? S 0:00 /etc/authlib/authProg 5197 ? S 0:00 /usr/local/apache/bin/httpd -k start -DSSL 5603 ? Ss 0:00 sshd: root@pts/0 5625 pts/0 Ss 0:00 -bash 5733 ? S 0:00 /usr/local/apache/bin/httpd -k start -DSSL 5962 ? S 0:00 /usr/local/apache/bin/httpd -k start -DSSL 5966 ? S 0:00 /usr/local/apache/bin/httpd -k start -DSSL 5967 ? S 0:00 /usr/local/apache/bin/httpd -k start -DSSL 6547 ? Ss 0:00 /usr/sbin/exim -Mc 1Jk6sS-0001hZ-9e 6548 ? S 0:00 /usr/sbin/exim -Mc 1Jk6sS-0001hZ-9e 6585 ? S 0:00 /usr/sbin/exim -bd -q60m 6926 ? SL 0:00 proftpd: barboros - 88.229.223.214: IDLE 7094 ? S 0:00 /usr/local/apache/bin/httpd -k start -DSSL 7177 ? Ss 0:00 /usr/sbin/exim -Mc 1Jk6uT-0001rk-MF 7178 ? S 0:00 /usr/sbin/exim -Mc 1Jk6uT-0001rk-MF 7278 ? S 0:00 /usr/sbin/exim -bd -q60m 7282 ? S 0:00 /usr/local/apache/bin/httpd -k start -DSSL 7291 ? S 0:00 cPhulkd - processor 7299 ? S 0:00 /usr/local/apache/bin/httpd -k start -DSSL 7310 ? S 0:00 /usr/sbin/exim -bd -q60m 7311 ? S 0:00 /usr/sbin/exim -bd -q60m 7327 ? Z 0:00 [exim] <defunct> 7331 ? S 0:00 /usr/sbin/exim -bd -q60m 7333 ? S 0:00 /usr/sbin/exim -bd -q60m 7334 ? S 0:00 /usr/sbin/exim -bd -q60m 7341 ? S 0:00 crond 7343 ? Ss 0:00 /bin/sh -c /usr/local/cpanel/bin/dcpumon
/dev/null 2>&1
7345 ? S 0:00 /usr/local/cpanel/bin/dcpumon 7358 pts/0 R+ 0:00 ps ax
Linux wrote:
# ps ax PID TTY STAT TIME COMMAND
<snip> 2994 ? Ss 0:00 sshd: root@pts/2 <snip> 4028 pts/2 Ss+ 0:00 -bash <snip> 5603 ? Ss 0:00 sshd: root@pts/0 5625 pts/0 Ss 0:00 -bash
Two root logins via ssh - are these both you? The first looks early in the boot process.
I'm sure I don't need to say you shouldn't really be logging in directly as root. Better to disable root logins by ssh - login as a regular user and su to root.
On Fri, Apr 11, 2008 at 3:18 AM, Linux linuxlist@gmail.com wrote:
On Fri, Apr 11, 2008 at 2:43 AM, William L. Maltby CentOS4Bill@triad.rr.com wrote:
On Fri, 2008-04-11 at 02:02 +0300, Linux wrote:
On Fri, Apr 11, 2008 at 1:44 AM, Benjamin Karhan simon@pop.psu.edu wrote:
] Every 59 minutes (maybe every hour) it reboots without any logs, ] without any traces and unfortunately with breaking software raid. ] After reboot dmesg does not have any strange entries. this really only helps you if you have X11 installed/enabled to begin with...
For the records...
Thanks for your responses and here is feedback: - ssh root login is limited to certain ip's only... - reboot is definitely a "cold reboot" but I am not a native english speaker :) - power surge/temporary failure is not a case, datacenter has expensive power equipment - nor cpu heat is a case. I placed an extra thermometer and moved server to the frond of cold air inlet. - lmsensors or sensors does not exists. - I tested with/out crons before. Pretty sure no relation with crons. - External and internal services were expected to leave some marks. Did a cronned "ps ax" and checked processes that run before restart, nothing suspicious.
Consequently: I had 4 DDR ECC RAM modules each 2 GB paired in 2 different brands. Each brand functioned well as 4GB total but when all are installed nightmare begins. I guess (this time a real guess) I have a problem with my RAM settings. 8 GB may cause this problem, may be a serious page fault etc... Without grub parameter "mem=" server boots very slowly and also responses very slowly.
Maybe I need a new thread title "Server too slow or too unstable with 8 GB RAM" :)
Thanks...
Linux wrote on Fri, 11 Apr 2008 00:06:40 +0300:
Every 59 minutes (maybe every hour) it reboots without any logs, without any traces and unfortunately with breaking software raid. After reboot dmesg does not have any strange entries.
I double-checked crons, any strange services, nothing suspicious.
Disable cron and at completely for two hours or so and see what happens.
guess reboot is cold reboot.
Guess? You would see that if you sit at the console. You do not see it shut down, just suddenly the BIOS screen? Then it's cold ...
For what do you need that mem line for the kernel? Doesn't it recognize the RAM?
Kai
On Fri, Apr 11, 2008 at 12:31:15PM +0200, Kai Schaetzl wrote:
Linux wrote on Fri, 11 Apr 2008 00:06:40 +0300:
Every 59 minutes (maybe every hour) it reboots without any logs,
Disable cron and at completely for two hours or so and see what happens.
[watch physical console and/or serial one]
also: double check any watch dog functions / hw / services in and OUTSIDE the machine.
Also check that no regular and huge change in power usage is occuring on the mains ... - if a classical light bulb flickers very noticable, the surge may be enough to get thru power supplies and condis and trigger a reboot. Suffered a partial second brown-out yesterday rebooting several computers - an old laserjet 4050 didn't even notice :).
On Thu, Apr 10, 2008 at 5:06 PM, Linux linuxlist@gmail.com wrote:
Every 59 minutes (maybe every hour) it reboots without any logs, without any traces and unfortunately with breaking software raid. After reboot dmesg does not have any strange entries.
I have several both CentOS 4.1 and 5.1 servers in product no unintended reboot issues. I generally install with as few packages as I can and add back what I need, and carefully review which daemons/services are running and run as few as possible.
Might try reviewing what services you have running:
chkconfig --list | grep 3:on
and try disabling a few at a time to see if shutting down any particular service changes behavior which would help narrow the issue.
I did have an issue with NFS under CentOS 4.x that under some conditions locked the NFS host and it had to be manually rebooted.
Brett
On Thu, Apr 10, 2008 at 6:06 PM, Linux linuxlist@gmail.com wrote:
Every 59 minutes (maybe every hour) it reboots without any logs, without any traces and unfortunately with breaking software raid. After reboot dmesg does not have any strange entries.
I have^Whad exactly the same problem with cold reboots every hour. Trying to watch more closely, I installed lm_sensors, run sensors-detect and rebooted. "sensors -s" now tells me "No sensors found!" BUT now I have 19 hours of uptime!!
Maybe someone can tell where to look for a reason to this *unexpected* solution...