I have a machine here that resets itself every one hour (without my intention, of course):
# cat /var/log/messages | grep "sith kernel: Linux version 2.6.18-128.1.16.el5" Jul 14 22:29:41 sith kernel: Linux version 2.6.18-128.1.16.el5 (mockbuild@builder16.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Tue Jun 30 06:10:28 EDT 2009 Jul 14 23:30:09 sith kernel: Linux version 2.6.18-128.1.16.el5 (mockbuild@builder16.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Tue Jun 30 06:10:28 EDT 2009 Jul 15 00:30:36 sith kernel: Linux version 2.6.18-128.1.16.el5 (mockbuild@builder16.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Tue Jun 30 06:10:28 EDT 2009 Jul 15 01:31:04 sith kernel: Linux version 2.6.18-128.1.16.el5 (mockbuild@builder16.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Tue Jun 30 06:10:28 EDT 2009 Jul 15 02:31:31 sith kernel: Linux version 2.6.18-128.1.16.el5 (mockbuild@builder16.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Tue Jun 30 06:10:28 EDT 2009 Jul 15 03:32:01 sith kernel: Linux version 2.6.18-128.1.16.el5 (mockbuild@builder16.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Tue Jun 30 06:10:28 EDT 2009 Jul 15 04:32:30 sith kernel: Linux version 2.6.18-128.1.16.el5 (mockbuild@builder16.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Tue Jun 30 06:10:28 EDT 2009 Jul 15 05:32:58 sith kernel: Linux version 2.6.18-128.1.16.el5 (mockbuild@builder16.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Tue Jun 30 06:10:28 EDT 2009 Jul 15 06:33:26 sith kernel: Linux version 2.6.18-128.1.16.el5 (mockbuild@builder16.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Tue Jun 30 06:10:28 EDT 2009 Jul 15 07:33:56 sith kernel: Linux version 2.6.18-128.1.16.el5 (mockbuild@builder16.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Tue Jun 30 06:10:28 EDT 2009 Jul 15 08:34:21 sith kernel: Linux version 2.6.18-128.1.16.el5 (mockbuild@builder16.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Tue Jun 30 06:10:28 EDT 2009 Jul 15 09:34:52 sith kernel: Linux version 2.6.18-128.1.16.el5 (mockbuild@builder16.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Tue Jun 30 06:10:28 EDT 2009 Jul 15 10:38:48 sith kernel: Linux version 2.6.18-128.1.16.el5 (mockbuild@builder16.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Tue Jun 30 06:10:28 EDT 2009 Jul 15 11:35:47 sith kernel: Linux version 2.6.18-128.1.16.el5 (mockbuild@builder16.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Tue Jun 30 06:10:28 EDT 2009 Jul 15 12:36:17 sith kernel: Linux version 2.6.18-128.1.16.el5 (mockbuild@builder16.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Tue Jun 30 06:10:28 EDT 2009 Jul 15 13:36:46 sith kernel: Linux version 2.6.18-128.1.16.el5 (mockbuild@builder16.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Tue Jun 30 06:10:28 EDT 2009
The machine is supposed to be up 24/7 and after each reset the system reboots again, and works normally only to reset again one hour later. However, the only unusual thing I am able to recognize in the logs is winbind daemon whining about something I don't understand:
# tail -f /var/log/messages Jul 15 13:47:39 sith winbindd[3353]: [2009/07/15 13:47:39, 0] nsswitch/idmap.c:idmap_alloc_init(820) Jul 15 13:47:39 sith winbindd[3353]: ERROR: Initialization failed for alloc backend, deferred! Jul 15 13:47:39 sith smbd[3373]: [2009/07/15 13:47:39, 0] auth/auth_util.c:create_builtin_users(810) Jul 15 13:47:39 sith smbd[3373]: create_builtin_users: Failed to create Users Jul 15 13:47:39 sith winbindd[2927]: [2009/07/15 13:47:39, 0] nsswitch/winbindd_passdb.c:sid_to_name(126) Jul 15 13:47:39 sith winbindd[2927]: Possible deadlock: Trying to lookup SID S-1-22-1-99 with passdb backend Jul 15 13:47:39 sith winbindd[2927]: [2009/07/15 13:47:39, 0] nsswitch/winbindd_passdb.c:sid_to_name(126) Jul 15 13:47:39 sith winbindd[2927]: Possible deadlock: Trying to lookup SID S-1-1-0 with passdb backend Jul 15 13:47:39 sith winbindd[2927]: [2009/07/15 13:47:39, 0] nsswitch/winbindd_passdb.c:sid_to_name(126) Jul 15 13:47:39 sith winbindd[2927]: Possible deadlock: Trying to lookup SID S-1-5-2 with passdb backend
Nevertheless, samba server seems to be running ok, so I am not sure this is related to reboots, and there is nothing else suspicious in the logs AFAICT.
How do I troubleshoot these restarts? I suspected hardware failure (power supply, cooling fans, etc.) but somehow the restarts happen way too periodically, so I have second thoughts on software as well. CentOS 5.3, fully updated.
Btw, I don't have physical access to the machine until September, so diagnosing hardware is pretty limited atm. Only remote ssh available.
I would really appreciate any advice on this!
Best, :-) Marko
On Wed, Jul 15, 2009 at 8:16 AM, Marko Vojinovic vvmarko@gmail.com wrote:
I have a machine here that resets itself every one hour (without my intention, of course):
# cat /var/log/messages | grep "sith kernel: Linux version 2.6.18-128.1.16.el5" Jul 14 22:29:41 sith kernel: Linux version 2.6.18-128.1.16.el5 (mockbuild@builder16.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Tue Jun 30 06:10:28 EDT 2009 Jul 14 23:30:09 sith kernel: Linux version 2.6.18-128.1.16.el5
Check the root crontab and the cron.hourly directory for a scheduled job??
On Wed, Jul 15, 2009 at 1:23 PM, Kwan Lowekwan.lowe@gmail.com wrote:
On Wed, Jul 15, 2009 at 8:16 AM, Marko Vojinovic vvmarko@gmail.com wrote:
I have a machine here that resets itself every one hour (without my intention, of course):
# cat /var/log/messages | grep "sith kernel: Linux version 2.6.18-128.1.16.el5" Jul 14 22:29:41 sith kernel: Linux version 2.6.18-128.1.16.el5 (mockbuild@builder16.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Tue Jun 30 06:10:28 EDT 2009 Jul 14 23:30:09 sith kernel: Linux version 2.6.18-128.1.16.el5
Check the root crontab and the cron.hourly directory for a scheduled job??
Nothing suspicious there:
# cat /etc/crontab SHELL=/bin/bash PATH=/sbin:/bin:/usr/sbin:/usr/bin MAILTO=root HOME=/ # run-parts 01 * * * * root run-parts /etc/cron.hourly 02 4 * * * root run-parts /etc/cron.daily 22 4 * * 0 root run-parts /etc/cron.weekly 42 4 1 * * root run-parts /etc/cron.monthly
# cd /etc/cron.hourly/ # ll -a total 40 drwxr-xr-x 2 root root 4096 Apr 1 14:04 . drwxr-xr-x 115 root root 12288 Jul 15 14:04 .. -rwxr-xr-x 1 root root 118 Feb 26 23:01 inn-cron-nntpsend -rwxr-xr-x 1 root root 118 Feb 26 23:01 inn-cron-rnews
# chkconfig innd --list innd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
I should also note that resets are abrupt, the system doesn't seem to go through shutdown phase. Thanks for the suggestion, though!
Best, :-) Marko
2009/7/15 Marko Vojinovic vvmarko@gmail.com
On Wed, Jul 15, 2009 at 1:23 PM, Kwan Lowekwan.lowe@gmail.com wrote:
On Wed, Jul 15, 2009 at 8:16 AM, Marko Vojinovic vvmarko@gmail.com
wrote:
I have a machine here that resets itself every one hour (without my intention, of course):
# cat /var/log/messages | grep "sith kernel: Linux version 2.6.18-128.1.16.el5" Jul 14 22:29:41 sith kernel: Linux version 2.6.18-128.1.16.el5 (mockbuild@builder16.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Tue Jun 30 06:10:28 EDT 2009 Jul 14 23:30:09 sith kernel: Linux version 2.6.18-128.1.16.el5
Check the root crontab and the cron.hourly directory for a scheduled
job??
Nothing suspicious there:
# cat /etc/crontab SHELL=/bin/bash PATH=/sbin:/bin:/usr/sbin:/usr/bin MAILTO=root HOME=/ # run-parts 01 * * * * root run-parts /etc/cron.hourly 02 4 * * * root run-parts /etc/cron.daily 22 4 * * 0 root run-parts /etc/cron.weekly 42 4 1 * * root run-parts /etc/cron.monthly
# cd /etc/cron.hourly/ # ll -a total 40 drwxr-xr-x 2 root root 4096 Apr 1 14:04 . drwxr-xr-x 115 root root 12288 Jul 15 14:04 .. -rwxr-xr-x 1 root root 118 Feb 26 23:01 inn-cron-nntpsend -rwxr-xr-x 1 root root 118 Feb 26 23:01 inn-cron-rnews
# chkconfig innd --list innd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
I should also note that resets are abrupt, the system doesn't seem to go through shutdown phase. Thanks for the suggestion, though!
bios? a process running in background?
Best, :-) Marko _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
From: Marko Vojinovic vvmarko@gmail.com
I should also note that resets are abrupt, the system doesn't seem to go through shutdown phase. Thanks for the suggestion, though!
Anything in the system logs (bios/ipmi)? Some kind of watchdog?
JD
On Wed, Jul 15, 2009 at 8:48 AM, Marko Vojinovic vvmarko@gmail.com wrote:
I should also note that resets are abrupt, the system doesn't seem to go through shutdown phase. Thanks for the suggestion, though!
Is it happening every hour at the same time?? If so, can you reproduce if you adjust the system time to just before the reset? This might at least tell you if the problem is clock related or some other problem.
You may also want to keep a remote top session running to the system to see if anything spikes at that point and point your system logs to a remote syslog server.
can you post the output of last command?
Maybe we can find something like the account currently login when server reboots.
On Thu, Jul 16, 2009 at 10:51 AM, Tran Van Hung tvhungsg@yahoo.com.vnwrote:
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Thu, Jul 16, 2009 at 11:06 AM, Michael Calizomike.calizo@gmail.com wrote:
can you post the output of last command?
Maybe we can find something like the account currently login when server reboots.
Here goes (note that it is sorted in most-recent-first fashion):
# last -R | less vmarko pts/1 Thu Jul 16 16:09 still logged in vmarko pts/1 Thu Jul 16 16:05 - 16:07 (00:02) vmarko pts/1 Thu Jul 16 11:37 - 11:37 (00:00) vmarko pts/1 Thu Jul 16 02:48 - 02:59 (00:10) reboot system boot Wed Jul 15 18:16 (21:59) reboot system boot Wed Jul 15 15:37 (00:03) vmarko pts/1 Wed Jul 15 15:34 - 15:34 (00:00) vmarko pts/1 Wed Jul 15 14:42 - 15:16 (00:34) reboot system boot Wed Jul 15 14:37 (01:04) vmarko pts/1 Wed Jul 15 13:38 - crash (00:58) reboot system boot Wed Jul 15 13:36 (02:04) reboot system boot Wed Jul 15 12:36 (03:05) reboot system boot Wed Jul 15 11:35 (04:05) reboot system boot Wed Jul 15 10:38 (05:02) reboot system boot Wed Jul 15 09:34 (06:06) reboot system boot Wed Jul 15 08:34 (07:07) reboot system boot Wed Jul 15 07:33 (08:07) reboot system boot Wed Jul 15 06:33 (09:08) reboot system boot Wed Jul 15 05:32 (10:08) reboot system boot Wed Jul 15 04:32 (11:09) reboot system boot Wed Jul 15 03:31 (12:09) reboot system boot Wed Jul 15 02:31 (13:10) reboot system boot Wed Jul 15 01:30 (14:10) reboot system boot Wed Jul 15 00:30 (15:10) reboot system boot Tue Jul 14 23:30 (16:11) reboot system boot Tue Jul 14 22:29 (17:11) reboot system boot Tue Jul 14 21:29 (18:12) reboot system boot Tue Jul 14 20:28 (19:12) reboot system boot Tue Jul 14 19:28 (20:13) reboot system boot Tue Jul 14 18:27 (21:13) reboot system boot Tue Jul 14 17:27 (22:14) reboot system boot Tue Jul 14 16:26 (23:14) vmarko pts/1 Tue Jul 14 15:39 - 15:42 (00:03) reboot system boot Tue Jul 14 15:26 (1+00:15) vmarko pts/1 Tue Jul 14 15:11 - crash (00:14) ljubica pts/1 Tue Jul 14 14:26 - 15:11 (00:44) ljubica :0 Tue Jul 14 14:26 - 14:54 (00:27) ljubica :0 Tue Jul 14 14:26 - 14:26 (00:00) reboot system boot Tue Jul 14 14:25 (1+01:15) ljubica pts/2 Tue Jul 14 13:27 - 13:27 (00:00) ljubica pts/1 Tue Jul 14 13:27 - crash (00:58) ljubica :0 Tue Jul 14 13:27 - crash (00:58) ljubica :0 Tue Jul 14 13:27 - 13:27 (00:00) reboot system boot Tue Jul 14 13:25 (1+02:16) ljubica pts/1 Tue Jul 14 12:48 - crash (00:36) ljubica :0 Tue Jul 14 12:48 - crash (00:36) ljubica :0 Tue Jul 14 12:48 - 12:48 (00:00) reboot system boot Tue Jul 14 12:41 (1+02:59) ljubica pts/1 Tue Jul 14 11:45 - crash (00:55)
Occasional flag "crash" might mean something?
Best, :-) Marko
On Thu, Jul 16, 2009 at 10:27 AM, Marko Vojinovicvvmarko@gmail.com wrote:
On Thu, Jul 16, 2009 at 11:06 AM, Michael Calizomike.calizo@gmail.com wrote:
can you post the output of last command?
Maybe we can find something like the account currently login when server reboots.
Here goes (note that it is sorted in most-recent-first fashion):
# last -R | less vmarko pts/1 Thu Jul 16 16:09 still logged in vmarko pts/1 Thu Jul 16 16:05 - 16:07 (00:02) vmarko pts/1 Thu Jul 16 11:37 - 11:37 (00:00) vmarko pts/1 Thu Jul 16 02:48 - 02:59 (00:10) reboot system boot Wed Jul 15 18:16 (21:59) reboot system boot Wed Jul 15 15:37 (00:03) vmarko pts/1 Wed Jul 15 15:34 - 15:34 (00:00) vmarko pts/1 Wed Jul 15 14:42 - 15:16 (00:34) reboot system boot Wed Jul 15 14:37 (01:04) vmarko pts/1 Wed Jul 15 13:38 - crash (00:58) reboot system boot Wed Jul 15 13:36 (02:04) reboot system boot Wed Jul 15 12:36 (03:05) reboot system boot Wed Jul 15 11:35 (04:05) reboot system boot Wed Jul 15 10:38 (05:02) reboot system boot Wed Jul 15 09:34 (06:06) reboot system boot Wed Jul 15 08:34 (07:07) reboot system boot Wed Jul 15 07:33 (08:07) reboot system boot Wed Jul 15 06:33 (09:08) reboot system boot Wed Jul 15 05:32 (10:08) reboot system boot Wed Jul 15 04:32 (11:09) reboot system boot Wed Jul 15 03:31 (12:09) reboot system boot Wed Jul 15 02:31 (13:10) reboot system boot Wed Jul 15 01:30 (14:10) reboot system boot Wed Jul 15 00:30 (15:10) reboot system boot Tue Jul 14 23:30 (16:11) reboot system boot Tue Jul 14 22:29 (17:11) reboot system boot Tue Jul 14 21:29 (18:12) reboot system boot Tue Jul 14 20:28 (19:12) reboot system boot Tue Jul 14 19:28 (20:13) reboot system boot Tue Jul 14 18:27 (21:13) reboot system boot Tue Jul 14 17:27 (22:14) reboot system boot Tue Jul 14 16:26 (23:14) vmarko pts/1 Tue Jul 14 15:39 - 15:42 (00:03) reboot system boot Tue Jul 14 15:26 (1+00:15) vmarko pts/1 Tue Jul 14 15:11 - crash (00:14) ljubica pts/1 Tue Jul 14 14:26 - 15:11 (00:44) ljubica :0 Tue Jul 14 14:26 - 14:54 (00:27) ljubica :0 Tue Jul 14 14:26 - 14:26 (00:00) reboot system boot Tue Jul 14 14:25 (1+01:15) ljubica pts/2 Tue Jul 14 13:27 - 13:27 (00:00) ljubica pts/1 Tue Jul 14 13:27 - crash (00:58) ljubica :0 Tue Jul 14 13:27 - crash (00:58) ljubica :0 Tue Jul 14 13:27 - 13:27 (00:00) reboot system boot Tue Jul 14 13:25 (1+02:16) ljubica pts/1 Tue Jul 14 12:48 - crash (00:36) ljubica :0 Tue Jul 14 12:48 - crash (00:36) ljubica :0 Tue Jul 14 12:48 - 12:48 (00:00) reboot system boot Tue Jul 14 12:41 (1+02:59) ljubica pts/1 Tue Jul 14 11:45 - crash (00:55)
From the last log it looks like user "ljubica" did something that was
causing his session to crash, then he did something to cause the server to reboot every hour. I would ask him/her what was done, it may be operator error on his/her part.
-Ross
Base on last output, I would start to look on the process that was invoke by ljubica and vmarko, you might find something from there. Anyways, is your server running any DB process? You might also look at the server history on when this problem start to happened and investigate any updates that you or the others have done prior to the problem. SAR command can help.
Mike --
On Fri, Jul 17, 2009 at 12:45 AM, Ross Walker rswwalker@gmail.com wrote:
On Thu, Jul 16, 2009 at 10:27 AM, Marko Vojinovicvvmarko@gmail.com wrote:
On Thu, Jul 16, 2009 at 11:06 AM, Michael Calizomike.calizo@gmail.com
wrote:
can you post the output of last command?
Maybe we can find something like the account currently login when server reboots.
Here goes (note that it is sorted in most-recent-first fashion):
# last -R | less vmarko pts/1 Thu Jul 16 16:09 still logged in vmarko pts/1 Thu Jul 16 16:05 - 16:07 (00:02) vmarko pts/1 Thu Jul 16 11:37 - 11:37 (00:00) vmarko pts/1 Thu Jul 16 02:48 - 02:59 (00:10) reboot system boot Wed Jul 15 18:16 (21:59) reboot system boot Wed Jul 15 15:37 (00:03) vmarko pts/1 Wed Jul 15 15:34 - 15:34 (00:00) vmarko pts/1 Wed Jul 15 14:42 - 15:16 (00:34) reboot system boot Wed Jul 15 14:37 (01:04) vmarko pts/1 Wed Jul 15 13:38 - crash (00:58) reboot system boot Wed Jul 15 13:36 (02:04) reboot system boot Wed Jul 15 12:36 (03:05) reboot system boot Wed Jul 15 11:35 (04:05) reboot system boot Wed Jul 15 10:38 (05:02) reboot system boot Wed Jul 15 09:34 (06:06) reboot system boot Wed Jul 15 08:34 (07:07) reboot system boot Wed Jul 15 07:33 (08:07) reboot system boot Wed Jul 15 06:33 (09:08) reboot system boot Wed Jul 15 05:32 (10:08) reboot system boot Wed Jul 15 04:32 (11:09) reboot system boot Wed Jul 15 03:31 (12:09) reboot system boot Wed Jul 15 02:31 (13:10) reboot system boot Wed Jul 15 01:30 (14:10) reboot system boot Wed Jul 15 00:30 (15:10) reboot system boot Tue Jul 14 23:30 (16:11) reboot system boot Tue Jul 14 22:29 (17:11) reboot system boot Tue Jul 14 21:29 (18:12) reboot system boot Tue Jul 14 20:28 (19:12) reboot system boot Tue Jul 14 19:28 (20:13) reboot system boot Tue Jul 14 18:27 (21:13) reboot system boot Tue Jul 14 17:27 (22:14) reboot system boot Tue Jul 14 16:26 (23:14) vmarko pts/1 Tue Jul 14 15:39 - 15:42 (00:03) reboot system boot Tue Jul 14 15:26 (1+00:15) vmarko pts/1 Tue Jul 14 15:11 - crash (00:14) ljubica pts/1 Tue Jul 14 14:26 - 15:11 (00:44) ljubica :0 Tue Jul 14 14:26 - 14:54 (00:27) ljubica :0 Tue Jul 14 14:26 - 14:26 (00:00) reboot system boot Tue Jul 14 14:25 (1+01:15) ljubica pts/2 Tue Jul 14 13:27 - 13:27 (00:00) ljubica pts/1 Tue Jul 14 13:27 - crash (00:58) ljubica :0 Tue Jul 14 13:27 - crash (00:58) ljubica :0 Tue Jul 14 13:27 - 13:27 (00:00) reboot system boot Tue Jul 14 13:25 (1+02:16) ljubica pts/1 Tue Jul 14 12:48 - crash (00:36) ljubica :0 Tue Jul 14 12:48 - crash (00:36) ljubica :0 Tue Jul 14 12:48 - 12:48 (00:00) reboot system boot Tue Jul 14 12:41 (1+02:59) ljubica pts/1 Tue Jul 14 11:45 - crash (00:55)
From the last log it looks like user "ljubica" did something that was
causing his session to crash, then he did something to cause the server to reboot every hour. I would ask him/her what was done, it may be operator error on his/her part.
-Ross _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
I have a similar problem since updating to 5.3. We noticed that our nightly backups are not getting done as system reboots in the middle of it. There are no error messages in the log and messages on the console are long gone before I get an access to it.
I noticed this in the last log:
nenad pts/1 n.local Wed Jul 15 10:08 - 15:56 (1+05:47) reboot system boot 2.6.18-92.1.22.e Wed Jul 15 10:08 (1+22:46) nenad pts/4 n.local Wed Jul 15 09:33 - down (00:33) root pts/4 screamer.local Wed Jul 15 09:30 - 09:32 (00:02) nenad pts/4 screamer.local Wed Jul 15 09:23 - 09:24 (00:00) gary pts/4 screamer.local Wed Jul 15 00:58 - 01:46 (00:48) nenad pts/3 10.10.11.101 Wed Jul 15 00:38 - down (09:28) nenad pts/2 10.10.11.101 Wed Jul 15 00:30 - down (09:36) reboot system boot 2.6.18-128.1.10. Tue Jul 14 23:47 (10:19) nenad pts/1 n.local Mon Jul 13 14:08 - crash (1+09:39) reboot system boot 2.6.18-128.1.10. Mon Jul 13 14:07 (1+19:59) nenad pts/2 n.local Mon Jul 13 08:48 - crash (05:19) reboot system boot 2.6.18-128.1.16. Mon Jul 13 01:07 (2+08:59) reboot system boot 2.6.18-128.1.16. Sun Jul 12 00:40 (3+09:26) gary pts/2 screamer.local Sat Jul 11 07:31 - crash (17:09) reboot system boot 2.6.18-128.1.16. Sat Jul 11 04:10 (4+05:56) reboot system boot 2.6.18-128.1.16. Fri Jul 10 04:10 (5+05:56) reboot system boot 2.6.18-128.1.16. Thu Jul 9 00:12 (6+09:53) nenad pts/2 n.local Mon Jul 6 15:28 - crash (2+08:44) reboot system boot 2.6.18-128.1.16. Mon Jul 6 15:23 (8+18:43)
On July 13 I switched to the older version of the kernel but still crashed that night. I am now back to 2.6.18-92.1.22 kernel but didn't have a chance to run backups as our admin system is being upgraded.
Note that our remote backups are done over the network with ssh logins. All our logins are ssh, and crashes that I see on some users are probably related to ssh or networking as I know that nothing special was done at that time. My guess that it is ssh/network related.
This is I dual processor Xeon machine (i686) running Xen kernel. I noticed on some other lists that people are complaining about crashes caused by Intel e1000 network device on 5.3, something that my system has.
Nenad
From: Michael Calizo Sent: Thursday, July 16, 2009 8:02 PM To: CentOS mailing list Subject: Re: [CentOS] My server reboots every hour! Help please!
Base on last output, I would start to look on the process that was invoke by ljubica and vmarko, you might find something from there.
Anyways, is your server running any DB process? You might also look at the server history on when this problem start to happened and investigate any updates that you or the others have done prior to the problem. SAR command can help.
Mike --
On Fri, Jul 17, 2009 at 12:45 AM, Ross Walker rswwalker@gmail.com wrote:
On Thu, Jul 16, 2009 at 10:27 AM, Marko Vojinovicvvmarko@gmail.com wrote:
On Thu, Jul 16, 2009 at 11:06 AM, Michael Calizomike.calizo@gmail.com wrote:
can you post the output of last command?
Maybe we can find something like the account currently login when server reboots.
Here goes (note that it is sorted in most-recent-first fashion):
# last -R | less vmarko pts/1 Thu Jul 16 16:09 still logged in vmarko pts/1 Thu Jul 16 16:05 - 16:07 (00:02) vmarko pts/1 Thu Jul 16 11:37 - 11:37 (00:00) vmarko pts/1 Thu Jul 16 02:48 - 02:59 (00:10) reboot system boot Wed Jul 15 18:16 (21:59) reboot system boot Wed Jul 15 15:37 (00:03) vmarko pts/1 Wed Jul 15 15:34 - 15:34 (00:00) vmarko pts/1 Wed Jul 15 14:42 - 15:16 (00:34) reboot system boot Wed Jul 15 14:37 (01:04) vmarko pts/1 Wed Jul 15 13:38 - crash (00:58) reboot system boot Wed Jul 15 13:36 (02:04) reboot system boot Wed Jul 15 12:36 (03:05) reboot system boot Wed Jul 15 11:35 (04:05) reboot system boot Wed Jul 15 10:38 (05:02) reboot system boot Wed Jul 15 09:34 (06:06) reboot system boot Wed Jul 15 08:34 (07:07) reboot system boot Wed Jul 15 07:33 (08:07) reboot system boot Wed Jul 15 06:33 (09:08) reboot system boot Wed Jul 15 05:32 (10:08) reboot system boot Wed Jul 15 04:32 (11:09) reboot system boot Wed Jul 15 03:31 (12:09) reboot system boot Wed Jul 15 02:31 (13:10) reboot system boot Wed Jul 15 01:30 (14:10) reboot system boot Wed Jul 15 00:30 (15:10) reboot system boot Tue Jul 14 23:30 (16:11) reboot system boot Tue Jul 14 22:29 (17:11) reboot system boot Tue Jul 14 21:29 (18:12) reboot system boot Tue Jul 14 20:28 (19:12) reboot system boot Tue Jul 14 19:28 (20:13) reboot system boot Tue Jul 14 18:27 (21:13) reboot system boot Tue Jul 14 17:27 (22:14) reboot system boot Tue Jul 14 16:26 (23:14) vmarko pts/1 Tue Jul 14 15:39 - 15:42 (00:03) reboot system boot Tue Jul 14 15:26 (1+00:15) vmarko pts/1 Tue Jul 14 15:11 - crash (00:14) ljubica pts/1 Tue Jul 14 14:26 - 15:11 (00:44) ljubica :0 Tue Jul 14 14:26 - 14:54 (00:27) ljubica :0 Tue Jul 14 14:26 - 14:26 (00:00) reboot system boot Tue Jul 14 14:25 (1+01:15) ljubica pts/2 Tue Jul 14 13:27 - 13:27 (00:00) ljubica pts/1 Tue Jul 14 13:27 - crash (00:58) ljubica :0 Tue Jul 14 13:27 - crash (00:58) ljubica :0 Tue Jul 14 13:27 - 13:27 (00:00) reboot system boot Tue Jul 14 13:25 (1+02:16) ljubica pts/1 Tue Jul 14 12:48 - crash (00:36) ljubica :0 Tue Jul 14 12:48 - crash (00:36) ljubica :0 Tue Jul 14 12:48 - 12:48 (00:00) reboot system boot Tue Jul 14 12:41 (1+02:59) ljubica pts/1 Tue Jul 14 11:45 - crash (00:55)
From the last log it looks like user "ljubica" did something that was
causing his session to crash, then he did something to cause the server to reboot every hour. I would ask him/her what was done, it may be operator error on his/her part.
-Ross
_______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Wed, 2009-07-15 at 13:16 +0100, Marko Vojinovic wrote:
The machine is supposed to be up 24/7 and after each reset the system reboots again, and works normally only to reset again one hour later. However, the only unusual thing I am able to recognize in the logs is winbind daemon whining about something I don't understand:
# tail -f /var/log/messages Jul 15 13:47:39 sith winbindd[3353]: [2009/07/15 13:47:39, 0] nsswitch/idmap.c:idmap_alloc_init(820) Jul 15 13:47:39 sith winbindd[3353]: ERROR: Initialization failed for alloc backend, deferred! Jul 15 13:47:39 sith smbd[3373]: [2009/07/15 13:47:39, 0] auth/auth_util.c:create_builtin_users(810) Jul 15 13:47:39 sith smbd[3373]: create_builtin_users: Failed to create Users Jul 15 13:47:39 sith winbindd[2927]: [2009/07/15 13:47:39, 0] nsswitch/winbindd_passdb.c:sid_to_name(126) Jul 15 13:47:39 sith winbindd[2927]: Possible deadlock: Trying to lookup SID S-1-22-1-99 with passdb backend Jul 15 13:47:39 sith winbindd[2927]: [2009/07/15 13:47:39, 0] nsswitch/winbindd_passdb.c:sid_to_name(126) Jul 15 13:47:39 sith winbindd[2927]: Possible deadlock: Trying to lookup SID S-1-1-0 with passdb backend Jul 15 13:47:39 sith winbindd[2927]: [2009/07/15 13:47:39, 0] nsswitch/winbindd_passdb.c:sid_to_name(126) Jul 15 13:47:39 sith winbindd[2927]: Possible deadlock: Trying to lookup SID S-1-5-2 with passdb backend
Nevertheless, samba server seems to be running ok, so I am not sure this is related to reboots, and there is nothing else suspicious in the logs AFAICT.
--- Almost positive the winbind and samba errors are not related to the reboots.
Looks like you had samba configured for LDAP type auth then switched to user auth and winbind is complaining about lookups. Thus if that is that case turn off winbindd.
John