Hi All,
My one server recently started acting very weird. At fist I couldn't import any images with cobbler as rsync crashes the whole time. I was told to ask on the cobbler list (maybe it's not supported here?) but I left it at that (subscribed to far too many lists already).
Yesteday I wanted to copy some stuff from a USB disk to the server (using rsync to update the files which have changed, or new files) but it seems like rsync hangs without any errors. Now, when I tried to update CentOS I get the same "error" in that yum hangs.
At the same time I can open a new SSH session and do whatevery I like. But it seems that running a command which takes time to complete hangs.
For example:
==================================================================================================================================================== Package Arch Version Repository Size ===================================================================================================================================================== Removing: iscsi-initiator-utils x86_64 6.2.0.871-0.12.el5_4.1 installed 1.9 M Removing for dependencies: gnome-applet-vm x86_64 0.1.2-1.el5 installed 121 k libvirt x86_64 0.6.3-20.1.el5_4 installed 7.1 M libvirt-python x86_64 0.6.3-20.1.el5_4 installed 431 k python-virtinst noarch 0.400.3-5.el5 installed 1.4 M virt-manager x86_64 0.6.1-8.el5 installed 4.9 M virt-viewer x86_64 0.0.2-3.el5 installed 48 k xen x86_64 3.0.3-94.el5_4.3 installed 4.7 M
Transaction Summary ===================================================================================================================================================== Remove 8 Package(s) Reinstall 0 Package(s) Downgrade 0 Package(s)
Is this ok [y/N]: y Downloading Packages: Running rpm_check_debug Running Transaction Test
[root@intranet Torrents]# mount /dev/sdc1 /mnt/
Even running "top" does the same, yet I can't kill "top" with CTRL+C, or even "killall -9 top" from a new SSH session.
[root@intranet ~]# ps ax | grep top 20825 pts/6 R+ 0:00 grep top [root@intranet ~]# ps ax | grep rsync 6536 ? D 3:10 rsync -avz --progress /mnt/usb-backup/backups/current/home/www/linux/centos /home/www/linux/ 6537 ? Z 0:00 [rsync] <defunct> 8630 ? S 0:00 rsync -avz --progress --stats /mnt/usb-backup/backups/current/home/www/linux/ /home/www/linux/ 8631 ? Z 0:00 [rsync] <defunct> 20827 pts/6 R+ 0:00 grep rsync [root@intranet ~]# ps ax | grep yum 20390 pts/4 S+ 0:02 /usr/bin/python /usr/bin/yum remove iscsi-initiator-utils 20829 pts/6 R+ 0:00 grep yum [root@intranet ~]#
/var/log/messages doesn't show me any errors.
Any suggestions on this?
On Sun, 2010-04-11 at 12:58 +0200, Rudi Ahlers wrote:
At the same time I can open a new SSH session and do whatevery I like. But it seems that running a command which takes time to complete hangs.
--- Try killing off those rsyncs and try it again. You need to provide some other type of error messages. Use strace. tail /var/log/messages and paste it in your reply. Even if you don't see anything in it that does not mean someone else can't. You may need a reboot.
John
On Sun, Apr 11, 2010 at 2:04 PM, JohnS jses27@gmail.com wrote:
On Sun, 2010-04-11 at 12:58 +0200, Rudi Ahlers wrote:
At the same time I can open a new SSH session and do whatevery I like. But it seems that running a command which takes time to complete hangs.
Try killing off those rsyncs and try it again. You need to provide some other type of error messages. Use strace. tail /var/log/messages and paste it in your reply. Even if you don't see anything in it that does not mean someone else can't. You may need a reboot.
John
John, I already said I can't kill the process and tail -f /var/log/messages *really* doesn't show me anything.
I am running tail -f /var/log/message in on SSH window, and at the same time killed & re-ran "yum remove iscsi-initiator-utils -y" in another SSH window. /var/log/message has *nothing* to report.
[root@intranet ~]# tail -f /var/log/messages Apr 11 14:18:50 intranet nmbd[4310]: find_domain_master_name_query_fail: Apr 11 14:18:50 intranet nmbd[4310]: Unable to find the Domain Master Browser name SOFTDUX<1b> for the workgroup SOFTDUX. Apr 11 14:18:50 intranet nmbd[4310]: Unable to sync browse lists in this workgroup. Apr 11 14:18:50 intranet nmbd[4310]: [2010/04/11 14:18:50, 0] nmbd/nmbd_browsesync.c:find_domain_master_name_query_fail(351) Apr 11 14:18:50 intranet nmbd[4310]: find_domain_master_name_query_fail: Apr 11 14:18:50 intranet nmbd[4310]: Unable to find the Domain Master Browser name SOFTDUX<1b> for the workgroup SOFTDUX. Apr 11 14:18:50 intranet nmbd[4310]: Unable to sync browse lists in this workgroup. Apr 11 14:19:14 intranet snmpd[3912]: error scanning interface data (expected 10, got 0) Apr 11 14:20:44 intranet snmpd[3912]:last message repeated 6 times Apr 11 14:22:14 intranet snmpd[3912]:last message repeated 6 times Apr 11 14:23:44 intranet snmpd[3912]:last message repeated 6 times
The ONLY "fix" is a reboot, but I don't want to reboot every few minutes (have done it already a few times today as the server is a dev server and everyone else working it (mainly web development) have to wait and this cuts in on production time.
On Sun, Apr 11, 2010 at 2:25 PM, Rudi Ahlers rudiahlers@gmail.com wrote:
On Sun, Apr 11, 2010 at 2:04 PM, JohnS jses27@gmail.com wrote:
On Sun, 2010-04-11 at 12:58 +0200, Rudi Ahlers wrote:
At the same time I can open a new SSH session and do whatevery I like. But it seems that running a command which takes time to complete hangs.
Try killing off those rsyncs and try it again. You need to provide some other type of error messages. Use strace. tail /var/log/messages and paste it in your reply. Even if you don't see anything in it that does not mean someone else can't. You may need a reboot.
John
John, I already said I can't kill the process and tail -f /var/log/messages *really* doesn't show me anything.
I am running tail -f /var/log/message in on SSH window, and at the same time killed & re-ran "yum remove iscsi-initiator-utils -y" in another SSH window. /var/log/message has *nothing* to report.
[root@intranet ~]# tail -f /var/log/messages Apr 11 14:18:50 intranet nmbd[4310]: find_domain_master_name_query_fail: Apr 11 14:18:50 intranet nmbd[4310]: Unable to find the Domain Master Browser name SOFTDUX<1b> for the workgroup SOFTDUX. Apr 11 14:18:50 intranet nmbd[4310]: Unable to sync browse lists in this workgroup. Apr 11 14:18:50 intranet nmbd[4310]: [2010/04/11 14:18:50, 0] nmbd/nmbd_browsesync.c:find_domain_master_name_query_fail(351) Apr 11 14:18:50 intranet nmbd[4310]: find_domain_master_name_query_fail: Apr 11 14:18:50 intranet nmbd[4310]: Unable to find the Domain Master Browser name SOFTDUX<1b> for the workgroup SOFTDUX. Apr 11 14:18:50 intranet nmbd[4310]: Unable to sync browse lists in this workgroup. Apr 11 14:19:14 intranet snmpd[3912]: error scanning interface data (expected 10, got 0) Apr 11 14:20:44 intranet snmpd[3912]:last message repeated 6 times Apr 11 14:22:14 intranet snmpd[3912]:last message repeated 6 times Apr 11 14:23:44 intranet snmpd[3912]:last message repeated 6 times
The ONLY "fix" is a reboot, but I don't want to reboot every few minutes (have done it already a few times today as the server is a dev server and everyone else working it (mainly web development) have to wait and this cuts in on production time.
I can't install strace either:
[root@intranet ~]# yum install strace -y Loaded plugins: fastestmirror Loading mirror speeds from cached hostfile * local-addons: 192.168.1.250 * local-base: 192.168.1.250 * local-extras: 192.168.1.250 * local-updates: 192.168.1.250 * rpmforge: apt.sw.be Setting up Install Process Resolving Dependencies --> Running transaction check ---> Package strace.x86_64 0:4.5.18-5.el5_4.4 set to be updated --> Finished Dependency Resolution
Dependencies Resolved
===================================================================================================================================================== Package Arch Version Repository Size ===================================================================================================================================================== Installing: strace x86_64 4.5.18-5.el5_4.4 local-updates 177 k
Transaction Summary ===================================================================================================================================================== Install 1 Package(s) Upgrade 0 Package(s)
Total size: 177 k Downloading Packages: Running rpm_check_debug Running Transaction Test
And that's where it sits and does nothing. The system's load isn't very high:
[root@intranet ~]# uptime 14:32:25 up 1 day, 2:05, 6 users, load average: 2.02, 2.02, 2.01
and again /var/log/messages reports nothing related to this problem:
[root@intranet ~]# tail -f /var/log/messages Apr 11 14:19:14 intranet snmpd[3912]: error scanning interface data (expected 10, got 0) Apr 11 14:20:44 intranet snmpd[3912]:last message repeated 6 times Apr 11 14:22:14 intranet snmpd[3912]:last message repeated 6 times Apr 11 14:23:44 intranet snmpd[3912]:last message repeated 6 times Apr 11 14:25:14 intranet snmpd[3912]:last message repeated 6 times Apr 11 14:26:44 intranet snmpd[3912]:last message repeated 6 times Apr 11 14:28:14 intranet snmpd[3912]:last message repeated 6 times Apr 11 14:29:44 intranet snmpd[3912]:last message repeated 6 times Apr 11 14:31:14 intranet snmpd[3912]:last message repeated 6 times Apr 11 14:32:44 intranet snmpd[3912]:last message repeated 6 times Apr 11 14:33:23 intranet snmpd[3912]:last message repeated 3 times Apr 11 14:33:23 intranet snmpd[3912]: Received TERM or STOP signal... shutting down...
I stopped snmpd since it's not being used. After that no other errors which tells me what causes this came up.
Check dmesg. The kernel may be reporting disk or filesystem IO problems that are not going to syslog.
-geoff
--------------------------------- Geoff Galitz Blankenheim NRW, Germany http://www.galitz.org/ http://german-way.com/blog/
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Rudi Ahlers Sent: Sonntag, 11. April 2010 14:49 To: CentOS mailing list Subject: Re: [CentOS] everything seems to hang, but system is idle?
On Sun, Apr 11, 2010 at 2:25 PM, Rudi Ahlers rudiahlers@gmail.com wrote:
On Sun, Apr 11, 2010 at 2:04 PM, JohnS jses27@gmail.com wrote:
On Sun, 2010-04-11 at 12:58 +0200, Rudi Ahlers wrote:
At the same time I can open a new SSH session and do whatevery I like. But it seems that running a command which takes time to complete hangs.
Try killing off those rsyncs and try it again. You need to provide
some
other type of error messages. Use strace. tail /var/log/messages and paste it in your reply. Even if you don't see anything in it that does not mean someone else can't. You may need a reboot.
.......
On Sun, Apr 11, 2010 at 3:00 PM, Geoff Galitz geoff@galitz.org wrote:
Check dmesg. The kernel may be reporting disk or filesystem IO problems that are not going to syslog.
-geoff
Thanx Geoff,
Already checked that, without any decent lead either:
[root@intranet ~]# tail -f /var/log/dmesg md: ... autorun DONE. device-mapper: multipath: version 1.0.5 loaded EXT3 FS on dm-0, internal journal kjournald starting. Commit interval 5 seconds EXT3 FS on dm-1, internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS on md0, internal journal EXT3-fs: mounted filesystem with ordered data mode. Adding 8388600k swap on /dev/nas/swap. Priority:-1 extents:1 across:8388600k
ip_tables: (C) 2000-2006 Netfilter Core Team Netfilter messages via NETLINK v0.30. ip_conntrack version 2.4 (8192 buckets, 65536 max) - 304 bytes per conntrack device vif0.0 entered promiscuous mode xenbr0: topology change detected, propagating xenbr0: port 1(vif0.0) entering forwarding state r8169: peth0: link up device peth0 entered promiscuous mode xenbr0: topology change detected, propagating xenbr0: port 2(peth0) entering forwarding state virbr0: no IPv6 routers present r8169: eth1: link up eth0: no IPv6 routers present eth1: no IPv6 routers present fuse init (API version 7.10) md: syncing RAID array md0 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction. md: using 128k window, over a total of 104320 blocks. md: syncing RAID array md2 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction. md: using 128k window, over a total of 244195904 blocks. md: md2: sync done. RAID1 conf printout: --- wd:1 rd:2 disk 1, wo:0, o:1, dev:sda1 md: md0: sync done. RAID1 conf printout: --- wd:2 rd:2 disk 0, wo:0, o:1, dev:hda1 disk 1, wo:0, o:1, dev:hdb1 usb 1-1: new high speed USB device using ehci_hcd and address 3 usb 1-1: configuration #1 chosen from 1 choice scsi3 : SCSI emulation for USB Mass Storage devices usb-storage: device found at 3 usb-storage: waiting for device to settle before scanning Vendor: Kingston Model: DT Mini Slim Rev: 1.00 Type: Direct-Access ANSI SCSI revision: 02 SCSI device sdc: 31506432 512-byte hdwr sectors (16131 MB) sdc: Write Protect is off sdc: Mode Sense: 23 00 00 00 sdc: assuming drive cache: write through SCSI device sdc: 31506432 512-byte hdwr sectors (16131 MB) sdc: Write Protect is off sdc: Mode Sense: 23 00 00 00 sdc: assuming drive cache: write through sdc: sdc1 sd 3:0:0:0: Attached scsi removable disk sdc sd 3:0:0:0: Attached scsi generic sg2 type 0 usb-storage: device scan complete SCSI device sdc: 31506432 512-byte hdwr sectors (16131 MB) sdc: Write Protect is off sdc: Mode Sense: 23 00 00 00 sdc: assuming drive cache: write through sdc: SCSI device sdc: 31506432 512-byte hdwr sectors (16131 MB) sdc: Write Protect is off sdc: Mode Sense: 23 00 00 00 sdc: assuming drive cache: write through sdc:
/dev/sdc is a faulty USB memory stick,which I just put in before posting this, hence the errors. But this has happened before I put it in even.
Thanx Geoff,
Already checked that, without any decent lead either:
Have you tried iostat, vmstat or sar to see if there is unusual activity? Were there any changes to the kernel lately (such as an update or a new module)?
Or perhaps an NFS/CIFS mount gone wonky causing blocking?
-geoff
--------------------------------- Geoff Galitz Blankenheim NRW, Germany http://www.galitz.org/ http://german-way.com/blog/
On Sun, Apr 11, 2010 at 3:32 PM, Geoff Galitz geoff@galitz.org wrote:
Thanx Geoff,
Already checked that, without any decent lead either:
Have you tried iostat, vmstat or sar to see if there is unusual activity? Were there any changes to the kernel lately (such as an update or a new module)?
Or perhaps an NFS/CIFS mount gone wonky causing blocking?
-geoff
Geoff Galitz Blankenheim NRW, Germany http://www.galitz.org/ http://german-way.com/blog/
Yes, I think it could be a problematic iscsi config. Now that I think of it, the server wasn't rebooted in about 2 or 3 months and I did some iscsi testing a while ago, but with a recent power outage it could have enabled / a faulty configuration, I just rebooted the server again, managed to the remove iscsi this time, and will see if this solves the problem.
On Sun, 2010-04-11 at 15:48 +0200, Rudi Ahlers wrote:
Yes, I think it could be a problematic iscsi config. Now that I think of it, the server wasn't rebooted in about 2 or 3 months and I did some iscsi testing a while ago, but with a recent power outage it could have enabled / a faulty configuration, I just rebooted the server again, managed to the remove iscsi this time, and will see if this solves the problem.
---
Well along with that I would do a file system check. Then if it keeps on stop just the nmbd service and not smbd for cifs sharing.
John
On Sun, 2010-04-11 at 14:49 +0200, Rudi Ahlers wrote:
[root@intranet ~]# yum install strace -y Loaded plugins: fastestmirror Loading mirror speeds from cached hostfile
- local-addons: 192.168.1.250
- local-base: 192.168.1.250
- local-extras: 192.168.1.250
- local-updates: 192.168.1.250
- rpmforge: apt.sw.be
Setting up Install Process Resolving Dependencies --> Running transaction check ---> Package strace.x86_64 0:4.5.18-5.el5_4.4 set to be updated --> Finished Dependency Resolution
Dependencies Resolved
===================================================================================================================================================== Package Arch Version Repository Size ===================================================================================================================================================== Installing: strace x86_64 4.5.18-5.el5_4.4 local-updates 177 k
Transaction Summary
Install 1 Package(s) Upgrade 0 Package(s)
Total size: 177 k Downloading Packages: Running rpm_check_debug Running Transaction Test
And that's where it sits and does nothing. The system's load isn't very high:
I saw this exact simptom once, on a server that had an dead nfs mount.
HTH,
Calin
Key fingerprint = 37B8 0DA5 9B2A 8554 FB2B 4145 5DC1 15DD A3EF E857
================================================= How many retured bricklayers from FLORIDA are out purchasing PENCIL SHARPENERS right NOW??
On Sun, Apr 11, 2010 at 6:58 AM, Rudi Ahlers rudiahlers@gmail.com wrote: [snip]
At the same time I can open a new SSH session and do whatevery I like. But it seems that running a command which takes time to complete hangs.
Is the server mounting any remote filesystems?
On Sun, Apr 11, 2010 at 3:33 PM, Kwan Lowe kwan.lowe@gmail.com wrote:
On Sun, Apr 11, 2010 at 6:58 AM, Rudi Ahlers rudiahlers@gmail.com wrote: [snip]
At the same time I can open a new SSH session and do whatevery I like. But it seems that running a command which takes time to complete hangs.
Is the server mounting any remote filesystems?
Nope, but I did notice that isci was giving errors, probably from an earlier (probably about 2 months ago) iscsi test, but the iscsi settings were removed, and I just rebooted in order to uninstall isci - which was succesful this time.