Hi,
We have two nodes with centos 5.5 x64 and cluster+gfs offering samba and NFS services. Recently one node displayed the following messages in log files:
Sep 13 08:19:07 NODE1 gfs_controld[3101]: cpg_mcast_joined error 2 handle 2846d7ad00000000 MSG_PLOCK Sep 13 08:19:07 NODE1 gfs_controld[3101]: send plock message error -1 Sep 13 08:19:11 NODE1 gfs_controld[3101]: cpg_mcast_joined error 2 handle 2846d7ad00000000 MSG_PLOCK Sep 13 08:19:11 NODE1 gfs_controld[3101]: send plock message error -1
When this happens in the other node access to samba services begin to freeze and this error appears:
Sep 13 08:08:22 NODE2 kernel: INFO: task smbd:23084 blocked for more than 120 seconds. Sep 13 08:08:22 NODE2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 13 08:08:22 NODE2 kernel: smbd D ffff810001576420 0 23084 6602 23307 19791 (NOTLB) Sep 13 08:08:22 NODE2 kernel: ffff81003e187e08 0000000000000086 ffff81003e187e24 0000000000000092 Sep 13 08:08:22 NODE2 kernel: ffff810005dbdc38 000000000000000a ffff81003f4f77a0 ffffffff80309b60 Sep 13 08:08:22 NODE2 kernel: 000062f1773ef4c3 000000000000624f ffff81003f4f7988 000000008008c597 Sep 13 08:08:22 NODE2 kernel: Call Trace: Sep 13 08:08:22 NODE2 kernel: [<ffffffff8875cb7d>] :dlm:dlm_posix_lock+0x172/0x210 Sep 13 08:08:22 NODE2 kernel: [<ffffffff800a1ba4>] autoremove_wake_function+0x0/0x2e Sep 13 08:08:22 NODE2 kernel: [<ffffffff8882a5b9>] :gfs:gfs_lock+0x9c/0xa8 Sep 13 08:08:22 NODE2 kernel: [<ffffffff8003a142>] fcntl_setlk+0x11e/0x273 Sep 13 08:08:22 NODE2 kernel: [<ffffffff800b878c>] audit_syscall_entry+0x180/0x1b3 Sep 13 08:08:22 NODE2 kernel: [<ffffffff8002e7da>] sys_fcntl+0x269/0x2dc Sep 13 08:08:22 NODE2 kernel: [<ffffffff8005e28d>] tracesys+0xd5/0xe0
The configuration of the cluster is the following:
<?xml version="1.0"?> <cluster alias="lcfib" config_version="60" name="lcfib"> <quorumd device="/dev/gfs-webn/quorum" interval="1" label="quorum" min_score="1" tko="10" votes="2"> <heuristic interval="10" program="/bin/ping -t1 -c1 numIP.1" score="1" tko="5"/> </quorumd> <fence_daemon post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="NODE2.fib.upc.es" nodeid="1" votes="1"> <fence> <method name="1"> <device lanplus="1" name="NODE2SP"/> </method> </fence> </clusternode> <clusternode name="NODE1.fib.upc.es" nodeid="2" votes="1"> <fence> <method name="1"> <device lanplus="1" name="NODE1SP"/> </method> </fence> </clusternode> </clusternodes> <cman broadcast="yes" expected_votes="4" two_node="0"/> <fencedevices> <fencedevice agent="fence_ipmilan" auth="md5" ipaddr="192.168.13.77" login="" name="NODE2SP" passwd="5jSTv3Mb"/> <fencedevice agent="fence_ipmilan" auth="md5" ipaddr="192.168.13.78" login="" name="NODE1SP" passwd="5jSTv3Mb"/> </fencedevices> <rm> <failoverdomains> <failoverdomain name="NODE1-NODE2" ordered="1" restricted="1"> <failoverdomainnode name="NODE2.fib.upc.es" priority="2"/> <failoverdomainnode name="NODE1.fib.upc.es" priority="1"/> </failoverdomain> <failoverdomain name="NODE2-NODE1" ordered="1" restricted="1"> <failoverdomainnode name="NODE2.fib.upc.es" priority="1"/> <failoverdomainnode name="NODE1.fib.upc.es" priority="2"/> </failoverdomain> </failoverdomains> <resources> <script file="/etc/init.d/fibsmb1" name="fibsmb1"/> <script file="/etc/init.d/fibsmb2" name="fibsmb2"/> <clusterfs device="/dev/gfs-webn/gfs-webn" force_unmount="0" fsid="14417" fstype="gfs" mountpoint="/web" name="web" options=""/> <clusterfs device="/dev/gfs-perfils/gfs-assig" force_unmount="0" fsid="21646" fstype="gfs" mountpoint="/assig" name="assig" options=""/> <smb name="FIBSMB1" workgroup="FIBSMB"/> <smb name="FIBSMB2" workgroup="FIBSMB"/> <ip address="numIP.111/24" monitor_link="1"/> <ip address="numIP.110/24" monitor_link="1"/> <ip address="numIP.112/24" monitor_link="1"/> </resources> <service autostart="1" domain="NODE2-NODE1" name="samba" recovery="disable"> <clusterfs ref="web"/> <ip ref="numIP.110/24"/> <ip ref="numIP.112/24"/> <clusterfs ref="assig"/> <script ref="fibsmb2"/> <smb ref="FIBSMB2"/> </service> <service domain="NODE2-NODE1" name="sambalin" recovery="disable"> <clusterfs ref="web"/> <ip ref="numIP.111/24"/> <smb ref="FIBSMB1"/> <script ref="fibsmb1"/> </service> </rm> </cluster>
I thinks it's a problem with the configuration, and the cluster cannot communicate together at one point. Any hints about this? Thanks,
Sandra
PD: versions of kernel and cluster
kernel-2.6.18-194.el5 cman-2.0.115-34.el5 kmod-gfs-0.1.34-12.el5.centos gfs-utils-0.1.20-7.el5