[CentOS] problem with gfs_controld

Wed Sep 15 08:39:45 UTC 2010
sandra-llistes <sandra-llistes at fib.upc.edu>

Hi,

We have two nodes with centos 5.5 x64 and cluster+gfs offering samba and
NFS services.
Recently one node displayed the following messages in log files:

Sep 13 08:19:07 NODE1 gfs_controld[3101]: cpg_mcast_joined error 2
handle 2846d7ad00000000 MSG_PLOCK
Sep 13 08:19:07 NODE1 gfs_controld[3101]: send plock message error -1
Sep 13 08:19:11 NODE1 gfs_controld[3101]: cpg_mcast_joined error 2
handle 2846d7ad00000000 MSG_PLOCK
Sep 13 08:19:11 NODE1 gfs_controld[3101]: send plock message error -1

When this happens in the other node access to samba services begin to
freeze and this error appears:

Sep 13 08:08:22 NODE2 kernel: INFO: task smbd:23084 blocked for more
than 120 seconds.
Sep 13 08:08:22 NODE2 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 13 08:08:22 NODE2 kernel: smbd          D ffff810001576420     0
23084   6602         23307 19791 (NOTLB)
Sep 13 08:08:22 NODE2 kernel:  ffff81003e187e08 0000000000000086
ffff81003e187e24 0000000000000092
Sep 13 08:08:22 NODE2 kernel:  ffff810005dbdc38 000000000000000a
ffff81003f4f77a0 ffffffff80309b60
Sep 13 08:08:22 NODE2 kernel:  000062f1773ef4c3 000000000000624f
ffff81003f4f7988 000000008008c597
Sep 13 08:08:22 NODE2 kernel: Call Trace:
Sep 13 08:08:22 NODE2 kernel:  [<ffffffff8875cb7d>]
:dlm:dlm_posix_lock+0x172/0x210
Sep 13 08:08:22 NODE2 kernel:  [<ffffffff800a1ba4>]
autoremove_wake_function+0x0/0x2e
Sep 13 08:08:22 NODE2 kernel:  [<ffffffff8882a5b9>] :gfs:gfs_lock+0x9c/0xa8
Sep 13 08:08:22 NODE2 kernel:  [<ffffffff8003a142>] fcntl_setlk+0x11e/0x273
Sep 13 08:08:22 NODE2 kernel:  [<ffffffff800b878c>]
audit_syscall_entry+0x180/0x1b3
Sep 13 08:08:22 NODE2 kernel:  [<ffffffff8002e7da>] sys_fcntl+0x269/0x2dc
Sep 13 08:08:22 NODE2 kernel:  [<ffffffff8005e28d>] tracesys+0xd5/0xe0

The configuration of the cluster is the following:

<?xml version="1.0"?>
<cluster alias="lcfib" config_version="60" name="lcfib">
        <quorumd device="/dev/gfs-webn/quorum" interval="1"
label="quorum" min_score="1" tko="10" votes="2">
                <heuristic interval="10" program="/bin/ping -t1 -c1
numIP.1" score="1" tko="5"/>
        </quorumd>
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="NODE2.fib.upc.es" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device lanplus="1" name="NODE2SP"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="NODE1.fib.upc.es" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device lanplus="1" name="NODE1SP"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman broadcast="yes" expected_votes="4" two_node="0"/>
        <fencedevices>
                <fencedevice agent="fence_ipmilan" auth="md5"
ipaddr="192.168.13.77" login="" name="NODE2SP" passwd="5jSTv3Mb"/>
                <fencedevice agent="fence_ipmilan" auth="md5"
ipaddr="192.168.13.78" login="" name="NODE1SP" passwd="5jSTv3Mb"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="NODE1-NODE2" ordered="1"
restricted="1">
                                <failoverdomainnode
name="NODE2.fib.upc.es" priority="2"/>
                                <failoverdomainnode
name="NODE1.fib.upc.es" priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="NODE2-NODE1" ordered="1"
restricted="1">
                                <failoverdomainnode
name="NODE2.fib.upc.es" priority="1"/>
                                <failoverdomainnode
name="NODE1.fib.upc.es" priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <script file="/etc/init.d/fibsmb1" name="fibsmb1"/>
                        <script file="/etc/init.d/fibsmb2" name="fibsmb2"/>
                        <clusterfs device="/dev/gfs-webn/gfs-webn"
force_unmount="0" fsid="14417" fstype="gfs" mountpoint="/web" name="web"
options=""/>
                        <clusterfs device="/dev/gfs-perfils/gfs-assig"
force_unmount="0" fsid="21646" fstype="gfs" mountpoint="/assig"
name="assig" options=""/>
                        <smb name="FIBSMB1" workgroup="FIBSMB"/>
                        <smb name="FIBSMB2" workgroup="FIBSMB"/>
                        <ip address="numIP.111/24" monitor_link="1"/>
                        <ip address="numIP.110/24" monitor_link="1"/>
                        <ip address="numIP.112/24" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="NODE2-NODE1" name="samba"
recovery="disable">
                        <clusterfs ref="web"/>
                        <ip ref="numIP.110/24"/>
                        <ip ref="numIP.112/24"/>
                        <clusterfs ref="assig"/>
                        <script ref="fibsmb2"/>
                        <smb ref="FIBSMB2"/>
                </service>
                <service domain="NODE2-NODE1" name="sambalin"
recovery="disable">
                        <clusterfs ref="web"/>
                        <ip ref="numIP.111/24"/>
                        <smb ref="FIBSMB1"/>
                        <script ref="fibsmb1"/>
                </service>
        </rm>
</cluster>


I thinks it's a problem with the configuration, and the cluster cannot
communicate together at one point.
Any hints about this?
Thanks,

Sandra

PD: versions of kernel and cluster

kernel-2.6.18-194.el5
cman-2.0.115-34.el5
kmod-gfs-0.1.34-12.el5.centos
gfs-utils-0.1.20-7.el5