[CentOS] Pacemaker bugs?

Fri Nov 25 15:24:50 UTC 2016

On 11/25/2016 04:30 AM, Andreas Haumer wrote:
> Hi!
> 
> I think I stumbled on at least two bugs in the CentOS 7.2 pacemaker package,
> though I'm not quite sure if or where to report it.
> 
> I'm using the following package to set up a 2-node active/passive cluster:
> 
> [root at clnode1 ~]# rpm -q pacemaker
> pacemaker-1.1.13-10.el7_2.4.x86_64
> 
> The installation is up-to-date on both nodes as of the current PIT.
> 
> I have currently the following cluster resources running:
> 
> [root at clnode2 ~]# pcs status
> Cluster name: rucluster1
> Last updated: Fri Nov 25 11:26:51 2016          Last change: Fri Nov 25 10:51:32 2016 by root via cibadmin on clnode1
> Stack: corosync
> Current DC: clnode2 (version 1.1.13-10.el7_2.4-44eb2dd) - partition with quorum
> 2 nodes and 12 resources configured
> 
> Online: [ clnode1 clnode2 ]
> 
> Full list of resources:
> 
>  p_ip_cluster   (ocf::heartbeat:IPaddr2):       Started clnode2
>  Master/Slave Set: ms_drbd_r0 [p_drbd_r0]
>      Masters: [ clnode2 ]
>      Slaves: [ clnode1 ]
>  p_fs_drbd1     (ocf::heartbeat:Filesystem):    Started clnode2
>  p_apache       (ocf::heartbeat:apache):        Started clnode2
>  p_dhcpd        (ocf::heartbeat:dhcpd): Started clnode2
>  p_named        (ocf::heartbeat:named): Started clnode2
>  p_slapd        (ocf::heartbeat:slapd): Started clnode2
>  p_postgres     (ocf::heartbeat:pgsql): Started clnode2
>  p_nmb  (systemd:nmb):  Started clnode2
>  p_smb  (systemd:smb):  Started clnode2
>  p_winbind      (systemd:winbind):      Started clnode2
> 
> PCSD Status:
>   clnode1: Online
>   clnode2: Online
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> 
> 
> The first bug is rather serious, though a workaround exists!
> 
> The cluster works fine, but as soon as I add a cluster resource of
> class "service", the cluster manager software runs havoc on node
> failover. In that situation, the lrmd process hangs in an infinite
> loop (neither strace nor ltrace show any outout so it seems to be
> an internal loop without any system or library call) and almost any
> call to the cluster manager software (crmsh or pcs) runs into a timeout.
> It's quite hard to recover the whole cluster from this situation.
> 
> When I replace the resource class "service" with resource class
> "systemd", everything seems to work just fine.
> 
> I found a rather old, already closed bug for Fedora which looks similar:
> 
> <https://bugzilla.redhat.com/show_bug.cgi?id=1117151>
> 
> 
> Another bug seems to be rather minor: I see following assertions in the corosync logs:
> 
> Nov 25 11:13:56 [3206] clnode1       crmd:    error: crm_abort: pcmkRegisterNode: Triggered assert at xml.c:594 : node->type == XML_ELEMENT_NODE
> 
> They seem to be related with the drbd resource, but do not cause any functional problem it seems.
> 
> For this particular problem I found the following patch:
> 
> <https://github.com/ClusterLabs/pacemaker/commit/68c7506aa84c69e5f425ef5f3025a9efb41d13da>
> 
> 
> Are these already known bugs?
> (I searched the CentOS bugzilla site but couldn't find any ticket
> describing these bugs)
> 
> 
> Any advise on if or where I should report it?
> 

The new pacemaker from RHEL 7.3 source code is now in CR
(pacemaker-1.1.15-11.el7).

There will be a newer still version later today in CR :
pacemaker-1.1.15-11.el7_3.2

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/centos/attachments/20161125/0168640a/attachment.sig>