-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi!
I think I stumbled on at least two bugs in the CentOS 7.2 pacemaker package, though I'm not quite sure if or where to report it.
I'm using the following package to set up a 2-node active/passive cluster:
[root@clnode1 ~]# rpm -q pacemaker pacemaker-1.1.13-10.el7_2.4.x86_64
The installation is up-to-date on both nodes as of the current PIT.
I have currently the following cluster resources running:
[root@clnode2 ~]# pcs status Cluster name: rucluster1 Last updated: Fri Nov 25 11:26:51 2016 Last change: Fri Nov 25 10:51:32 2016 by root via cibadmin on clnode1 Stack: corosync Current DC: clnode2 (version 1.1.13-10.el7_2.4-44eb2dd) - partition with quorum 2 nodes and 12 resources configured
Online: [ clnode1 clnode2 ]
Full list of resources:
p_ip_cluster (ocf::heartbeat:IPaddr2): Started clnode2 Master/Slave Set: ms_drbd_r0 [p_drbd_r0] Masters: [ clnode2 ] Slaves: [ clnode1 ] p_fs_drbd1 (ocf::heartbeat:Filesystem): Started clnode2 p_apache (ocf::heartbeat:apache): Started clnode2 p_dhcpd (ocf::heartbeat:dhcpd): Started clnode2 p_named (ocf::heartbeat:named): Started clnode2 p_slapd (ocf::heartbeat:slapd): Started clnode2 p_postgres (ocf::heartbeat:pgsql): Started clnode2 p_nmb (systemd:nmb): Started clnode2 p_smb (systemd:smb): Started clnode2 p_winbind (systemd:winbind): Started clnode2
PCSD Status: clnode1: Online clnode2: Online
Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
The first bug is rather serious, though a workaround exists!
The cluster works fine, but as soon as I add a cluster resource of class "service", the cluster manager software runs havoc on node failover. In that situation, the lrmd process hangs in an infinite loop (neither strace nor ltrace show any outout so it seems to be an internal loop without any system or library call) and almost any call to the cluster manager software (crmsh or pcs) runs into a timeout. It's quite hard to recover the whole cluster from this situation.
When I replace the resource class "service" with resource class "systemd", everything seems to work just fine.
I found a rather old, already closed bug for Fedora which looks similar:
https://bugzilla.redhat.com/show_bug.cgi?id=1117151
Another bug seems to be rather minor: I see following assertions in the corosync logs:
Nov 25 11:13:56 [3206] clnode1 crmd: error: crm_abort: pcmkRegisterNode: Triggered assert at xml.c:594 : node->type == XML_ELEMENT_NODE
They seem to be related with the drbd resource, but do not cause any functional problem it seems.
For this particular problem I found the following patch:
https://github.com/ClusterLabs/pacemaker/commit/68c7506aa84c69e5f425ef5f3025a9efb41d13da
Are these already known bugs? (I searched the CentOS bugzilla site but couldn't find any ticket describing these bugs)
Any advise on if or where I should report it?
Thanks!
- - andreas
- -- Andreas Haumer | mailto:andreas@xss.co.at *x Software + Systeme | http://www.xss.co.at/ Karmarschgasse 51/2/20 | Tel: +43-1-6060114-0 A-1100 Vienna, Austria | Fax: +43-1-6060114-71