On 11/25/2016 04:30 AM, Andreas Haumer wrote: > Hi! > > I think I stumbled on at least two bugs in the CentOS 7.2 pacemaker package, > though I'm not quite sure if or where to report it. > > I'm using the following package to set up a 2-node active/passive cluster: > > [root at clnode1 ~]# rpm -q pacemaker > pacemaker-1.1.13-10.el7_2.4.x86_64 > > The installation is up-to-date on both nodes as of the current PIT. > > I have currently the following cluster resources running: > > [root at clnode2 ~]# pcs status > Cluster name: rucluster1 > Last updated: Fri Nov 25 11:26:51 2016 Last change: Fri Nov 25 10:51:32 2016 by root via cibadmin on clnode1 > Stack: corosync > Current DC: clnode2 (version 1.1.13-10.el7_2.4-44eb2dd) - partition with quorum > 2 nodes and 12 resources configured > > Online: [ clnode1 clnode2 ] > > Full list of resources: > > p_ip_cluster (ocf::heartbeat:IPaddr2): Started clnode2 > Master/Slave Set: ms_drbd_r0 [p_drbd_r0] > Masters: [ clnode2 ] > Slaves: [ clnode1 ] > p_fs_drbd1 (ocf::heartbeat:Filesystem): Started clnode2 > p_apache (ocf::heartbeat:apache): Started clnode2 > p_dhcpd (ocf::heartbeat:dhcpd): Started clnode2 > p_named (ocf::heartbeat:named): Started clnode2 > p_slapd (ocf::heartbeat:slapd): Started clnode2 > p_postgres (ocf::heartbeat:pgsql): Started clnode2 > p_nmb (systemd:nmb): Started clnode2 > p_smb (systemd:smb): Started clnode2 > p_winbind (systemd:winbind): Started clnode2 > > PCSD Status: > clnode1: Online > clnode2: Online > > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > > > The first bug is rather serious, though a workaround exists! > > The cluster works fine, but as soon as I add a cluster resource of > class "service", the cluster manager software runs havoc on node > failover. In that situation, the lrmd process hangs in an infinite > loop (neither strace nor ltrace show any outout so it seems to be > an internal loop without any system or library call) and almost any > call to the cluster manager software (crmsh or pcs) runs into a timeout. > It's quite hard to recover the whole cluster from this situation. > > When I replace the resource class "service" with resource class > "systemd", everything seems to work just fine. > > I found a rather old, already closed bug for Fedora which looks similar: > > <https://bugzilla.redhat.com/show_bug.cgi?id=1117151> > > > Another bug seems to be rather minor: I see following assertions in the corosync logs: > > Nov 25 11:13:56 [3206] clnode1 crmd: error: crm_abort: pcmkRegisterNode: Triggered assert at xml.c:594 : node->type == XML_ELEMENT_NODE > > They seem to be related with the drbd resource, but do not cause any functional problem it seems. > > For this particular problem I found the following patch: > > <https://github.com/ClusterLabs/pacemaker/commit/68c7506aa84c69e5f425ef5f3025a9efb41d13da> > > > Are these already known bugs? > (I searched the CentOS bugzilla site but couldn't find any ticket > describing these bugs) > > > Any advise on if or where I should report it? > The new pacemaker from RHEL 7.3 source code is now in CR (pacemaker-1.1.15-11.el7). There will be a newer still version later today in CR : pacemaker-1.1.15-11.el7_3.2 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: OpenPGP digital signature URL: <http://lists.centos.org/pipermail/centos/attachments/20161125/0168640a/attachment-0005.sig>