Hi all,
A recent update to CentOS 5.9 has broken my cluster's ability to fence nodes. I have two Dell's which are both fenced via their DRAC6 cards. The current configuration in cluster.conf for the fencing devices is:
<fencedevice agent="fence_drac5" cmd_prompt="admin1->" ipaddr="192.168.251.11" login="fencer" name="ms1-drac" passwd="[omitted]" secure="1"/>
After the updates, when one of the two systems comes online and starts the cman service, it will startup fencing. At that point it contacts the other node and reboots it repeatedly, never allowing the system to come back online. If I disable the cman service and renable it after both systems come back online, when cman starts up, it reboots the other and the process starts over again.
Fencing with DRAC devices is not terribly well-documented, so I immagine there was something in the update that changed the way this worked. I originally used system-config-cluster to created the configuration file, but that had to tweak it adding the fence_drac5 agent manually because the configuration tool didn't support it. I also tried recreating the cluster with conga, via luci and ricci, but no success there either.
Is anyone out there doing clustering with Dells and DRAC6 cards under CentOS 5.9? Or under CentOS 6 for that matter... I'm willing to update if it fixes this.
thanks in advance,
...adam
____________________________________________ Adam Wead Systems and Digital Collections Librarian Rock and Roll Hall of Fame and Museum 216.515.1960 (t) 215.515.1964 (f)
Thanks for the response. I just discovered the problem about 30' ago. post_join_delay was set to the default of 3, meaning that it was only waiting 3 seconds for the node to join before fencing it. Silly. After changing that to 300 seconds, it worked fine.
The config was this way with 5.8 and prior so why it wasn't an issue then, who can say.
I also changed the fencing agent to fence_ipmilan, and configured the user on the DRAC card to be an "administrator" for IMPI.
If it's any help to anyone, I've posted the working cluster.conf file. You can also test your fencing for each drac:
fence_ipmilan -a [drac IP] -l [drac user] -p [password for drac user] -o status
...adam
____________________________________________ Adam Wead Systems and Digital Collections Librarian Rock and Roll Hall of Fame and Museum 216.515.1960 (t) 215.515.1964 (f)
On Tue, Mar 5, 2013 at 5:57 PM, Joseph L. Casale jcasale@activenetwerx.comwrote:
I have two Dell's which are both fenced via their DRAC6 cards.
Without your cluster config, we can only guess. Fencing w/ two nodes requires specific startup config for this scenario. Given that, I presume you can find your issue, or post your conf.
jlc _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Turns out I spoke too soon.
Increasing the post_join_delay did at least allow me to restart the cman+clvmd+gfs2+rgmanager services on each node, but if I reboot a node, it will not rejoin the cluster.
If I start with both machines up, and cman stopped, I can start cman on one, then then the other and they'll both join the cluster. After that, I start clvmd, gfs2 and rgmanager on one node then the other (all in that order) and the gfs2 partition is mount on both nodes.
Now, if I reboot one of the other nodes, it will leave the cluster, but when it comes back online, it starts up cman, and hangs forever on fencing. After awhile, "dlm closing connection to node 0" and "dlm closing connection to node 1" appear in the console and the system finishes boot up. At that point, it is not in the cluster. I have to stop the cman service (in the reverse order: rgmanager, gfs2, clvmd, cman) on both nodes and then restart cman on both nodes, and proceed with the rest of the services.
I should add, ricci is running on both nodes, but I'm not using luci and configured the setup with system-config-cluster.
Anyway, I'd appreciate it if anyone could shed light on this. I'm stumped as to why this has changed in 5.9, but it could be just my ignorance of the changes that were made with this latest release.
Many thanks,
...adam
On Tue, Mar 5, 2013 at 6:31 PM, Adam Wead amsterdamos@gmail.com wrote:
Thanks for the response. I just discovered the problem about 30' ago. post_join_delay was set to the default of 3, meaning that it was only waiting 3 seconds for the node to join before fencing it. Silly. After changing that to 300 seconds, it worked fine.
The config was this way with 5.8 and prior so why it wasn't an issue then, who can say.
I also changed the fencing agent to fence_ipmilan, and configured the user on the DRAC card to be an "administrator" for IMPI.
If it's any help to anyone, I've posted the working cluster.conf file. You can also test your fencing for each drac:
fence_ipmilan -a [drac IP] -l [drac user] -p [password for drac user] -o status
...adam
Adam Wead Systems and Digital Collections Librarian Rock and Roll Hall of Fame and Museum 216.515.1960 (t) 215.515.1964 (f)
On Tue, Mar 5, 2013 at 5:57 PM, Joseph L. Casale < jcasale@activenetwerx.com> wrote:
I have two Dell's which are both fenced via their DRAC6 cards.
Without your cluster config, we can only guess. Fencing w/ two nodes requires specific startup config for this scenario. Given that, I presume you can find your issue, or post your conf.
jlc _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Turns out I spoke too soon.
Slow down cowboy, man fenced:)
See clean_start for example. There are more than one param needed for a two node cluster.
https://fedorahosted.org/cluster/wiki/FAQ/Fencing#fence_stuck https://fedorahosted.org/cluster/wiki/FAQ/Fencing#fence_startup
Much discussion about this, for example: http://www.mail-archive.com/linux-cluster@redhat.com/msg07289.html ...
Running a cluster is the result of a need for HA. You need to learn and most importantly *lab* these scenarios up to validate your setup and know what to do in all the scenarios you may encounter. The cluster wiki is packed full of good info, give it a read.
jlc
Brilliant. Many thanks. I love great documentation: "once the guns are out, someone *must* win" ____________________________________________ Adam Wead Systems and Digital Collections Librarian Rock and Roll Hall of Fame and Museum 216.515.1960 (t) 215.515.1964 (f)
On Tue, Mar 5, 2013 at 8:44 PM, Joseph L. Casale jcasale@activenetwerx.comwrote:
Turns out I spoke too soon.
Slow down cowboy, man fenced:)
See clean_start for example. There are more than one param needed for a two node cluster.
https://fedorahosted.org/cluster/wiki/FAQ/Fencing#fence_stuck https://fedorahosted.org/cluster/wiki/FAQ/Fencing#fence_startup
Much discussion about this, for example: http://www.mail-archive.com/linux-cluster@redhat.com/msg07289.html ...
Running a cluster is the result of a need for HA. You need to learn and most importantly *lab* these scenarios up to validate your setup and know what to do in all the scenarios you may encounter. The cluster wiki is packed full of good info, give it a read.
jlc _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos