<font size=2 face="sans-serif">Hello all.  I posted this in the forum

and was told to instead post it to the mailing list.  My apologies

for the redundancy if you have already seen and been irritated by my blatherings.</font>

<br>

<br><font size=2 face="sans-serif">Thanks.</font>

<br><font size=2 face="sans-serif">_________________________</font>

<br>

<br><font size=3>I am working on a CentOS clustered LAMP stack and running

into problems. I have searched extensively and have come up empty.<br>

<br>

Here's my setup:<br>

<br>

Two node cluster identical hardware. IBM x226 with RSAII adapters for fencing.<br>

Configured for Active/Passive failover - no load balancing.<br>

No shared storage - manual rsync of data (shared SSH keys, rsync over SSH,

cron job).<br>

Single shared IP address<br>

<br>

I used luci and ricci to configure the cluster. It's a bit confusing that

there's an 'apache' script but you have to use the custom init script.

I'm past that though.<br>

<br>

The failover function is working when it's kicked off manually from the

luci web interface. I can tell it to transfer the services (IP, httpd,

msqld) to the secondary server and it works fine.<br>

<br>

I run into problems when I attempt to simulate a failure (a pulled network

cord for instance). The primary system recognizes the failure, shuts down

it's services, attempts to inform the secondary server to take over and

then it never does. Here is a log excerpt from a cable pull test:<br>

</font>

<br><tt><font size=3>Jun 16 15:33:27 flex kernel: tg3: eth0: Link is down.<br>

Jun 16 15:33:34 flex clurgmgrd: [2970]: <warning> Link for eth0:

Not detected<br>

Jun 16 15:33:34 flex clurgmgrd: [2970]: <warning> No link on eth0...<br>

Jun 16 15:33:34 flex clurgmgrd[2970]: <notice> status on ip "10.6.2.25"

returned 1 (generic error)<br>

Jun 16 15:33:34 flex clurgmgrd[2970]: <notice> Stopping service service:web<br>

Jun 16 15:33:35 flex proftpd[6321]: 10.6.2.47 - ProFTPD killed (signal

15)<br>

Jun 16 15:33:35 flex proftpd[6321]: 10.6.2.47 - ProFTPD 1.3.3c standalone

mode SHUTDOWN<br>

Jun 16 15:33:39 flex avahi-daemon[2850]: Withdrawing address record for

10.6.2.25 on eth0.<br>

Jun 16 15:33:49 flex clurgmgrd[2970]: <notice> Service service:web

is recovering<br>

Jun 16 15:33:49 flex clurgmgrd[2970]: <notice> Recovering failed

service service:web<br>

Jun 16 15:33:49 flex clurgmgrd: [2970]: <warning> Link for eth0:

Not detected<br>

Jun 16 15:33:49 flex clurgmgrd[2970]: <notice> start on ip "10.6.2.25"

returned 1 (generic error)<br>

Jun 16 15:33:49 flex clurgmgrd[2970]: <warning> #68: Failed to start

service:web; return value: 1<br>

Jun 16 15:33:49 flex clurgmgrd[2970]: <notice> Stopping service service:web<br>

Jun 16 15:33:49 flex clurgmgrd: [2970]: <err> script:mysqld: stop

of /etc/rc.d/init.d/mysqld failed (returned 1)<br>

Jun 16 15:33:49 flex clurgmgrd[2970]: <notice> stop on script "mysqld"

returned 1 (generic error)<br>

Jun 16 15:33:49 flex clurgmgrd[2970]: <crit> #12: RG service:web

failed to stop; intervention required<br>

Jun 16 15:33:49 flex clurgmgrd[2970]: <notice> Service service:web

is failed<br>

Jun 16 15:33:49 flex clurgmgrd[2970]: <crit> #13: Service service:web

failed to stop cleanly<br>

Jun 16 15:36:43 flex kernel: tg3: eth0: Link is up at 100 Mbps, full duplex.<br>

Jun 16 15:36:43 flex kernel: tg3: eth0: Flow control is off for TX and

off for RX.<br>

Jun 16 16:04:52 flex luci[2904]: Unable to retrieve batch 306226694 status

from web2:11111: Unable to disable failed service web before starting it:clusvcadm

failed to stop web:<br>

Jun 16 16:05:28 flex clurgmgrd[2970]: <notice> Starting disabled

service service:web<br>

Jun 16 16:05:31 flex avahi-daemon[2850]: Registering new address record

for 10.6.2.25 on eth0.<br>

Jun 16 16:05:31 flex luci[2904]: Unable to retrieve batch 1997354692 status

from web2:11111: module scheduled for execution<br>

Jun 16 16:05:33 flex proftpd[1926]: 10.6.2.47 - ProFTPD 1.3.3c (maint)

(built Thu Nov 18 2010 03:38:57 CET) standalone mode STARTUP<br>

Jun 16 16:05:33 flex clurgmgrd[2970]: <notice> Service service:web

started<br>

</font></tt>

<br><font size=3><br>

<br>

I have followed the HowTos for setting up the cluster (with the exception

of the shared storage) as closely as possible.<br>

<br>

Here's what I've already troubleshot:<br>

<br>

No IPTables running<br>

No SELinux running<br>

Hosts file resolves all IP address/host names properly.<br>

<br>

I must say that I am less familiar with how all of the cluster components

work together. All of the Linux clusters I have built thus far have been

heartbeat+mon style clusters.<br>

<br>

I'm looking to find out if there is an additional debug layer that I can

put in place to get some more detailed information about what is transacting

(or not) between the two cluster members.<br>

<br>

Many thanks. </font>