[CentOS] Unexplained reboots in DRBD82 + OCFS2 setup

Wed Jun 24 14:10:43 UTC 2009
Kris Buytaert <mlkb at inuits.be>


We're trying to setup a dual-primary DRBD environment, with a shared
disk with either OCFS2 or GFS.   The environment is a Centos 5.3 with
DRBD82 (but also tried with DRBD83 from testing) .

Setting up a single primary disk and running bonnie++ on it works.
Setting up a dual-primary disk, only mounting it on one node (ext3) and
running bonnie++  works

When setting up ocfs2 on the /dev/drbd0 disk and mounting it on both
nodes, basic functionality seems in place but usually less than 5-10
minutes after I start bonnie++ as a test on one of the nodes , both
nodes power cycle  with no errors in the logfiles, just a crash.

When at the console at the time of crash it looks like a disk IO (you
can type , but actions happen)  block happens  then a reboot, no panics,
no oops , nothing. ( sysctl panic values set to timeouts etc )
Setting up a dual-primary disk , with ocfs2 only mounting it on one node
and starting bonnie++ causes only that node to crash.

On DRBD level I get the following error when that node dissapears

drbd0: PingAck did not arrive in time.
drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure )
pdsk(UpToDate -> DUnknown )
drbd0: asender terminated
drbd0: Terminating asender thread

That however is an expected error because of the reboot.

At first I assumed OCFS2 to be the root of this problem ..so I moved
forward and setup an ISCSI target on a 3rd node, and used that device
with the same OCFS2 setup. There no crashes occured and bonnie++
flawlessly completed it test run.

So my attention went  back to the combination of DRBD and OCFS 

I tried both DRBD 8.2 drbd82-8.2.6-1.el5.centos kmod-drbd82-8.2.6-2  and
the 83 variant from Centos Testing

At first I was trying with the ocfs2 1.4.1-1.el5.i386.rpm verson but
upgrading to  1.4.2-1.el5.i386.rpm didn't change the behaviour


Anyone has an idea on this ? 
How can we get more debug info from OCFS2  , apart from heartbeat
tracing which doesn't learn me nothing yet ..  in order to potentially
file a valuable bug report.


thnx in advance 

Kris