On Wed, 2009-06-24 at 12:02 -0700, Sunil Mushran wrote: > Do you have a separate network path for drbd traffic? If you do > not, then you are probably overloading the network. In this case, > I believe drbd is unable to replicate the ios fast enough and thus > is blocking the o2cb disk heartbeat. One workaround is to increase > the O2CB_HEARTBEAT_THRESHOLD to more than the default of 60 secs. > Refer to the ocfs2 faq or ocfs2 1.4 user's guide for more on this. > I've already modified the O2CB_HEARTBEAT_TRESHOLD to different values (120, 240 etc), with no changes.. > And if you want to capture the logs, setup netconsole. > /dev/console is a serial device connected to a terminal server, so far the best I got was a partial timestamp before I saw the output of the reboot again .. It tries to log .. but doesn't finish writing it :( But mostly there is no activity at all on the serial console :( Any other ideas ? greetings Kris > Kris Buytaert wrote: > > We're trying to setup a dual-primary DRBD environment, with a shared > > disk with either OCFS2 or GFS. The environment is a Centos 5.3 with > > DRBD82 (but also tried with DRBD83 from testing) . > > > > Setting up a single primary disk and running bonnie++ on it works. > > Setting up a dual-primary disk, only mounting it on one node (ext3) and > > running bonnie++ works > > > > When setting up ocfs2 on the /dev/drbd0 disk and mounting it on both > > nodes, basic functionality seems in place but usually less than 5-10 > > minutes after I start bonnie++ as a test on one of the nodes , both > > nodes power cycle with no errors in the logfiles, just a crash. > > > > When at the console at the time of crash it looks like a disk IO (you > > can type , but actions happen) block happens then a reboot, no panics, > > no oops , nothing. ( sysctl panic values set to timeouts etc ) > > Setting up a dual-primary disk , with ocfs2 only mounting it on one node > > and starting bonnie++ causes only that node to crash. > > > > On DRBD level I get the following error when that node dissapears > > > > drbd0: PingAck did not arrive in time. > > drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) > > pdsk(UpToDate -> DUnknown ) > > drbd0: asender terminated > > drbd0: Terminating asender thread > > > > That however is an expected error because of the reboot. > > > > At first I assumed OCFS2 to be the root of this problem ..so I moved > > forward and setup an ISCSI target on a 3rd node, and used that device > > with the same OCFS2 setup. There no crashes occured and bonnie++ > > flawlessly completed it test run. > > > > So my attention went back to the combination of DRBD and OCFS > > > > I tried both DRBD 8.2 drbd82-8.2.6-1.el5.centos kmod-drbd82-8.2.6-2 and > > the 83 variant from Centos Testing > > > > At first I was trying with the ocfs2 1.4.1-1.el5.i386.rpm verson but > > upgrading to 1.4.2-1.el5.i386.rpm didn't change the behaviour > > > > > > Anyone has an idea on this ? > > How can we get more debug info from OCFS2 , apart from heartbeat > > tracing which doesn't learn me nothing yet .. in order to potentially > > file a valuable bug report. > > >