On Wed, 2009-06-24 at 12:02 -0700, Sunil Mushran wrote:
Do you have a separate network path for drbd traffic? If you do not, then you are probably overloading the network. In this case, I believe drbd is unable to replicate the ios fast enough and thus is blocking the o2cb disk heartbeat. One workaround is to increase the O2CB_HEARTBEAT_THRESHOLD to more than the default of 60 secs. Refer to the ocfs2 faq or ocfs2 1.4 user's guide for more on this.
I've already modified the O2CB_HEARTBEAT_TRESHOLD to different values (120, 240 etc), with no changes..
And if you want to capture the logs, setup netconsole.
/dev/console is a serial device connected to a terminal server, so far the best I got was a partial timestamp before I saw the output of the reboot again ..
It tries to log .. but doesn't finish writing it :( But mostly there is no activity at all on the serial console :(
Any other ideas ?
greetings
Kris
Kris Buytaert wrote:
We're trying to setup a dual-primary DRBD environment, with a shared disk with either OCFS2 or GFS. The environment is a Centos 5.3 with DRBD82 (but also tried with DRBD83 from testing) .
Setting up a single primary disk and running bonnie++ on it works. Setting up a dual-primary disk, only mounting it on one node (ext3) and running bonnie++ works
When setting up ocfs2 on the /dev/drbd0 disk and mounting it on both nodes, basic functionality seems in place but usually less than 5-10 minutes after I start bonnie++ as a test on one of the nodes , both nodes power cycle with no errors in the logfiles, just a crash.
When at the console at the time of crash it looks like a disk IO (you can type , but actions happen) block happens then a reboot, no panics, no oops , nothing. ( sysctl panic values set to timeouts etc ) Setting up a dual-primary disk , with ocfs2 only mounting it on one node and starting bonnie++ causes only that node to crash.
On DRBD level I get the following error when that node dissapears
drbd0: PingAck did not arrive in time. drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk(UpToDate -> DUnknown ) drbd0: asender terminated drbd0: Terminating asender thread
That however is an expected error because of the reboot.
At first I assumed OCFS2 to be the root of this problem ..so I moved forward and setup an ISCSI target on a 3rd node, and used that device with the same OCFS2 setup. There no crashes occured and bonnie++ flawlessly completed it test run.
So my attention went back to the combination of DRBD and OCFS
I tried both DRBD 8.2 drbd82-8.2.6-1.el5.centos kmod-drbd82-8.2.6-2 and the 83 variant from Centos Testing
At first I was trying with the ocfs2 1.4.1-1.el5.i386.rpm verson but upgrading to 1.4.2-1.el5.i386.rpm didn't change the behaviour
Anyone has an idea on this ? How can we get more debug info from OCFS2 , apart from heartbeat tracing which doesn't learn me nothing yet .. in order to potentially file a valuable bug report.
On Jun 25, 2009, at 5:44 AM, Kris Buytaert mlkb@inuits.be wrote:
/dev/console is a serial device connected to a terminal server, so far the best I got was a partial timestamp before I saw the output of the reboot again ..
It tries to log .. but doesn't finish writing it :( But mostly there is no activity at all on the serial console :(
Any other ideas ?
Set up the crash kernel and get a core dump of the system at the time of the crash.
It's the only way to find the culprit.
-Ross
Kris Buytaert wrote: <big snip> Have you already tested with GFS/GFS2 ? I remember (after having discussed with DRBD people) that OCFS2 was more or less supported on DRBD 7.x while they advised using GFS/GFS2 on top of DRBD > 8.x devices .. Just my two cents though : i've only tested OCFS2 once and never used it in production ;-)