[CentOS] ext4 deadlock issue

Tue Mar 26 21:01:40 UTC 2013

I'm having an occasional problem with a box. It's a Supermicro 16-core 
Xeon, running CentOS 6.3 with kernel 2.6.32-279.el6.x86_64, 96 gigs of 
RAM, and an Areca 1882ix-24 RAID controller with 24 disks, 23 in RAID6 
plus a hot spare. The RAID is divided into 3 partitions, two of 25 TB 
plus one for the rest.

Lately, I've noticed sporadic hangs on writing to the RAID, which 
"resolve" themselves with the following message in dmesg:

INFO: task jbd2/sdb2-8:3607 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/sdb2-8   D 000000000000000a     0  3607      2 0x00000080
  ffff881055d03d20 0000000000000046 ffff881055d03ca0 ffffffff811ada77
  ffff8810552e2400 ffff88109c6566e8 0000000000002f38 ffff8810546e1540
  ffff8810546e1af8 ffff881055d03fd8 000000000000fb88 ffff8810546e1af8
Call Trace:
  [<ffffffff811ada77>] ? __set_page_dirty+0x87/0xf0
  [<ffffffff810923be>] ? prepare_to_wait+0x4e/0x80
  [<ffffffffa01e881f>] jbd2_journal_commit_transaction+0x19f/0x14b0 [jbd2]
  [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
  [<ffffffff8107e00c>] ? lock_timer_base+0x3c/0x70
  [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
  [<ffffffffa01eef78>] kjournald2+0xb8/0x220 [jbd2]
  [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
  [<ffffffffa01eeec0>] ? kjournald2+0x0/0x220 [jbd2]
  [<ffffffff81091d66>] kthread+0x96/0xa0
  [<ffffffff8100c14a>] child_rip+0xa/0x20
  [<ffffffff81091cd0>] ? kthread+0x0/0xa0
  [<ffffffff8100c140>] ? child_rip+0x0/0x20

I get two of these messages in close succession, then things proceed 
normally. It doesn't happen often, and only under load. I'm uncertain, 
but think it might just happen once per boot, once it's happened, it 
doesn't repeat until rebooting. I'm not sure of this, however.

Things otherwise seem fine, other logs, RAID controller event logs, RAID 
controller physical disk status and health reports are normal, etc. 
After getting this error once, attempting to re-write the exact same 
file on the filesystem gets no errors and works normally, so I doubt 
it's a physical thing with the drives anyway.

Anyone seen anything like this before? It's not very frequent, but it's 
very annoying.

-- 
Joakim Ziegler  -  Supervisor de postproducción  -  Terminal
joakim at terminalmx.com   -   044 55 2971 8514   -   5264 0864