Journal Aborts in VMware ESX (Filesystem Corruption)

List overview All Threads
Download

newer

older

server specifications

CentOS 64 bit php 5.2 huge problem

Adam Tauno Williams

13 Feb 2011 13 Feb '11

2:09 p.m.

I have several CentOS5 hosts in a VMware ESX 3.5.0 226117 environment using iSCSI storage. Recently we've begun to experience journal aborts resulting in remounted-read-only filesystems as well as other filesystem issues - I can unmount a filesystem and force a check with "fsck -f" and occasionally find errors.

I've found - https://bugzilla.redhat.com/show_bug.cgi?id=228108 http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=51306 - which seem related but I believe I am running a kernel that contains these fixes.

My kernel is 2.6.18-194.32.1.el5 on one of the most effected hosts.

Does anyone else have experience with similar issues or know of the status of this Bug/KB?

I can install, boot, run, and then at some seemingly random moment -

init_special_inode: bogus i_mode (50632) init_special_inode: bogus i_mode (137147) init_special_inode: bogus i_mode (172036) init_special_inode: bogus i_mode (175720) init_special_inode: bogus i_mode (72350) init_special_inode: bogus i_mode (174751) EXT3-fs error (device sdb2): ext3_lookup: unlinked inode 19698169 in dir #19696695 Aborting journal on device sdb2. init_special_inode: bogus i_mode (165661) EXT3-fs error (device sdb2): ext3_lookup: unlinked inode 19698131 in dir #19696695 init_special_inode: bogus i_mode (76763) init_special_inode: bogus i_mode (3116) init_special_inode: bogus i_mode (75363) init_special_inode: bogus i_mode (77034) init_special_inode: bogus i_mode (132237) EXT3-fs error (device sdb2): ext3_lookup: unlinked inode 19698139 in dir #19696695 init_special_inode: bogus i_mode (53031) init_special_inode: bogus i_mode (33361) init_special_inode: bogus i_mode (77546) init_special_inode: bogus i_mode (6516) EXT3-fs error (device sdb2): ext3_lookup: unlinked inode 19698143 in dir #19696695 init_special_inode: bogus i_mode (6442) init_special_inode: bogus i_mode (72554) EXT3-fs error (device sdb2): ext3_lookup: unlinked inode 19698142 in dir #19696695 EXT3-fs error (device sdb2): ext3_lookup: unlinked inode 19698164 in dir #19696695 init_special_inode: bogus i_mode (73171) init_special_inode: bogus i_mode (154432) init_special_inode: bogus i_mode (34302) init_special_inode: bogus i_mode (131733) init_special_inode: bogus i_mode (30773) ext3_abort called. EXT3-fs error (device sdb2): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only

Show replies by date

Kwan Lowe

13 Feb 13 Feb

2:29 p.m.

On Sun, Feb 13, 2011 at 9:09 AM, Adam Tauno Williams awilliam@whitemice.org wrote:

...

I have several CentOS5 hosts in a VMware ESX 3.5.0 226117 environment using iSCSI storage. Recently we've begun to experience journal aborts resulting in remounted-read-only filesystems as well as other filesystem issues - I can unmount a filesystem and force a check with "fsck -f" and occasionally find errors.

I've found - https://bugzilla.redhat.com/show_bug.cgi?id=228108 http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=51306

which seem related but I believe I am running a kernel that contains

these fixes.

I ran into a similar problem, but it was not specifically iSCSI. We ended up setting a sysctl.conf file. Give me a few and I will find the setting..

Kwan Lowe

2:40 p.m.

On Sun, Feb 13, 2011 at 9:09 AM, Adam Tauno Williams awilliam@whitemice.org wrote:

...

I have several CentOS5 hosts in a VMware ESX 3.5.0 226117 environment using iSCSI storage. Recently we've begun to experience journal aborts resulting in remounted-read-only filesystems as well as other filesystem issues - I can unmount a filesystem and force a check with "fsck -f" and occasionally find errors.

http://communities.vmware.com/message/245983

The setting we used to resolve was vm.min_free_kbytes = 8192

Previous to this we were seeing the error pop up every week or so.

Keith Beeby

8:28 p.m.

Also seeing this issue with CentOS 5.4 and 5.5 with NFS shared storage, according the the VMware knowledge base article this should have been resolved in v5.1 update??.

Does changing the vm.min_free_kbytes value apply CentOS v.5.4 and 5.5 as well to resolve the issue?

On 13 Feb 2011, at 14:40, "Kwan Lowe" kwan.lowe@gmail.com wrote:

...

On Sun, Feb 13, 2011 at 9:09 AM, Adam Tauno Williams awilliam@whitemice.org wrote:

...
I have several CentOS5 hosts in a VMware ESX 3.5.0 226117 environment using iSCSI storage. Recently we've begun to experience journal aborts resulting in remounted-read-only filesystems as well as other filesystem issues - I can unmount a filesystem and force a check with "fsck -f" and occasionally find errors.

http://communities.vmware.com/message/245983

The setting we used to resolve was vm.min_free_kbytes = 8192

Previous to this we were seeing the error pop up every week or so. _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Adam Tauno Williams

14 Feb 14 Feb

12:03 a.m.

On Sun, 2011-02-13 at 20:28 +0000, Keith Beeby wrote:

...

Also seeing this issue with CentOS 5.4 and 5.5 with NFS shared storage, according the the VMware knowledge base article this should have been resolved in v5.1 update??. Does changing the vm.min_free_kbytes valu apply CentOS v.5.4 and 5.5 as well to resolve the issue?

I guess we'll see [this issue has become extremely frustrating].

I suppose it is 'good' to see that someone else sees the issue as well. One issue with virtualization is that debugging these types of issues is an order-of-magnitude more difficult [virtualized OS, virtualized storage, virtualization platform, or some interaction of all the above... ugh].

...

On 13 Feb 2011, at 14:40, "Kwan Lowe" kwan.lowe@gmail.com wrote:

...
On Sun, Feb 13, 2011 at 9:09 AM, Adam Tauno Williams awilliam@whitemice.org wrote:

...
I have several CentOS5 hosts in a VMware ESX 3.5.0 226117 environment using iSCSI storage. Recently we've begun to experience journal aborts resulting in remounted-read-only filesystems as well as other filesystem issues - I can unmount a filesystem and force a check with "fsck -f" and occasionally find errors.

http://communities.vmware.com/message/245983 The setting we used to resolve was vm.min_free_kbytes = 8192 Previous to this we were seeing the error pop up every week or so.

Bazooka Joe

9:01 p.m.

On Sun, Feb 13, 2011 at 4:03 PM, Adam Tauno Williams awilliam@whitemice.org wrote:

...

On Sun, 2011-02-13 at 20:28 +0000, Keith Beeby wrote:

...
Also seeing this issue with CentOS 5.4 and 5.5 with NFS shared storage, according the the VMware knowledge base article this should have been resolved in v5.1 update??. Does changing the vm.min_free_kbytes valu apply CentOS v.5.4 and 5.5 as well to resolve the issue?

I guess we'll see [this issue has become extremely frustrating].

I suppose it is 'good' to see that someone else sees the issue as well. One issue with virtualization is that debugging these types of issues is an order-of-magnitude more difficult [virtualized OS, virtualized storage, virtualization platform, or some interaction of all the above... ugh].

I am experiencing the same issue.

cent: current exsi v3.5 update 5 storage nfs

I am in the process of rebuilding the virtual server using a different os thinking it was just file system errors.

-bazooka

Adam Tauno Williams

10:02 p.m.

On Mon, 2011-02-14 at 13:01 -0800, Bazooka Joe wrote:

...

On Sun, Feb 13, 2011 at 4:03 PM, Adam Tauno Williams awilliam@whitemice.org wrote:

...
On Sun, 2011-02-13 at 20:28 +0000, Keith Beeby wrote:

...
Also seeing this issue with CentOS 5.4 and 5.5 with NFS shared storage, according the the VMware knowledge base article this should have been resolved in v5.1 update??. Does changing the vm.min_free_kbytes valu apply CentOS v.5.4 and 5.5 as well to resolve the issue?

I guess we'll see [this issue has become extremely frustrating]. I suppose it is 'good' to see that someone else sees the issue as well. One issue with virtualization is that debugging these types of issues is an order-of-magnitude more difficult [virtualized OS, virtualized storage, virtualization platform, or some interaction of all the above... ugh].

I am experiencing the same issue. cent: current exsi v3.5 update 5 storage nfs I am in the process of rebuilding the virtual server using a different os thinking it was just file system errors.

What other OS?

I've experienced one [possibly unrelated] corruption of the /tmp filesystem on an openSUSE 11.1 VM. So far Windows VMs seem immune to the issue.

Adam Tauno Williams

midnight

On Sun, 2011-02-13 at 09:40 -0500, Kwan Lowe wrote:

...

On Sun, Feb 13, 2011 at 9:09 AM, Adam Tauno Williams awilliam@whitemice.org wrote:

...
I have several CentOS5 hosts in a VMware ESX 3.5.0 226117 environment using iSCSI storage. Recently we've begun to experience journal aborts resulting in remounted-read-only filesystems as well as other filesystem issues - I can unmount a filesystem and force a check with "fsck -f" and occasionally find errors.

http://communities.vmware.com/message/245983 The setting we used to resolve was vm.min_free_kbytes = 8192 Previous to this we were seeing the error pop up every week or so.

You made this change to the *virtual machine* [not the host OS]?

This thread indicates this was with VMware Workstation and not ESX (correct)?

Kwan Lowe

10:36 a.m.

On Sun, Feb 13, 2011 at 7:00 PM, Adam Tauno Williams awilliam@whitemice.org wrote: em and force a check with "fsck -f" and

...

...
...
occasionally find errors.

http://communities.vmware.com/message/245983 The setting we used to resolve was vm.min_free_kbytes = 8192 Previous to this we were seeing the error pop up every week or so.

You made this change to the *virtual machine* [not the host OS]?

This thread indicates this was with VMware Workstation and not ESX (correct)?

This was done on the CentOS and RHEL guests on VMWare ESX hosts.

Keith Beeby

12:08 p.m.

Hi,

So the 'fix' is applied directly to the host os, is this the correct thing to do?

sysctl -w vm.min_free_kbytes = 8192

Keith

On 14 Feb 2011, at 10:36, Kwan Lowe wrote:

...

On Sun, Feb 13, 2011 at 7:00 PM, Adam Tauno Williams awilliam@whitemice.org wrote: em and force a check with "fsck -f" and

...
...
...
occasionally find errors.

http://communities.vmware.com/message/245983 The setting we used to resolve was vm.min_free_kbytes = 8192 Previous to this we were seeing the error pop up every week or so.

You made this change to the *virtual machine* [not the host OS]?

This thread indicates this was with VMware Workstation and not ESX (correct)?

This was done on the CentOS and RHEL guests on VMWare ESX hosts. _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Adam Tauno Williams

1 p.m.

On Mon, 2011-02-14 at 12:08 +0000, Keith Beeby wrote:

...

Hi, So the 'fix' is applied directly to the host os,

no, to the *guest* OS instances. [please, do not top-post].

...

is this the correct thing to do? sysctl -w vm.min_free_kbytes = 8192

No space(s) I believe.

sysctl -w vm.min_free_kbytes=8192

I'm still not entirely clear as to why this setting should/will make a difference in maintaining filesystem integrity.

On "Jun 20, 2007" in the aforementioned thread there is the comment: "RHEL5 still needs a "fix" as well, and since it's not yet officially supported from VMware for ESX my guess is it won't get a formal fix until it is certified. I plan to post a patched driver for RHEL5 on my website in the next day or so." - but the comment is from *2007* and RHEL5 is now certified.

http://communities.vmware.com/message/881727#881727 seems like an update that describes my issue; but even that is from 2008.

Reference: VMware KB#1001778 (Note: RHEL5U1 is long since released)

...

On 14 Feb 2011, at 10:36, Kwan Lowe wrote:

...
On Sun, Feb 13, 2011 at 7:00 PM, Adam Tauno Williams awilliam@whitemice.org wrote: em and force a check with "fsck -f" and

...
...
...
occasionally find errors.

http://communities.vmware.com/message/245983 The setting we used to resolve was vm.min_free_kbytes = 8192 Previous to this we were seeing the error pop up every week or so.

You made this change to the *virtual machine* [not the host OS]? This thread indicates this was with VMware Workstation and not ESX (correct)?

This was done on the CentOS and RHEL guests on VMWare ESX hosts.

Kwan Lowe

1:31 p.m.

On Mon, Feb 14, 2011 at 8:00 AM, Adam Tauno Williams awilliam@whitemice.org wrote:

...

On Mon, 2011-02-14 at 12:08 +0000, Keith Beeby wrote:

...
Hi, So the 'fix' is applied directly to the host os,

no, to the *guest* OS instances. [please, do not top-post].

...
is this the correct thing to do? sysctl -w vm.min_free_kbytes = 8192

No space(s) I believe.

sysctl -w vm.min_free_kbytes=8192

I'm still not entirely clear as to why this setting should/will make a difference in maintaining filesystem integrity.

It's certainly possible that the error I was receiving was a different reason, though similar symptoms. We started seeing filesystems go read-only, and only rebooting would clear it up.

Johnny Hughes

6:37 p.m.

On 02/14/2011 07:31 AM, Kwan Lowe wrote:

...

On Mon, Feb 14, 2011 at 8:00 AM, Adam Tauno Williams awilliam@whitemice.org wrote:

...
On Mon, 2011-02-14 at 12:08 +0000, Keith Beeby wrote:

...
Hi, So the 'fix' is applied directly to the host os,

no, to the *guest* OS instances. [please, do not top-post].

...
is this the correct thing to do? sysctl -w vm.min_free_kbytes = 8192

No space(s) I believe.

sysctl -w vm.min_free_kbytes=8192

I'm still not entirely clear as to why this setting should/will make a difference in maintaining filesystem integrity.

It's certainly possible that the error I was receiving was a different reason, though similar symptoms. We started seeing filesystems go read-only, and only rebooting would clear it up.

I use that setting on the "Host OS" for VMWare to prevent a whole vm from getting killed.

That setting will maintain a minimum amount of free memory available to prevent a large program that requests memory quick from depleting all available memory and causing the program killer from killing the highest RAM process.

If you are on a Host OS box, the biggest Memory processes are your VMs, and getting one killed off because memory reaches zero is not good.

I don't have any idea how it would fix journal errors on a drive, but I guess it could.

I set it much higher than 8192 on the host machines ... I set it to 131072.

Kwan Lowe

6:49 p.m.

...

...
It's certainly possible that the error I was receiving was a different reason, though similar symptoms. We started seeing filesystems go read-only, and only rebooting would clear it up.

I use that setting on the "Host OS" for VMWare to prevent a whole vm from getting killed.

That setting will maintain a minimum amount of free memory available to prevent a large program that requests memory quick from depleting all available memory and causing the program killer from killing the highest RAM process.

If you are on a Host OS box, the biggest Memory processes are your VMs, and getting one killed off because memory reaches zero is not good.

I don't have any idea how it would fix journal errors on a drive, but I guess it could.

It's been a few years since I put in the tuning, but here's some info that might be useful:

http://communities.vmware.com/thread/20690?start=0&tstart=0

In particular, others had reported seeing this error:

"kernel: journal_get_undo_access: No memory for committed data".

I don't recall that error in my case, but might explain why the tuning fixed the problem. There's a bugzilla for this:

https://bugzilla.redhat.com/show_bug.cgi?id=179605

5433

Age (days ago)

5434

Last active (days ago)

discuss@lists.centos.org

13 comments

5 participants

tags (0)

participants (5)

Adam Tauno Williams
Bazooka Joe
Johnny Hughes
Keith Beeby
Kwan Lowe