[CentOS] Re: OOM condition with file-4.17-9.0.1.el5

Sat May 17 12:36:54 UTC 2008
John <jknappers-argentia at hotmail.com>

"Filipe Brandenburger" <filbranden at gmail.com> 
schreef in bericht 
news:e814db780805161718oeb2763ja80120d58b3abc93 at mail.gmail.com...
> On Fri, May 16, 2008 at 8:30 AM, John 
> <jknappers-argentia at hotmail.com> wrote:
>> We just encountered several serious Out of Memory Problems caused by the
>> file 4.17-9.0.1.EL5 utility on our mail Gateway. CentOS 5.1 kept it self
>> alive by killing it parent process Amavisd-new. Manually restarting
>> Amavisd-new and flushing the mail queue caused same problem. Disabling 
>> file
>> for the moment, caused Amavisd-new to fail, and also defer all mail.
>
> I cannot understand it. On which situations would the "file" utility
> be causing OOM issues? "file" is not a daemon or something that is
> running all the time. And to generate an OOM situation, you have to
> exhaust all the memory on your machine.
>
> Could you give more details here?
>
> How much memory does the machine have?
> How much swap space does the machine have?
> Why do you believe "file" is the offending application?
> Who is using "file" on a regular basis?
> What types of files are being tested for type with the "file" utility?
> Can you produce a specific file which causes "file" to allocate a lot of 
> memory?
> Can you produce the output of "ps" or "top" which shows "file" using a
> lot of memory just before a crash?
>
> If you don't give us such details, it's really hard to believe that
> "file" is what is causing your issue. Please back your statement.
>
> Filipe

Ok, I will try.
The machine config
Machine is a HP DL360G3, 2gb Mem, 2Gb swap running CentOS5.1 x86

The purpose of and how file(1) is used.
The machine is our mail gateway, Running Postfix / Amavisd-new / Clamd  / 
Perl Spamassassin
Al incoming and outgoing mail is handed to Amavisd which extract the mail 
and ran it against for example file(1)
So every incoming / outgoing mail passes file(1), where file(1) is used to 
enforce mail policy (No exe, com, hta, pif etc is coming in, even if the 
extension is changed).

I cannot provide a example causing file(1) to consume all memory, because, I 
had to get the mail system running again. Now thinking, It had been wise to 
make a copy off the postfix deferred queue before starting the system again.

The kernel has logged the OOM condition and killed the parent Amavisd-new to 
keep itself alive.
I saved that part, and adding it here at the end of the mail

This is what happend just before in the maillog

May 16 13:16:56 gw-0001 amavis[6058]: (06058-01) FWD via SMTP: 
<696e666f726e6c2c4e4c312c33322c31303334 at bounce.msg1svc.net> -> 
<john.xxx at xxxxx.nl>,BODY=8BITMIME 250 2.6.0 Ok, id=06058-01, from $
May 16 13:16:56 gw-0001 amavis[6059]: (06059-02-2) (!!)TROUBLE in 
check_mail: parts_decode_ext FAILED: file(1) utility (/usr/bin/file) error: 
run_command (open pipe): Can't fork at /usr/lib/perl5/5.8.8/i386-li$
May 16 13:16:56 gw-0001 amavis[6059]: (06059-02-2) (!)PRESERVING EVIDENCE in 
/var/amavis/tmp/amavis-20080516T131638-06059
May 16 13:16:56 gw-0001 amavis[6058]: (06058-01) Passed CLEAN, [backup-mx] 
[66.179.205.75] 
<696e666f726e6c2c4e4c312c33322c31303334 at bounce.msg1svc.net> -> 
<john.xxxx at xxxxx.nl>, Message-ID: <1$
May 16 13:16:56 gw-0001 postfix/lmtp[7373]: 8196E1380A6: 
to=<john.xxxx at xxxxxx.nl>, relay=localhost[127.0.0.1]:10024, delay=12071, 
delays=12052/0.01/0.03/19, dsn=2.0.0, status=sent (250 2.0.0 Ok:$

Maybe Amavisd-new saved the problem message,
This is the first entry where file failed. The saved message contains a 
normal html message. Nothing strange here.

A bit later in the maillog:
ay 16 13:17:59 gw-0001 postfix/smtpd[7385]: connect from 
localhost.localdomain[127.0.0.1]
May 16 13:18:02 gw-0001 amavis[5639]: (!)Net::Server: 2008/05/16-13:18:02 
Bad fork [Cannot allocate memory]\n  at line 168 in file 
/usr/lib/perl5/vendor_perl/5.8.8/Net/Server/PreForkSimple.pm
May 16 13:18:02 gw-0001 amavis[5639]: Net::Server: 2008/05/16-13:18:02 
Server closing!

This combined with the Kernel logging at the end, and after repeatable 
restarting Amavisd-new and flushing the postfix queue, it ran dead in the 
same problem, I started to believe that there may be a problem in file(1). 
After checking with yum list file, I concluded that there were no CentOS 
updates for file.

The easied way to get a newer version of file was to rebuild the SRPM of 
FC9's file-4.23-05 on CentOS 5. It compiled without problems and after 
installing, starting Amavisd-new and flushing the queue al deferred mail was 
delivered without any problems.

For me, the problem was solved, and I had enough proof for my self to say, I 
assume file(1) was the problem. I agree, this is not enough for some 
developers to be convinced, and the problem file is easily found. But I 
believe the problem file should be somewhere in my Amavisd-new directory. 
But at this moment I don't have a test system to test the complete directory 
against file(1)

I hope I provide some of the information you wanted to see.

Regards,
John


May 16 13:17:53 gw-0001 kernel: cmasasd invoked oom-killer: 
gfp_mask=0x201d2, order=0, oomkilladj=0
May 16 13:17:53 gw-0001 kernel:  [<c0457d36>] out_of_memory+0x72/0x1a4
May 16 13:17:53 gw-0001 kernel:  [<c0459161>] __alloc_pages+0x201/0x282
May 16 13:17:53 gw-0001 kernel:  [<c045a3bf>] 
__do_page_cache_readahead+0xc4/0x1c6
May 16 13:17:53 gw-0001 kernel:  [<c0454a21>] sync_page+0x0/0x3b
May 16 13:17:53 gw-0001 kernel:  [<c044fbe5>] 
__delayacct_blkio_end+0x32/0x35
May 16 13:17:53 gw-0001 kernel:  [<c0604c24>] __wait_on_bit_lock+0x4b/0x52
May 16 13:17:53 gw-0001 kernel:  [<c04549a4>] __lock_page+0x51/0x57
May 16 13:17:53 gw-0001 kernel:  [<c04572d8>] filemap_nopage+0x151/0x315
May 16 13:17:53 gw-0001 kernel:  [<c045fda3>] __handle_mm_fault+0x172/0x87b
May 16 13:17:53 gw-0001 kernel:  [<c04b43cf>] do_msgrcv+0x241/0x290
May 16 13:17:53 gw-0001 kernel:  [<c0606c2b>] do_page_fault+0x20a/0x4b8
May 16 13:17:53 gw-0001 kernel:  [<c0606a21>] do_page_fault+0x0/0x4b8
May 16 13:17:53 gw-0001 kernel:  [<c0405a71>] error_code+0x39/0x40
May 16 13:17:53 gw-0001 kernel:  =======================
May 16 13:17:53 gw-0001 kernel: Mem-info:
May 16 13:17:53 gw-0001 kernel: DMA per-cpu:
May 16 13:17:53 gw-0001 kernel: cpu 0 hot: high 0, batch 1 used:0
May 16 13:17:53 gw-0001 kernel: cpu 0 cold: high 0, batch 1 used:0
May 16 13:17:53 gw-0001 kernel: cpu 1 hot: high 0, batch 1 used:0
May 16 13:17:53 gw-0001 kernel: cpu 1 cold: high 0, batch 1 used:0
May 16 13:17:53 gw-0001 kernel: cpu 2 hot: high 0, batch 1 used:0
May 16 13:17:53 gw-0001 kernel: cpu 2 cold: high 0, batch 1 used:0
May 16 13:17:53 gw-0001 kernel: cpu 3 hot: high 0, batch 1 used:0
May 16 13:17:54 gw-0001 kernel: cpu 3 cold: high 0, batch 1 used:0
May 16 13:17:54 gw-0001 kernel: DMA32 per-cpu: empty
May 16 13:17:54 gw-0001 kernel: Normal per-cpu:
May 16 13:17:54 gw-0001 kernel: cpu 0 hot: high 186, batch 31 used:161
May 16 13:17:54 gw-0001 kernel: cpu 0 cold: high 62, batch 15 used:14
May 16 13:17:54 gw-0001 kernel: cpu 1 hot: high 186, batch 31 used:44
May 16 13:17:54 gw-0001 kernel: cpu 1 cold: high 62, batch 15 used:47
May 16 13:17:54 gw-0001 kernel: cpu 2 hot: high 186, batch 31 used:70
May 16 13:17:54 gw-0001 kernel: cpu 2 cold: high 62, batch 15 used:15
May 16 13:17:54 gw-0001 kernel: cpu 3 hot: high 186, batch 31 used:99
May 16 13:17:54 gw-0001 kernel: cpu 3 cold: high 62, batch 15 used:29
May 16 13:17:54 gw-0001 kernel: HighMem per-cpu:
May 16 13:17:54 gw-0001 kernel: cpu 0 hot: high 186, batch 31 used:29
May 16 13:17:54 gw-0001 kernel: cpu 0 cold: high 62, batch 15 used:42
May 16 13:17:54 gw-0001 kernel: cpu 1 hot: high 186, batch 31 used:26
May 16 13:17:54 gw-0001 kernel: cpu 1 cold: high 62, batch 15 used:53
May 16 13:17:54 gw-0001 kernel: cpu 2 hot: high 186, batch 31 used:23
May 16 13:17:54 gw-0001 kernel: cpu 2 cold: high 62, batch 15 used:56
May 16 13:17:54 gw-0001 kernel: cpu 3 hot: high 186, batch 31 used:168
May 16 13:17:54 gw-0001 kernel: cpu 3 cold: high 62, batch 15 used:42
May 16 13:17:54 gw-0001 kernel: Free pages:       47492kB (464kB HighMem)
May 16 13:17:54 gw-0001 kernel: Active:248505 inactive:243082 dirty:0 
writeback:0 unstable:0 free:11873 slab:
May 16 13:17:54 gw-0001 kernel: Active:248505 inactive:243082 dirty:0 
writeback:0 unstable:0 free:11873 slab:
4798 mapped-file:1037 mapped-anon:491633 pagetables:2940
May 16 13:17:54 gw-0001 kernel: DMA free:8160kB min:36kB low:44kB high:52kB 
active:2340kB inactive:1364kB pre
sent:16384kB pages_scanned:6857 all_unreclaimable? yes
May 16 13:17:54 gw-0001 kernel: lowmem_reserve[]: 0 0 880 2031
May 16 13:17:54 gw-0001 kernel: DMA32 free:0kB min:0kB low:0kB high:0kB 
active:0kB inactive:0kB present:0kB p
ages_scanned:0 all_unreclaimable? no
May 16 13:17:54 gw-0001 kernel: lowmem_reserve[]: 0 0 880 2031
May 16 13:17:54 gw-0001 kernel: Normal free:38868kB min:2008kB low:2508kB 
high:3012kB active:397440kB inactiv
e:409920kB present:901120kB pages_scanned:1894767 all_unreclaimable? yes
May 16 13:17:54 gw-0001 kernel: lowmem_reserve[]: 0 0 0 9215
May 16 13:17:54 gw-0001 kernel: HighMem free:464kB min:512kB low:1168kB 
high:1828kB active:594240kB inactive:
561044kB present:1179624kB pages_scanned:4549828 all_unreclaimable? yes
May 16 13:17:54 gw-0001 kernel: lowmem_reserve[]: 0 0 0 0
May 16 13:17:54 gw-0001 kernel: DMA: 0*4kB 0*8kB 126*16kB 60*32kB 10*64kB 
0*128kB 2*256kB 0*512kB 1*1024kB 1*
2048kB 0*4096kB = 8160kB
May 16 13:17:54 gw-0001 kernel: DMA32: empty
May 16 13:17:54 gw-0001 kernel: Normal: 1*4kB 28*8kB 2059*16kB 120*32kB 
1*64kB 0*128kB 1*256kB 1*512kB 1*1024
kB 0*2048kB 0*4096kB = 38868kB
May 16 13:17:54 gw-0001 kernel: HighMem: 4*4kB 8*8kB 0*16kB 2*32kB 1*64kB 
2*128kB 0*256kB 0*512kB 0*1024kB 0*
2048kB 0*4096kB = 464kB
May 16 13:17:54 gw-0001 kernel: Swap cache: add 591115, delete 590893, find 
5038951/5045583, race 0+568
May 16 13:17:54 gw-0001 kernel: Free swap  = 0kB
May 16 13:17:54 gw-0001 kernel: Total swap = 2097096kB
May 16 13:17:54 gw-0001 kernel: Free swap:            0kB
May 16 13:17:54 gw-0001 kernel: 524282 pages of RAM
May 16 13:17:54 gw-0001 kernel: 294906 pages of HIGHMEM
May 16 13:17:54 gw-0001 kernel: 5445 reserved pages
May 16 13:17:54 gw-0001 kernel: 357914 pages shared
May 16 13:17:54 gw-0001 kernel: 222 pages swap cached
May 16 13:17:54 gw-0001 kernel: 0 pages dirty
May 16 13:17:54 gw-0001 kernel: 0 pages writeback
May 16 13:17:54 gw-0001 kernel: 1037 pages mapped
May 16 13:17:54 gw-0001 kernel: 4798 pages slab
May 16 13:17:54 gw-0001 kernel: 2940 pages pagetables
May 16 13:17:54 gw-0001 kernel: Out of memory: Killed process 6054 
(amavisd).


Version ussed:
unane -a
Linux gw-0001.xxx.yyy.com 2.6.18-53.1.14.el5 #1 SMP Wed Mar 5 11:36:49 EST 
2008 i686 i686 i386 GNU/Linux

rpm -q postfix                    postfix-2.4.7-3.el5.mysql_pgsql.jk
rpm -q amvisd-new            amavisd-new-2.5.4-5.el5.jk
rpm -q perl-spamassassin   spamassassin-3.2.4-1.el5.rf
rpm -q clamd                     clamd-0.93-2.el5.rf
rpm -q file                          file-4.23-5.el5.jk (This is the 
backported version, it the problem version is 4.17-9.0.1