[CentOS] Re: OOM condition with file-4.17-9.0.1.el5

Sat May 17 12:41:19 UTC 2008
John <jknappers-argentia at hotmail.com>

Hello Filip,

I forgot the word NOT in this part of my message.

> For me, the problem was solved, and I had enough proof for my self to say, 
> I assume file(1) was the problem. I agree, this is not enough for some 
> developers to be convinced, and the problem file is NOT easily found. But 
> I believe the problem file should be somewhere in my Amavisd-new 
> directory. But at this moment I don't have a test system to test the 
> complete directory against file(1)


John


"John" <jknappers-argentia at hotmail.com> schreef 
in bericht news:g0mjh8$hah$1 at ger.gmane.org...
>
> "Filipe Brandenburger" 
> <filbranden at gmail.com> schreef in bericht 
> news:e814db780805161718oeb2763ja80120d58b3abc93 at mail.gmail.com...
>> On Fri, May 16, 2008 at 8:30 AM, John 
>> <jknappers-argentia at hotmail.com> wrote:
>>> We just encountered several serious Out of Memory Problems caused by the
>>> file 4.17-9.0.1.EL5 utility on our mail Gateway. CentOS 5.1 kept it self
>>> alive by killing it parent process Amavisd-new. Manually restarting
>>> Amavisd-new and flushing the mail queue caused same problem. Disabling 
>>> file
>>> for the moment, caused Amavisd-new to fail, and also defer all mail.
>>
>> I cannot understand it. On which situations would the "file" utility
>> be causing OOM issues? "file" is not a daemon or something that is
>> running all the time. And to generate an OOM situation, you have to
>> exhaust all the memory on your machine.
>>
>> Could you give more details here?
>>
>> How much memory does the machine have?
>> How much swap space does the machine have?
>> Why do you believe "file" is the offending application?
>> Who is using "file" on a regular basis?
>> What types of files are being tested for type with the "file" utility?
>> Can you produce a specific file which causes "file" to allocate a lot of 
>> memory?
>> Can you produce the output of "ps" or "top" which shows "file" using a
>> lot of memory just before a crash?
>>
>> If you don't give us such details, it's really hard to believe that
>> "file" is what is causing your issue. Please back your statement.
>>
>> Filipe
>
> Ok, I will try.
> The machine config
> Machine is a HP DL360G3, 2gb Mem, 2Gb swap running CentOS5.1 x86
>
> The purpose of and how file(1) is used.
> The machine is our mail gateway, Running Postfix / Amavisd-new / Clamd  / 
> Perl Spamassassin
> Al incoming and outgoing mail is handed to Amavisd which extract the mail 
> and ran it against for example file(1)
> So every incoming / outgoing mail passes file(1), where file(1) is used to 
> enforce mail policy (No exe, com, hta, pif etc is coming in, even if the 
> extension is changed).
>
> I cannot provide a example causing file(1) to consume all memory, because, 
> I had to get the mail system running again. Now thinking, It had been wise 
> to make a copy off the postfix deferred queue before starting the system 
> again.
>
> The kernel has logged the OOM condition and killed the parent Amavisd-new 
> to keep itself alive.
> I saved that part, and adding it here at the end of the mail
>
> This is what happend just before in the maillog
>
> May 16 13:16:56 gw-0001 amavis[6058]: (06058-01) FWD via SMTP: 
> <696e666f726e6c2c4e4c312c33322c31303334 at bounce.msg1svc.net> 
>  -> <john.xxx at xxxxx.nl>,BODY=8BITMIME 250 2.6.0 Ok, 
> id=06058-01, from $
> May 16 13:16:56 gw-0001 amavis[6059]: (06059-02-2) (!!)TROUBLE in 
> check_mail: parts_decode_ext FAILED: file(1) utility (/usr/bin/file) 
> error: run_command (open pipe): Can't fork at 
> /usr/lib/perl5/5.8.8/i386-li$
> May 16 13:16:56 gw-0001 amavis[6059]: (06059-02-2) (!)PRESERVING EVIDENCE 
> in /var/amavis/tmp/amavis-20080516T131638-06059
> May 16 13:16:56 gw-0001 amavis[6058]: (06058-01) Passed CLEAN, [backup-mx] 
> [66.179.205.75] 
> <696e666f726e6c2c4e4c312c33322c31303334 at bounce.msg1svc.net> 
>  -> <john.xxxx at xxxxx.nl>, Message-ID: <1$
> May 16 13:16:56 gw-0001 postfix/lmtp[7373]: 8196E1380A6: 
> to=<john.xxxx at xxxxxx.nl>, 
> relay=localhost[127.0.0.1]:10024, delay=12071, delays=12052/0.01/0.03/19, 
> dsn=2.0.0, status=sent (250 2.0.0 Ok:$
>
> Maybe Amavisd-new saved the problem message,
> This is the first entry where file failed. The saved message contains a 
> normal html message. Nothing strange here.
>
> A bit later in the maillog:
> ay 16 13:17:59 gw-0001 postfix/smtpd[7385]: connect from 
> localhost.localdomain[127.0.0.1]
> May 16 13:18:02 gw-0001 amavis[5639]: (!)Net::Server: 2008/05/16-13:18:02 
> Bad fork [Cannot allocate memory]\n  at line 168 in file 
> /usr/lib/perl5/vendor_perl/5.8.8/Net/Server/PreForkSimple.pm
> May 16 13:18:02 gw-0001 amavis[5639]: Net::Server: 2008/05/16-13:18:02 
> Server closing!
>
> This combined with the Kernel logging at the end, and after repeatable 
> restarting Amavisd-new and flushing the postfix queue, it ran dead in the 
> same problem, I started to believe that there may be a problem in file(1). 
> After checking with yum list file, I concluded that there were no CentOS 
> updates for file.
>
> The easied way to get a newer version of file was to rebuild the SRPM of 
> FC9's file-4.23-05 on CentOS 5. It compiled without problems and after 
> installing, starting Amavisd-new and flushing the queue al deferred mail 
> was delivered without any problems.
>
> For me, the problem was solved, and I had enough proof for my self to say, 
> I assume file(1) was the problem. I agree, this is not enough for some 
> developers to be convinced, and the problem file is easily found. But I 
> believe the problem file should be somewhere in my Amavisd-new directory. 
> But at this moment I don't have a test system to test the complete 
> directory against file(1)
>
> I hope I provide some of the information you wanted to see.
>
> Regards,
> John
>
>
> May 16 13:17:53 gw-0001 kernel: cmasasd invoked oom-killer: 
> gfp_mask=0x201d2, order=0, oomkilladj=0
> May 16 13:17:53 gw-0001 kernel:  [<c0457d36>] out_of_memory+0x72/0x1a4
> May 16 13:17:53 gw-0001 kernel:  [<c0459161>] __alloc_pages+0x201/0x282
> May 16 13:17:53 gw-0001 kernel:  [<c045a3bf>] 
> __do_page_cache_readahead+0xc4/0x1c6
> May 16 13:17:53 gw-0001 kernel:  [<c0454a21>] sync_page+0x0/0x3b
> May 16 13:17:53 gw-0001 kernel:  [<c044fbe5>] 
> __delayacct_blkio_end+0x32/0x35
> May 16 13:17:53 gw-0001 kernel:  [<c0604c24>] __wait_on_bit_lock+0x4b/0x52
> May 16 13:17:53 gw-0001 kernel:  [<c04549a4>] __lock_page+0x51/0x57
> May 16 13:17:53 gw-0001 kernel:  [<c04572d8>] filemap_nopage+0x151/0x315
> May 16 13:17:53 gw-0001 kernel:  [<c045fda3>] 
> __handle_mm_fault+0x172/0x87b
> May 16 13:17:53 gw-0001 kernel:  [<c04b43cf>] do_msgrcv+0x241/0x290
> May 16 13:17:53 gw-0001 kernel:  [<c0606c2b>] do_page_fault+0x20a/0x4b8
> May 16 13:17:53 gw-0001 kernel:  [<c0606a21>] do_page_fault+0x0/0x4b8
> May 16 13:17:53 gw-0001 kernel:  [<c0405a71>] error_code+0x39/0x40
> May 16 13:17:53 gw-0001 kernel:  =======================
> May 16 13:17:53 gw-0001 kernel: Mem-info:
> May 16 13:17:53 gw-0001 kernel: DMA per-cpu:
> May 16 13:17:53 gw-0001 kernel: cpu 0 hot: high 0, batch 1 used:0
> May 16 13:17:53 gw-0001 kernel: cpu 0 cold: high 0, batch 1 used:0
> May 16 13:17:53 gw-0001 kernel: cpu 1 hot: high 0, batch 1 used:0
> May 16 13:17:53 gw-0001 kernel: cpu 1 cold: high 0, batch 1 used:0
> May 16 13:17:53 gw-0001 kernel: cpu 2 hot: high 0, batch 1 used:0
> May 16 13:17:53 gw-0001 kernel: cpu 2 cold: high 0, batch 1 used:0
> May 16 13:17:53 gw-0001 kernel: cpu 3 hot: high 0, batch 1 used:0
> May 16 13:17:54 gw-0001 kernel: cpu 3 cold: high 0, batch 1 used:0
> May 16 13:17:54 gw-0001 kernel: DMA32 per-cpu: empty
> May 16 13:17:54 gw-0001 kernel: Normal per-cpu:
> May 16 13:17:54 gw-0001 kernel: cpu 0 hot: high 186, batch 31 used:161
> May 16 13:17:54 gw-0001 kernel: cpu 0 cold: high 62, batch 15 used:14
> May 16 13:17:54 gw-0001 kernel: cpu 1 hot: high 186, batch 31 used:44
> May 16 13:17:54 gw-0001 kernel: cpu 1 cold: high 62, batch 15 used:47
> May 16 13:17:54 gw-0001 kernel: cpu 2 hot: high 186, batch 31 used:70
> May 16 13:17:54 gw-0001 kernel: cpu 2 cold: high 62, batch 15 used:15
> May 16 13:17:54 gw-0001 kernel: cpu 3 hot: high 186, batch 31 used:99
> May 16 13:17:54 gw-0001 kernel: cpu 3 cold: high 62, batch 15 used:29
> May 16 13:17:54 gw-0001 kernel: HighMem per-cpu:
> May 16 13:17:54 gw-0001 kernel: cpu 0 hot: high 186, batch 31 used:29
> May 16 13:17:54 gw-0001 kernel: cpu 0 cold: high 62, batch 15 used:42
> May 16 13:17:54 gw-0001 kernel: cpu 1 hot: high 186, batch 31 used:26
> May 16 13:17:54 gw-0001 kernel: cpu 1 cold: high 62, batch 15 used:53
> May 16 13:17:54 gw-0001 kernel: cpu 2 hot: high 186, batch 31 used:23
> May 16 13:17:54 gw-0001 kernel: cpu 2 cold: high 62, batch 15 used:56
> May 16 13:17:54 gw-0001 kernel: cpu 3 hot: high 186, batch 31 used:168
> May 16 13:17:54 gw-0001 kernel: cpu 3 cold: high 62, batch 15 used:42
> May 16 13:17:54 gw-0001 kernel: Free pages:       47492kB (464kB HighMem)
> May 16 13:17:54 gw-0001 kernel: Active:248505 inactive:243082 dirty:0 
> writeback:0 unstable:0 free:11873 slab:
> May 16 13:17:54 gw-0001 kernel: Active:248505 inactive:243082 dirty:0 
> writeback:0 unstable:0 free:11873 slab:
> 4798 mapped-file:1037 mapped-anon:491633 pagetables:2940
> May 16 13:17:54 gw-0001 kernel: DMA free:8160kB min:36kB low:44kB 
> high:52kB active:2340kB inactive:1364kB pre
> sent:16384kB pages_scanned:6857 all_unreclaimable? yes
> May 16 13:17:54 gw-0001 kernel: lowmem_reserve[]: 0 0 880 2031
> May 16 13:17:54 gw-0001 kernel: DMA32 free:0kB min:0kB low:0kB high:0kB 
> active:0kB inactive:0kB present:0kB p
> ages_scanned:0 all_unreclaimable? no
> May 16 13:17:54 gw-0001 kernel: lowmem_reserve[]: 0 0 880 2031
> May 16 13:17:54 gw-0001 kernel: Normal free:38868kB min:2008kB low:2508kB 
> high:3012kB active:397440kB inactiv
> e:409920kB present:901120kB pages_scanned:1894767 all_unreclaimable? yes
> May 16 13:17:54 gw-0001 kernel: lowmem_reserve[]: 0 0 0 9215
> May 16 13:17:54 gw-0001 kernel: HighMem free:464kB min:512kB low:1168kB 
> high:1828kB active:594240kB inactive:
> 561044kB present:1179624kB pages_scanned:4549828 all_unreclaimable? yes
> May 16 13:17:54 gw-0001 kernel: lowmem_reserve[]: 0 0 0 0
> May 16 13:17:54 gw-0001 kernel: DMA: 0*4kB 0*8kB 126*16kB 60*32kB 10*64kB 
> 0*128kB 2*256kB 0*512kB 1*1024kB 1*
> 2048kB 0*4096kB = 8160kB
> May 16 13:17:54 gw-0001 kernel: DMA32: empty
> May 16 13:17:54 gw-0001 kernel: Normal: 1*4kB 28*8kB 2059*16kB 120*32kB 
> 1*64kB 0*128kB 1*256kB 1*512kB 1*1024
> kB 0*2048kB 0*4096kB = 38868kB
> May 16 13:17:54 gw-0001 kernel: HighMem: 4*4kB 8*8kB 0*16kB 2*32kB 1*64kB 
> 2*128kB 0*256kB 0*512kB 0*1024kB 0*
> 2048kB 0*4096kB = 464kB
> May 16 13:17:54 gw-0001 kernel: Swap cache: add 591115, delete 590893, 
> find 5038951/5045583, race 0+568
> May 16 13:17:54 gw-0001 kernel: Free swap  = 0kB
> May 16 13:17:54 gw-0001 kernel: Total swap = 2097096kB
> May 16 13:17:54 gw-0001 kernel: Free swap:            0kB
> May 16 13:17:54 gw-0001 kernel: 524282 pages of RAM
> May 16 13:17:54 gw-0001 kernel: 294906 pages of HIGHMEM
> May 16 13:17:54 gw-0001 kernel: 5445 reserved pages
> May 16 13:17:54 gw-0001 kernel: 357914 pages shared
> May 16 13:17:54 gw-0001 kernel: 222 pages swap cached
> May 16 13:17:54 gw-0001 kernel: 0 pages dirty
> May 16 13:17:54 gw-0001 kernel: 0 pages writeback
> May 16 13:17:54 gw-0001 kernel: 1037 pages mapped
> May 16 13:17:54 gw-0001 kernel: 4798 pages slab
> May 16 13:17:54 gw-0001 kernel: 2940 pages pagetables
> May 16 13:17:54 gw-0001 kernel: Out of memory: Killed process 6054 
> (amavisd).
>
>
> Version ussed:
> unane -a
> Linux gw-0001.xxx.yyy.com 2.6.18-53.1.14.el5 #1 SMP Wed Mar 5 11:36:49 EST 
> 2008 i686 i686 i386 GNU/Linux
>
> rpm -q postfix                    postfix-2.4.7-3.el5.mysql_pgsql.jk
> rpm -q amvisd-new            amavisd-new-2.5.4-5.el5.jk
> rpm -q perl-spamassassin   spamassassin-3.2.4-1.el5.rf
> rpm -q clamd                     clamd-0.93-2.el5.rf
> rpm -q file                          file-4.23-5.el5.jk (This is the 
> backported version, it the problem version is 4.17-9.0.1