Hello Filip,
I forgot the word NOT in this part of my message.
For me, the problem was solved, and I had enough proof for my self to say, I assume file(1) was the problem. I agree, this is not enough for some developers to be convinced, and the problem file is NOT easily found. But I believe the problem file should be somewhere in my Amavisd-new directory. But at this moment I don't have a test system to test the complete directory against file(1)
John
"John" jknappers-argentia@hotmail.com schreef in bericht news:g0mjh8$hah$1@ger.gmane.org...
"Filipe Brandenburger" filbranden@gmail.com schreef in bericht news:e814db780805161718oeb2763ja80120d58b3abc93@mail.gmail.com...
On Fri, May 16, 2008 at 8:30 AM, John jknappers-argentia@hotmail.com wrote:
We just encountered several serious Out of Memory Problems caused by the file 4.17-9.0.1.EL5 utility on our mail Gateway. CentOS 5.1 kept it self alive by killing it parent process Amavisd-new. Manually restarting Amavisd-new and flushing the mail queue caused same problem. Disabling file for the moment, caused Amavisd-new to fail, and also defer all mail.
I cannot understand it. On which situations would the "file" utility be causing OOM issues? "file" is not a daemon or something that is running all the time. And to generate an OOM situation, you have to exhaust all the memory on your machine.
Could you give more details here?
How much memory does the machine have? How much swap space does the machine have? Why do you believe "file" is the offending application? Who is using "file" on a regular basis? What types of files are being tested for type with the "file" utility? Can you produce a specific file which causes "file" to allocate a lot of memory? Can you produce the output of "ps" or "top" which shows "file" using a lot of memory just before a crash?
If you don't give us such details, it's really hard to believe that "file" is what is causing your issue. Please back your statement.
Filipe
Ok, I will try. The machine config Machine is a HP DL360G3, 2gb Mem, 2Gb swap running CentOS5.1 x86
The purpose of and how file(1) is used. The machine is our mail gateway, Running Postfix / Amavisd-new / Clamd / Perl Spamassassin Al incoming and outgoing mail is handed to Amavisd which extract the mail and ran it against for example file(1) So every incoming / outgoing mail passes file(1), where file(1) is used to enforce mail policy (No exe, com, hta, pif etc is coming in, even if the extension is changed).
I cannot provide a example causing file(1) to consume all memory, because, I had to get the mail system running again. Now thinking, It had been wise to make a copy off the postfix deferred queue before starting the system again.
The kernel has logged the OOM condition and killed the parent Amavisd-new to keep itself alive. I saved that part, and adding it here at the end of the mail
This is what happend just before in the maillog
May 16 13:16:56 gw-0001 amavis[6058]: (06058-01) FWD via SMTP: 696e666f726e6c2c4e4c312c33322c31303334@bounce.msg1svc.net -> john.xxx@xxxxx.nl,BODY=8BITMIME 250 2.6.0 Ok, id=06058-01, from $ May 16 13:16:56 gw-0001 amavis[6059]: (06059-02-2) (!!)TROUBLE in check_mail: parts_decode_ext FAILED: file(1) utility (/usr/bin/file) error: run_command (open pipe): Can't fork at /usr/lib/perl5/5.8.8/i386-li$ May 16 13:16:56 gw-0001 amavis[6059]: (06059-02-2) (!)PRESERVING EVIDENCE in /var/amavis/tmp/amavis-20080516T131638-06059 May 16 13:16:56 gw-0001 amavis[6058]: (06058-01) Passed CLEAN, [backup-mx] [66.179.205.75] 696e666f726e6c2c4e4c312c33322c31303334@bounce.msg1svc.net -> john.xxxx@xxxxx.nl, Message-ID: <1$ May 16 13:16:56 gw-0001 postfix/lmtp[7373]: 8196E1380A6: to=john.xxxx@xxxxxx.nl, relay=localhost[127.0.0.1]:10024, delay=12071, delays=12052/0.01/0.03/19, dsn=2.0.0, status=sent (250 2.0.0 Ok:$
Maybe Amavisd-new saved the problem message, This is the first entry where file failed. The saved message contains a normal html message. Nothing strange here.
A bit later in the maillog: ay 16 13:17:59 gw-0001 postfix/smtpd[7385]: connect from localhost.localdomain[127.0.0.1] May 16 13:18:02 gw-0001 amavis[5639]: (!)Net::Server: 2008/05/16-13:18:02 Bad fork [Cannot allocate memory]\n at line 168 in file /usr/lib/perl5/vendor_perl/5.8.8/Net/Server/PreForkSimple.pm May 16 13:18:02 gw-0001 amavis[5639]: Net::Server: 2008/05/16-13:18:02 Server closing!
This combined with the Kernel logging at the end, and after repeatable restarting Amavisd-new and flushing the postfix queue, it ran dead in the same problem, I started to believe that there may be a problem in file(1). After checking with yum list file, I concluded that there were no CentOS updates for file.
The easied way to get a newer version of file was to rebuild the SRPM of FC9's file-4.23-05 on CentOS 5. It compiled without problems and after installing, starting Amavisd-new and flushing the queue al deferred mail was delivered without any problems.
For me, the problem was solved, and I had enough proof for my self to say, I assume file(1) was the problem. I agree, this is not enough for some developers to be convinced, and the problem file is easily found. But I believe the problem file should be somewhere in my Amavisd-new directory. But at this moment I don't have a test system to test the complete directory against file(1)
I hope I provide some of the information you wanted to see.
Regards, John
May 16 13:17:53 gw-0001 kernel: cmasasd invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0 May 16 13:17:53 gw-0001 kernel: [<c0457d36>] out_of_memory+0x72/0x1a4 May 16 13:17:53 gw-0001 kernel: [<c0459161>] __alloc_pages+0x201/0x282 May 16 13:17:53 gw-0001 kernel: [<c045a3bf>] __do_page_cache_readahead+0xc4/0x1c6 May 16 13:17:53 gw-0001 kernel: [<c0454a21>] sync_page+0x0/0x3b May 16 13:17:53 gw-0001 kernel: [<c044fbe5>] __delayacct_blkio_end+0x32/0x35 May 16 13:17:53 gw-0001 kernel: [<c0604c24>] __wait_on_bit_lock+0x4b/0x52 May 16 13:17:53 gw-0001 kernel: [<c04549a4>] __lock_page+0x51/0x57 May 16 13:17:53 gw-0001 kernel: [<c04572d8>] filemap_nopage+0x151/0x315 May 16 13:17:53 gw-0001 kernel: [<c045fda3>] __handle_mm_fault+0x172/0x87b May 16 13:17:53 gw-0001 kernel: [<c04b43cf>] do_msgrcv+0x241/0x290 May 16 13:17:53 gw-0001 kernel: [<c0606c2b>] do_page_fault+0x20a/0x4b8 May 16 13:17:53 gw-0001 kernel: [<c0606a21>] do_page_fault+0x0/0x4b8 May 16 13:17:53 gw-0001 kernel: [<c0405a71>] error_code+0x39/0x40 May 16 13:17:53 gw-0001 kernel: ======================= May 16 13:17:53 gw-0001 kernel: Mem-info: May 16 13:17:53 gw-0001 kernel: DMA per-cpu: May 16 13:17:53 gw-0001 kernel: cpu 0 hot: high 0, batch 1 used:0 May 16 13:17:53 gw-0001 kernel: cpu 0 cold: high 0, batch 1 used:0 May 16 13:17:53 gw-0001 kernel: cpu 1 hot: high 0, batch 1 used:0 May 16 13:17:53 gw-0001 kernel: cpu 1 cold: high 0, batch 1 used:0 May 16 13:17:53 gw-0001 kernel: cpu 2 hot: high 0, batch 1 used:0 May 16 13:17:53 gw-0001 kernel: cpu 2 cold: high 0, batch 1 used:0 May 16 13:17:53 gw-0001 kernel: cpu 3 hot: high 0, batch 1 used:0 May 16 13:17:54 gw-0001 kernel: cpu 3 cold: high 0, batch 1 used:0 May 16 13:17:54 gw-0001 kernel: DMA32 per-cpu: empty May 16 13:17:54 gw-0001 kernel: Normal per-cpu: May 16 13:17:54 gw-0001 kernel: cpu 0 hot: high 186, batch 31 used:161 May 16 13:17:54 gw-0001 kernel: cpu 0 cold: high 62, batch 15 used:14 May 16 13:17:54 gw-0001 kernel: cpu 1 hot: high 186, batch 31 used:44 May 16 13:17:54 gw-0001 kernel: cpu 1 cold: high 62, batch 15 used:47 May 16 13:17:54 gw-0001 kernel: cpu 2 hot: high 186, batch 31 used:70 May 16 13:17:54 gw-0001 kernel: cpu 2 cold: high 62, batch 15 used:15 May 16 13:17:54 gw-0001 kernel: cpu 3 hot: high 186, batch 31 used:99 May 16 13:17:54 gw-0001 kernel: cpu 3 cold: high 62, batch 15 used:29 May 16 13:17:54 gw-0001 kernel: HighMem per-cpu: May 16 13:17:54 gw-0001 kernel: cpu 0 hot: high 186, batch 31 used:29 May 16 13:17:54 gw-0001 kernel: cpu 0 cold: high 62, batch 15 used:42 May 16 13:17:54 gw-0001 kernel: cpu 1 hot: high 186, batch 31 used:26 May 16 13:17:54 gw-0001 kernel: cpu 1 cold: high 62, batch 15 used:53 May 16 13:17:54 gw-0001 kernel: cpu 2 hot: high 186, batch 31 used:23 May 16 13:17:54 gw-0001 kernel: cpu 2 cold: high 62, batch 15 used:56 May 16 13:17:54 gw-0001 kernel: cpu 3 hot: high 186, batch 31 used:168 May 16 13:17:54 gw-0001 kernel: cpu 3 cold: high 62, batch 15 used:42 May 16 13:17:54 gw-0001 kernel: Free pages: 47492kB (464kB HighMem) May 16 13:17:54 gw-0001 kernel: Active:248505 inactive:243082 dirty:0 writeback:0 unstable:0 free:11873 slab: May 16 13:17:54 gw-0001 kernel: Active:248505 inactive:243082 dirty:0 writeback:0 unstable:0 free:11873 slab: 4798 mapped-file:1037 mapped-anon:491633 pagetables:2940 May 16 13:17:54 gw-0001 kernel: DMA free:8160kB min:36kB low:44kB high:52kB active:2340kB inactive:1364kB pre sent:16384kB pages_scanned:6857 all_unreclaimable? yes May 16 13:17:54 gw-0001 kernel: lowmem_reserve[]: 0 0 880 2031 May 16 13:17:54 gw-0001 kernel: DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB p ages_scanned:0 all_unreclaimable? no May 16 13:17:54 gw-0001 kernel: lowmem_reserve[]: 0 0 880 2031 May 16 13:17:54 gw-0001 kernel: Normal free:38868kB min:2008kB low:2508kB high:3012kB active:397440kB inactiv e:409920kB present:901120kB pages_scanned:1894767 all_unreclaimable? yes May 16 13:17:54 gw-0001 kernel: lowmem_reserve[]: 0 0 0 9215 May 16 13:17:54 gw-0001 kernel: HighMem free:464kB min:512kB low:1168kB high:1828kB active:594240kB inactive: 561044kB present:1179624kB pages_scanned:4549828 all_unreclaimable? yes May 16 13:17:54 gw-0001 kernel: lowmem_reserve[]: 0 0 0 0 May 16 13:17:54 gw-0001 kernel: DMA: 0*4kB 0*8kB 126*16kB 60*32kB 10*64kB 0*128kB 2*256kB 0*512kB 1*1024kB 1* 2048kB 0*4096kB = 8160kB May 16 13:17:54 gw-0001 kernel: DMA32: empty May 16 13:17:54 gw-0001 kernel: Normal: 1*4kB 28*8kB 2059*16kB 120*32kB 1*64kB 0*128kB 1*256kB 1*512kB 1*1024 kB 0*2048kB 0*4096kB = 38868kB May 16 13:17:54 gw-0001 kernel: HighMem: 4*4kB 8*8kB 0*16kB 2*32kB 1*64kB 2*128kB 0*256kB 0*512kB 0*1024kB 0* 2048kB 0*4096kB = 464kB May 16 13:17:54 gw-0001 kernel: Swap cache: add 591115, delete 590893, find 5038951/5045583, race 0+568 May 16 13:17:54 gw-0001 kernel: Free swap = 0kB May 16 13:17:54 gw-0001 kernel: Total swap = 2097096kB May 16 13:17:54 gw-0001 kernel: Free swap: 0kB May 16 13:17:54 gw-0001 kernel: 524282 pages of RAM May 16 13:17:54 gw-0001 kernel: 294906 pages of HIGHMEM May 16 13:17:54 gw-0001 kernel: 5445 reserved pages May 16 13:17:54 gw-0001 kernel: 357914 pages shared May 16 13:17:54 gw-0001 kernel: 222 pages swap cached May 16 13:17:54 gw-0001 kernel: 0 pages dirty May 16 13:17:54 gw-0001 kernel: 0 pages writeback May 16 13:17:54 gw-0001 kernel: 1037 pages mapped May 16 13:17:54 gw-0001 kernel: 4798 pages slab May 16 13:17:54 gw-0001 kernel: 2940 pages pagetables May 16 13:17:54 gw-0001 kernel: Out of memory: Killed process 6054 (amavisd).
Version ussed: unane -a Linux gw-0001.xxx.yyy.com 2.6.18-53.1.14.el5 #1 SMP Wed Mar 5 11:36:49 EST 2008 i686 i686 i386 GNU/Linux
rpm -q postfix postfix-2.4.7-3.el5.mysql_pgsql.jk rpm -q amvisd-new amavisd-new-2.5.4-5.el5.jk rpm -q perl-spamassassin spamassassin-3.2.4-1.el5.rf rpm -q clamd clamd-0.93-2.el5.rf rpm -q file file-4.23-5.el5.jk (This is the backported version, it the problem version is 4.17-9.0.1