[CentOS] Find reason for heavy load

Thu Dec 31 20:13:28 UTC 2009

Just an concluding update to anybody who might be interested :)

My apologies for blaming spamassassin in the earlier email. It was
taking so long because of the real problem.

Apparently the odd exim processes that was related to the mail loop
problem I nipped was still the culprit. I had overlooked the fact that
by the time I caught onto the mail loop issue, there were actually
hundreds if not thousands of bounced and rebounced messages in the
queue already. Attempting to deliver these messages queued before I
terminated the mail loop was what those exim processes were trying to
do.

This would had been ok if not for the other problem. The user
apparently went on 2 week vacation since 15th and thought it was a
good idea to enlarge his mailbox before doing so. So there was this
2.5GB mailbox choked full of both valid & rebounced mails, plus the
queue of more rebounced mails. So every time exim attempted to add the
queued mails to the user's account, the quota system rejected it. The
cpu load was probably due to this never ending ping pong match between
exim and the quota.

Yeah, I can't help but feel this must be such a noob mistake allowing
that to develop without realizing it.

Now that I've purged the queue of those bounced messages and other
housekeeping for that user, server load has finally gone back to the
expected sub 1.0 levels so I can finally go and enjoy my holiday :)

On 1/1/10, Noob Centos Admin <centos.admin at gmail.com> wrote:
> I initiated services shutdown as previously planned and once the
> external services like exim, dovecot, httpd, crond (because it kept
> restarting these services), the problem child stood out like a sore
> thumb.
>
> There was two exim instances that didn't go away despite service exim
> stop. Once I killed these two PID, the load average started dropping
> rapidly. After a minute or so, the server went back to a happy 0.2~0.3
> load and disk activity became almost negligible.
>
> I think these, orphaned? zombied?, exim instances were related to a
> mail loop problem I discovered earlier today where one of my client on
> holiday had a full mailbox and keep bouncing mails from a contact
> whose site was suspended. Although I terminated that loop, it seemed
> that exim had gotten those two instances stuck in limbo sucking up
> processing power and hitting the disk somewhere unknown since they
> weren't showing up in my exim logs.
>
> After observing a while, I brought the services back and once exim got
> started, my load went back to 2.x ~ 3.x. Unfortunately while I was
> typing this email, I realize it didn't stop there. I'm up to 4.x ~ 5.x
> load level by now.
>
> So the application that is the cause of the load is definitely exim,
> more specifically I think it's spam assassin because now that the mail
> logs entries are slow, I can read the spamd details and mails are
> taking between 3 to 8 seconds to be checked.
>
> Thanks again to everybody who had offer suggestions and advice and do
> have a Happy New Year :)
>
>
> On 1/1/10, Noob Centos Admin <centos.admin at gmail.com> wrote:
>> Hi,
>>
>>> I do not know about now but I had to unload the modules in question.
>>> Just clearing the rules was not enough to ensure that the netfilter
>>> connection tracking modules were not using any cpu at all.
>>
>> Thanks for pointing this out. Being a noob admin as my pseudonym
>> states, I'd assumed stopping apf and restarting iptables was
>> sufficient. I'll have to look up unloading module later.
>>
>>> /me shrugs. When I was the mta admin at Outblaze Ltd. (messaging
>>> business now owned by IBM and called Lotus Live) spammers always ensured
>>> I got called. All they do is just press the big red button (aka start
>>> the script/system) and then go and play while I would have to deal with
>>> whatever was started.
>>
>> Based on the almost precise timing of around 9:30 to 5:30 India time,
>> I'm inclined to think in my case it wasn't so much a spammer pressing
>> a red button but a compromised machine in an office starting up when
>> the user gets into office and knocks off on time at 5:30 :D
>>
>>> I remember only one occasion when the spams were
>>> launched but neutralized very soon because they were pushing a website
>>> and I found a sample real early and so the anti spam system could just
>>> dump the spams and knock out accounts being used to send the crap.
>>
>> Could I ask how do I knock out the accounts sending the crap if they
>> are not within my systems?
>>
>>> First, try rmmod'ing the netfilter modules after you have cleared away
>>> the state related rules to make sure that you are only using static
>>> rules in netfilter...unless you have done that already..
>>
>> I think I'm only using static rules because after I restart iptables,
>> I would then do a service iptables status to check my rules were in,
>> and that list was very short compared to when APF was active.
>>
>> The good news is, I think I've fixed the big problem after doing my
>> shutdown tests and returned to the original problem.
>>
>