[CentOS] Find reason for heavy load

Thu Dec 31 21:14:58 UTC 2009

On 2009-12-31 15:13, Noob Centos Admin wrote:
> Just an concluding update to anybody who might be interested :)
>
> My apologies for blaming spamassassin in the earlier email. It was
> taking so long because of the real problem.
>
> Apparently the odd exim processes that was related to the mail loop
> problem I nipped was still the culprit. I had overlooked the fact that
> by the time I caught onto the mail loop issue, there were actually
> hundreds if not thousands of bounced and rebounced messages in the
> queue already. Attempting to deliver these messages queued before I
> terminated the mail loop was what those exim processes were trying to
> do.
>
> This would had been ok if not for the other problem. The user
> apparently went on 2 week vacation since 15th and thought it was a
> good idea to enlarge his mailbox before doing so. So there was this
> 2.5GB mailbox choked full of both valid&  rebounced mails, plus the
> queue of more rebounced mails. So every time exim attempted to add the
> queued mails to the user's account, the quota system rejected it. The
> cpu load was probably due to this never ending ping pong match between
> exim and the quota.
>
> Yeah, I can't help but feel this must be such a noob mistake allowing
> that to develop without realizing it.
>
> Now that I've purged the queue of those bounced messages and other
> housekeeping for that user, server load has finally gone back to the
> expected sub 1.0 levels so I can finally go and enjoy my holiday :)
>
>
>
> On 1/1/10, Noob Centos Admin<centos.admin at gmail.com>  wrote:
>> I initiated services shutdown as previously planned and once the
>> external services like exim, dovecot, httpd, crond (because it kept
>> restarting these services), the problem child stood out like a sore
>> thumb.
>>
>> There was two exim instances that didn't go away despite service exim
>> stop. Once I killed these two PID, the load average started dropping
>> rapidly. After a minute or so, the server went back to a happy 0.2~0.3
>> load and disk activity became almost negligible.
>>
>> I think these, orphaned? zombied?, exim instances were related to a
>> mail loop problem I discovered earlier today where one of my client on
>> holiday had a full mailbox and keep bouncing mails from a contact
>> whose site was suspended. Although I terminated that loop, it seemed
>> that exim had gotten those two instances stuck in limbo sucking up
>> processing power and hitting the disk somewhere unknown since they
>> weren't showing up in my exim logs.
>>
>> After observing a while, I brought the services back and once exim got
>> started, my load went back to 2.x ~ 3.x. Unfortunately while I was
>> typing this email, I realize it didn't stop there. I'm up to 4.x ~ 5.x
>> load level by now.
>>
>> So the application that is the cause of the load is definitely exim,
>> more specifically I think it's spam assassin because now that the mail
>> logs entries are slow, I can read the spamd details and mails are
>> taking between 3 to 8 seconds to be checked.
>>
>> Thanks again to everybody who had offer suggestions and advice and do
>> have a Happy New Year :)
>>
>>
>> On 1/1/10, Noob Centos Admin<centos.admin at gmail.com>  wrote:
>>> Hi,
>>>
>>>> I do not know about now but I had to unload the modules in question.
>>>> Just clearing the rules was not enough to ensure that the netfilter
>>>> connection tracking modules were not using any cpu at all.
>>>
>>> Thanks for pointing this out. Being a noob admin as my pseudonym
>>> states, I'd assumed stopping apf and restarting iptables was
>>> sufficient. I'll have to look up unloading module later.
>>>
>>>> /me shrugs. When I was the mta admin at Outblaze Ltd. (messaging
>>>> business now owned by IBM and called Lotus Live) spammers always ensured
>>>> I got called. All they do is just press the big red button (aka start
>>>> the script/system) and then go and play while I would have to deal with
>>>> whatever was started.
>>>
>>> Based on the almost precise timing of around 9:30 to 5:30 India time,
>>> I'm inclined to think in my case it wasn't so much a spammer pressing
>>> a red button but a compromised machine in an office starting up when
>>> the user gets into office and knocks off on time at 5:30 :D
>>>
>>>> I remember only one occasion when the spams were
>>>> launched but neutralized very soon because they were pushing a website
>>>> and I found a sample real early and so the anti spam system could just
>>>> dump the spams and knock out accounts being used to send the crap.
>>>
>>> Could I ask how do I knock out the accounts sending the crap if they
>>> are not within my systems?
>>>
>>>> First, try rmmod'ing the netfilter modules after you have cleared away
>>>> the state related rules to make sure that you are only using static
>>>> rules in netfilter...unless you have done that already..
>>>
>>> I think I'm only using static rules because after I restart iptables,
>>> I would then do a service iptables status to check my rules were in,
>>> and that list was very short compared to when APF was active.
>>>
>>> The good news is, I think I've fixed the big problem after doing my
>>> shutdown tests and returned to the original problem.
>>>
>>

If you (and other people) have learned, it was worth it :).

Ugo