[CentOS] strange system outage

Thu May 18 16:40:22 UTC 2017
Larry Martell <larry.martell at gmail.com>

On Fri, May 12, 2017 at 12:52 PM, Larry Martell <larry.martell at gmail.com> wrote:
> On Fri, May 12, 2017 at 12:21 PM, Brian Mathis
> <brian.mathis+centos at betteradmin.com> wrote:
>> On Fri, May 12, 2017 at 11:44 AM, Larry Martell <larry.martell at gmail.com>
>> wrote:
>>
>>> On Thu, May 11, 2017 at 7:58 PM, Alexander Dalloz <ad+lists at uni-x.org>
>>> wrote:
>>> > Am 11.05.2017 um 20:30 schrieb Larry Martell:
>>> >>
>>> >> On Wed, May 10, 2017 at 3:19 PM, Larry Martell <larry.martell at gmail.com
>>> >
>>> >> wrote:
>>> >>>
>>> >>> On Wed, May 10, 2017 at 3:07 PM, Jonathan Billings <
>>> billings at negate.org>
>>> >>> wrote:
>>> >>>>
>>> >>>> On Wed, May 10, 2017 at 02:40:04PM -0400, Larry Martell wrote:
>>> >>>>>
>>> >>>>> I have a CentOS 7 system that I run a home grown python daemon on. I
>>> >>>>> run this same daemon on many other systems without any incident. On
>>> >>>>> this one system the daemon seems to die or be killed every day around
>>> >>>>> 3:30am. There is nothing it its log or any system logs that tell me
>>> >>>>> why it dies. However in /var/log/messages every day I see something
>>> >>>>> like this:
>>> >>>>
>>> >>>>
>>> >>>> How are you starting this daemon?
>>> >>>
>>> >>>
>>> >>> I am using code something like this:
>>> >>> https://gist.github.com/slor/5946334.
>>> >>>
>>> >>>> Can you check the journal?  Perhaps
>>> >>>> you'll see more useful information than what you see in the syslogs?
>>> >>>
>>> >>>
>>> >>> Thanks, I will do that.
>>> >>
>>> >>
>>> >> Thank you for that suggestion. I was able to get someone to run
>>> >> journalctl and send me the output and it was very interesting.
>>> >>
>>> >> First, there is logging going on continuously during the time when
>>> >> logging stops in /var/log/messages.
>>> >>
>>> >> Second, I see messages like this periodically:
>>> >>
>>> >> May 10 03:57:46 localhost.localdomain python[40222]: detected
>>> >> unhandled Python exception in
>>> >> '/usr/local/motor/motor/core/data/importer.py'
>>> >> May 10 03:57:46 localhost.localdomain abrt-server[40277]: Only 0MiB is
>>> >> available on /var/spool/abrt
>>> >> May 10 03:57:46 localhost.localdomain python[40222]: error sending
>>> >> data to ABRT daemon:
>>> >>
>>> >> This happens at various times of the day, and I do not think is
>>> >> related to the daemon crashing.
>>> >>
>>> >> But I did see one occurrence of this:
>>> >>
>>> >> May 09 03:49:35 localhost.localdomain python[14042]: detected
>>> >> unhandled Python exception in
>>> >> '/usr/local/motor/motor/core/data/importerd.py'
>>> >> May 09 03:49:35 localhost.localdomain abrt-server[22714]: Only 0MiB is
>>> >> available on /var/spool/abrt
>>> >> May 09 03:49:35 localhost.localdomain python[14042]: error sending
>>> >> data to ABRT daemon:
>>> >>
>>> >> And that is the daemon. But I only see that on this one day, and it
>>> >> crashes every day.
>>> >>
>>> >> And I see this type of message frequently throughout the day, every day:
>>> >>
>>> >> May 09 03:40:01 localhost.localdomain CROND[21447]: (motor) CMD
>>> >> (python /usr/local/motor/motor/scripts/image_mover.py -v1 -d
>>> >> /usr/local/motor/data > ~/last_image_move_log.txt)
>>> >> May 09 03:40:01 localhost.localdomain abrt-server[21453]: Only 0MiB is
>>> >> available on /var/spool/abrt
>>> >> May 09 03:40:01 localhost.localdomain python[21402]: error sending
>>> >> data to ABRT daemon:
>>> >> May 09 03:40:01 localhost.localdomain postfix/postdrop[21456]:
>>> >> warning: uid=0: No space left on device
>>> >> May 09 03:40:01 localhost.localdomain postfix/sendmail[21455]: fatal:
>>> >> root(0): queue file write error
>>> >> May 09 03:40:01 localhost.localdomain crond[2630]: postdrop: warning:
>>> >> uid=0: No space left on device
>>> >> May 09 03:40:01 localhost.localdomain crond[2630]: sendmail: fatal:
>>> >> root(0): queue file write error
>>> >> May 09 03:40:01 localhost.localdomain CROND[21443]: (root) MAIL
>>> >> (mailed 67 bytes of output but got status 0x004b)
>>> >>
>>> >> So it seems there is a space issue.
>>> >>
>>> >> And finally, coinciding with the time that the logging resumes in
>>> >> /var/log/messages I see this every day at that time:
>>> >>
>>> >> May 10 03:57:57 localhost.localdomain
>>> >> run-parts(/etc/cron.daily)[40293]: finished mlocate
>>> >> May 10 03:57:57 localhost.localdomain anacron[33406]: Job `cron.daily'
>>> >> terminated (mailing output)
>>> >> May 10 03:57:57 localhost.localdomain anacron[33406]: Normal exit (1 job
>>> >> run)
>>> >>
>>> >> I need to get my remote hands to get me more info.
>>> >
>>> >
>>> > df -hT; df -i
>>> >
>>> > There is no space left on a vital partition / logical volume.
>>> >
>>> > "Only 0MiB is available on /var/spool/abrt"
>>> >
>>> > "postdrop: warning: uid=0: No space left on device"
>>>
>>> Yes, I saw that and assumed that was the root cause of the issue. But
>>> when I had my guy over in Japan check he found that / had 15G (of 50)
>>> free. We did some more investigating and it seems that when mlocate
>>> runs the disk fills up and bad things happen. Why is that happening?
>>> It is because 15G free space is not enough? We ran a du and most of
>>> the space on / was used by /var/log (11G), and /var/lib/mlocate (20G).
>>> Can I disable mlocate and get rid of that large dir?
>>>
>>
>>
>> 20GB for mlocate is absolutely (and suspiciously) huge.  You must have
>> millions and millions of files on that server.  If not, then there's
>> something wrong with mlocate.  'mlocate' can be removed unless you're using
>> it, there's nothing else really dependent on it in CentOS.  You'd need to
>> really evaluate if someone else is using it on that server.
>
> Yes, we do have millions and millions of files (give or take a million
> or so). I am going to disable mlocate and remove the db and see if
> this fixes the issues we've been having.

Since disabling mlocate and removing its db we have not had the daemon crash.

Thanks to all who helped!