[CentOS] strange system outage

On Fri, May 12, 2017 at 12:21 PM, Brian Mathis
<brian.mathis+centos at betteradmin.com> wrote:
> On Fri, May 12, 2017 at 11:44 AM, Larry Martell <larry.martell at gmail.com>
> wrote:
>
>> On Thu, May 11, 2017 at 7:58 PM, Alexander Dalloz <ad+lists at uni-x.org>
>> wrote:
>> > Am 11.05.2017 um 20:30 schrieb Larry Martell:
>> >>
>> >> On Wed, May 10, 2017 at 3:19 PM, Larry Martell <larry.martell at gmail.com
>> >
>> >> wrote:
>> >>>
>> >>> On Wed, May 10, 2017 at 3:07 PM, Jonathan Billings <
>> billings at negate.org>
>> >>> wrote:
>> >>>>
>> >>>> On Wed, May 10, 2017 at 02:40:04PM -0400, Larry Martell wrote:
>> >>>>>
>> >>>>> I have a CentOS 7 system that I run a home grown python daemon on. I
>> >>>>> run this same daemon on many other systems without any incident. On
>> >>>>> this one system the daemon seems to die or be killed every day around
>> >>>>> 3:30am. There is nothing it its log or any system logs that tell me
>> >>>>> why it dies. However in /var/log/messages every day I see something
>> >>>>> like this:
>> >>>>
>> >>>>
>> >>>> How are you starting this daemon?
>> >>>
>> >>>
>> >>> I am using code something like this:
>> >>> https://gist.github.com/slor/5946334.
>> >>>
>> >>>> Can you check the journal?  Perhaps
>> >>>> you'll see more useful information than what you see in the syslogs?
>> >>>
>> >>>
>> >>> Thanks, I will do that.
>> >>
>> >>
>> >> Thank you for that suggestion. I was able to get someone to run
>> >> journalctl and send me the output and it was very interesting.
>> >>
>> >> First, there is logging going on continuously during the time when
>> >> logging stops in /var/log/messages.
>> >>
>> >> Second, I see messages like this periodically:
>> >>
>> >> May 10 03:57:46 localhost.localdomain python[40222]: detected
>> >> unhandled Python exception in
>> >> '/usr/local/motor/motor/core/data/importer.py'
>> >> May 10 03:57:46 localhost.localdomain abrt-server[40277]: Only 0MiB is
>> >> available on /var/spool/abrt
>> >> May 10 03:57:46 localhost.localdomain python[40222]: error sending
>> >> data to ABRT daemon:
>> >>
>> >> This happens at various times of the day, and I do not think is
>> >> related to the daemon crashing.
>> >>
>> >> But I did see one occurrence of this:
>> >>
>> >> May 09 03:49:35 localhost.localdomain python[14042]: detected
>> >> unhandled Python exception in
>> >> '/usr/local/motor/motor/core/data/importerd.py'
>> >> May 09 03:49:35 localhost.localdomain abrt-server[22714]: Only 0MiB is
>> >> available on /var/spool/abrt
>> >> May 09 03:49:35 localhost.localdomain python[14042]: error sending
>> >> data to ABRT daemon:
>> >>
>> >> And that is the daemon. But I only see that on this one day, and it
>> >> crashes every day.
>> >>
>> >> And I see this type of message frequently throughout the day, every day:
>> >>
>> >> May 09 03:40:01 localhost.localdomain CROND[21447]: (motor) CMD
>> >> (python /usr/local/motor/motor/scripts/image_mover.py -v1 -d
>> >> /usr/local/motor/data > ~/last_image_move_log.txt)
>> >> May 09 03:40:01 localhost.localdomain abrt-server[21453]: Only 0MiB is
>> >> available on /var/spool/abrt
>> >> May 09 03:40:01 localhost.localdomain python[21402]: error sending
>> >> data to ABRT daemon:
>> >> May 09 03:40:01 localhost.localdomain postfix/postdrop[21456]:
>> >> warning: uid=0: No space left on device
>> >> May 09 03:40:01 localhost.localdomain postfix/sendmail[21455]: fatal:
>> >> root(0): queue file write error
>> >> May 09 03:40:01 localhost.localdomain crond[2630]: postdrop: warning:
>> >> uid=0: No space left on device
>> >> May 09 03:40:01 localhost.localdomain crond[2630]: sendmail: fatal:
>> >> root(0): queue file write error
>> >> May 09 03:40:01 localhost.localdomain CROND[21443]: (root) MAIL
>> >> (mailed 67 bytes of output but got status 0x004b)
>> >>
>> >> So it seems there is a space issue.
>> >>
>> >> And finally, coinciding with the time that the logging resumes in
>> >> /var/log/messages I see this every day at that time:
>> >>
>> >> May 10 03:57:57 localhost.localdomain
>> >> run-parts(/etc/cron.daily)[40293]: finished mlocate
>> >> May 10 03:57:57 localhost.localdomain anacron[33406]: Job `cron.daily'
>> >> terminated (mailing output)
>> >> May 10 03:57:57 localhost.localdomain anacron[33406]: Normal exit (1 job
>> >> run)
>> >>
>> >> I need to get my remote hands to get me more info.
>> >
>> >
>> > df -hT; df -i
>> >
>> > There is no space left on a vital partition / logical volume.
>> >
>> > "Only 0MiB is available on /var/spool/abrt"
>> >
>> > "postdrop: warning: uid=0: No space left on device"
>>
>> Yes, I saw that and assumed that was the root cause of the issue. But
>> when I had my guy over in Japan check he found that / had 15G (of 50)
>> free. We did some more investigating and it seems that when mlocate
>> runs the disk fills up and bad things happen. Why is that happening?
>> It is because 15G free space is not enough? We ran a du and most of
>> the space on / was used by /var/log (11G), and /var/lib/mlocate (20G).
>> Can I disable mlocate and get rid of that large dir?
>>
>
>
> 20GB for mlocate is absolutely (and suspiciously) huge.  You must have
> millions and millions of files on that server.  If not, then there's
> something wrong with mlocate.  'mlocate' can be removed unless you're using
> it, there's nothing else really dependent on it in CentOS.  You'd need to
> really evaluate if someone else is using it on that server.

Yes, we do have millions and millions of files (give or take a million
or so). I am going to disable mlocate and remove the db and see if
this fixes the issues we've been having.