[CentOS] strange system outage

Fri May 12 16:21:36 UTC 2017
Brian Mathis <brian.mathis+centos at betteradmin.com>

On Fri, May 12, 2017 at 11:44 AM, Larry Martell <larry.martell at gmail.com>
wrote:

> On Thu, May 11, 2017 at 7:58 PM, Alexander Dalloz <ad+lists at uni-x.org>
> wrote:
> > Am 11.05.2017 um 20:30 schrieb Larry Martell:
> >>
> >> On Wed, May 10, 2017 at 3:19 PM, Larry Martell <larry.martell at gmail.com
> >
> >> wrote:
> >>>
> >>> On Wed, May 10, 2017 at 3:07 PM, Jonathan Billings <
> billings at negate.org>
> >>> wrote:
> >>>>
> >>>> On Wed, May 10, 2017 at 02:40:04PM -0400, Larry Martell wrote:
> >>>>>
> >>>>> I have a CentOS 7 system that I run a home grown python daemon on. I
> >>>>> run this same daemon on many other systems without any incident. On
> >>>>> this one system the daemon seems to die or be killed every day around
> >>>>> 3:30am. There is nothing it its log or any system logs that tell me
> >>>>> why it dies. However in /var/log/messages every day I see something
> >>>>> like this:
> >>>>
> >>>>
> >>>> How are you starting this daemon?
> >>>
> >>>
> >>> I am using code something like this:
> >>> https://gist.github.com/slor/5946334.
> >>>
> >>>> Can you check the journal?  Perhaps
> >>>> you'll see more useful information than what you see in the syslogs?
> >>>
> >>>
> >>> Thanks, I will do that.
> >>
> >>
> >> Thank you for that suggestion. I was able to get someone to run
> >> journalctl and send me the output and it was very interesting.
> >>
> >> First, there is logging going on continuously during the time when
> >> logging stops in /var/log/messages.
> >>
> >> Second, I see messages like this periodically:
> >>
> >> May 10 03:57:46 localhost.localdomain python[40222]: detected
> >> unhandled Python exception in
> >> '/usr/local/motor/motor/core/data/importer.py'
> >> May 10 03:57:46 localhost.localdomain abrt-server[40277]: Only 0MiB is
> >> available on /var/spool/abrt
> >> May 10 03:57:46 localhost.localdomain python[40222]: error sending
> >> data to ABRT daemon:
> >>
> >> This happens at various times of the day, and I do not think is
> >> related to the daemon crashing.
> >>
> >> But I did see one occurrence of this:
> >>
> >> May 09 03:49:35 localhost.localdomain python[14042]: detected
> >> unhandled Python exception in
> >> '/usr/local/motor/motor/core/data/importerd.py'
> >> May 09 03:49:35 localhost.localdomain abrt-server[22714]: Only 0MiB is
> >> available on /var/spool/abrt
> >> May 09 03:49:35 localhost.localdomain python[14042]: error sending
> >> data to ABRT daemon:
> >>
> >> And that is the daemon. But I only see that on this one day, and it
> >> crashes every day.
> >>
> >> And I see this type of message frequently throughout the day, every day:
> >>
> >> May 09 03:40:01 localhost.localdomain CROND[21447]: (motor) CMD
> >> (python /usr/local/motor/motor/scripts/image_mover.py -v1 -d
> >> /usr/local/motor/data > ~/last_image_move_log.txt)
> >> May 09 03:40:01 localhost.localdomain abrt-server[21453]: Only 0MiB is
> >> available on /var/spool/abrt
> >> May 09 03:40:01 localhost.localdomain python[21402]: error sending
> >> data to ABRT daemon:
> >> May 09 03:40:01 localhost.localdomain postfix/postdrop[21456]:
> >> warning: uid=0: No space left on device
> >> May 09 03:40:01 localhost.localdomain postfix/sendmail[21455]: fatal:
> >> root(0): queue file write error
> >> May 09 03:40:01 localhost.localdomain crond[2630]: postdrop: warning:
> >> uid=0: No space left on device
> >> May 09 03:40:01 localhost.localdomain crond[2630]: sendmail: fatal:
> >> root(0): queue file write error
> >> May 09 03:40:01 localhost.localdomain CROND[21443]: (root) MAIL
> >> (mailed 67 bytes of output but got status 0x004b)
> >>
> >> So it seems there is a space issue.
> >>
> >> And finally, coinciding with the time that the logging resumes in
> >> /var/log/messages I see this every day at that time:
> >>
> >> May 10 03:57:57 localhost.localdomain
> >> run-parts(/etc/cron.daily)[40293]: finished mlocate
> >> May 10 03:57:57 localhost.localdomain anacron[33406]: Job `cron.daily'
> >> terminated (mailing output)
> >> May 10 03:57:57 localhost.localdomain anacron[33406]: Normal exit (1 job
> >> run)
> >>
> >> I need to get my remote hands to get me more info.
> >
> >
> > df -hT; df -i
> >
> > There is no space left on a vital partition / logical volume.
> >
> > "Only 0MiB is available on /var/spool/abrt"
> >
> > "postdrop: warning: uid=0: No space left on device"
>
> Yes, I saw that and assumed that was the root cause of the issue. But
> when I had my guy over in Japan check he found that / had 15G (of 50)
> free. We did some more investigating and it seems that when mlocate
> runs the disk fills up and bad things happen. Why is that happening?
> It is because 15G free space is not enough? We ran a du and most of
> the space on / was used by /var/log (11G), and /var/lib/mlocate (20G).
> Can I disable mlocate and get rid of that large dir?
>


20GB for mlocate is absolutely (and suspiciously) huge.  You must have
millions and millions of files on that server.  If not, then there's
something wrong with mlocate.  'mlocate' can be removed unless you're using
it, there's nothing else really dependent on it in CentOS.  You'd need to
really evaluate if someone else is using it on that server.


~ Brian Mathis
@orev