[CentOS] strange system outage

Thu May 11 23:58:06 UTC 2017
Alexander Dalloz <ad+lists at uni-x.org>

Am 11.05.2017 um 20:30 schrieb Larry Martell:
> On Wed, May 10, 2017 at 3:19 PM, Larry Martell <larry.martell at gmail.com> wrote:
>> On Wed, May 10, 2017 at 3:07 PM, Jonathan Billings <billings at negate.org> wrote:
>>> On Wed, May 10, 2017 at 02:40:04PM -0400, Larry Martell wrote:
>>>> I have a CentOS 7 system that I run a home grown python daemon on. I
>>>> run this same daemon on many other systems without any incident. On
>>>> this one system the daemon seems to die or be killed every day around
>>>> 3:30am. There is nothing it its log or any system logs that tell me
>>>> why it dies. However in /var/log/messages every day I see something
>>>> like this:
>>>
>>> How are you starting this daemon?
>>
>> I am using code something like this: https://gist.github.com/slor/5946334.
>>
>>> Can you check the journal?  Perhaps
>>> you'll see more useful information than what you see in the syslogs?
>>
>> Thanks, I will do that.
> 
> Thank you for that suggestion. I was able to get someone to run
> journalctl and send me the output and it was very interesting.
> 
> First, there is logging going on continuously during the time when
> logging stops in /var/log/messages.
> 
> Second, I see messages like this periodically:
> 
> May 10 03:57:46 localhost.localdomain python[40222]: detected
> unhandled Python exception in
> '/usr/local/motor/motor/core/data/importer.py'
> May 10 03:57:46 localhost.localdomain abrt-server[40277]: Only 0MiB is
> available on /var/spool/abrt
> May 10 03:57:46 localhost.localdomain python[40222]: error sending
> data to ABRT daemon:
> 
> This happens at various times of the day, and I do not think is
> related to the daemon crashing.
> 
> But I did see one occurrence of this:
> 
> May 09 03:49:35 localhost.localdomain python[14042]: detected
> unhandled Python exception in
> '/usr/local/motor/motor/core/data/importerd.py'
> May 09 03:49:35 localhost.localdomain abrt-server[22714]: Only 0MiB is
> available on /var/spool/abrt
> May 09 03:49:35 localhost.localdomain python[14042]: error sending
> data to ABRT daemon:
> 
> And that is the daemon. But I only see that on this one day, and it
> crashes every day.
> 
> And I see this type of message frequently throughout the day, every day:
> 
> May 09 03:40:01 localhost.localdomain CROND[21447]: (motor) CMD
> (python /usr/local/motor/motor/scripts/image_mover.py -v1 -d
> /usr/local/motor/data > ~/last_image_move_log.txt)
> May 09 03:40:01 localhost.localdomain abrt-server[21453]: Only 0MiB is
> available on /var/spool/abrt
> May 09 03:40:01 localhost.localdomain python[21402]: error sending
> data to ABRT daemon:
> May 09 03:40:01 localhost.localdomain postfix/postdrop[21456]:
> warning: uid=0: No space left on device
> May 09 03:40:01 localhost.localdomain postfix/sendmail[21455]: fatal:
> root(0): queue file write error
> May 09 03:40:01 localhost.localdomain crond[2630]: postdrop: warning:
> uid=0: No space left on device
> May 09 03:40:01 localhost.localdomain crond[2630]: sendmail: fatal:
> root(0): queue file write error
> May 09 03:40:01 localhost.localdomain CROND[21443]: (root) MAIL
> (mailed 67 bytes of output but got status 0x004b)
> 
> So it seems there is a space issue.
> 
> And finally, coinciding with the time that the logging resumes in
> /var/log/messages I see this every day at that time:
> 
> May 10 03:57:57 localhost.localdomain
> run-parts(/etc/cron.daily)[40293]: finished mlocate
> May 10 03:57:57 localhost.localdomain anacron[33406]: Job `cron.daily'
> terminated (mailing output)
> May 10 03:57:57 localhost.localdomain anacron[33406]: Normal exit (1 job run)
> 
> I need to get my remote hands to get me more info.

df -hT; df -i

There is no space left on a vital partition / logical volume.

"Only 0MiB is available on /var/spool/abrt"

"postdrop: warning: uid=0: No space left on device"

Alexander