On Fri, May 12, 2017 at 12:21 PM, Brian Mathis <brian.mathis+centos at betteradmin.com> wrote: > On Fri, May 12, 2017 at 11:44 AM, Larry Martell <larry.martell at gmail.com> > wrote: > >> On Thu, May 11, 2017 at 7:58 PM, Alexander Dalloz <ad+lists at uni-x.org> >> wrote: >> > Am 11.05.2017 um 20:30 schrieb Larry Martell: >> >> >> >> On Wed, May 10, 2017 at 3:19 PM, Larry Martell <larry.martell at gmail.com >> > >> >> wrote: >> >>> >> >>> On Wed, May 10, 2017 at 3:07 PM, Jonathan Billings < >> billings at negate.org> >> >>> wrote: >> >>>> >> >>>> On Wed, May 10, 2017 at 02:40:04PM -0400, Larry Martell wrote: >> >>>>> >> >>>>> I have a CentOS 7 system that I run a home grown python daemon on. I >> >>>>> run this same daemon on many other systems without any incident. On >> >>>>> this one system the daemon seems to die or be killed every day around >> >>>>> 3:30am. There is nothing it its log or any system logs that tell me >> >>>>> why it dies. However in /var/log/messages every day I see something >> >>>>> like this: >> >>>> >> >>>> >> >>>> How are you starting this daemon? >> >>> >> >>> >> >>> I am using code something like this: >> >>> https://gist.github.com/slor/5946334. >> >>> >> >>>> Can you check the journal? Perhaps >> >>>> you'll see more useful information than what you see in the syslogs? >> >>> >> >>> >> >>> Thanks, I will do that. >> >> >> >> >> >> Thank you for that suggestion. I was able to get someone to run >> >> journalctl and send me the output and it was very interesting. >> >> >> >> First, there is logging going on continuously during the time when >> >> logging stops in /var/log/messages. >> >> >> >> Second, I see messages like this periodically: >> >> >> >> May 10 03:57:46 localhost.localdomain python[40222]: detected >> >> unhandled Python exception in >> >> '/usr/local/motor/motor/core/data/importer.py' >> >> May 10 03:57:46 localhost.localdomain abrt-server[40277]: Only 0MiB is >> >> available on /var/spool/abrt >> >> May 10 03:57:46 localhost.localdomain python[40222]: error sending >> >> data to ABRT daemon: >> >> >> >> This happens at various times of the day, and I do not think is >> >> related to the daemon crashing. >> >> >> >> But I did see one occurrence of this: >> >> >> >> May 09 03:49:35 localhost.localdomain python[14042]: detected >> >> unhandled Python exception in >> >> '/usr/local/motor/motor/core/data/importerd.py' >> >> May 09 03:49:35 localhost.localdomain abrt-server[22714]: Only 0MiB is >> >> available on /var/spool/abrt >> >> May 09 03:49:35 localhost.localdomain python[14042]: error sending >> >> data to ABRT daemon: >> >> >> >> And that is the daemon. But I only see that on this one day, and it >> >> crashes every day. >> >> >> >> And I see this type of message frequently throughout the day, every day: >> >> >> >> May 09 03:40:01 localhost.localdomain CROND[21447]: (motor) CMD >> >> (python /usr/local/motor/motor/scripts/image_mover.py -v1 -d >> >> /usr/local/motor/data > ~/last_image_move_log.txt) >> >> May 09 03:40:01 localhost.localdomain abrt-server[21453]: Only 0MiB is >> >> available on /var/spool/abrt >> >> May 09 03:40:01 localhost.localdomain python[21402]: error sending >> >> data to ABRT daemon: >> >> May 09 03:40:01 localhost.localdomain postfix/postdrop[21456]: >> >> warning: uid=0: No space left on device >> >> May 09 03:40:01 localhost.localdomain postfix/sendmail[21455]: fatal: >> >> root(0): queue file write error >> >> May 09 03:40:01 localhost.localdomain crond[2630]: postdrop: warning: >> >> uid=0: No space left on device >> >> May 09 03:40:01 localhost.localdomain crond[2630]: sendmail: fatal: >> >> root(0): queue file write error >> >> May 09 03:40:01 localhost.localdomain CROND[21443]: (root) MAIL >> >> (mailed 67 bytes of output but got status 0x004b) >> >> >> >> So it seems there is a space issue. >> >> >> >> And finally, coinciding with the time that the logging resumes in >> >> /var/log/messages I see this every day at that time: >> >> >> >> May 10 03:57:57 localhost.localdomain >> >> run-parts(/etc/cron.daily)[40293]: finished mlocate >> >> May 10 03:57:57 localhost.localdomain anacron[33406]: Job `cron.daily' >> >> terminated (mailing output) >> >> May 10 03:57:57 localhost.localdomain anacron[33406]: Normal exit (1 job >> >> run) >> >> >> >> I need to get my remote hands to get me more info. >> > >> > >> > df -hT; df -i >> > >> > There is no space left on a vital partition / logical volume. >> > >> > "Only 0MiB is available on /var/spool/abrt" >> > >> > "postdrop: warning: uid=0: No space left on device" >> >> Yes, I saw that and assumed that was the root cause of the issue. But >> when I had my guy over in Japan check he found that / had 15G (of 50) >> free. We did some more investigating and it seems that when mlocate >> runs the disk fills up and bad things happen. Why is that happening? >> It is because 15G free space is not enough? We ran a du and most of >> the space on / was used by /var/log (11G), and /var/lib/mlocate (20G). >> Can I disable mlocate and get rid of that large dir? >> > > > 20GB for mlocate is absolutely (and suspiciously) huge. You must have > millions and millions of files on that server. If not, then there's > something wrong with mlocate. 'mlocate' can be removed unless you're using > it, there's nothing else really dependent on it in CentOS. You'd need to > really evaluate if someone else is using it on that server. Yes, we do have millions and millions of files (give or take a million or so). I am going to disable mlocate and remove the db and see if this fixes the issues we've been having.