I have a CentOS 7 system that I run a home grown python daemon on. I run this same daemon on many other systems without any incident. On this one system the daemon seems to die or be killed every day around 3:30am. There is nothing it its log or any system logs that tell me why it dies. However in /var/log/messages every day I see something like this:
May 10 03:35:58 localhost pure-ftpd: (tool@xx.xx.xx.xx) [NOTICE] /usrMay 10 03:57:57 localhost pure-ftpd: (tool@xx.xx.xx.xx) [NOTICE] /usr/local/motor/data//B31/today/Images/CP0982436.00C_T6PH8_M0-R_1/T6PH8_M0-RN_TT6PH8_M0-R_P4_M1_FX-1_FY4_RR1_TR1_Ver3.jpg uploaded (90666 bytes, 8322.34KB/sec)
Notice how the message that was being printed at 03:35:58 is truncated mid-message and the next message is at 03:57:57 – there are always messages like around the time when the daemon dies. What could be going on here?
Am 10.05.2017 um 20:40 schrieb Larry Martell:
I have a CentOS 7 system that I run a home grown python daemon on. I run this same daemon on many other systems without any incident. On this one system the daemon seems to die or be killed every day around 3:30am. There is nothing it its log or any system logs that tell me why it dies. However in /var/log/messages every day I see something like this:
May 10 03:35:58 localhost pure-ftpd: (tool@xx.xx.xx.xx) [NOTICE] /usrMay 10 03:57:57 localhost pure-ftpd: (tool@xx.xx.xx.xx) [NOTICE] /usr/local/motor/data//B31/today/Images/CP0982436.00C_T6PH8_M0-R_1/T6PH8_M0-RN_TT6PH8_M0-R_P4_M1_FX-1_FY4_RR1_TR1_Ver3.jpg uploaded (90666 bytes, 8322.34KB/sec)
Notice how the message that was being printed at 03:35:58 is truncated mid-message and the next message is at 03:57:57 – there are always messages like around the time when the daemon dies. What could be going on here?
Have you checked /etc/cron.daily/ for a cron job to restart the pure-ftpd service? /etc/cron.daily is called from /etc/anacrontab with some randomness so that the execution time varies.
Alexander
On Wed, May 10, 2017 at 3:05 PM, Alexander Dalloz ad+lists@uni-x.org wrote:
Am 10.05.2017 um 20:40 schrieb Larry Martell:
I have a CentOS 7 system that I run a home grown python daemon on. I run this same daemon on many other systems without any incident. On this one system the daemon seems to die or be killed every day around 3:30am. There is nothing it its log or any system logs that tell me why it dies. However in /var/log/messages every day I see something like this:
May 10 03:35:58 localhost pure-ftpd: (tool@xx.xx.xx.xx) [NOTICE] /usrMay 10 03:57:57 localhost pure-ftpd: (tool@xx.xx.xx.xx) [NOTICE]
/usr/local/motor/data//B31/today/Images/CP0982436.00C_T6PH8_M0-R_1/T6PH8_M0-RN_TT6PH8_M0-R_P4_M1_FX-1_FY4_RR1_TR1_Ver3.jpg uploaded (90666 bytes, 8322.34KB/sec)
Notice how the message that was being printed at 03:35:58 is truncated mid-message and the next message is at 03:57:57 – there are always messages like around the time when the daemon dies. What could be going on here?
Have you checked /etc/cron.daily/ for a cron job to restart the pure-ftpd service? /etc/cron.daily is called from /etc/anacrontab with some randomness so that the execution time varies.
I will check that, but the the pure-ftpd service is not my daemon that is getting killed - I was thinking the sudden stopping of the logging and subsequent time gap would be some sort of clue as to what is happing at that time.
This is a system I do not have remote access to and it's in Japan so I have to ask someone there to do things for me. Such a pain.
On Wed, May 10, 2017 at 3:11 PM, Larry Martell larry.martell@gmail.com wrote:
On Wed, May 10, 2017 at 3:05 PM, Alexander Dalloz ad+lists@uni-x.org wrote:
Am 10.05.2017 um 20:40 schrieb Larry Martell:
I have a CentOS 7 system that I run a home grown python daemon on. I run this same daemon on many other systems without any incident. On this one system the daemon seems to die or be killed every day around 3:30am. There is nothing it its log or any system logs that tell me why it dies. However in /var/log/messages every day I see something like this:
May 10 03:35:58 localhost pure-ftpd: (tool@xx.xx.xx.xx) [NOTICE] /usrMay 10 03:57:57 localhost pure-ftpd: (tool@xx.xx.xx.xx) [NOTICE]
/usr/local/motor/data//B31/today/Images/CP0982436.00C_T6PH8_M0-R_1/T6PH8_M0-RN_TT6PH8_M0-R_P4_M1_FX-1_FY4_RR1_TR1_Ver3.jpg uploaded (90666 bytes, 8322.34KB/sec)
Notice how the message that was being printed at 03:35:58 is truncated mid-message and the next message is at 03:57:57 – there are always messages like around the time when the daemon dies. What could be going on here?
Have you checked /etc/cron.daily/ for a cron job to restart the pure-ftpd service? /etc/cron.daily is called from /etc/anacrontab with some randomness so that the execution time varies.
I will check that, but the the pure-ftpd service is not my daemon that is getting killed - I was thinking the sudden stopping of the logging and subsequent time gap would be some sort of clue as to what is happing at that time.
/etc/cron.daily has:
0yum-daily.cron logrotate man-db.cron mlocate
On Wed, May 10, 2017 at 02:40:04PM -0400, Larry Martell wrote:
I have a CentOS 7 system that I run a home grown python daemon on. I run this same daemon on many other systems without any incident. On this one system the daemon seems to die or be killed every day around 3:30am. There is nothing it its log or any system logs that tell me why it dies. However in /var/log/messages every day I see something like this:
How are you starting this daemon? Can you check the journal? Perhaps you'll see more useful information than what you see in the syslogs?
On Wed, May 10, 2017 at 3:07 PM, Jonathan Billings billings@negate.org wrote:
On Wed, May 10, 2017 at 02:40:04PM -0400, Larry Martell wrote:
I have a CentOS 7 system that I run a home grown python daemon on. I run this same daemon on many other systems without any incident. On this one system the daemon seems to die or be killed every day around 3:30am. There is nothing it its log or any system logs that tell me why it dies. However in /var/log/messages every day I see something like this:
How are you starting this daemon?
I am using code something like this: https://gist.github.com/slor/5946334.
Can you check the journal? Perhaps you'll see more useful information than what you see in the syslogs?
Thanks, I will do that.
On Wed, May 10, 2017 at 03:19:05PM -0400, Larry Martell wrote:
How are you starting this daemon?
I am using code something like this: https://gist.github.com/slor/5946334.
Oh, I was assuming that since you called it a daemon, it was actually something started automatically on boot, instead of something you manually started and it daemonized.
If it is dying, are you logged in when its running? does it require you to be connected when its running? Maybe you should run it in a screen/tmux instead of daemonizing, so you can see stderr/stdout?
On Wed, May 10, 2017 at 4:18 PM, Jonathan Billings billings@negate.org wrote:
On Wed, May 10, 2017 at 03:19:05PM -0400, Larry Martell wrote:
How are you starting this daemon?
I am using code something like this: https://gist.github.com/slor/5946334.
Oh, I was assuming that since you called it a daemon, it was actually something started automatically on boot, instead of something you manually started and it daemonized.
So on all the other systems that this is deployed on it is started automatically at boot and it's controlled with service or systemctl. On this system because I do not have remote access and I do not have a very competent set of hands over there it was never set up to run as a service. So a user manually starts it by running the daemon script.
This had been running on the machine like that for 8 months. The nightly crash just started on April 21 and has happened every day since.
If it is dying, are you logged in when its running?
I am not 100% sure, but I think the user sus, starts the daemon process and then exits the su.
does it require you to be connected when its running?
No. The underlying script polls dirs looking for files, and loads data into a db.
Maybe you should run it in a screen/tmux instead of daemonizing, so you can see stderr/stdout?
stdout and stderr are written to a file and when the daemon gets killed or dies or whatever happens to it, the output in the file abruptly stops, just like what I showed in the messages file.
On Wed, May 10, 2017 at 04:35:07PM -0400, Larry Martell wrote:
On Wed, May 10, 2017 at 4:18 PM, Jonathan Billings billings@negate.org wrote:
On Wed, May 10, 2017 at 03:19:05PM -0400, Larry Martell wrote:
How are you starting this daemon?
I am using code something like this: https://gist.github.com/slor/5946334.
Oh, I was assuming that since you called it a daemon, it was actually something started automatically on boot, instead of something you manually started and it daemonized.
So on all the other systems that this is deployed on it is started automatically at boot and it's controlled with service or systemctl. On this system because I do not have remote access and I do not have a very competent set of hands over there it was never set up to run as a service. So a user manually starts it by running the daemon script.
This had been running on the machine like that for 8 months. The nightly crash just started on April 21 and has happened every day since.
If it is dying, are you logged in when its running?
I am not 100% sure, but I think the user sus, starts the daemon process and then exits the su.
does it require you to be connected when its running?
No. The underlying script polls dirs looking for files, and loads data into a db.
Maybe you should run it in a screen/tmux instead of daemonizing, so you can see stderr/stdout?
stdout and stderr are written to a file and when the daemon gets killed or dies or whatever happens to it, the output in the file abruptly stops, just like what I showed in the messages file.
Could you enable core dumps (ulimit -c unlimited) then after it has died, check for core files?
On Wed, May 10, 2017 at 3:19 PM, Larry Martell larry.martell@gmail.com wrote:
On Wed, May 10, 2017 at 3:07 PM, Jonathan Billings billings@negate.org wrote:
On Wed, May 10, 2017 at 02:40:04PM -0400, Larry Martell wrote:
I have a CentOS 7 system that I run a home grown python daemon on. I run this same daemon on many other systems without any incident. On this one system the daemon seems to die or be killed every day around 3:30am. There is nothing it its log or any system logs that tell me why it dies. However in /var/log/messages every day I see something like this:
How are you starting this daemon?
I am using code something like this: https://gist.github.com/slor/5946334.
Can you check the journal? Perhaps you'll see more useful information than what you see in the syslogs?
Thanks, I will do that.
Thank you for that suggestion. I was able to get someone to run journalctl and send me the output and it was very interesting.
First, there is logging going on continuously during the time when logging stops in /var/log/messages.
Second, I see messages like this periodically:
May 10 03:57:46 localhost.localdomain python[40222]: detected unhandled Python exception in '/usr/local/motor/motor/core/data/importer.py' May 10 03:57:46 localhost.localdomain abrt-server[40277]: Only 0MiB is available on /var/spool/abrt May 10 03:57:46 localhost.localdomain python[40222]: error sending data to ABRT daemon:
This happens at various times of the day, and I do not think is related to the daemon crashing.
But I did see one occurrence of this:
May 09 03:49:35 localhost.localdomain python[14042]: detected unhandled Python exception in '/usr/local/motor/motor/core/data/importerd.py' May 09 03:49:35 localhost.localdomain abrt-server[22714]: Only 0MiB is available on /var/spool/abrt May 09 03:49:35 localhost.localdomain python[14042]: error sending data to ABRT daemon:
And that is the daemon. But I only see that on this one day, and it crashes every day.
And I see this type of message frequently throughout the day, every day:
May 09 03:40:01 localhost.localdomain CROND[21447]: (motor) CMD (python /usr/local/motor/motor/scripts/image_mover.py -v1 -d /usr/local/motor/data > ~/last_image_move_log.txt) May 09 03:40:01 localhost.localdomain abrt-server[21453]: Only 0MiB is available on /var/spool/abrt May 09 03:40:01 localhost.localdomain python[21402]: error sending data to ABRT daemon: May 09 03:40:01 localhost.localdomain postfix/postdrop[21456]: warning: uid=0: No space left on device May 09 03:40:01 localhost.localdomain postfix/sendmail[21455]: fatal: root(0): queue file write error May 09 03:40:01 localhost.localdomain crond[2630]: postdrop: warning: uid=0: No space left on device May 09 03:40:01 localhost.localdomain crond[2630]: sendmail: fatal: root(0): queue file write error May 09 03:40:01 localhost.localdomain CROND[21443]: (root) MAIL (mailed 67 bytes of output but got status 0x004b)
So it seems there is a space issue.
And finally, coinciding with the time that the logging resumes in /var/log/messages I see this every day at that time:
May 10 03:57:57 localhost.localdomain run-parts(/etc/cron.daily)[40293]: finished mlocate May 10 03:57:57 localhost.localdomain anacron[33406]: Job `cron.daily' terminated (mailing output) May 10 03:57:57 localhost.localdomain anacron[33406]: Normal exit (1 job run)
I need to get my remote hands to get me more info.
Am 11.05.2017 um 20:30 schrieb Larry Martell:
On Wed, May 10, 2017 at 3:19 PM, Larry Martell larry.martell@gmail.com wrote:
On Wed, May 10, 2017 at 3:07 PM, Jonathan Billings billings@negate.org wrote:
On Wed, May 10, 2017 at 02:40:04PM -0400, Larry Martell wrote:
I have a CentOS 7 system that I run a home grown python daemon on. I run this same daemon on many other systems without any incident. On this one system the daemon seems to die or be killed every day around 3:30am. There is nothing it its log or any system logs that tell me why it dies. However in /var/log/messages every day I see something like this:
How are you starting this daemon?
I am using code something like this: https://gist.github.com/slor/5946334.
Can you check the journal? Perhaps you'll see more useful information than what you see in the syslogs?
Thanks, I will do that.
Thank you for that suggestion. I was able to get someone to run journalctl and send me the output and it was very interesting.
First, there is logging going on continuously during the time when logging stops in /var/log/messages.
Second, I see messages like this periodically:
May 10 03:57:46 localhost.localdomain python[40222]: detected unhandled Python exception in '/usr/local/motor/motor/core/data/importer.py' May 10 03:57:46 localhost.localdomain abrt-server[40277]: Only 0MiB is available on /var/spool/abrt May 10 03:57:46 localhost.localdomain python[40222]: error sending data to ABRT daemon:
This happens at various times of the day, and I do not think is related to the daemon crashing.
But I did see one occurrence of this:
May 09 03:49:35 localhost.localdomain python[14042]: detected unhandled Python exception in '/usr/local/motor/motor/core/data/importerd.py' May 09 03:49:35 localhost.localdomain abrt-server[22714]: Only 0MiB is available on /var/spool/abrt May 09 03:49:35 localhost.localdomain python[14042]: error sending data to ABRT daemon:
And that is the daemon. But I only see that on this one day, and it crashes every day.
And I see this type of message frequently throughout the day, every day:
May 09 03:40:01 localhost.localdomain CROND[21447]: (motor) CMD (python /usr/local/motor/motor/scripts/image_mover.py -v1 -d /usr/local/motor/data > ~/last_image_move_log.txt) May 09 03:40:01 localhost.localdomain abrt-server[21453]: Only 0MiB is available on /var/spool/abrt May 09 03:40:01 localhost.localdomain python[21402]: error sending data to ABRT daemon: May 09 03:40:01 localhost.localdomain postfix/postdrop[21456]: warning: uid=0: No space left on device May 09 03:40:01 localhost.localdomain postfix/sendmail[21455]: fatal: root(0): queue file write error May 09 03:40:01 localhost.localdomain crond[2630]: postdrop: warning: uid=0: No space left on device May 09 03:40:01 localhost.localdomain crond[2630]: sendmail: fatal: root(0): queue file write error May 09 03:40:01 localhost.localdomain CROND[21443]: (root) MAIL (mailed 67 bytes of output but got status 0x004b)
So it seems there is a space issue.
And finally, coinciding with the time that the logging resumes in /var/log/messages I see this every day at that time:
May 10 03:57:57 localhost.localdomain run-parts(/etc/cron.daily)[40293]: finished mlocate May 10 03:57:57 localhost.localdomain anacron[33406]: Job `cron.daily' terminated (mailing output) May 10 03:57:57 localhost.localdomain anacron[33406]: Normal exit (1 job run)
I need to get my remote hands to get me more info.
df -hT; df -i
There is no space left on a vital partition / logical volume.
"Only 0MiB is available on /var/spool/abrt"
"postdrop: warning: uid=0: No space left on device"
Alexander
On Thu, May 11, 2017 at 7:58 PM, Alexander Dalloz ad+lists@uni-x.org wrote:
Am 11.05.2017 um 20:30 schrieb Larry Martell:
On Wed, May 10, 2017 at 3:19 PM, Larry Martell larry.martell@gmail.com wrote:
On Wed, May 10, 2017 at 3:07 PM, Jonathan Billings billings@negate.org wrote:
On Wed, May 10, 2017 at 02:40:04PM -0400, Larry Martell wrote:
I have a CentOS 7 system that I run a home grown python daemon on. I run this same daemon on many other systems without any incident. On this one system the daemon seems to die or be killed every day around 3:30am. There is nothing it its log or any system logs that tell me why it dies. However in /var/log/messages every day I see something like this:
How are you starting this daemon?
I am using code something like this: https://gist.github.com/slor/5946334.
Can you check the journal? Perhaps you'll see more useful information than what you see in the syslogs?
Thanks, I will do that.
Thank you for that suggestion. I was able to get someone to run journalctl and send me the output and it was very interesting.
First, there is logging going on continuously during the time when logging stops in /var/log/messages.
Second, I see messages like this periodically:
May 10 03:57:46 localhost.localdomain python[40222]: detected unhandled Python exception in '/usr/local/motor/motor/core/data/importer.py' May 10 03:57:46 localhost.localdomain abrt-server[40277]: Only 0MiB is available on /var/spool/abrt May 10 03:57:46 localhost.localdomain python[40222]: error sending data to ABRT daemon:
This happens at various times of the day, and I do not think is related to the daemon crashing.
But I did see one occurrence of this:
May 09 03:49:35 localhost.localdomain python[14042]: detected unhandled Python exception in '/usr/local/motor/motor/core/data/importerd.py' May 09 03:49:35 localhost.localdomain abrt-server[22714]: Only 0MiB is available on /var/spool/abrt May 09 03:49:35 localhost.localdomain python[14042]: error sending data to ABRT daemon:
And that is the daemon. But I only see that on this one day, and it crashes every day.
And I see this type of message frequently throughout the day, every day:
May 09 03:40:01 localhost.localdomain CROND[21447]: (motor) CMD (python /usr/local/motor/motor/scripts/image_mover.py -v1 -d /usr/local/motor/data > ~/last_image_move_log.txt) May 09 03:40:01 localhost.localdomain abrt-server[21453]: Only 0MiB is available on /var/spool/abrt May 09 03:40:01 localhost.localdomain python[21402]: error sending data to ABRT daemon: May 09 03:40:01 localhost.localdomain postfix/postdrop[21456]: warning: uid=0: No space left on device May 09 03:40:01 localhost.localdomain postfix/sendmail[21455]: fatal: root(0): queue file write error May 09 03:40:01 localhost.localdomain crond[2630]: postdrop: warning: uid=0: No space left on device May 09 03:40:01 localhost.localdomain crond[2630]: sendmail: fatal: root(0): queue file write error May 09 03:40:01 localhost.localdomain CROND[21443]: (root) MAIL (mailed 67 bytes of output but got status 0x004b)
So it seems there is a space issue.
And finally, coinciding with the time that the logging resumes in /var/log/messages I see this every day at that time:
May 10 03:57:57 localhost.localdomain run-parts(/etc/cron.daily)[40293]: finished mlocate May 10 03:57:57 localhost.localdomain anacron[33406]: Job `cron.daily' terminated (mailing output) May 10 03:57:57 localhost.localdomain anacron[33406]: Normal exit (1 job run)
I need to get my remote hands to get me more info.
df -hT; df -i
There is no space left on a vital partition / logical volume.
"Only 0MiB is available on /var/spool/abrt"
"postdrop: warning: uid=0: No space left on device"
Yes, I saw that and assumed that was the root cause of the issue. But when I had my guy over in Japan check he found that / had 15G (of 50) free. We did some more investigating and it seems that when mlocate runs the disk fills up and bad things happen. Why is that happening? It is because 15G free space is not enough? We ran a du and most of the space on / was used by /var/log (11G), and /var/lib/mlocate (20G). Can I disable mlocate and get rid of that large dir?
On Fri, May 12, 2017 at 11:44 AM, Larry Martell larry.martell@gmail.com wrote:
On Thu, May 11, 2017 at 7:58 PM, Alexander Dalloz ad+lists@uni-x.org wrote:
Am 11.05.2017 um 20:30 schrieb Larry Martell:
On Wed, May 10, 2017 at 3:19 PM, Larry Martell <larry.martell@gmail.com
wrote:
On Wed, May 10, 2017 at 3:07 PM, Jonathan Billings <
billings@negate.org>
wrote:
On Wed, May 10, 2017 at 02:40:04PM -0400, Larry Martell wrote:
I have a CentOS 7 system that I run a home grown python daemon on. I run this same daemon on many other systems without any incident. On this one system the daemon seems to die or be killed every day around 3:30am. There is nothing it its log or any system logs that tell me why it dies. However in /var/log/messages every day I see something like this:
How are you starting this daemon?
I am using code something like this: https://gist.github.com/slor/5946334.
Can you check the journal? Perhaps you'll see more useful information than what you see in the syslogs?
Thanks, I will do that.
Thank you for that suggestion. I was able to get someone to run journalctl and send me the output and it was very interesting.
First, there is logging going on continuously during the time when logging stops in /var/log/messages.
Second, I see messages like this periodically:
May 10 03:57:46 localhost.localdomain python[40222]: detected unhandled Python exception in '/usr/local/motor/motor/core/data/importer.py' May 10 03:57:46 localhost.localdomain abrt-server[40277]: Only 0MiB is available on /var/spool/abrt May 10 03:57:46 localhost.localdomain python[40222]: error sending data to ABRT daemon:
This happens at various times of the day, and I do not think is related to the daemon crashing.
But I did see one occurrence of this:
May 09 03:49:35 localhost.localdomain python[14042]: detected unhandled Python exception in '/usr/local/motor/motor/core/data/importerd.py' May 09 03:49:35 localhost.localdomain abrt-server[22714]: Only 0MiB is available on /var/spool/abrt May 09 03:49:35 localhost.localdomain python[14042]: error sending data to ABRT daemon:
And that is the daemon. But I only see that on this one day, and it crashes every day.
And I see this type of message frequently throughout the day, every day:
May 09 03:40:01 localhost.localdomain CROND[21447]: (motor) CMD (python /usr/local/motor/motor/scripts/image_mover.py -v1 -d /usr/local/motor/data > ~/last_image_move_log.txt) May 09 03:40:01 localhost.localdomain abrt-server[21453]: Only 0MiB is available on /var/spool/abrt May 09 03:40:01 localhost.localdomain python[21402]: error sending data to ABRT daemon: May 09 03:40:01 localhost.localdomain postfix/postdrop[21456]: warning: uid=0: No space left on device May 09 03:40:01 localhost.localdomain postfix/sendmail[21455]: fatal: root(0): queue file write error May 09 03:40:01 localhost.localdomain crond[2630]: postdrop: warning: uid=0: No space left on device May 09 03:40:01 localhost.localdomain crond[2630]: sendmail: fatal: root(0): queue file write error May 09 03:40:01 localhost.localdomain CROND[21443]: (root) MAIL (mailed 67 bytes of output but got status 0x004b)
So it seems there is a space issue.
And finally, coinciding with the time that the logging resumes in /var/log/messages I see this every day at that time:
May 10 03:57:57 localhost.localdomain run-parts(/etc/cron.daily)[40293]: finished mlocate May 10 03:57:57 localhost.localdomain anacron[33406]: Job `cron.daily' terminated (mailing output) May 10 03:57:57 localhost.localdomain anacron[33406]: Normal exit (1 job run)
I need to get my remote hands to get me more info.
df -hT; df -i
There is no space left on a vital partition / logical volume.
"Only 0MiB is available on /var/spool/abrt"
"postdrop: warning: uid=0: No space left on device"
Yes, I saw that and assumed that was the root cause of the issue. But when I had my guy over in Japan check he found that / had 15G (of 50) free. We did some more investigating and it seems that when mlocate runs the disk fills up and bad things happen. Why is that happening? It is because 15G free space is not enough? We ran a du and most of the space on / was used by /var/log (11G), and /var/lib/mlocate (20G). Can I disable mlocate and get rid of that large dir?
20GB for mlocate is absolutely (and suspiciously) huge. You must have millions and millions of files on that server. If not, then there's something wrong with mlocate. 'mlocate' can be removed unless you're using it, there's nothing else really dependent on it in CentOS. You'd need to really evaluate if someone else is using it on that server.
~ Brian Mathis @orev
On Fri, May 12, 2017 at 12:21 PM, Brian Mathis brian.mathis+centos@betteradmin.com wrote:
On Fri, May 12, 2017 at 11:44 AM, Larry Martell larry.martell@gmail.com wrote:
On Thu, May 11, 2017 at 7:58 PM, Alexander Dalloz ad+lists@uni-x.org wrote:
Am 11.05.2017 um 20:30 schrieb Larry Martell:
On Wed, May 10, 2017 at 3:19 PM, Larry Martell <larry.martell@gmail.com
wrote:
On Wed, May 10, 2017 at 3:07 PM, Jonathan Billings <
billings@negate.org>
wrote:
On Wed, May 10, 2017 at 02:40:04PM -0400, Larry Martell wrote: > > I have a CentOS 7 system that I run a home grown python daemon on. I > run this same daemon on many other systems without any incident. On > this one system the daemon seems to die or be killed every day around > 3:30am. There is nothing it its log or any system logs that tell me > why it dies. However in /var/log/messages every day I see something > like this:
How are you starting this daemon?
I am using code something like this: https://gist.github.com/slor/5946334.
Can you check the journal? Perhaps you'll see more useful information than what you see in the syslogs?
Thanks, I will do that.
Thank you for that suggestion. I was able to get someone to run journalctl and send me the output and it was very interesting.
First, there is logging going on continuously during the time when logging stops in /var/log/messages.
Second, I see messages like this periodically:
May 10 03:57:46 localhost.localdomain python[40222]: detected unhandled Python exception in '/usr/local/motor/motor/core/data/importer.py' May 10 03:57:46 localhost.localdomain abrt-server[40277]: Only 0MiB is available on /var/spool/abrt May 10 03:57:46 localhost.localdomain python[40222]: error sending data to ABRT daemon:
This happens at various times of the day, and I do not think is related to the daemon crashing.
But I did see one occurrence of this:
May 09 03:49:35 localhost.localdomain python[14042]: detected unhandled Python exception in '/usr/local/motor/motor/core/data/importerd.py' May 09 03:49:35 localhost.localdomain abrt-server[22714]: Only 0MiB is available on /var/spool/abrt May 09 03:49:35 localhost.localdomain python[14042]: error sending data to ABRT daemon:
And that is the daemon. But I only see that on this one day, and it crashes every day.
And I see this type of message frequently throughout the day, every day:
May 09 03:40:01 localhost.localdomain CROND[21447]: (motor) CMD (python /usr/local/motor/motor/scripts/image_mover.py -v1 -d /usr/local/motor/data > ~/last_image_move_log.txt) May 09 03:40:01 localhost.localdomain abrt-server[21453]: Only 0MiB is available on /var/spool/abrt May 09 03:40:01 localhost.localdomain python[21402]: error sending data to ABRT daemon: May 09 03:40:01 localhost.localdomain postfix/postdrop[21456]: warning: uid=0: No space left on device May 09 03:40:01 localhost.localdomain postfix/sendmail[21455]: fatal: root(0): queue file write error May 09 03:40:01 localhost.localdomain crond[2630]: postdrop: warning: uid=0: No space left on device May 09 03:40:01 localhost.localdomain crond[2630]: sendmail: fatal: root(0): queue file write error May 09 03:40:01 localhost.localdomain CROND[21443]: (root) MAIL (mailed 67 bytes of output but got status 0x004b)
So it seems there is a space issue.
And finally, coinciding with the time that the logging resumes in /var/log/messages I see this every day at that time:
May 10 03:57:57 localhost.localdomain run-parts(/etc/cron.daily)[40293]: finished mlocate May 10 03:57:57 localhost.localdomain anacron[33406]: Job `cron.daily' terminated (mailing output) May 10 03:57:57 localhost.localdomain anacron[33406]: Normal exit (1 job run)
I need to get my remote hands to get me more info.
df -hT; df -i
There is no space left on a vital partition / logical volume.
"Only 0MiB is available on /var/spool/abrt"
"postdrop: warning: uid=0: No space left on device"
Yes, I saw that and assumed that was the root cause of the issue. But when I had my guy over in Japan check he found that / had 15G (of 50) free. We did some more investigating and it seems that when mlocate runs the disk fills up and bad things happen. Why is that happening? It is because 15G free space is not enough? We ran a du and most of the space on / was used by /var/log (11G), and /var/lib/mlocate (20G). Can I disable mlocate and get rid of that large dir?
20GB for mlocate is absolutely (and suspiciously) huge. You must have millions and millions of files on that server. If not, then there's something wrong with mlocate. 'mlocate' can be removed unless you're using it, there's nothing else really dependent on it in CentOS. You'd need to really evaluate if someone else is using it on that server.
Yes, we do have millions and millions of files (give or take a million or so). I am going to disable mlocate and remove the db and see if this fixes the issues we've been having.
On Sat, 13 May 2017, Larry Martell wrote:
Yes, we do have millions and millions of files (give or take a million or so). I am going to disable mlocate and remove the db and see if this fixes the issues we've been having.
An alternative to disabling mlocate would be to add the paths with huge numbers of files to PRUNEPATHS in /etc/updatedb.conf.
--- Ian
On Fri, May 12, 2017 at 12:52 PM, Larry Martell larry.martell@gmail.com wrote:
On Fri, May 12, 2017 at 12:21 PM, Brian Mathis brian.mathis+centos@betteradmin.com wrote:
On Fri, May 12, 2017 at 11:44 AM, Larry Martell larry.martell@gmail.com wrote:
On Thu, May 11, 2017 at 7:58 PM, Alexander Dalloz ad+lists@uni-x.org wrote:
Am 11.05.2017 um 20:30 schrieb Larry Martell:
On Wed, May 10, 2017 at 3:19 PM, Larry Martell <larry.martell@gmail.com
wrote:
On Wed, May 10, 2017 at 3:07 PM, Jonathan Billings <
billings@negate.org>
wrote: > > On Wed, May 10, 2017 at 02:40:04PM -0400, Larry Martell wrote: >> >> I have a CentOS 7 system that I run a home grown python daemon on. I >> run this same daemon on many other systems without any incident. On >> this one system the daemon seems to die or be killed every day around >> 3:30am. There is nothing it its log or any system logs that tell me >> why it dies. However in /var/log/messages every day I see something >> like this: > > > How are you starting this daemon?
I am using code something like this: https://gist.github.com/slor/5946334.
> Can you check the journal? Perhaps > you'll see more useful information than what you see in the syslogs?
Thanks, I will do that.
Thank you for that suggestion. I was able to get someone to run journalctl and send me the output and it was very interesting.
First, there is logging going on continuously during the time when logging stops in /var/log/messages.
Second, I see messages like this periodically:
May 10 03:57:46 localhost.localdomain python[40222]: detected unhandled Python exception in '/usr/local/motor/motor/core/data/importer.py' May 10 03:57:46 localhost.localdomain abrt-server[40277]: Only 0MiB is available on /var/spool/abrt May 10 03:57:46 localhost.localdomain python[40222]: error sending data to ABRT daemon:
This happens at various times of the day, and I do not think is related to the daemon crashing.
But I did see one occurrence of this:
May 09 03:49:35 localhost.localdomain python[14042]: detected unhandled Python exception in '/usr/local/motor/motor/core/data/importerd.py' May 09 03:49:35 localhost.localdomain abrt-server[22714]: Only 0MiB is available on /var/spool/abrt May 09 03:49:35 localhost.localdomain python[14042]: error sending data to ABRT daemon:
And that is the daemon. But I only see that on this one day, and it crashes every day.
And I see this type of message frequently throughout the day, every day:
May 09 03:40:01 localhost.localdomain CROND[21447]: (motor) CMD (python /usr/local/motor/motor/scripts/image_mover.py -v1 -d /usr/local/motor/data > ~/last_image_move_log.txt) May 09 03:40:01 localhost.localdomain abrt-server[21453]: Only 0MiB is available on /var/spool/abrt May 09 03:40:01 localhost.localdomain python[21402]: error sending data to ABRT daemon: May 09 03:40:01 localhost.localdomain postfix/postdrop[21456]: warning: uid=0: No space left on device May 09 03:40:01 localhost.localdomain postfix/sendmail[21455]: fatal: root(0): queue file write error May 09 03:40:01 localhost.localdomain crond[2630]: postdrop: warning: uid=0: No space left on device May 09 03:40:01 localhost.localdomain crond[2630]: sendmail: fatal: root(0): queue file write error May 09 03:40:01 localhost.localdomain CROND[21443]: (root) MAIL (mailed 67 bytes of output but got status 0x004b)
So it seems there is a space issue.
And finally, coinciding with the time that the logging resumes in /var/log/messages I see this every day at that time:
May 10 03:57:57 localhost.localdomain run-parts(/etc/cron.daily)[40293]: finished mlocate May 10 03:57:57 localhost.localdomain anacron[33406]: Job `cron.daily' terminated (mailing output) May 10 03:57:57 localhost.localdomain anacron[33406]: Normal exit (1 job run)
I need to get my remote hands to get me more info.
df -hT; df -i
There is no space left on a vital partition / logical volume.
"Only 0MiB is available on /var/spool/abrt"
"postdrop: warning: uid=0: No space left on device"
Yes, I saw that and assumed that was the root cause of the issue. But when I had my guy over in Japan check he found that / had 15G (of 50) free. We did some more investigating and it seems that when mlocate runs the disk fills up and bad things happen. Why is that happening? It is because 15G free space is not enough? We ran a du and most of the space on / was used by /var/log (11G), and /var/lib/mlocate (20G). Can I disable mlocate and get rid of that large dir?
20GB for mlocate is absolutely (and suspiciously) huge. You must have millions and millions of files on that server. If not, then there's something wrong with mlocate. 'mlocate' can be removed unless you're using it, there's nothing else really dependent on it in CentOS. You'd need to really evaluate if someone else is using it on that server.
Yes, we do have millions and millions of files (give or take a million or so). I am going to disable mlocate and remove the db and see if this fixes the issues we've been having.
Since disabling mlocate and removing its db we have not had the daemon crash.
Thanks to all who helped!