Hi all!
I'm stuck on something really bizarre that is happening to a product I "own" at work. It's a C program, built on CentOS, runs on CentOs or RHEL, has been in circulation since the early 00's, is in use at hundreds of sites.
recently, at multiple customer sites it has started just going away. no core file (yes, ulimit is configured), nothing in any of its (several) log files. it's just gone.
running it under strace until it dies reveals that every thread has been given a SIGKILL.
How does one figure out who deliverd a SIGKILL? For other, non-fatal, signals it is possible to glean the PID of the sending process in a signal handler, but obviously you can't do that for SIGKILL because the app doesn't survive the signal.
I'm grasping at straws here, and am open to almost any kind of suggestion that can be followed-up (as compared to "beats me" which is where I am now).
I'm even wondering if systemd has something to do with it.
Thanks in advance!
Try checking your /var/log/messages for OOM killer log lines. If your machine is running low on memory the oom killer will start killing high memory usage programs.
Grant ________________________________________ From: CentOS centos-bounces@centos.org on behalf of Fred Smith fredex@fcshome.stoneham.ma.us Sent: Tuesday, 6 August 2019 10:57 AM To: centos@centos.org Subject: [CentOS] another bizarre thing...
Hi all!
I'm stuck on something really bizarre that is happening to a product I "own" at work. It's a C program, built on CentOS, runs on CentOs or RHEL, has been in circulation since the early 00's, is in use at hundreds of sites.
recently, at multiple customer sites it has started just going away. no core file (yes, ulimit is configured), nothing in any of its (several) log files. it's just gone.
running it under strace until it dies reveals that every thread has been given a SIGKILL.
How does one figure out who deliverd a SIGKILL? For other, non-fatal, signals it is possible to glean the PID of the sending process in a signal handler, but obviously you can't do that for SIGKILL because the app doesn't survive the signal.
I'm grasping at straws here, and am open to almost any kind of suggestion that can be followed-up (as compared to "beats me" which is where I am now).
I'm even wondering if systemd has something to do with it.
Thanks in advance! -- ---- Fred Smith -- fredex@fcshome.stoneham.ma.us ----------------------------- But God demonstrates his own love for us in this: While we were still sinners, Christ died for us. ------------------------------- Romans 5:8 (niv) ------------------------------ _______________________________________________ CentOS mailing list CentOS@centos.org https://clicktime.symantec.com/39tX9Zv3dbX6w8rkcpnA46w7Vc?u=https%3A%2F%2Fli... -- Grant Street Senior Systems Engineer
T: +61 2 9383 4800 (main) D: +61 2 8310 3582 (direct) E: Grant.Street@al.com.au
Building 54 / FSA #19, Fox Studios Australia, 38 Driver Avenue Moore Park, NSW 2021 AUSTRALIA
[LinkedIn] https://www.linkedin.com/company/animal-logic [Facebook] https://www.facebook.com/Animal-Logic-129284263808191/ [Twitter] https://twitter.com/AnimalLogic [Instagram] https://www.instagram.com/animallogicstudios/
[Animal Logic]http://www.animallogic.com
www.animallogic.comhttp://www.animallogic.com
CONFIDENTIALITY AND PRIVILEGE NOTICE This email is intended only to be read or used by the addressee. It is confidential and may contain privileged information. If you are not the intended recipient, any use, distribution, disclosure or copying of this email is strictly prohibited. Confidentiality and legal privilege attached to this communication are not waived or lost by reason of the mistaken delivery to you. If you have received this email in error, please delete it and notify us immediately by telephone or email.
On Tue, Aug 06, 2019 at 01:54:56AM +0000, Grant Street wrote:
Try checking your /var/log/messages for OOM killer log lines. If your machine is running low on memory the oom killer will start killing high memory usage programs.
Grant
we have watched top while it runs and there's no evidence of a memory shortage.
From: CentOS centos-bounces@centos.org on behalf of Fred Smith fredex@fcshome.stoneham.ma.us Sent: Tuesday, 6 August 2019 10:57 AM To: centos@centos.org Subject: [CentOS] another bizarre thing...
Hi all!
I'm stuck on something really bizarre that is happening to a product I "own" at work. It's a C program, built on CentOS, runs on CentOs or RHEL, has been in circulation since the early 00's, is in use at hundreds of sites.
recently, at multiple customer sites it has started just going away. no core file (yes, ulimit is configured), nothing in any of its (several) log files. it's just gone.
running it under strace until it dies reveals that every thread has been given a SIGKILL.
How does one figure out who deliverd a SIGKILL? For other, non-fatal, signals it is possible to glean the PID of the sending process in a signal handler, but obviously you can't do that for SIGKILL because the app doesn't survive the signal.
I'm grasping at straws here, and am open to almost any kind of suggestion that can be followed-up (as compared to "beats me" which is where I am now).
I'm even wondering if systemd has something to do with it.
Thanks in advance!
---- Fred Smith -- fredex@fcshome.stoneham.ma.us ----------------------------- But God demonstrates his own love for us in this: While we were still sinners, Christ died for us. ------------------------------- Romans 5:8 (niv) ------------------------------ _______________________________________________ CentOS mailing list CentOS@centos.org https://clicktime.symantec.com/39tX9Zv3dbX6w8rkcpnA46w7Vc?u=https%3A%2F%2Fli... -- Grant Street Senior Systems Engineer
T: +61 2 9383 4800 (main) D: +61 2 8310 3582 (direct) E: Grant.Street@al.com.au
Building 54 / FSA #19, Fox Studios Australia, 38 Driver Avenue Moore Park, NSW 2021 AUSTRALIA
[LinkedIn] https://www.linkedin.com/company/animal-logic [Facebook] https://www.facebook.com/Animal-Logic-129284263808191/ [Twitter] https://twitter.com/AnimalLogic [Instagram] https://www.instagram.com/animallogicstudios/
[Animal Logic]http://www.animallogic.com
www.animallogic.comhttp://www.animallogic.com
CONFIDENTIALITY AND PRIVILEGE NOTICE This email is intended only to be read or used by the addressee. It is confidential and may contain privileged information. If you are not the intended recipient, any use, distribution, disclosure or copying of this email is strictly prohibited. Confidentiality and legal privilege attached to this communication are not waived or lost by reason of the mistaken delivery to you. If you have received this email in error, please delete it and notify us immediately by telephone or email. _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
"has been in circulation since the early 00's" I assume it is not the same binary since '00?
SIGKILL usually comes from the kernel. is selinux enabled? Does the application start "automatically", or is it started by a user?
Ron
On 8/5/19 9:02 PM, Fred Smith wrote:
On Tue, Aug 06, 2019 at 01:54:56AM +0000, Grant Street wrote:
Try checking your /var/log/messages for OOM killer log lines. If your machine is running low on memory the oom killer will start killing high memory usage programs.
Grant
we have watched top while it runs and there's no evidence of a memory shortage.
From: CentOS centos-bounces@centos.org on behalf of Fred Smith fredex@fcshome.stoneham.ma.us Sent: Tuesday, 6 August 2019 10:57 AM To: centos@centos.org Subject: [CentOS] another bizarre thing...
Hi all!
I'm stuck on something really bizarre that is happening to a product I "own" at work. It's a C program, built on CentOS, runs on CentOs or RHEL, has been in circulation since the early 00's, is in use at hundreds of sites.
recently, at multiple customer sites it has started just going away. no core file (yes, ulimit is configured), nothing in any of its (several) log files. it's just gone.
running it under strace until it dies reveals that every thread has been given a SIGKILL.
How does one figure out who deliverd a SIGKILL? For other, non-fatal, signals it is possible to glean the PID of the sending process in a signal handler, but obviously you can't do that for SIGKILL because the app doesn't survive the signal.
I'm grasping at straws here, and am open to almost any kind of suggestion that can be followed-up (as compared to "beats me" which is where I am now).
I'm even wondering if systemd has something to do with it.
Thanks in advance!
---- Fred Smith -- fredex@fcshome.stoneham.ma.us ----------------------------- But God demonstrates his own love for us in this: While we were still sinners, Christ died for us. ------------------------------- Romans 5:8 (niv) ------------------------------ _______________________________________________ CentOS mailing list CentOS@centos.org https://clicktime.symantec.com/39tX9Zv3dbX6w8rkcpnA46w7Vc?u=https%3A%2F%2Fli... -- Grant Street Senior Systems Engineer
T: +61 2 9383 4800 (main) D: +61 2 8310 3582 (direct) E: Grant.Street@al.com.au
Building 54 / FSA #19, Fox Studios Australia, 38 Driver Avenue Moore Park, NSW 2021 AUSTRALIA
[LinkedIn] https://www.linkedin.com/company/animal-logic [Facebook] https://www.facebook.com/Animal-Logic-129284263808191/ [Twitter] https://twitter.com/AnimalLogic [Instagram] https://www.instagram.com/animallogicstudios/
[Animal Logic]http://www.animallogic.com
www.animallogic.comhttp://www.animallogic.com
CONFIDENTIALITY AND PRIVILEGE NOTICE This email is intended only to be read or used by the addressee. It is confidential and may contain privileged information. If you are not the intended recipient, any use, distribution, disclosure or copying of this email is strictly prohibited. Confidentiality and legal privilege attached to this communication are not waived or lost by reason of the mistaken delivery to you. If you have received this email in error, please delete it and notify us immediately by telephone or email. _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
On Aug 5, 2019, at 6:57 PM, Fred Smith fredex@fcshome.stoneham.ma.us wrote:
no core file (yes, ulimit is configured)
That’s nowhere near sufficient. To restore classic core file dumps on CentOS 7, you must:
1. Remove Red Hat’s ABRT system, which wants to catch all of this and handle it directly. Say something like “sudo yum remove abrt*”
2. Override the default sysctl telling where core dumps land by writing this file, /etc/sysctl.d/10-core.conf:
kernel.core_pattern = /tmp/core-%e-%p kernel.core_uses_pid = 1 fs.suid_dumpable = 2
Then apply those settings with “sudo sysctl —system”.
I don’t remember what the default is, which this overrides, but I definitely didn’t want it.
You can choose any pattern you like, just remember what permissions the service runs under, because that’s the permission needed by the process that actually dumps the core to make the file hit the disk. That’s why I chose /tmp in this example: anyone can write there.
3. Raise the limits by writing the following to /etc/security/limits.d/10-core.conf:
* hard core unlimited * soft core unlimited
If this is what you meant by “ulimit,” then great, but I suspect you actually meant “ulimit -c unlimited”, but I believe until you do the above, the ulimit CLI app can have no effect. You have to log out and back in to make this take effect.
Once the above is done, “ulimit -c unlimited” can take effect, but it’s of no value at all in conjunction with systemd services, for example, since those don’t run under a standard shell, so your .bash_profile and such aren’t even exec’d.
4. If your program is launched via systemd, then you must edit /etc/systemd/system.conf and set
DefaultLimitCORE=infinity
then say “sudo systemctl daemon-reeexec”
Case matters; “Core” won’t work. Ask me how I know. :)
5. If you have a systemd unit file for your service, you have to set a related value in there as well:
LimitCore=infinity
You need both because #4 sets the system-wide cap, while this sets the per-service value, which can go no higher than the system cap.
6. Restart the service to apply the above two changes.
Yes, it really is that difficult to enable classic core dumps on CentOS 7. You’re welcome. :)
On Tue, 2019-08-06 at 05:27 -0600, Warren Young wrote:
On Aug 5, 2019, at 6:57 PM, Fred Smith fredex@fcshome.stoneham.ma.us wrote:
no core file (yes, ulimit is configured)
That’s nowhere near sufficient. To restore classic core file dumps on CentOS 7, you must:
I was under the impression that a SIGKILL doesn't trigger a core dump anyway. It just kills the process.
P.
On Aug 6, 2019, at 5:35 AM, Pete Biggs pete@biggs.org.uk wrote:
On Tue, 2019-08-06 at 05:27 -0600, Warren Young wrote:
On Aug 5, 2019, at 6:57 PM, Fred Smith fredex@fcshome.stoneham.ma.us wrote:
no core file (yes, ulimit is configured)
That’s nowhere near sufficient. To restore classic core file dumps on CentOS 7, you must:
I was under the impression that a SIGKILL doesn't trigger a core dump anyway. It just kills the process.
True; you need SIGABRT to force a core to drop.
I posted that because if all he did was set the shell’s ulimit value, the lack of core files proves nothing, because there’s half a dozen other things that could be preventing them from dropping.
Wow, thanks for the detailed recipe!
How did we deserve this when it was so easy in the past :-)
On Aug 5, 2019, at 6:57 PM, Fred Smith fredex@fcshome.stoneham.ma.us wrote:
no core file (yes, ulimit is configured)
That’s nowhere near sufficient. To restore classic core file dumps on CentOS 7, you must:
- Remove Red Hat’s ABRT system, which wants to catch all of this and
handle it directly. Say something like “sudo yum remove abrt*”
- Override the default sysctl telling where core dumps land by writing
this file, /etc/sysctl.d/10-core.conf:
kernel.core_pattern = /tmp/core-%e-%p kernel.core_uses_pid = 1 fs.suid_dumpable = 2
Then apply those settings with “sudo sysctl —system”.
I don’t remember what the default is, which this overrides, but I definitely didn’t want it.
You can choose any pattern you like, just remember what permissions the service runs under, because that’s the permission needed by the process that actually dumps the core to make the file hit the disk. That’s why I chose /tmp in this example: anyone can write there.
- Raise the limits by writing the following to
/etc/security/limits.d/10-core.conf:
* hard core unlimited * soft core unlimited
If this is what you meant by “ulimit,” then great, but I suspect you actually meant “ulimit -c unlimited”, but I believe until you do the above, the ulimit CLI app can have no effect. You have to log out and back in to make this take effect.
Once the above is done, “ulimit -c unlimited” can take effect, but it’s of no value at all in conjunction with systemd services, for example, since those don’t run under a standard shell, so your .bash_profile and such aren’t even exec’d.
- If your program is launched via systemd, then you must edit
/etc/systemd/system.conf and set
DefaultLimitCORE=infinity
then say “sudo systemctl daemon-reeexec”
Case matters; “Core” won’t work. Ask me how I know. :)
- If you have a systemd unit file for your service, you have to set a
related value in there as well:
LimitCore=infinity
You need both because #4 sets the system-wide cap, while this sets the per-service value, which can go no higher than the system cap.
- Restart the service to apply the above two changes.
Yes, it really is that difficult to enable classic core dumps on CentOS 7. You’re welcome. :) _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
On Tue, Aug 06, 2019 at 05:27:54AM -0600, Warren Young wrote:
On Aug 5, 2019, at 6:57 PM, Fred Smith fredex@fcshome.stoneham.ma.us wrote:
no core file (yes, ulimit is configured)
yeah, I meant "ulimit -c unlimited" is in effect.
I had no idea systemd had made such a drastic change. or is it that someone at RH decided to make it (nearly) impossible to do? I fail to see how it is beneficial to anyone to make it so hard to get core dump files.
but thanks for the details!
Fred
That’s nowhere near sufficient. To restore classic core file dumps on CentOS 7, you must:
Remove Red Hat’s ABRT system, which wants to catch all of this and handle it directly. Say something like “sudo yum remove abrt*”
Override the default sysctl telling where core dumps land by writing this file, /etc/sysctl.d/10-core.conf:
kernel.core_pattern = /tmp/core-%e-%p kernel.core_uses_pid = 1 fs.suid_dumpable = 2
Then apply those settings with “sudo sysctl —system”.
I don’t remember what the default is, which this overrides, but I definitely didn’t want it.
You can choose any pattern you like, just remember what permissions the service runs under, because that’s the permission needed by the process that actually dumps the core to make the file hit the disk. That’s why I chose /tmp in this example: anyone can write there.
Raise the limits by writing the following to /etc/security/limits.d/10-core.conf:
- hard core unlimited
- soft core unlimited
If this is what you meant by “ulimit,” then great, but I suspect you actually meant “ulimit -c unlimited”, but I believe until you do the above, the ulimit CLI app can have no effect. You have to log out and back in to make this take effect.
Once the above is done, “ulimit -c unlimited” can take effect, but it’s of no value at all in conjunction with systemd services, for example, since those don’t run under a standard shell, so your .bash_profile and such aren’t even exec’d.
If your program is launched via systemd, then you must edit /etc/systemd/system.conf and set
DefaultLimitCORE=infinity
then say “sudo systemctl daemon-reeexec”
Case matters; “Core” won’t work. Ask me how I know. :)
If you have a systemd unit file for your service, you have to set a related value in there as well:
LimitCore=infinity
You need both because #4 sets the system-wide cap, while this sets the per-service value, which can go no higher than the system cap.
- Restart the service to apply the above two changes.
Yes, it really is that difficult to enable classic core dumps on CentOS 7. You’re welcome. :) _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
On Aug 6, 2019, at 7:59 AM, Fred Smith fredex@fcshome.stoneham.ma.us wrote:
On Tue, Aug 06, 2019 at 05:27:54AM -0600, Warren Young wrote:
On Aug 5, 2019, at 6:57 PM, Fred Smith fredex@fcshome.stoneham.ma.us wrote:
no core file (yes, ulimit is configured)
yeah, I meant "ulimit -c unlimited" is in effect.
That only affects the shell it’s set for, which isn’t generally important for a service, since we no longer start services via shell scripts in the systemd world.
I had no idea systemd had made such a drastic change.
This isn’t a systemd change, it’s a *system* change. The only reason systemd is involved is that it also has its own defaults, just as your shell does, overridden by the ulimit command. Steps 1-3 remove the system limits, then 4 & 5 remove the systemd limits under that, which can affect your service, if it’s being started via systemd.
or is it that someone at RH decided to make it (nearly) impossible to do? I fail to see how it is beneficial to anyone to make it so hard to get core dump files.
Core dumps are a security risk. They’re memory images of running processes. If you configure your server like I give in my recipe, every process that drops core will create a world-readable file in /tmp showing that process’s memory state, which means you can recover everything it was doing at the time of the crash.
So, if you can find a way to make, say, PAM or sshd drop core, you’ll get live login details in debuggable form, available to anyone who can log into that box.
You definitely want core dumps off by default.
Making core dumps enabled by default is about as sensible as enabling rsh by default.
https://en.wikipedia.org/wiki/Remote_Shell
We stopped doing that on production servers about 20-30 years ago, for more or less the same reason.
On Tue, Aug 06, 2019 at 03:18:06PM -0600, Warren Young wrote:
On Aug 6, 2019, at 7:59 AM, Fred Smith fredex@fcshome.stoneham.ma.us wrote:
On Tue, Aug 06, 2019 at 05:27:54AM -0600, Warren Young wrote:
On Aug 5, 2019, at 6:57 PM, Fred Smith fredex@fcshome.stoneham.ma.us wrote:
no core file (yes, ulimit is configured)
yeah, I meant "ulimit -c unlimited" is in effect.
That only affects the shell it’s set for, which isn’t generally important for a service, since we no longer start services via shell scripts in the systemd world.
I had no idea systemd had made such a drastic change.
This isn’t a systemd change, it’s a *system* change. The only reason systemd is involved is that it also has its own defaults, just as your shell does, overridden by the ulimit command. Steps 1-3 remove the system limits, then 4 & 5 remove the systemd limits under that, which can affect your service, if it’s being started via systemd.
or is it that someone at RH decided to make it (nearly) impossible to do? I fail to see how it is beneficial to anyone to make it so hard to get core dump files.
Core dumps are a security risk. They’re memory images of running processes. If you configure your server like I give in my recipe, every process that drops core will create a world-readable file in /tmp showing that process’s memory state, which means you can recover everything it was doing at the time of the crash.
So, if you can find a way to make, say, PAM or sshd drop core, you’ll get live login details in debuggable form, available to anyone who can log into that box.
You definitely want core dumps off by default.
Making core dumps enabled by default is about as sensible as enabling rsh by default.
Oh of course. duh!
What we've alwayws done with this program is to put "ulimit -c unlimited" in the script that sets its environment then starts the program itself. that minimizes the attack surface.
Setting up as you described earlier, is there a way to allow only a single program to drop core?
On Aug 6, 2019, at 8:48 PM, Fred Smith fredex@fcshome.stoneham.ma.us wrote:
Setting up as you described earlier, is there a way to allow only a single program to drop core?
Of course.
The * in the limits.d file is a “domain” value you can adjust to suit:
https://www.thegeekdiary.com/understanding-etc-security-limits-conf-file-to-...
You’d have to read the systemd docs to figure out the defaults for LimitCore, but I suspect you don’t get cores until you set this on a per-service basis.
You can also adjust the sysctl pattern path to put cores somewhere secure. That’s the normal use of absolute paths: put the cores into a dropbox directory that only root can read but anyone can write to.
Also, I should point out that my first step, removing ABRT, is a heavy-handed method. Maybe what you *actually* want to do is learn to cooperate with ABRT rather than rip it out entirely.
On Tue, Aug 06, 2019 at 09:02:37PM -0600, Warren Young wrote:
On Aug 6, 2019, at 8:48 PM, Fred Smith fredex@fcshome.stoneham.ma.us wrote:
Setting up as you described earlier, is there a way to allow only a single program to drop core?
Of course.
The * in the limits.d file is a “domain” value you can adjust to suit:
https://www.thegeekdiary.com/understanding-etc-security-limits-conf-file-to-set-ulimit/
You’d have to read the systemd docs to figure out the defaults for LimitCore, but I suspect you don’t get cores until you set this on a per-service basis.
You can also adjust the sysctl pattern path to put cores somewhere secure. That’s the normal use of absolute paths: put the cores into a dropbox directory that only root can read but anyone can write to.
Also, I should point out that my first step, removing ABRT, is a heavy-handed method. Maybe what you *actually* want to do is learn to cooperate with ABRT rather than rip it out entirely.
how about "simply" disabling and stopping it?
Fred Smith wrote:
Hi all!
I'm stuck on something really bizarre that is happening to a product I "own" at work. It's a C program, built on CentOS, runs on CentOs or RHEL, has been in circulation since the early 00's, is in use at hundreds of sites.
recently, at multiple customer sites it has started just going away. no core file (yes, ulimit is configured), nothing in any of its (several) log files. it's just gone.
running it under strace until it dies reveals that every thread has been given a SIGKILL.
How does one figure out who deliverd a SIGKILL? For other, non-fatal, signals it is possible to glean the PID of the sending process in a signal handler, but obviously you can't do that for SIGKILL because the app doesn't survive the signal.
I'm grasping at straws here, and am open to almost any kind of suggestion that can be followed-up (as compared to "beats me" which is where I am now).
I'm even wondering if systemd has something to do with it.
I had an issue a few years ago where 'something' was killing processes - I found it by writing a simple LD_PRELOAD hack that intercepted kill(2) and logged what is was doing via syslog before doing the actual kill - and used /etc/ld.so.preload to get it loaded by every process ...
James Pearson
On Tue, Aug 06, 2019 at 03:49:29PM +0000, James Pearson wrote:
Fred Smith wrote:
Hi all!
I'm stuck on something really bizarre that is happening to a product I "own" at work. It's a C program, built on CentOS, runs on CentOs or RHEL, has been in circulation since the early 00's, is in use at hundreds of sites.
recently, at multiple customer sites it has started just going away. no core file (yes, ulimit is configured), nothing in any of its (several) log files. it's just gone.
running it under strace until it dies reveals that every thread has been given a SIGKILL.
How does one figure out who deliverd a SIGKILL? For other, non-fatal, signals it is possible to glean the PID of the sending process in a signal handler, but obviously you can't do that for SIGKILL because the app doesn't survive the signal.
I'm grasping at straws here, and am open to almost any kind of suggestion that can be followed-up (as compared to "beats me" which is where I am now).
I'm even wondering if systemd has something to do with it.
I had an issue a few years ago where 'something' was killing processes - I found it by writing a simple LD_PRELOAD hack that intercepted kill(2) and logged what is was doing via syslog before doing the actual kill - and used /etc/ld.so.preload to get it loaded by every process ...
James Pearson
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
James:
After posting my original mail, I found this URL:
https://www.ibm.com/developerworks/community/blogs/aimsupport/entry/Finding_...
which shows a very simple recipe for programming system tap to report sigkills, the UID that sends it, and the target process. We've asked the customer who is helping troubleshoot to implement that and get back to us with the result.
I suspect systemd has something to do with it, but I have absolutely no evidence, just a nagging feeling that since it has its little fingers in all the pies, it could be doing anything and I'd have no way of knowing. :(
I try not to be one of the systemd bashers, but I seem to be losing that battle.
Fred
On Mon, Aug 05, 2019 at 08:57:45PM -0400, Fred Smith wrote:
Hi all!
I'm stuck on something really bizarre that is happening to a product I "own" at work. It's a C program, built on CentOS, runs on CentOs or RHEL, has been in circulation since the early 00's, is in use at hundreds of sites.
recently, at multiple customer sites it has started just going away. no core file (yes, ulimit is configured), nothing in any of its (several) log files. it's just gone.
running it under strace until it dies reveals that every thread has been given a SIGKILL.
How does one figure out who deliverd a SIGKILL? For other, non-fatal, signals it is possible to glean the PID of the sending process in a signal handler, but obviously you can't do that for SIGKILL because the app doesn't survive the signal.
I'm grasping at straws here, and am open to almost any kind of suggestion that can be followed-up (as compared to "beats me" which is where I am now).
OK, more information.
Found a recipe to cause systemtap to emit a line of text identifying the sender of the SIGKILL.
probe signal.send { if (sig_name == "SIGKILL") printf("%s was sent to %s (pid:%d) by %s uid:%d\n", sig_name, pid_name, sig_pid, execname(), uid())
unfortunately, it says the program is killing itself:
SIGKILL was sent to myprog (pid:12269) by myprog uid:1000
So,... now I'm wondering how one figures that out. nowhere in my source code does it explicitly raise any signal, much less SIGKILL. So there must be some underlying library or system call or something doing it.
On Wed, 7 Aug 2019 13:38:54 -0400 Fred Smith wrote:
So,... now I'm wondering how one figures that out.
Since it's your program you have the source code.
printf is your friend.
Start adding printf statements (to console and/or to a file at your option) with status reports ("widget counting executing", "addition function executing", "huge explosion executing") and use that to find out where it quits. Add more printf's as needed to narrow it down.
Is this on both EL6 and EL7? If only EL7, it could be control groups causing the issue. The idea of cgroups is to prevent zombie processes, but if you need your program to spawn another process then restart itself while the other process continues to run, you need to launch it in a different control group, or the shutdown of the parent process will also kill the child. In my case, we have an upgrade script which needs to get called, then shut down the calling process in order to upgrade it. For example:
# Clear any errors in the upgrade control group. /bin/systemctl reset-failed upgrade-trigger
# Launch the upgrader in its own control group. /bin/systemd-run --unit=upgrade-trigger --slice=upgrade-trigger /bin/bash /opt/myapp/Upgrade.sh "$1" "$2"
If we don't do this, the upgrade fails as the upgrader get's terminated when the parent application is shut down.
Gregory Young
-----Original Message----- From: CentOS centos-bounces@centos.org On Behalf Of Fred Smith Sent: August 7, 2019 1:39 PM To: centos@centos.org Subject: Re: [CentOS] another bizarre thing...
On Mon, Aug 05, 2019 at 08:57:45PM -0400, Fred Smith wrote:
Hi all!
I'm stuck on something really bizarre that is happening to a product I "own" at work. It's a C program, built on CentOS, runs on CentOs or RHEL, has been in circulation since the early 00's, is in use at hundreds of sites.
recently, at multiple customer sites it has started just going away. no core file (yes, ulimit is configured), nothing in any of its (several) log files. it's just gone.
running it under strace until it dies reveals that every thread has been given a SIGKILL.
How does one figure out who deliverd a SIGKILL? For other, non-fatal, signals it is possible to glean the PID of the sending process in a signal handler, but obviously you can't do that for SIGKILL because the app doesn't survive the signal.
I'm grasping at straws here, and am open to almost any kind of suggestion that can be followed-up (as compared to "beats me" which is where I am now).
OK, more information.
Found a recipe to cause systemtap to emit a line of text identifying the sender of the SIGKILL.
probe signal.send { if (sig_name == "SIGKILL") printf("%s was sent to %s (pid:%d) by %s uid:%d\n", sig_name, pid_name, sig_pid, execname(), uid())
unfortunately, it says the program is killing itself:
SIGKILL was sent to myprog (pid:12269) by myprog uid:1000
So,... now I'm wondering how one figures that out. nowhere in my source code does it explicitly raise any signal, much less SIGKILL. So there must be some underlying library or system call or something doing it.
-- ---- Fred Smith -- fredex@fcshome.stoneham.ma.us ----------------------------- I can do all things through Christ who strengthens me. ------------------------------ Philippians 4:13 ------------------------------- _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
On Thu, Aug 08, 2019 at 05:06:06PM +0000, Young, Gregory wrote:
Is this on both EL6 and EL7? If only EL7, it could be control groups causing the issue. The idea of cgroups is to prevent zombie processes, but if you need your program to spawn another process then restart itself while the other process continues to run, you need to launch it in a different control group, or the shutdown of the parent process will also kill the child. In my case, we have an upgrade script which needs to get called, then shut down the calling process in order to upgrade it. For example:
# Clear any errors in the upgrade control group. /bin/systemctl reset-failed upgrade-trigger)
# Launch the upgrader in its own control group. /bin/systemd-run --unit=upgrade-trigger --slice=upgrade-trigger /bin/bash /opt/myapp/Upgrade.sh "$1" "$2"
If we don't do this, the upgrade fails as the upgrader get's terminated when the parent application is shut down.
well, we aren't INTENTINALLY using control groups. do we get put into one by the very act of launching a program w hich then creates threads, and they then all coexist until they're told to stop?
I think it's not the scenario you describe, the main program launches from an init script, does some sanity checks, loads some config files, then spawns the number of threads defined by its configuration. then all the threads, including the main prog, hang around doing stuff until they're told to stop, which happens all at once for all of them. On a good day, anyway. what is happening now is they will all run fine for some time (anhour or twelve) then they all receive a SIGKILL.
Accordiing to a systemtap script I found online, it thinks the program is killing itself, but as the guy who wrote it, I don't think so. the script can be seen below in earlier mail.
As for if it also fails on C6, I don't know. I've asked our support team to see if they have a C6/EL6 customer who will let them install the latest version for 6 and see what happens, but so far, no joy.
Fred
Subject: Re: [CentOS] another bizarre thing...
On Mon, Aug 05, 2019 at 08:57:45PM -0400, Fred Smith wrote:
Hi all!
I'm stuck on something really bizarre that is happening to a product I "own" at work. It's a C program, built on CentOS, runs on CentOs or RHEL, has been in circulation since the early 00's, is in use at hundreds of sites.
recently, at multiple customer sites it has started just going away. no core file (yes, ulimit is configured), nothing in any of its (several) log files. it's just gone.
running it under strace until it dies reveals that every thread has been given a SIGKILL.
How does one figure out who deliverd a SIGKILL? For other, non-fatal, signals it is possible to glean the PID of the sending process in a signal handler, but obviously you can't do that for SIGKILL because the app doesn't survive the signal.
I'm grasping at straws here, and am open to almost any kind of suggestion that can be followed-up (as compared to "beats me" which is where I am now).
OK, more information.
Found a recipe to cause systemtap to emit a line of text identifying the sender of the SIGKILL.
probe signal.send { if (sig_name == "SIGKILL") printf("%s was sent to %s (pid:%d) by %s uid:%d\n", sig_name, pid_name, sig_pid, execname(), uid())
unfortunately, it says the program is killing itself:
SIGKILL was sent to myprog (pid:12269) by myprog uid:1000
So,... now I'm wondering how one figures that out. nowhere in my source code does it explicitly raise any signal, much less SIGKILL. So there must be some underlying library or system call or something doing it.
-- ---- Fred Smith -- fredex@fcshome.stoneham.ma.us ----------------------------- I can do all things through Christ who strengthens me. ------------------------------ Philippians 4:13 ------------------------------- _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
Hi Fred,
Yep, that's exactly how control groups work in CentOS 7. You don't need to define them (normally), they get assigned when the init script or systemd service launches it. As I mentioned, the idea is to ensure none of those child threads become zombies if the parent dies/crashes/gets killed. For troubleshooting, you could try moving the child threads into their own cgroup, which might help reduce the noise when the parent process gets killed. Of course, you will have to manually kill the child processes during this testing, but it might clear enough of the strace logging for you to see where the parent process is getting killed. Don't forget to undo this debugging step when done, or you will end up with zombies when you legitimately want to shut down the process.
Also, if you haven't already, you may want to convert it to use the systemd ".service" file launching. It gives you a lot of control over startup timeouts, restarts, shutdown commands, process branching, etc. if nothing else, it might help you identify when the process dies, and restart it without intervention...
Gregory Young
-----Original Message----- From: CentOS centos-bounces@centos.org On Behalf Of Fred Smith Sent: August 8, 2019 7:48 PM To: centos@centos.org Subject: Re: [CentOS] another bizarre thing...
On Thu, Aug 08, 2019 at 05:06:06PM +0000, Young, Gregory wrote:
Is this on both EL6 and EL7? If only EL7, it could be control groups causing the issue. The idea of cgroups is to prevent zombie processes, but if you need your program to spawn another process then restart itself while the other process continues to run, you need to launch it in a different control group, or the shutdown of the parent process will also kill the child. In my case, we have an upgrade script which needs to get called, then shut down the calling process in order to upgrade it. For example:
# Clear any errors in the upgrade control group. /bin/systemctl reset-failed upgrade-trigger)
# Launch the upgrader in its own control group. /bin/systemd-run --unit=upgrade-trigger --slice=upgrade-trigger /bin/bash /opt/myapp/Upgrade.sh "$1" "$2"
If we don't do this, the upgrade fails as the upgrader get's terminated when the parent application is shut down.
well, we aren't INTENTINALLY using control groups. do we get put into one by the very act of launching a program w hich then creates threads, and they then all coexist until they're told to stop?
I think it's not the scenario you describe, the main program launches from an init script, does some sanity checks, loads some config files, then spawns the number of threads defined by its configuration. then all the threads, including the main prog, hang around doing stuff until they're told to stop, which happens all at once for all of them. On a good day, anyway. what is happening now is they will all run fine for some time (anhour or twelve) then they all receive a SIGKILL.
Accordiing to a systemtap script I found online, it thinks the program is killing itself, but as the guy who wrote it, I don't think so. the script can be seen below in earlier mail.
As for if it also fails on C6, I don't know. I've asked our support team to see if they have a C6/EL6 customer who will let them install the latest version for 6 and see what happens, but so far, no joy.
Fred
Subject: Re: [CentOS] another bizarre thing...
On Mon, Aug 05, 2019 at 08:57:45PM -0400, Fred Smith wrote:
Hi all!
I'm stuck on something really bizarre that is happening to a product I "own" at work. It's a C program, built on CentOS, runs on CentOs or RHEL, has been in circulation since the early 00's, is in use at hundreds of sites.
recently, at multiple customer sites it has started just going away. no core file (yes, ulimit is configured), nothing in any of its (several) log files. it's just gone.
running it under strace until it dies reveals that every thread has been given a SIGKILL.
How does one figure out who deliverd a SIGKILL? For other, non-fatal, signals it is possible to glean the PID of the sending process in a signal handler, but obviously you can't do that for SIGKILL because the app doesn't survive the signal.
I'm grasping at straws here, and am open to almost any kind of suggestion that can be followed-up (as compared to "beats me" which is where I am now).
OK, more information.
Found a recipe to cause systemtap to emit a line of text identifying the sender of the SIGKILL.
probe signal.send { if (sig_name == "SIGKILL") printf("%s was sent to %s (pid:%d) by %s uid:%d\n", sig_name, pid_name, sig_pid, execname(), uid())
unfortunately, it says the program is killing itself:
SIGKILL was sent to myprog (pid:12269) by myprog uid:1000
So,... now I'm wondering how one figures that out. nowhere in my source code does it explicitly raise any signal, much less SIGKILL. So there must be some underlying library or system call or something doing it.
-- ---- Fred Smith -- fredex@fcshome.stoneham.ma.us ----------------------------- I can do all things through Christ who strengthens me.
------------------------------ Philippians 4:13
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
-- ---- Fred Smith -- fredex@fcshome.stoneham.ma.us ----------------------------- "And he will be called Wonderful Counselor, Mighty God, Everlasting Father, Prince of Peace. Of the increase of his government there will be no end. He will reign on David's throne and over his kingdom, establishing and upholding it with justice and righteousness from that time on and forever." ------------------------------- Isaiah 9:7 (niv) ------------------------------ _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
On Mon, Aug 05, 2019 at 08:57:45PM -0400, Fred Smith (fredex@fcshome.stoneham.ma.us) wrote:
Hi all!
I'm stuck on something really bizarre that is happening to a product I "own" at work. It's a C program, built on CentOS, runs on CentOs or RHEL, has been in circulation since the early 00's, is in use at hundreds of sites.
recently, at multiple customer sites it has started just going away. no core file (yes, ulimit is configured), nothing in any of its (several) log files. it's just gone.
Late to the thread but since it has not been suggested: Have you tried to statically link all libs?
Then use Frank Cox's suggestion to use printf's at location thoughout the source code.
I know it will be big (depending on the number of libs) But this way you are sure that the compile is against a known (yours) set of libs!
Also have you recompiled it and given the new binaries to the customers?
Just an idea ..
On Mon, Aug 12, 2019 at 10:16:35AM +1000, Jobst Schmalenbach wrote:
On Mon, Aug 05, 2019 at 08:57:45PM -0400, Fred Smith (fredex@fcshome.stoneham.ma.us) wrote:
Hi all!
I'm stuck on something really bizarre that is happening to a product I "own" at work. It's a C program, built on CentOS, runs on CentOs or RHEL, has been in circulation since the early 00's, is in use at hundreds of sites.
recently, at multiple customer sites it has started just going away. no core file (yes, ulimit is configured), nothing in any of its (several) log files. it's just gone.
Late to the thread but since it has not been suggested: Have you tried to statically link all libs?
I doubt modern Linux systems will produce a fully-static binary, since many of the system libs come only as .so files.
Then use Frank Cox's suggestion to use printf's at location thoughout the source code.
I know it will be big (depending on the number of libs) But this way you are sure that the compile is against a known (yours) set of libs!
Also have you recompiled it and given the new binaries to the customers?
Yes, every time there's a new RHEL/CentOS version released it gets completely rebuilt on that new release. I don't depend on compatibility between releases. Not to mention as maintenance and feeping-creaturism* strikes.
* for those not in the know: feeping-creaturism ==> creeping featurism
On Sun, Aug 11, 2019 at 08:52:59PM -0400, Fred Smith (fredex@fcshome.stoneham.ma.us) wrote:
On Mon, Aug 12, 2019 at 10:16:35AM +1000, Jobst Schmalenbach wrote:
On Mon, Aug 05, 2019 at 08:57:45PM -0400, Fred Smith (fredex@fcshome.stoneham.ma.us) wrote:
Hi all!
Late to the thread but since it has not been suggested: Have you tried to statically link all libs?
I doubt modern Linux systems will produce a fully-static binary, since many of the system libs come only as .so files.
I know that. It's just how keen you are to find the reason ... especially if you have no control what libraries (even i686) are installed on the other machines.
Depening how many libraries the binary uses you could download them and use those as source for inclusion. You could omit the obvious libs for starters ... and then even include those if still crashing. You only need to distribute those binaries to the people who have problems ...
If a couple of those customers (failing progs) are helpful get a "yum list installed" and scan the list of libs and see whether sth might raise eyebrows.
For example I had one of my machines failing on one prog because it had "glibc.i686" installed due to ftdi. I changed the program using the ftdi libs to use full x86_64 (took me a few hours) and unstinalled the "glibc.i686" and suddenly the other prog had no problems!
I know you cant tell people to un-install but static linking MIGHT help.