[CentOS] another bizarre thing...

Tue Aug 6 18:20:48 UTC 2019
Fred Smith <fredex at fcshome.stoneham.ma.us>

On Tue, Aug 06, 2019 at 03:49:29PM +0000, James Pearson wrote:
> Fred Smith wrote:
> > 
> > Hi all!
> > 
> > I'm stuck on something really bizarre that is happening to a product
> > I "own" at work. It's a C program, built on CentOS, runs on CentOs or
> > RHEL, has been in circulation since the early 00's, is in use at
> > hundreds of sites.
> > 
> > recently, at multiple customer sites it has started just going away.
> > no core file (yes, ulimit is configured), nothing in any of its
> > (several) log files. it's just gone.
> > 
> > running it under strace until it dies reveals that every thread has
> > been given a SIGKILL.
> > 
> > How does one figure out who deliverd a SIGKILL? For other, non-fatal,
> > signals it is possible to glean the PID of the sending process in a
> > signal  handler, but obviously you can't do that for SIGKILL because
> > the app doesn't survive the signal.
> > 
> > I'm grasping at straws here, and am open to almost any kind of
> > suggestion that can be followed-up (as compared to "beats me" which
> > is where I am now).
> > 
> > I'm even wondering if systemd has something to do with it.
> 
> I had an issue a few years ago where 'something' was killing processes - 
> I found it by writing a simple LD_PRELOAD hack that intercepted kill(2) 
> and logged what is was doing via syslog before doing the actual kill - 
> and used /etc/ld.so.preload to get it loaded by every process ...
> 
> James Pearson
> 
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos

James:

After posting my original mail, I found this URL:

https://www.ibm.com/developerworks/community/blogs/aimsupport/entry/Finding_the_source_of_signals_on_Linux_with_strace_auditd_or_Systemtap?lang=en

which shows a very simple recipe for programming system tap to report
sigkills, the UID that sends it, and the target process. We've asked
the customer who is helping troubleshoot to implement that and get
back to us with the result.

I suspect systemd has something to do with it, but I have absolutely
no evidence, just a nagging feeling that since it has its little
fingers in all the pies, it could be doing anything and I'd have
no way of knowing. :(

I try not to be one of the systemd bashers, but I seem to be losing
that battle.

Fred

-- 
-------------------------------------------------------------------------------
 .----    Fred Smith   /              
( /__  ,__.   __   __ /  __   : /     
 /    /  /   /__) /  /  /__) .+'           Home: fredex at fcshome.stoneham.ma.us 
/    /  (__ (___ (__(_ (___ / :__                                 781-438-5471 
-------------------------------- Jude 1:24,25 ---------------------------------