We're trying to debug a problem: a server that reboots spontaneously when this user's large, multithreaded program's running. Sometimes it won't do it for hours, other times it's literally every 10 min. I've run iostat, netstat, have top running, tail -f /var/log/dmesg, *nada*. Nothing out of the ordinary.
One thing that's constant: as the system's coming back up, we see a segv of pbs_mom (we're using torque for clustering), and every time it saves the core dump, then a second or so later, Jun 26 14:29:58 <servername> abrtd: Package 'torque-mom' isn't signed with proper key Jun 26 14:29:58 <servername> abrtd: Corrupted or bad dump /var/spool/abrt/ccpp-2012-06-26-14:29:57-3086 (res:2), deleting
I've changed /etc/abrt/abrtd.conf to tell it to save cores from programs not associated with packages (his program's not), and still; neither the man page nor googling has found anything.
So, does anyone know either why it thinks it's corrupt, or how I can make it *stop* deleting it?
mark