From: David McGiven <davidmcgivenn at gmail.com> > I'm running a Supermicro server with the latest CentOS 6.4 versions (kernel > : 2.6.32-358.23.2.el6.x86_64) and the latest nvidia driver (331.20). > A few minutes after using the GPU for doing some HPC calculations, the > server crashes and reboots itself. This is happening every time. I know it > will be rebooted but I don't know when. Sometimes it's 20 minutes after > starting using it. Sometimes it's 2 hours. > If I unplug the GPU card and put some stress on the server, it works ok. So > I suspect there's a bug in the kernel/nvidia driver. > I can't find any messages on /var/log/messages. Did you check the IPMI logs? First thing that comes to my mind would be overheating. Maybe dump the temperatures every minute to a log file and after next reboot, check if there is a pic... Or maybe a freeze + the watchdog kicking in? JD