[CentOS] Crash and automatical reboot when using the NVIDIA card

Fri Nov 15 11:11:47 UTC 2013
David McGiven <davidmcgivenn at gmail.com>

Hello there,

I'm running a Supermicro server with the latest CentOS 6.4 versions (kernel
: 2.6.32-358.23.2.el6.x86_64) and the latest nvidia driver (331.20).

A few minutes after using the GPU for doing some HPC calculations, the
server crashes and reboots itself. This is happening every time. I know it
will be rebooted but I don't know when. Sometimes it's 20 minutes after
starting using it. Sometimes it's 2 hours.

If I unplug the GPU card and put some stress on the server, it works ok. So
I suspect there's a bug in the kernel/nvidia driver.

I can't find any messages on /var/log/messages.

What should I do ? Should I file a bug on the centos bugtracking system ?
Is there anyway I can gather more information ? The server is in a remote
location so I have a hard time accessing the console.