Panruo Wu wrote:
David McGiven <davidmcgivenn@...> writes:
I'm running a Supermicro server with the latest CentOS 6.4 versions (kernel 2.6.32-358.23.2.el6.x86_64) and the latest nvidia driver (331.20).
A few minutes after using the GPU for doing some HPC calculations, the server crashes and reboots itself. This is happening every time. I know it will be rebooted but I don't know when. Sometimes it's 20 minutes after starting using it. Sometimes it's 2 hours.
<snip>
I also have the same problem with all my 4 Supermicro machines. I don't know why it happens but nvidia driver seems to be blamed for me. I'm using CentOS 6.3 and nVidia driver version 304.54 or 319.37.
On our Dell R720s, I'm using the kmod-nvidia from elrepo. They don't crash... and that even when they're running week-long jobs.
mark