Panruo Wu wrote: > David McGiven <davidmcgivenn at ...> writes: >> >> I'm running a Supermicro server with the latest CentOS 6.4 versions >> (kernel 2.6.32-358.23.2.el6.x86_64) and the latest nvidia driver (331.20). >> >> A few minutes after using the GPU for doing some HPC calculations, the >> server crashes and reboots itself. This is happening every time. I know >> it will be rebooted but I don't know when. Sometimes it's 20 minutes after >> starting using it. Sometimes it's 2 hours. <snip> > I also have the same problem with all my 4 Supermicro machines. I don't > know why it happens but nvidia driver seems to be blamed for me. > I'm using CentOS 6.3 and nVidia driver version 304.54 or 319.37. On our Dell R720s, I'm using the kmod-nvidia from elrepo. They don't crash... and that even when they're running week-long jobs. mark