Panruo Wu <armiuswu at ...> writes: > > David McGiven <davidmcgivenn <at> ...> writes: > > > > > Hello there, > > > > I'm running a Supermicro server with the latest CentOS 6.4 versions (kernel > > : 2.6.32-358.23.2.el6.x86_64) and the latest nvidia driver (331.20). > > > > A few minutes after using the GPU for doing some HPC calculations, the > > server crashes and reboots itself. This is happening every time. I know it > > will be rebooted but I don't know when. Sometimes it's 20 minutes after > > starting using it. Sometimes it's 2 hours. > > > > If I unplug the GPU card and put some stress on the server, it works ok. So > > I suspect there's a bug in the kernel/nvidia driver. > > > > I can't find any messages on /var/log/messages. > > > > What should I do ? Should I file a bug on the centos bugtracking system ? > > Is there anyway I can gather more information ? The server is in a remote > > location so I have a hard time accessing the console. > > > > Thanks. > > > > Hi there, > > I also have the same problem with all my 4 Supermicro machines. I don't > know why it happens but nvidia driver seems to be blamed for me. > I'm using CentOS 6.3 and nVidia driver version 304.54 or 319.37. > > Best, > Panruo > Hi David, I think I might have found a way to work around this. In short, just set the persistence mode of your GPU on, so that nVidia drive will not be unloaded when the GPU is idling. I suspect the frequent load/unload of nvidia driver might have bugs and mess up the kernel. To set the persistence mode on: $ nvidia-smi -pm 1 Let me know if this works for you. I have a node running strong after 4 hours of running all the cuda 5.5 samples over and over. No crashes so far. Panruo