[CentOS] Crash and automatical reboot when using the NVIDIA card

Sat Nov 23 00:38:12 UTC 2013
Panruo Wu <armiuswu at gmail.com>

Panruo Wu <armiuswu at ...> writes:

> 
> David McGiven <davidmcgivenn <at> ...> writes:
> 
> > 
> > Hello there,
> > 
> > I'm running a Supermicro server with the latest CentOS 6.4 versions (kernel
> > : 2.6.32-358.23.2.el6.x86_64) and the latest nvidia driver (331.20).
> > 
> > A few minutes after using the GPU for doing some HPC calculations, the
> > server crashes and reboots itself. This is happening every time. I know it
> > will be rebooted but I don't know when. Sometimes it's 20 minutes after
> > starting using it. Sometimes it's 2 hours.
> > 
> > If I unplug the GPU card and put some stress on the server, it works ok. So
> > I suspect there's a bug in the kernel/nvidia driver.
> > 
> > I can't find any messages on /var/log/messages.
> > 
> > What should I do ? Should I file a bug on the centos bugtracking system ?
> > Is there anyway I can gather more information ? The server is in a remote
> > location so I have a hard time accessing the console.
> > 
> > Thanks.
> > 
> 
> Hi there,
> 
> I also have the same problem with all my 4 Supermicro machines. I don't
> know why it happens but nvidia driver seems to be blamed for me. 
> I'm using CentOS 6.3 and nVidia driver version 304.54 or 319.37.
> 
> Best,
> Panruo
> 

Hi David,

I think I might have found a way to work around this. In short, just set the
persistence mode of your GPU on, so that nVidia drive will not be unloaded
when the GPU is idling. I suspect the frequent load/unload of nvidia driver
might have bugs and mess up the kernel. To set the persistence mode on:

    $ nvidia-smi -pm 1

Let me know if this works for you. I have a node running strong after 4 hours
of running all the cuda 5.5 samples over and over. No crashes so far.

Panruo