[CentOS] Crash and automatical reboot when using the NVIDIA card

Fri Nov 22 19:36:56 UTC 2013
Panruo Wu <armiuswu at gmail.com>

David McGiven <davidmcgivenn at ...> writes:

> 
> Hello there,
> 
> I'm running a Supermicro server with the latest CentOS 6.4 versions (kernel
> : 2.6.32-358.23.2.el6.x86_64) and the latest nvidia driver (331.20).
> 
> A few minutes after using the GPU for doing some HPC calculations, the
> server crashes and reboots itself. This is happening every time. I know it
> will be rebooted but I don't know when. Sometimes it's 20 minutes after
> starting using it. Sometimes it's 2 hours.
> 
> If I unplug the GPU card and put some stress on the server, it works ok. So
> I suspect there's a bug in the kernel/nvidia driver.
> 
> I can't find any messages on /var/log/messages.
> 
> What should I do ? Should I file a bug on the centos bugtracking system ?
> Is there anyway I can gather more information ? The server is in a remote
> location so I have a hard time accessing the console.
> 
> Thanks.
> 


Hi there,

I also have the same problem with all my 4 Supermicro machines. I don't
know why it happens but nvidia driver seems to be blamed for me. 
I'm using CentOS 6.3 and nVidia driver version 304.54 or 319.37.


Best,
Panruo