Hello there,
I'm running a Supermicro server with the latest CentOS 6.4 versions (kernel : 2.6.32-358.23.2.el6.x86_64) and the latest nvidia driver (331.20).
A few minutes after using the GPU for doing some HPC calculations, the server crashes and reboots itself. This is happening every time. I know it will be rebooted but I don't know when. Sometimes it's 20 minutes after starting using it. Sometimes it's 2 hours.
If I unplug the GPU card and put some stress on the server, it works ok. So I suspect there's a bug in the kernel/nvidia driver.
I can't find any messages on /var/log/messages.
What should I do ? Should I file a bug on the centos bugtracking system ? Is there anyway I can gather more information ? The server is in a remote location so I have a hard time accessing the console.
Thanks.
From: David McGiven davidmcgivenn@gmail.com
I'm running a Supermicro server with the latest CentOS 6.4 versions (kernel : 2.6.32-358.23.2.el6.x86_64) and the latest nvidia driver (331.20). A few minutes after using the GPU for doing some HPC calculations, the server crashes and reboots itself. This is happening every time. I know it will be rebooted but I don't know when. Sometimes it's 20 minutes after starting using it. Sometimes it's 2 hours. If I unplug the GPU card and put some stress on the server, it works ok. So I suspect there's a bug in the kernel/nvidia driver. I can't find any messages on /var/log/messages.
Did you check the IPMI logs? First thing that comes to my mind would be overheating. Maybe dump the temperatures every minute to a log file and after next reboot, check if there is a pic... Or maybe a freeze + the watchdog kicking in?
JD
I am forced to use a windoze 7 box and recently MS decided in its infinite wisdom to update the nvidia driver via windoze update. My machine immediately started with the same symptoms David is having...hanging at indeterminate times, even a BSOD twice. It would do this even when idle during the night.
Googling for an answer resulted in finding a forum related to the nvidia web site on which there was a post suggesting that there were a lot of problems with the current version and we should reinstall back level drivers. The post suggested going back to 314.22. I did so and have not had a single problem since.
YMMV
Regards,
Ron Young 919-621-9015 http://www.linkedin.com/in/ronhyoung
+++++++++++++++++++ Little tiny dreams require little tiny thoughts and little tiny steps. Great big dreams require great big thoughts and little tiny steps. +++++++++++++++++++ *Kosh*: The avalanche has already started. It is too late for the pebbles to vote.
On Fri, Nov 15, 2013 at 6:11 AM, David McGiven davidmcgivenn@gmail.comwrote:
Hello there,
I'm running a Supermicro server with the latest CentOS 6.4 versions (kernel : 2.6.32-358.23.2.el6.x86_64) and the latest nvidia driver (331.20).
A few minutes after using the GPU for doing some HPC calculations, the server crashes and reboots itself. This is happening every time. I know it will be rebooted but I don't know when. Sometimes it's 20 minutes after starting using it. Sometimes it's 2 hours.
If I unplug the GPU card and put some stress on the server, it works ok. So I suspect there's a bug in the kernel/nvidia driver.
I can't find any messages on /var/log/messages.
What should I do ? Should I file a bug on the centos bugtracking system ? Is there anyway I can gather more information ? The server is in a remote location so I have a hard time accessing the console.
Thanks. _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
David McGiven <davidmcgivenn@...> writes:
Hello there,
I'm running a Supermicro server with the latest CentOS 6.4 versions (kernel : 2.6.32-358.23.2.el6.x86_64) and the latest nvidia driver (331.20).
A few minutes after using the GPU for doing some HPC calculations, the server crashes and reboots itself. This is happening every time. I know it will be rebooted but I don't know when. Sometimes it's 20 minutes after starting using it. Sometimes it's 2 hours.
If I unplug the GPU card and put some stress on the server, it works ok. So I suspect there's a bug in the kernel/nvidia driver.
I can't find any messages on /var/log/messages.
What should I do ? Should I file a bug on the centos bugtracking system ? Is there anyway I can gather more information ? The server is in a remote location so I have a hard time accessing the console.
Thanks.
Hi there,
I also have the same problem with all my 4 Supermicro machines. I don't know why it happens but nvidia driver seems to be blamed for me. I'm using CentOS 6.3 and nVidia driver version 304.54 or 319.37.
Best, Panruo
Panruo Wu wrote:
David McGiven <davidmcgivenn@...> writes:
I'm running a Supermicro server with the latest CentOS 6.4 versions (kernel 2.6.32-358.23.2.el6.x86_64) and the latest nvidia driver (331.20).
A few minutes after using the GPU for doing some HPC calculations, the server crashes and reboots itself. This is happening every time. I know it will be rebooted but I don't know when. Sometimes it's 20 minutes after starting using it. Sometimes it's 2 hours.
<snip>
I also have the same problem with all my 4 Supermicro machines. I don't know why it happens but nvidia driver seems to be blamed for me. I'm using CentOS 6.3 and nVidia driver version 304.54 or 319.37.
On our Dell R720s, I'm using the kmod-nvidia from elrepo. They don't crash... and that even when they're running week-long jobs.
mark
Panruo Wu <armiuswu@...> writes:
David McGiven <davidmcgivenn <at> ...> writes:
Hello there,
I'm running a Supermicro server with the latest CentOS 6.4 versions (kernel : 2.6.32-358.23.2.el6.x86_64) and the latest nvidia driver (331.20).
A few minutes after using the GPU for doing some HPC calculations, the server crashes and reboots itself. This is happening every time. I know it will be rebooted but I don't know when. Sometimes it's 20 minutes after starting using it. Sometimes it's 2 hours.
If I unplug the GPU card and put some stress on the server, it works ok. So I suspect there's a bug in the kernel/nvidia driver.
I can't find any messages on /var/log/messages.
What should I do ? Should I file a bug on the centos bugtracking system ? Is there anyway I can gather more information ? The server is in a remote location so I have a hard time accessing the console.
Thanks.
Hi there,
I also have the same problem with all my 4 Supermicro machines. I don't know why it happens but nvidia driver seems to be blamed for me. I'm using CentOS 6.3 and nVidia driver version 304.54 or 319.37.
Best, Panruo
Hi David,
I think I might have found a way to work around this. In short, just set the persistence mode of your GPU on, so that nVidia drive will not be unloaded when the GPU is idling. I suspect the frequent load/unload of nvidia driver might have bugs and mess up the kernel. To set the persistence mode on:
$ nvidia-smi -pm 1
Let me know if this works for you. I have a node running strong after 4 hours of running all the cuda 5.5 samples over and over. No crashes so far.
Panruo
On Fri, Nov 22, 2013 at 2:36 PM, Panruo Wu armiuswu@gmail.com wrote:
A few minutes after using the GPU for doing some HPC calculations, the server crashes and reboots itself. This is happening every time. I know
it
will be rebooted but I don't know when. Sometimes it's 20 minutes after starting using it. Sometimes it's 2 hours.
I had a similar problem. Under load the system would crash. Turned out to be the fans weren't spinning up correctly.