Tru Huynh wrote:
On Wed, Mar 26, 2014 at 09:40:17AM -0400, m.roth@5-cent.us wrote:
Johnny Hughes wrote:
...
Are you connecting to the server to do X related things remotely ... and therefore need NVIDIA drivers for that?
I think you missed that part of my original post: no X. This box has two Tesla GPUs, and my users are using them for heavy duty scientific computing....
afaik, in order to use your Tesla cards, you need to have the nvidia driver loaded, but ymmv.
I am aware of that. Here's the latest in my fight: I got one server, which I'd updated but not rebooted, and it's still on 358-18. I yum downgraded kmod-nvidia and nvidia-x11-drv to what it had been running, 325.15-1, and it's happy as a clam (after I reloaded the nvidia driver). BUT, I note that modinfo shows /lib/modules/2.6.32-358.18.1.el6.x86_64/weak-updates/nvidia/nvid, which is a link to /lib/modules/2.6.32-358.el6.x86_64/extra/nvidia/nvidia.ko. I find this... odd.
Now, running the new 431.5.1 kernel on the other server, the one that's been rebooted, and I'm still fighting, I did the same... and see /lib/modules/2.6.32-431.5.1.el6.x86_64/weak-updates/nvidia/nvidia.ko -> /lib/modules/2.6.32-358.el6.x86_64/extra/nvidia/nvidia.ko
THAT does not look right at all.
dmesg shows NVRM: loading NVIDIA UNIX x86_64 Kernel Module 325.15 Wed Jul 31 18:50:56 PDT 2013 nvidia 0000:05:00.0: irq 113 for MSI/MSI-X NVRM: RmInitAdapter failed! (0x25:0x48:1157) NVRM: rm_init_adapter(0) failed nvidia 0000:05:00.0: irq 113 for MSI/MSI-X NVRM: RmInitAdapter failed! (0x25:0x48:1157) NVRM: rm_init_adapter(0) failed
At least I've got one back. As a last resort, I can reboot to the older kernel and see if that works with this version of kmod-nvidia, but I'd *REALLY* like to have the new kernel
mark