[CentOS] NVidia, again

Wed Mar 26 19:47:17 UTC 2014
m.roth at 5-cent.us <m.roth at 5-cent.us>

Tru Huynh wrote:
> On Wed, Mar 26, 2014 at 09:40:17AM -0400, m.roth at 5-cent.us wrote:
>> Johnny Hughes wrote:
> ...
>> > Are you connecting to the server to do X related things remotely ...
>> > and therefore need NVIDIA drivers for that?
>> >
>> I think you missed that part of my original post: no X. This box has two
>> Tesla GPUs, and my users are using them for heavy duty scientific
>> computing....
>
> afaik, in order to use your Tesla cards, you need to have the nvidia
> driver loaded, but ymmv.
>
I am aware of that. Here's the latest in my fight: I got one server, which
I'd updated but not rebooted, and it's still on 358-18. I yum downgraded
kmod-nvidia and nvidia-x11-drv to what it had been running, 325.15-1, and
it's happy as a clam (after I reloaded the nvidia driver). BUT, I note
that modinfo shows
/lib/modules/2.6.32-358.18.1.el6.x86_64/weak-updates/nvidia/nvid, which is
a link to /lib/modules/2.6.32-358.el6.x86_64/extra/nvidia/nvidia.ko. I
find this... odd.

Now, running the new 431.5.1 kernel on the other server, the one that's
been rebooted, and I'm still fighting, I did the same... and see
/lib/modules/2.6.32-431.5.1.el6.x86_64/weak-updates/nvidia/nvidia.ko ->
/lib/modules/2.6.32-358.el6.x86_64/extra/nvidia/nvidia.ko

THAT does not look right at all.

dmesg shows
NVRM: loading NVIDIA UNIX x86_64 Kernel Module  325.15  Wed Jul 31
18:50:56 PDT 2013
nvidia 0000:05:00.0: irq 113 for MSI/MSI-X
NVRM: RmInitAdapter failed! (0x25:0x48:1157)
NVRM: rm_init_adapter(0) failed
nvidia 0000:05:00.0: irq 113 for MSI/MSI-X
NVRM: RmInitAdapter failed! (0x25:0x48:1157)
NVRM: rm_init_adapter(0) failed

At least I've got one back. As a last resort, I can reboot to the older
kernel and see if that works with this version of kmod-nvidia, but I'd
*REALLY* like to have the new kernel

        mark