[CentOS] NVidia, again

Wed Mar 26 21:47:32 UTC 2014
m.roth at 5-cent.us <m.roth at 5-cent.us>

Ghu! My hosting provider is "pushing security updates and enhancements",
with the result that half the message I hit <reply> to come up with
nothing.... Can't even remember what I've posted recently....

Anyway,
PaulH wrote:
> Here's the init script I use on our 3-card CUDA box. In particular,
> note the mknod stuff, which might be at issue in your situation.
> (Sorry about line breaks; you may have to guess in a couple spots.)

That's not the problem. ll /dev/nv*
crw-rw-rw-. 1 root root 195,   0 Mar 26 16:26 /dev/nvidia0
crw-rw-rw-. 1 root root 195,   1 Mar 26 16:26 /dev/nvidia1
crw-rw-rw-. 1 root root 195,   2 Mar 26 16:26 /dev/nvidia2
crw-rw-rw-. 1 root root 195,   3 Mar 26 16:26 /dev/nvidia3
crw-rw-rw-. 1 root root 195,   4 Mar 26 16:26 /dev/nvidia4
crw-rw-rw-. 1 root root 195,   5 Mar 26 16:26 /dev/nvidia5
crw-rw-rw-. 1 root root 195,   6 Mar 26 16:26 /dev/nvidia6
crw-rw-rw-. 1 root root 195,   7 Mar 26 16:26 /dev/nvidia7
crw-rw-rw-. 1 root root 195,   8 Mar 26 16:26 /dev/nvidia8
crw-rw-rw-. 1 root root 195,   9 Mar 26 16:26 /dev/nvidia9
crw-rw-rw-. 1 root root 195, 255 Mar 26 16:26 /dev/nvidiactl

Everything *looks* ok....

Here's the end of the day hole in the wall (where I'm beating my head). I
rebooted to the previous kernel, 2.6.32-358.18.1.el6.x86_64, which is
identical to what's running on the other box (the one I never rebooted).

On the other box, enum_gpu was failing. I d/l
kmod-nvidia-325.15-1.el6.elrepo.x86_64.rpm and
nvidia-x11-drv-325.15-1.el6.elrepo.x86_64.rpm
did a yum downgrade to those two... and everything was wonderful. I gave
my user the box back.

This one... rebooted to the same old kernel, did a yum reinstall (I'd
tried installing under the new kernel, so I was trying to replicate what
I'd done on the other box). Everything looks good... but I'm still at
enum_gpu failing, asserting a code error, which I figure means it's not
seeing the GPUs. (They are different there - this box has M2090's, and the
other K20Cs.) But they were both working before the updates yesterday....

Identical software... but different results.

rmmod nvidia results in
Mar 26 17:41:09 <server> kernel: nvidia 0000:05:00.0: PCI INT A disabled
Mar 26 17:41:09 <server> kernel: nvidia 0000:42:00.0: PCI INT A disabled
Mar 26 17:41:09 <server> kernel: NVRM: VM: nvidia_exit_module:1127:
0xffff882051e7d7c0, 20 page(s), count = 1, flags = 0x00010011, key =
0x204eca1000, page_table = 0xffff882051e20548
Mar 26 17:41:09 <server> kernel: NVRM: VM: nvidia_exit_module:1127:
0xffff88204f683dc0, 1 page(s), count = 1, flags = 0x00010015, key =
0x204fc34000, page_table = 0xffff88204f698aa8
Mar 26 17:41:09 <server> kernel: NVRM: VM: nvidia_exit_module:1127:
0xffff88205233fd40, 1 page(s), count = 1, flags = 0x00000015, key =
0x2051d7e000, page_table = 0xffff88204f698a48
Mar 26 17:41:09 <server> kernel: NVRM: VM: nvidia_exit_module:1127:
0xffff88205233f840, 1 page(s), count = 1, flags = 0x00010015, key =
0x2050863000, page_table = 0xffff88204fdaa6c8

Now I do a modprobe, and get:
Mar 26 17:42:34 <server> kernel: nvidia 0000:05:00.0: PCI INT A -> GSI 48
(level, low) -> IRQ 48
Mar 26 17:42:34 <server> kernel: nvidia 0000:42:00.0: PCI INT A -> GSI 72
(level, low) -> IRQ 72
Mar 26 17:42:34 <server> kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel
Module  325.15  Wed Jul 31 18:50:56 PDT 2013

Hmmm, a little better, and there may be a library thing here... my
enum_gpu came back with invalid ordinal device, not the code error, but
this is in messages:
Mar 26 17:43:29 <server> kernel: NVRM: RmInitAdapter failed! (0x25:0x48:1157)
Mar 26 17:43:29 <server> kernel: NVRM: rm_init_adapter(0) failed

oh, that's right: the last thing I did was, after doing a find on all the
.ko's, notice all of them were executable by user (root), and this was
*not*, so I chmod'ed it....

I'll pick this up tomorrow. If anyone's got a clue, *please*....

      mark "GPUs, why did it have to be GPU's...?"