Ghu! My hosting provider is "pushing security updates and enhancements", with the result that half the message I hit <reply> to come up with nothing.... Can't even remember what I've posted recently....
Anyway, PaulH wrote:
Here's the init script I use on our 3-card CUDA box. In particular, note the mknod stuff, which might be at issue in your situation. (Sorry about line breaks; you may have to guess in a couple spots.)
That's not the problem. ll /dev/nv* crw-rw-rw-. 1 root root 195, 0 Mar 26 16:26 /dev/nvidia0 crw-rw-rw-. 1 root root 195, 1 Mar 26 16:26 /dev/nvidia1 crw-rw-rw-. 1 root root 195, 2 Mar 26 16:26 /dev/nvidia2 crw-rw-rw-. 1 root root 195, 3 Mar 26 16:26 /dev/nvidia3 crw-rw-rw-. 1 root root 195, 4 Mar 26 16:26 /dev/nvidia4 crw-rw-rw-. 1 root root 195, 5 Mar 26 16:26 /dev/nvidia5 crw-rw-rw-. 1 root root 195, 6 Mar 26 16:26 /dev/nvidia6 crw-rw-rw-. 1 root root 195, 7 Mar 26 16:26 /dev/nvidia7 crw-rw-rw-. 1 root root 195, 8 Mar 26 16:26 /dev/nvidia8 crw-rw-rw-. 1 root root 195, 9 Mar 26 16:26 /dev/nvidia9 crw-rw-rw-. 1 root root 195, 255 Mar 26 16:26 /dev/nvidiactl
Everything *looks* ok....
Here's the end of the day hole in the wall (where I'm beating my head). I rebooted to the previous kernel, 2.6.32-358.18.1.el6.x86_64, which is identical to what's running on the other box (the one I never rebooted).
On the other box, enum_gpu was failing. I d/l kmod-nvidia-325.15-1.el6.elrepo.x86_64.rpm and nvidia-x11-drv-325.15-1.el6.elrepo.x86_64.rpm did a yum downgrade to those two... and everything was wonderful. I gave my user the box back.
This one... rebooted to the same old kernel, did a yum reinstall (I'd tried installing under the new kernel, so I was trying to replicate what I'd done on the other box). Everything looks good... but I'm still at enum_gpu failing, asserting a code error, which I figure means it's not seeing the GPUs. (They are different there - this box has M2090's, and the other K20Cs.) But they were both working before the updates yesterday....
Identical software... but different results.
rmmod nvidia results in Mar 26 17:41:09 <server> kernel: nvidia 0000:05:00.0: PCI INT A disabled Mar 26 17:41:09 <server> kernel: nvidia 0000:42:00.0: PCI INT A disabled Mar 26 17:41:09 <server> kernel: NVRM: VM: nvidia_exit_module:1127: 0xffff882051e7d7c0, 20 page(s), count = 1, flags = 0x00010011, key = 0x204eca1000, page_table = 0xffff882051e20548 Mar 26 17:41:09 <server> kernel: NVRM: VM: nvidia_exit_module:1127: 0xffff88204f683dc0, 1 page(s), count = 1, flags = 0x00010015, key = 0x204fc34000, page_table = 0xffff88204f698aa8 Mar 26 17:41:09 <server> kernel: NVRM: VM: nvidia_exit_module:1127: 0xffff88205233fd40, 1 page(s), count = 1, flags = 0x00000015, key = 0x2051d7e000, page_table = 0xffff88204f698a48 Mar 26 17:41:09 <server> kernel: NVRM: VM: nvidia_exit_module:1127: 0xffff88205233f840, 1 page(s), count = 1, flags = 0x00010015, key = 0x2050863000, page_table = 0xffff88204fdaa6c8
Now I do a modprobe, and get: Mar 26 17:42:34 <server> kernel: nvidia 0000:05:00.0: PCI INT A -> GSI 48 (level, low) -> IRQ 48 Mar 26 17:42:34 <server> kernel: nvidia 0000:42:00.0: PCI INT A -> GSI 72 (level, low) -> IRQ 72 Mar 26 17:42:34 <server> kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 325.15 Wed Jul 31 18:50:56 PDT 2013
Hmmm, a little better, and there may be a library thing here... my enum_gpu came back with invalid ordinal device, not the code error, but this is in messages: Mar 26 17:43:29 <server> kernel: NVRM: RmInitAdapter failed! (0x25:0x48:1157) Mar 26 17:43:29 <server> kernel: NVRM: rm_init_adapter(0) failed
oh, that's right: the last thing I did was, after doing a find on all the .ko's, notice all of them were executable by user (root), and this was *not*, so I chmod'ed it....
I'll pick this up tomorrow. If anyone's got a clue, *please*....
mark "GPUs, why did it have to be GPU's...?"