NVidia, again

List overview All Threads
Download

newer

older

Re: [CentOS] Is there any benefit...

The letter "ש" or "a" button in...

m.roth＠5-cent.us

25 Mar 2014 25 Mar '14

9:36 p.m.

Got a HBS (y'know, Honkin' Big Server, one o' them technical terms), a Dell 720 with two Tesla GPUs. I updated the o/s, 6.5, and I cannot get the GPUs recognized. As a last resort, I d/l NVidia's proprietary driver/installer, 325, and it builds fine... I've yum removed the kmod-nvidia I had on the system, nouveau is blacklisted, and when I reboot, lsmod shows me nvidia loaded, which modinfo tells me looks like the one I built.... but enum_gpu, which is from a CUDA group, builds... but can't enumerate the GPUs (how we wake them up for the users). I see the /dev/nvidia*, and they're a+r, a+w.... Oh, and selinux is permissive.

Anyone got a clue? If I can't get this working, I'm going to have to downgrade the system several kernels.

mark

Show replies by date

Ljubomir Ljubojevic

25 Mar 25 Mar

11:31 p.m.

On 03/25/2014 10:36 PM, m.roth@5-cent.us wrote:

...

Got a HBS (y'know, Honkin' Big Server, one o' them technical terms), a Dell 720 with two Tesla GPUs. I updated the o/s, 6.5, and I cannot get the GPUs recognized. As a last resort, I d/l NVidia's proprietary driver/installer, 325, and it builds fine... I've yum removed the kmod-nvidia I had on the system, nouveau is blacklisted, and when I reboot, lsmod shows me nvidia loaded, which modinfo tells me looks like the one I built.... but enum_gpu, which is from a CUDA group, builds... but can't enumerate the GPUs (how we wake them up for the users). I see the /dev/nvidia*, and they're a+r, a+w.... Oh, and selinux is permissive.

Anyone got a clue? If I can't get this working, I'm going to have to downgrade the system several kernels.
    mark

Elrepo kmod drivers are not an option? First nvidia-detect then selected packages...

-- Ljubomir Ljubojevic (Love is in the Air) PL Computers Serbia, Europe StarOS, Mikrotik and CentOS/RHEL/Linux consultant

mark

26 Mar 26 Mar

12:03 p.m.

On 03/25/14 19:31, Ljubomir Ljubojevic wrote:

...

On 03/25/2014 10:36 PM, m.roth@5-cent.us wrote:

...
Got a HBS (y'know, Honkin' Big Server, one o' them technical terms), a Dell 720 with two Tesla GPUs. I updated the o/s, 6.5, and I cannot get the GPUs recognized. As a last resort, I d/l NVidia's proprietary driver/installer, 325, and it builds fine... I've yum removed the kmod-nvidia I had on the system, nouveau is blacklisted, and when I reboot, lsmod shows me nvidia loaded, which modinfo tells me looks like the one I built.... but enum_gpu, which is from a CUDA group, builds... but can't enumerate the GPUs (how we wake them up for the users). I see the /dev/nvidia*, and they're a+r, a+w.... Oh, and selinux is permissive.

Anyone got a clue? If I can't get this working, I'm going to have to downgrade the system several kernels.

Elrepo kmod drivers are not an option? First nvidia-detect then selected packages...

I had kmod-nvidia from elrepo - as I said, as a last resort, I yum removed it (along with the x11-drv-nvidia, a dependency) - and built proprietary, trying to eliminate all interactions.

mark

Johnny Hughes

7:01 a.m.

On 03/25/2014 04:36 PM, m.roth@5-cent.us wrote:

...

Got a HBS (y'know, Honkin' Big Server, one o' them technical terms), a Dell 720 with two Tesla GPUs. I updated the o/s, 6.5, and I cannot get the GPUs recognized. As a last resort, I d/l NVidia's proprietary driver/installer, 325, and it builds fine... I've yum removed the kmod-nvidia I had on the system, nouveau is blacklisted, and when I reboot, lsmod shows me nvidia loaded, which modinfo tells me looks like the one I built.... but enum_gpu, which is from a CUDA group, builds... but can't enumerate the GPUs (how we wake them up for the users). I see the /dev/nvidia*, and they're a+r, a+w.... Oh, and selinux is permissive.

Anyone got a clue? If I can't get this working, I'm going to have to downgrade the system several kernels.

Do you have an /etc/X11/xorg.conf file or something in /etc/X11/xorg.conf.d/ that actually name nvidia and not nv as the driver?

mark

12:01 p.m.

On 03/26/14 03:01, Johnny Hughes wrote:

...

On 03/25/2014 04:36 PM, m.roth@5-cent.us wrote:

...
Got a HBS (y'know, Honkin' Big Server, one o' them technical terms), a Dell 720 with two Tesla GPUs. I updated the o/s, 6.5, and I cannot get the GPUs recognized. As a last resort, I d/l NVidia's proprietary driver/installer, 325, and it builds fine... I've yum removed the kmod-nvidia I had on the system, nouveau is blacklisted, and when I reboot, lsmod shows me nvidia loaded, which modinfo tells me looks like the one I built.... but enum_gpu, which is from a CUDA group, builds... but can't enumerate the GPUs (how we wake them up for the users). I see the /dev/nvidia*, and they're a+r, a+w.... Oh, and selinux is permissive.

Anyone got a clue? If I can't get this working, I'm going to have to downgrade the system several kernels.

Do you have an /etc/X11/xorg.conf file or something in /etc/X11/xorg.conf.d/ that actually name nvidia and not nv as the driver?

Nope - nothing there.

mark

Johnny Hughes

12:58 p.m.

On 03/26/2014 07:01 AM, mark wrote:

...

On 03/26/14 03:01, Johnny Hughes wrote:

...
On 03/25/2014 04:36 PM, m.roth@5-cent.us wrote:

...
Got a HBS (y'know, Honkin' Big Server, one o' them technical terms), a Dell 720 with two Tesla GPUs. I updated the o/s, 6.5, and I cannot get the GPUs recognized. As a last resort, I d/l NVidia's proprietary driver/installer, 325, and it builds fine... I've yum removed the kmod-nvidia I had on the system, nouveau is blacklisted, and when I reboot, lsmod shows me nvidia loaded, which modinfo tells me looks like the one I built.... but enum_gpu, which is from a CUDA group, builds... but can't enumerate the GPUs (how we wake them up for the users). I see the /dev/nvidia*, and they're a+r, a+w.... Oh, and selinux is permissive.

Anyone got a clue? If I can't get this working, I'm going to have to downgrade the system several kernels.

Do you have an /etc/X11/xorg.conf file or something in /etc/X11/xorg.conf.d/ that actually name nvidia and not nv as the driver?

Nope - nothing there.

When you run the ./NVIDIA<version> command to build the driver, one of the last steps is to have it "automatically update your configuration file" .. select yes for that and it should create an xorg.conf file that will use the nvidia driver.

m.roth＠5-cent.us

1:14 p.m.

Johnny Hughes wrote:

...

On 03/26/2014 07:01 AM, mark wrote:

...
On 03/26/14 03:01, Johnny Hughes wrote:

...
On 03/25/2014 04:36 PM, m.roth@5-cent.us wrote:

...
Got a HBS (y'know, Honkin' Big Server, one o' them technical terms), a Dell 720 with two Tesla GPUs. I updated the o/s, 6.5, and I cannot get the GPUs recognized. As a last resort, I d/l NVidia's proprietary driver/installer, 325, and it builds fine... I've yum removed the kmod-nvidia I had on the system, nouveau is blacklisted, and when I reboot, lsmod shows me nvidia loaded, which modinfo tells me looks like the one I built.... but enum_gpu, which is from a CUDA group, builds... but can't enumerate the GPUs (how we wake them up for the

users). I

...

...
...
...
see the /dev/nvidia*, and they're a+r, a+w.... Oh, and selinux is permissive.

Anyone got a clue? If I can't get this working, I'm going to have to downgrade the system several kernels.

Do you have an /etc/X11/xorg.conf file or something in /etc/X11/xorg.conf.d/ that actually name nvidia and not nv as the driver?

Nope - nothing there.

When you run the ./NVIDIA<version> command to build the driver, one of the last steps is to have it "automatically update your configuration file" .. select yes for that and it should create an xorg.conf file that will use the nvidia driver.

a) I didn't have that before - did kmod-nvidia handle loading the correct one *without* an xorg.conf? b) Do you think it'll do the right thing - this *is* a headless server.

And a general question: what *does* kmod-nvidia do - is it different than, say, setting up a flag, or a script to notice that you're booting a new kernel, and run the proprietary installer -a -s?

mark

Johnny Hughes

1:18 p.m.

On 03/26/2014 08:14 AM, m.roth@5-cent.us wrote:

...

Johnny Hughes wrote:

...
On 03/26/2014 07:01 AM, mark wrote:

...
On 03/26/14 03:01, Johnny Hughes wrote:

...
On 03/25/2014 04:36 PM, m.roth@5-cent.us wrote:

...
Got a HBS (y'know, Honkin' Big Server, one o' them technical terms), a Dell 720 with two Tesla GPUs. I updated the o/s, 6.5, and I cannot get the GPUs recognized. As a last resort, I d/l NVidia's proprietary driver/installer, 325, and it builds fine... I've yum removed the kmod-nvidia I had on the system, nouveau is blacklisted, and when I reboot, lsmod shows me nvidia loaded, which modinfo tells me looks like the one I built.... but enum_gpu, which is from a CUDA group, builds... but can't enumerate the GPUs (how we wake them up for the

users). I

...
...
...
...
see the /dev/nvidia*, and they're a+r, a+w.... Oh, and selinux is permissive.

Anyone got a clue? If I can't get this working, I'm going to have to downgrade the system several kernels.

Do you have an /etc/X11/xorg.conf file or something in /etc/X11/xorg.conf.d/ that actually name nvidia and not nv as the driver?

Nope - nothing there.

When you run the ./NVIDIA<version> command to build the driver, one of the last steps is to have it "automatically update your configuration file" .. select yes for that and it should create an xorg.conf file that will use the nvidia driver.

a) I didn't have that before - did kmod-nvidia handle loading the correct one *without* an xorg.conf? b) Do you think it'll do the right thing - this *is* a headless server.

And a general question: what *does* kmod-nvidia do - is it different than, say, setting up a flag, or a script to notice that you're booting a new kernel, and run the proprietary installer -a -s?

Are you connecting to the server to do X related things remotely ... and therefore need NVIDIA drivers for that?

I'll let one of the elrepo guys explain their RPM.

m.roth＠5-cent.us

1:40 p.m.

Johnny Hughes wrote:

...

On 03/26/2014 08:14 AM, m.roth@5-cent.us wrote:

...
Johnny Hughes wrote:

...
On 03/26/2014 07:01 AM, mark wrote:

...
On 03/26/14 03:01, Johnny Hughes wrote:

...
On 03/25/2014 04:36 PM, m.roth@5-cent.us wrote:

...
Got a HBS (y'know, Honkin' Big Server, one o' them technical terms), a Dell 720 with two Tesla GPUs. I updated the o/s, 6.5, and I cannot get the GPUs recognized. As a last resort, I d/l NVidia's proprietary driver/installer, 325, and it builds fine... I've yum removed the kmod-nvidia I had on the system, nouveau is blacklisted, and when I reboot, lsmod shows me nvidia loaded, which modinfo tells me looks like the one I built.... but enum_gpu, which is from a CUDA group, builds... but can't enumerate the GPUs (how we wake them up for the

users). I

...
...
...
...
see the /dev/nvidia*, and they're a+r, a+w.... Oh, and selinux is permissive.

Anyone got a clue? If I can't get this working, I'm going to have to downgrade the system several kernels.

Do you have an /etc/X11/xorg.conf file or something in /etc/X11/xorg.conf.d/ that actually name nvidia and not nv as the driver?

Nope - nothing there.

When you run the ./NVIDIA<version> command to build the driver, one of the last steps is to have it "automatically update your configuration file" .. select yes for that and it should create an xorg.conf file that will use the nvidia driver.

a) I didn't have that before - did kmod-nvidia handle loading the correct one *without* an xorg.conf? b) Do you think it'll do the right thing - this *is* a headless server.

And a general question: what *does* kmod-nvidia do - is it different than, say, setting up a flag, or a script to notice that you're booting

a new

...

...
kernel, and run the proprietary installer -a -s?

Are you connecting to the server to do X related things remotely ... and therefore need NVIDIA drivers for that?

I think you missed that part of my original post: no X. This box has two Tesla GPUs, and my users are using them for heavy duty scientific computing.... And my problem is that neither their programs, nor the utility I use (I *think* it that it seems to be part of the CUDA toolkit - I didn't set that part up) can enumerate them... meaning that they can't see or use the GPUs.

...

I'll let one of the elrepo guys explain their RPM.

Fair 'nough. I just threw that out as a general question, not expecting that was yours to answer.

mark

Alexandru Chiscan

1:45 p.m.

On 03/26/2014 03:40 PM, m.roth@5-cent.us wrote:

...

Johnny Hughes wrote:

...
On 03/26/2014 08:14 AM, m.roth@5-cent.us wrote:

...
Johnny Hughes wrote:

...
On 03/26/2014 07:01 AM, mark wrote:

...
On 03/26/14 03:01, Johnny Hughes wrote:

...
On 03/25/2014 04:36 PM, m.roth@5-cent.us wrote: > Got a HBS (y'know, Honkin' Big Server, one o' them technical terms), > a Dell 720 with two Tesla GPUs. I updated the o/s, 6.5, and I cannot > get the GPUs recognized. As a last resort, I d/l NVidia's proprietary > driver/installer, 325, and it builds fine... I've yum removed the > kmod-nvidia I had on the system, nouveau is blacklisted, and when I > reboot, lsmod shows me nvidia loaded, which modinfo tells me looks > like the one I built.... but enum_gpu, which is from a CUDA group, > builds... but can't enumerate the GPUs (how we wake them up for the

users). I

...
...
...
> see the /dev/nvidia*, and they're a+r, a+w.... Oh, and selinux is > permissive. > > Anyone got a clue? If I can't get this working, I'm going to have to > downgrade the system several kernels. Do you have an /etc/X11/xorg.conf file or something in /etc/X11/xorg.conf.d/ that actually name nvidia and not nv as the driver?

Nope - nothing there.

When you run the ./NVIDIA<version> command to build the driver, one of the last steps is to have it "automatically update your configuration file" .. select yes for that and it should create an xorg.conf file that will use the nvidia driver.

a) I didn't have that before - did kmod-nvidia handle loading the correct one *without* an xorg.conf? b) Do you think it'll do the right thing - this *is* a headless server.

And a general question: what *does* kmod-nvidia do - is it different than, say, setting up a flag, or a script to notice that you're booting

a new

...
...
kernel, and run the proprietary installer -a -s?

Are you connecting to the server to do X related things remotely ... and therefore need NVIDIA drivers for that?

I think you missed that part of my original post: no X. This box has two Tesla GPUs, and my users are using them for heavy duty scientific computing.... And my problem is that neither their programs, nor the utility I use (I *think* it that it seems to be part of the CUDA toolkit - I didn't set that part up) can enumerate them... meaning that they can't see or use the GPUs.

Try to install CUDA Toolkit (https://developer.nvidia.com/cuda-downloads), see from their FAQ: *Q: *Will the installer replace the driver currently installed on my system? *A: *The installer will provide an option to install the included driver, and if selected, it will replace the driver currently on your system.

Lec

...

...
I'll let one of the elrepo guys explain their RPM.

Fair 'nough. I just threw that out as a general question, not expecting that was yours to answer.
    mark
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

-- Lec

Tru Huynh

1:58 p.m.

On Wed, Mar 26, 2014 at 09:40:17AM -0400, m.roth@5-cent.us wrote:

...

Johnny Hughes wrote:

...

...
Are you connecting to the server to do X related things remotely ... and therefore need NVIDIA drivers for that?

I think you missed that part of my original post: no X. This box has two Tesla GPUs, and my users are using them for heavy duty scientific computing....

afaik, in order to use your Tesla cards, you need to have the nvidia driver loaded, but ymmv.

[tru@visu1 build]$ uname -a Linux visu1.erc 2.6.32-431.5.1.el6.x86_64 #1 SMP Wed Feb 12 00:41:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

[tru@visu1 build]$ type nvidia-smi nvidia-smi is hashed (/usr/bin/nvidia-smi) [tru@visu1 build]$ rpm -qf /usr/bin/nvidia-smi nvidia-x11-drv-325.15-1.el6.elrepo.x86_64 [tru@visu1 build]$ nvidia-smi Wed Mar 26 14:54:44 2014 +------------------------------------------------------+ | NVIDIA-SMI 5.325.15 Driver Version: 325.15 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K20Xm Off | 0000:82:00.0 Off | 0 | | N/A 39C P0 75W / 235W | 380MB / 5759MB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Compute processes: GPU Memory | | GPU PID Process name Usage | |=============================================================================| | 0 9767 /c6/shared/NAMD/2.8/x86_64-CUDA/namd2 88MB | | 0 9766 /c6/shared/NAMD/2.8/x86_64-CUDA/namd2 87MB | | 0 9768 /c6/shared/NAMD/2.8/x86_64-CUDA/namd2 89MB | | 0 9765 /c6/shared/NAMD/2.8/x86_64-CUDA/namd2 88MB | +-----------------------------------------------------------------------------+ [tru@visu1 build]$ rpm -qa kmod-nvidia* kmod-nvidia-325.15-1.el6.elrepo.x86_64

[tru@visu1 build]$ lsmod| grep nvidia nvidia 9357435 80 i2c_core 31084 2 nvidia,i2c_i801

Tru

-- Tru Huynh (mirrors, CentOS i386/x86_64 Package Maintenance) http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xBEFA581B

m.roth＠5-cent.us

7:47 p.m.

Tru Huynh wrote:

...

On Wed, Mar 26, 2014 at 09:40:17AM -0400, m.roth@5-cent.us wrote:

...
Johnny Hughes wrote:

...

...
...
Are you connecting to the server to do X related things remotely ... and therefore need NVIDIA drivers for that?

I think you missed that part of my original post: no X. This box has two Tesla GPUs, and my users are using them for heavy duty scientific computing....

afaik, in order to use your Tesla cards, you need to have the nvidia driver loaded, but ymmv.

I am aware of that. Here's the latest in my fight: I got one server, which I'd updated but not rebooted, and it's still on 358-18. I yum downgraded kmod-nvidia and nvidia-x11-drv to what it had been running, 325.15-1, and it's happy as a clam (after I reloaded the nvidia driver). BUT, I note that modinfo shows /lib/modules/2.6.32-358.18.1.el6.x86_64/weak-updates/nvidia/nvid, which is a link to /lib/modules/2.6.32-358.el6.x86_64/extra/nvidia/nvidia.ko. I find this... odd.

Now, running the new 431.5.1 kernel on the other server, the one that's been rebooted, and I'm still fighting, I did the same... and see /lib/modules/2.6.32-431.5.1.el6.x86_64/weak-updates/nvidia/nvidia.ko -> /lib/modules/2.6.32-358.el6.x86_64/extra/nvidia/nvidia.ko

THAT does not look right at all.

dmesg shows NVRM: loading NVIDIA UNIX x86_64 Kernel Module 325.15 Wed Jul 31 18:50:56 PDT 2013 nvidia 0000:05:00.0: irq 113 for MSI/MSI-X NVRM: RmInitAdapter failed! (0x25:0x48:1157) NVRM: rm_init_adapter(0) failed nvidia 0000:05:00.0: irq 113 for MSI/MSI-X NVRM: RmInitAdapter failed! (0x25:0x48:1157) NVRM: rm_init_adapter(0) failed

At least I've got one back. As a last resort, I can reboot to the older kernel and see if that works with this version of kmod-nvidia, but I'd *REALLY* like to have the new kernel

mark

Paul Heinlein

9:17 p.m.

Mark,

Here's the init script I use on our 3-card CUDA box. In particular, note the mknod stuff, which might be at issue in your situation. (Sorry about line breaks; you may have to guess in a couple spots.)

----- %< ----- #!/bin/bash # # Startup/shutdown script for nVidia CUDA # # chkconfig: 345 80 20 # description: Startup/shutdown script for nVidia CUDA # # =====================================================

# Source function library.

. /etc/init.d/functions

DRIVER=nvidia RETVAL=0

# Create /dev nodes for nvidia devices function createnodes() { # Count the number of NVIDIA controllers found. N3D=$(/sbin/lspci | grep -i NVIDIA | grep "3D controller" | wc -l) NVGA=$(/sbin/lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc -l)

N=$(expr $N3D + $NVGA - 1) for i in $(seq 0 $N); do mknod -m 666 /dev/nvidia$i c 195 $i RETVAL=$? [ "$RETVAL" = 0 ] || exit $RETVAL done

mknod -m 666 /dev/nvidiactl c 195 255 RETVAL=$? [ "$RETVAL" = 0 ] || exit $RETVAL }

# Remove /dev nodes for nvidia devices function removenodes() { rm -f /dev/nvidia* }

# Start daemon function start() { echo -n $"Loading $DRIVER kernel module: " modprobe $DRIVER && success || failure RETVAL=$? echo [ "$RETVAL" = 0 ] || exit $RETVAL

echo -n $"Initializing CUDA /dev entries: " createnodes && success || failure RETVAL=$? echo [ "$RETVAL" = 0 ] || exit $RETVAL # this can fail without stopping the entire script echo -n $"Setting persistence mode: " /usr/bin/nvidia-smi -pm 1 && success || failure }

# Stop daemon function stop() { echo -n $"Unloading $DRIVER kernel module: " rmmod -f $DRIVER && success || failure RETVAL=$? echo [ "$RETVAL" = 0 ] || exit $RETVAL

echo -n $"Removing CUDA /dev entries: " removenodes && success || failure RETVAL=$? echo [ "$RETVAL" = 0 ] || exit $RETVAL }

# See how we were called case "$1" in start) start ;; stop) stop ;; restart) stop start ;; *) echo $"Usage: $0 {start|stop|restart}" RETVAL=1 esac exit $RETVAL ----- %< -----

-- Paul Heinlein heinlein@madboa.com 45°38' N, 122°6' W

m.roth＠5-cent.us

9:47 p.m.

Ghu! My hosting provider is "pushing security updates and enhancements", with the result that half the message I hit <reply> to come up with nothing.... Can't even remember what I've posted recently....

Anyway, PaulH wrote:

...

Here's the init script I use on our 3-card CUDA box. In particular, note the mknod stuff, which might be at issue in your situation. (Sorry about line breaks; you may have to guess in a couple spots.)

That's not the problem. ll /dev/nv* crw-rw-rw-. 1 root root 195, 0 Mar 26 16:26 /dev/nvidia0 crw-rw-rw-. 1 root root 195, 1 Mar 26 16:26 /dev/nvidia1 crw-rw-rw-. 1 root root 195, 2 Mar 26 16:26 /dev/nvidia2 crw-rw-rw-. 1 root root 195, 3 Mar 26 16:26 /dev/nvidia3 crw-rw-rw-. 1 root root 195, 4 Mar 26 16:26 /dev/nvidia4 crw-rw-rw-. 1 root root 195, 5 Mar 26 16:26 /dev/nvidia5 crw-rw-rw-. 1 root root 195, 6 Mar 26 16:26 /dev/nvidia6 crw-rw-rw-. 1 root root 195, 7 Mar 26 16:26 /dev/nvidia7 crw-rw-rw-. 1 root root 195, 8 Mar 26 16:26 /dev/nvidia8 crw-rw-rw-. 1 root root 195, 9 Mar 26 16:26 /dev/nvidia9 crw-rw-rw-. 1 root root 195, 255 Mar 26 16:26 /dev/nvidiactl

Everything *looks* ok....

Here's the end of the day hole in the wall (where I'm beating my head). I rebooted to the previous kernel, 2.6.32-358.18.1.el6.x86_64, which is identical to what's running on the other box (the one I never rebooted).

On the other box, enum_gpu was failing. I d/l kmod-nvidia-325.15-1.el6.elrepo.x86_64.rpm and nvidia-x11-drv-325.15-1.el6.elrepo.x86_64.rpm did a yum downgrade to those two... and everything was wonderful. I gave my user the box back.

This one... rebooted to the same old kernel, did a yum reinstall (I'd tried installing under the new kernel, so I was trying to replicate what I'd done on the other box). Everything looks good... but I'm still at enum_gpu failing, asserting a code error, which I figure means it's not seeing the GPUs. (They are different there - this box has M2090's, and the other K20Cs.) But they were both working before the updates yesterday....

Identical software... but different results.

rmmod nvidia results in Mar 26 17:41:09 <server> kernel: nvidia 0000:05:00.0: PCI INT A disabled Mar 26 17:41:09 <server> kernel: nvidia 0000:42:00.0: PCI INT A disabled Mar 26 17:41:09 <server> kernel: NVRM: VM: nvidia_exit_module:1127: 0xffff882051e7d7c0, 20 page(s), count = 1, flags = 0x00010011, key = 0x204eca1000, page_table = 0xffff882051e20548 Mar 26 17:41:09 <server> kernel: NVRM: VM: nvidia_exit_module:1127: 0xffff88204f683dc0, 1 page(s), count = 1, flags = 0x00010015, key = 0x204fc34000, page_table = 0xffff88204f698aa8 Mar 26 17:41:09 <server> kernel: NVRM: VM: nvidia_exit_module:1127: 0xffff88205233fd40, 1 page(s), count = 1, flags = 0x00000015, key = 0x2051d7e000, page_table = 0xffff88204f698a48 Mar 26 17:41:09 <server> kernel: NVRM: VM: nvidia_exit_module:1127: 0xffff88205233f840, 1 page(s), count = 1, flags = 0x00010015, key = 0x2050863000, page_table = 0xffff88204fdaa6c8

Now I do a modprobe, and get: Mar 26 17:42:34 <server> kernel: nvidia 0000:05:00.0: PCI INT A -> GSI 48 (level, low) -> IRQ 48 Mar 26 17:42:34 <server> kernel: nvidia 0000:42:00.0: PCI INT A -> GSI 72 (level, low) -> IRQ 72 Mar 26 17:42:34 <server> kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 325.15 Wed Jul 31 18:50:56 PDT 2013

Hmmm, a little better, and there may be a library thing here... my enum_gpu came back with invalid ordinal device, not the code error, but this is in messages: Mar 26 17:43:29 <server> kernel: NVRM: RmInitAdapter failed! (0x25:0x48:1157) Mar 26 17:43:29 <server> kernel: NVRM: rm_init_adapter(0) failed

oh, that's right: the last thing I did was, after doing a find on all the .ko's, notice all of them were executable by user (root), and this was *not*, so I chmod'ed it....

I'll pick this up tomorrow. If anyone's got a clue, *please*....

mark "GPUs, why did it have to be GPU's...?"

Paul Heinlein

10:43 p.m.

On Wed, 26 Mar 2014, m.roth@5-cent.us wrote:

...

Mar 26 17:42:34 <server> kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 325.15 Wed Jul 31 18:50:56 PDT 2013

I'll note for the record that I had trouble with module versions greater than 319; I downgraded to 319.82 and have had much more luck.

Our CUDA machine has two M2090s and one K20c, so both product lines work here.

This is all CentOS 6.5, kernel 2.6.32-431.5.1.el6.x86_64.

-- Paul Heinlein heinlein@madboa.com 45°38' N, 122°6' W

Alexandru Chiscan

2:36 p.m.

On 03/26/2014 03:40 PM, m.roth@5-cent.us wrote:

...

I think you missed that part of my original post: no X. This box has two Tesla GPUs, and my users are using them for heavy duty scientific computing.... And my problem is that neither their programs, nor the utility I use (I *think* it that it seems to be part of the CUDA toolkit - I didn't set that part up) can enumerate them... meaning that they can't see or use the GPUs.

what is the error? For example if I run "CUDA Device Query" (example from cuda toolkit) I get the following error if the kernel module is not the version needed for the compiled version of cuda program (cuda toolkit 5.5 and nvidia kernel module 310.19 - installed from nvidia.com)

#./deviceQuery ./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35 -> CUDA driver version is insufficient for CUDA runtime version Result = FAIL

-- Lec

m.roth＠5-cent.us

3:53 p.m.

Alexandru Chiscan wrote:

...

On 03/26/2014 03:40 PM, m.roth@5-cent.us wrote:

...
I think you missed that part of my original post: no X. This box has two Tesla GPUs, and my users are using them for heavy duty scientific computing.... And my problem is that neither their programs, nor the utility I use (I *think* it that it seems to be part of the CUDA toolkit

I didn't set that part up) can enumerate them... meaning that they can't see or use the GPUs.

what is the error? For example if I run "CUDA Device Query" (example from cuda toolkit) I get the following error if the kernel module is not the version needed for the compiled version of cuda program (cuda toolkit 5.5 and nvidia kernel module 310.19 - installed from nvidia.com)

I'm not sure what he's getting, but if I run enum_gpu, I get invalid device ordinal in enum_gpu.cu at line 23, which seems to be in the code for that (cuda by example, chapter 3?) HANDLE_ERROR( cudaGetDeviceCount( &count ) );

mark

...

#./deviceQuery ./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35 -> CUDA driver version is insufficient for CUDA runtime version Result = FAIL

-- Lec

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

4272

Age (days ago)

4273

Last active (days ago)

discuss@lists.centos.org

16 comments

7 participants

tags (0)

participants (7)

Alexandru Chiscan
Johnny Hughes
Ljubomir Ljubojevic
m.roth＠5-cent.us
mark
Paul Heinlein
Tru Huynh