It has been a few days so I am sending this again incase someone has seen this issue and might have a seen this problem or has a suggestion of where to look and why it might not be taking these settings with 5.2 when it did with 5.1 On Mon, Aug 4, 2008 at 2:00 PM, Rob Lines <rlinesseagate at gmail.com> wrote: > We were previously running 5.1 x86_64 and recently updated to 5.2 > using yum. Under 5.1 we were having problems when running jobs using > torque and the solution had been to add the following items to the > files noted > > "* soft memlock unlimited" in /etc/security/limits.conf > "session required pam_limits.so" in /etc/pam.d/{rsh,sshd} > > This changed the max locked memory setting in ulimit as follows: > > Before the change > rsh nodeX ulimit -a > still gives us > max locked memory (kbytes, -l) 32 > > After the change > rsh nodeX ulimit -a > max locked memory (kbytes, -l) 16505400 > > The nodes have 16gb of memory. > > Now after the 5.2 updates those files are all the same and on most of > the nodes we haven't yet rebooted them due to log running processes > but a few nodes have been restarted and now that jobs are starting to > be put on them we are back to max locked memory of 32k rather than > 16gb. > > The error we are receiving on those jobs is : > > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. > This will severely limit memory registrations. > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. > This will severely limit memory registrations. > Fatal error in MPI_Init: > Other MPI error, error stack: > MPIR_Init_thread(306).......: Initialization failed > MPID_Init(113)..............: channel initialization failed > MPIDI_CH3_Init(167).........: > MPIDI_CH3I_RDMA_init(138)...: > rdma_setup_startup_ring(333): cannot create cq > Fatal error in MPI_Init: > Other MPI error, error stack: > MPIR_Init_thread(306).......: Initialization failed > MPID_Init(113)..............: channel initialization failed > MPIDI_CH3_Init(167).........: > MPIDI_CH3I_RDMA_init(138)...: > rdma_setup_startup_ring(333): cannot create cq > rank 45 in job 1 nodeX_35175 caused collective abort of all ranks > exit status of rank 45: return code 1 > rank 44 in job 1 nodeX_35175 caused collective abort of all ranks > exit status of rank 44: return code 1 > > > The full output of : > > rsh nodeX ulimit -a > > connect to address x.x.x.x port 544: Connection refused > Trying krb4 rsh... > connect to address x.x.x.x port 544: Connection refused > trying normal rsh (/usr/bin/rsh) > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 135168 > max locked memory (kbytes, -l) 32 > max memory size (kbytes, -m) unlimited > open files (-n) 1024 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 10240 > cpu time (seconds, -t) unlimited > max user processes (-u) 135168 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > > > Any ideas, suggestions or items I could roll back would be > appreciated. I looked through the list of packages that were updated > and the only one that I could see that was related was pam. ssh and > rsh were not updated. > > Thank you, > Rob >