[CentOS] pam max locked memory issue after updating to 5.2 and rebooting

Mon Aug 4 18:00:28 UTC 2008
Rob Lines <rlinesseagate at gmail.com>

We were previously running 5.1 x86_64 and recently updated to 5.2
using yum.  Under 5.1 we were having problems when running jobs using
torque and the solution had been to add the following items to the
files noted

"*          soft    memlock         unlimited" in /etc/security/limits.conf
"session    required     pam_limits.so" in /etc/pam.d/{rsh,sshd}

This changed the max locked memory setting in ulimit as follows:

Before the change
rsh nodeX ulimit -a
still gives us
max locked memory       (kbytes, -l) 32

After the change
rsh nodeX ulimit -a
max locked memory       (kbytes, -l) 16505400

The nodes have 16gb of memory.

Now after the 5.2 updates those files are all the same and on most of
the nodes we haven't yet rebooted them due to log running processes
but a few nodes have been restarted and now that jobs are starting to
be put on them we are back to max locked memory of 32k rather than
16gb.

The error we are receiving on those jobs is :

libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
    This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
    This will severely limit memory registrations.
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(306).......: Initialization failed
MPID_Init(113)..............: channel initialization failed
MPIDI_CH3_Init(167).........:
MPIDI_CH3I_RDMA_init(138)...:
rdma_setup_startup_ring(333): cannot create cq
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(306).......: Initialization failed
MPID_Init(113)..............: channel initialization failed
MPIDI_CH3_Init(167).........:
MPIDI_CH3I_RDMA_init(138)...:
rdma_setup_startup_ring(333): cannot create cq
rank 45 in job 1  nodeX_35175   caused collective abort of all ranks
  exit status of rank 45: return code 1
rank 44 in job 1  nodeX_35175   caused collective abort of all ranks
  exit status of rank 44: return code 1


The full output of :

rsh nodeX ulimit -a

connect to address x.x.x.x port 544: Connection refused
Trying krb4 rsh...
connect to address x.x.x.x port 544: Connection refused
trying normal rsh (/usr/bin/rsh)
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 135168
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 135168
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


Any ideas, suggestions or items I could roll back would be
appreciated.  I looked through the list of packages that were updated
and the only one that I could see that was related was pam.  ssh and
rsh were not updated.

Thank you,
Rob