It has been a few days so I am sending this again incase someone has seen this issue and might have a seen this problem or has a suggestion of where to look and why it might not be taking these settings with 5.2 when it did with 5.1
On Mon, Aug 4, 2008 at 2:00 PM, Rob Lines rlinesseagate@gmail.com wrote:
We were previously running 5.1 x86_64 and recently updated to 5.2 using yum. Under 5.1 we were having problems when running jobs using torque and the solution had been to add the following items to the files noted
"* soft memlock unlimited" in /etc/security/limits.conf "session required pam_limits.so" in /etc/pam.d/{rsh,sshd}
This changed the max locked memory setting in ulimit as follows:
Before the change rsh nodeX ulimit -a still gives us max locked memory (kbytes, -l) 32
After the change rsh nodeX ulimit -a max locked memory (kbytes, -l) 16505400
The nodes have 16gb of memory.
Now after the 5.2 updates those files are all the same and on most of the nodes we haven't yet rebooted them due to log running processes but a few nodes have been restarted and now that jobs are starting to be put on them we are back to max locked memory of 32k rather than 16gb.
The error we are receiving on those jobs is :
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(306).......: Initialization failed MPID_Init(113)..............: channel initialization failed MPIDI_CH3_Init(167).........: MPIDI_CH3I_RDMA_init(138)...: rdma_setup_startup_ring(333): cannot create cq Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(306).......: Initialization failed MPID_Init(113)..............: channel initialization failed MPIDI_CH3_Init(167).........: MPIDI_CH3I_RDMA_init(138)...: rdma_setup_startup_ring(333): cannot create cq rank 45 in job 1 nodeX_35175 caused collective abort of all ranks exit status of rank 45: return code 1 rank 44 in job 1 nodeX_35175 caused collective abort of all ranks exit status of rank 44: return code 1
The full output of :
rsh nodeX ulimit -a
connect to address x.x.x.x port 544: Connection refused Trying krb4 rsh... connect to address x.x.x.x port 544: Connection refused trying normal rsh (/usr/bin/rsh) core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 135168 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 135168 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
Any ideas, suggestions or items I could roll back would be appreciated. I looked through the list of packages that were updated and the only one that I could see that was related was pam. ssh and rsh were not updated.
Thank you, Rob