Hello,
I have several systems which I recently updated with
yum -y update
to all the latest packages. These systems use yum-priorities and use the CentOS (priority 1) EPEL (priority 5) and rpmforge (priority 10) repositories. After the updates, dhcpd stopped working with a SIGPIPE error which occurs shortly after it attempts to fork into the background. I worked around that problem by building a new server with no additional repos, only CentOS and dhcpd works fine on that system. Since then I have found the problem, or similar problems with a few more applications. Here is what the tail of an strace of pbs_mom as it attempts to fork into the background:
listen(5, 512) = 0 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 6 setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(6, {sa_family=AF_INET, sin_port=htons(15003), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 listen(6, 512) = 0 fcntl(4, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0 clone(Process 23938 attached (waiting for parent) Process 23938 resumed (parent 23937 ready) child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2aaaaad30db0) = 23938 [pid 23937] exit_group(0) = ? getsockname(3, 0x7fff6b7728a0, [128]) = -1 ENOTSOCK (Socket operation on non-socket) fcntl(3, F_GETFD) = 0 dup(3) = 7 fcntl(7, F_SETFD, 0) = 0 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8 close(3) = 0 fcntl(8, F_GETFD) = 0 dup2(8, 3) = 3 fcntl(3, F_SETFD, 0) = 0 close(8) = 0 write(3, "\25\3\1\0\22\334\362\36\233\253\205\2633\323\322q\4\3T\rxK\210", 23) = -1 EPIPE (Broken pipe) --- SIGPIPE (Broken pipe) @ 0 (0) --- Process 23938 detached
This is pretty much the same thing that happened to dhcpd. In both cases they applications work fine in debug mode when they don't attempt to fork, but quietly die when ran normally. A third set of apps, wrappers for the client part of torque (pbs_mom) do this:
stat("/usr/local/sbin/pbs_iff", {st_mode=S_IFREG|S_ISUID|0755, st_size=21412, ...}) = 0 pipe([5, 6]) = 0 clone(Process 24068 attached (waiting for parent) Process 24068 resumed (parent 24067 ready) child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2aaaaad31ce0) = 24068 [pid 24067] close(6) = 0 [pid 24067] fcntl(5, F_GETFL) = 0 (flags O_RDONLY) [pid 24067] read(5, <unfinished ...> [pid 24068] getsockname(3, {sa_family=AF_INET, sin_port=htons(41855), sin_addr=inet_addr("129.123.148.49")}, [1164321820984213520]) = 0 [pid 24068] getpeername(3, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("129.123.20.92")}, [68719476752]) = 0 [pid 24068] fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC) [pid 24068] dup(3) = 7 [pid 24068] fcntl(7, F_SETFD, FD_CLOEXEC) = 0 [pid 24068] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8 [pid 24068] close(3) = 0 [pid 24068] fcntl(8, F_GETFD) = 0 [pid 24068] dup2(8, 3) = 3 [pid 24068] fcntl(3, F_SETFD, 0) = 0 [pid 24068] close(8) = 0 [pid 24068] write(3, "\25\3\1\0\22\346h\357n\r\17x\374B\312\217\374x\276\311\217\342%", 23) = -1 EPIPE (Broken pipe) [pid 24068] --- SIGPIPE (Broken pipe) @ 0 (0) --- Process 24068 detached <... read resumed> "", 4) = 0 --- SIGCHLD (Child exited) @ 0 (0) --- close(5) = 0 wait4(24068, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGPIPE}], 0, NULL) = 24068 close(4) = 0 write(2, "No Permission.\n", 15No Permission. ) = 15 write(2, "qstat: cannot connect to server "..., 63qstat: cannot connect to server moab.hpc.usu.edu (errno=15007) ) = 63 exit_group(-1) = ?
Once again, the app dies after it attempts to fork into the background. There are other things running on these systems that can successfully fork and I have been unable to figure out any pattern, other than if I don't use additional repos then it doesn't seem to break. That may be coincidental though, I haven't repeated it enough yet to be certain.
Any hints or suggestions would be appreciated. Unfortunately I noticed this after deciding it was "safe" to update *all* my machines and so I'm suffering through a lot of rebuilds/restores because of this.
Thanks,
jbh
On Sun, Jul 6, 2008 at 7:44 PM, John Hanks griznog@gmail.com wrote:
Hello,
I have several systems which I recently updated with
yum -y update
to all the latest packages. These systems use yum-priorities and use the CentOS (priority 1) EPEL (priority 5) and rpmforge (priority 10) repositories. After the updates, dhcpd stopped working with a SIGPIPE error which occurs shortly after it attempts to fork into the background. I worked around that problem by building a new server with no additional repos, only CentOS and dhcpd works fine on that system. Since then I have found the problem, or similar problems with a few more applications. Here is what the tail of an strace of pbs_mom as it attempts to fork into the background:
listen(5, 512) = 0 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 6 setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(6, {sa_family=AF_INET, sin_port=htons(15003), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 listen(6, 512) = 0 fcntl(4, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0 clone(Process 23938 attached (waiting for parent) Process 23938 resumed (parent 23937 ready) child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2aaaaad30db0) = 23938 [pid 23937] exit_group(0) = ? getsockname(3, 0x7fff6b7728a0, [128]) = -1 ENOTSOCK (Socket operation on non-socket) fcntl(3, F_GETFD) = 0 dup(3) = 7 fcntl(7, F_SETFD, 0) = 0 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8 close(3) = 0 fcntl(8, F_GETFD) = 0 dup2(8, 3) = 3 fcntl(3, F_SETFD, 0) = 0 close(8) = 0 write(3, "\25\3\1\0\22\334\362\36\233\253\205\2633\323\322q\4\3T\rxK\210", 23) = -1 EPIPE (Broken pipe) --- SIGPIPE (Broken pipe) @ 0 (0) --- Process 23938 detached
This is pretty much the same thing that happened to dhcpd. In both cases they applications work fine in debug mode when they don't attempt to fork, but quietly die when ran normally. A third set of apps, wrappers for the client part of torque (pbs_mom) do this:
stat("/usr/local/sbin/pbs_iff", {st_mode=S_IFREG|S_ISUID|0755, st_size=21412, ...}) = 0 pipe([5, 6]) = 0 clone(Process 24068 attached (waiting for parent) Process 24068 resumed (parent 24067 ready) child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2aaaaad31ce0) = 24068 [pid 24067] close(6) = 0 [pid 24067] fcntl(5, F_GETFL) = 0 (flags O_RDONLY) [pid 24067] read(5, <unfinished ...> [pid 24068] getsockname(3, {sa_family=AF_INET, sin_port=htons(41855), sin_addr=inet_addr("129.123.148.49")}, [1164321820984213520]) = 0 [pid 24068] getpeername(3, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("129.123.20.92")}, [68719476752]) = 0 [pid 24068] fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC) [pid 24068] dup(3) = 7 [pid 24068] fcntl(7, F_SETFD, FD_CLOEXEC) = 0 [pid 24068] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8 [pid 24068] close(3) = 0 [pid 24068] fcntl(8, F_GETFD) = 0 [pid 24068] dup2(8, 3) = 3 [pid 24068] fcntl(3, F_SETFD, 0) = 0 [pid 24068] close(8) = 0 [pid 24068] write(3, "\25\3\1\0\22\346h\357n\r\17x\374B\312\217\374x\276\311\217\342%", 23) = -1 EPIPE (Broken pipe) [pid 24068] --- SIGPIPE (Broken pipe) @ 0 (0) --- Process 24068 detached <... read resumed> "", 4) = 0 --- SIGCHLD (Child exited) @ 0 (0) --- close(5) = 0 wait4(24068, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGPIPE}], 0, NULL) = 24068 close(4) = 0 write(2, "No Permission.\n", 15No Permission. ) = 15 write(2, "qstat: cannot connect to server "..., 63qstat: cannot connect to server moab.hpc.usu.edu (errno=15007) ) = 63 exit_group(-1) = ?
Once again, the app dies after it attempts to fork into the background. There are other things running on these systems that can successfully fork and I have been unable to figure out any pattern, other than if I don't use additional repos then it doesn't seem to break. That may be coincidental though, I haven't repeated it enough yet to be certain.
Any hints or suggestions would be appreciated. Unfortunately I noticed this after deciding it was "safe" to update *all* my machines and so I'm suffering through a lot of rebuilds/restores because of this.
Thanks,
jbh
Just fouund yet another system demonstrating pipe related weirdness. Here's the tail of an strace where this app (qsub, another part of Torque) hangs after the SIGPIPE:
write(5, "\3\34\177\25\4\32", 6) = 6 write(5, "WINSIZE 36,137,822,504\0\0R(A\240:\0\0\0"..., 80) = 80 write(1, "qsub: job 7.jobs.hpc.usu.edu rea"..., 36qsub: job 7.jobs.hpc.usu.edu ready
) = 36 rt_sigaction(SIGINT, {SIG_IGN}, NULL, 8) = 0 rt_sigaction(SIGTERM, {SIG_IGN}, NULL, 8) = 0 rt_sigaction(SIGALRM, {SIG_IGN}, NULL, 8) = 0 rt_sigaction(SIGTSTP, {SIG_IGN}, NULL, 8) = 0 clone(Process 3149 attached (waiting for parent) Process 3149 resumed (parent 3143 ready) child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b43da67f770) = 3149 [pid 3149] getsockname(4, <unfinished ...> [pid 3143] rt_sigaction(SIGCHLD, {0x402c00, [], SA_RESTORER, 0x3aa08301b0}, NULL, 8) = 0 [pid 3143] fcntl(0, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE) [pid 3143] read(0, <unfinished ...> [pid 3149] <... getsockname resumed> {sa_family=AF_INET, sin_port=htons(52700), sin_addr=inet_addr("129.123.148.50")}, [16]) = 0 [pid 3149] getpeername(4, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("129.123.20.92")}, [68719476752]) = 0 [pid 3149] fcntl(4, F_GETFD) = 0x1 (flags FD_CLOEXEC) [pid 3149] dup(4) = 6 [pid 3149] fcntl(6, F_SETFD, FD_CLOEXEC) = 0 [pid 3149] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7 [pid 3149] close(4) = 0 [pid 3149] fcntl(7, F_GETFD) = 0 [pid 3149] dup2(7, 4) = 4 [pid 3149] fcntl(4, F_SETFD, 0) = 0 [pid 3149] close(7) = 0 [pid 3149] write(4, "\25\3\1\0\22%\341U\3202\323i\207\240Z\220iTL\202'\264\t", 23) = -1 EPIPE (Broken pipe) [pid 3149] --- SIGPIPE (Broken pipe) @ 0 (0) --- Process 3149 detached <... read resumed> 0x7fffd0682aff, 1) = ? ERESTARTSYS (To be restarted) --- SIGCHLD (Child exited) @ 0 (0) --- wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGPIPE}], WNOHANG|WSTOPPED, NULL) = 3149 kill(3149, SIGTERM) = -1 ESRCH (No such process) ioctl(0, SNDCTL_TMR_START or TCSETS, {B9600 opost isig icanon echo ...}) = 0 ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B9600 opost isig icanon echo ...}) = 0 exit_group(0) = ?
jbh
On Sun, Jul 6, 2008 at 9:23 PM, John Hanks griznog@gmail.com wrote:
On Sun, Jul 6, 2008 at 7:44 PM, John Hanks griznog@gmail.com wrote:
Hello,
I have several systems which I recently updated with
yum -y update
to all the latest packages. These systems use yum-priorities and use the CentOS (priority 1) EPEL (priority 5) and rpmforge (priority 10) repositories. After the updates, dhcpd stopped working with a SIGPIPE error which occurs shortly after it attempts to fork into the background. I worked around that problem by building a new server with no additional repos, only CentOS and dhcpd works fine on that system. Since then I have found the problem, or similar problems with a few more applications. Here is what the tail of an strace of pbs_mom as it attempts to fork into the background:
Paul Bijnens pointed out that Ian Forde had similar issues with dhcpd minutes before I posted my message. I missed that one as I scanned the archives, then joined the list to ask my question. My problem is also solved by removing ldap from the services line in /etc/nsswitch, in every app that was previously failing with the SIGPIPE errors. I'm still curious to understand why, but more so I'm grateful to have a fix for it. Should have joined the list a long time ago :)
Thanks,
jbh
On Mon, 2008-07-07 at 07:07 -0600, John Hanks wrote:
Paul Bijnens pointed out that Ian Forde had similar issues with dhcpd minutes before I posted my message. I missed that one as I scanned the archives, then joined the list to ask my question. My problem is also solved by removing ldap from the services line in /etc/nsswitch, in every app that was previously failing with the SIGPIPE errors. I'm still curious to understand why, but more so I'm grateful to have a fix for it. Should have joined the list a long time ago :)
Nah - 20 minutes sooner would have done it! I joined the list to get an answer too! ;)
-I