On Sun, Jul 6, 2008 at 7:44 PM, John Hanks <griznog at gmail.com> wrote: > Hello, > > I have several systems which I recently updated with > > yum -y update > > to all the latest packages. These systems use yum-priorities and use > the CentOS (priority 1) EPEL (priority 5) and rpmforge (priority 10) > repositories. After the updates, dhcpd stopped working with a SIGPIPE > error which occurs shortly after it attempts to fork into the > background. I worked around that problem by building a new server with > no additional repos, only CentOS and dhcpd works fine on that system. > Since then I have found the problem, or similar problems with a few > more applications. Here is what the tail of an strace of pbs_mom as it > attempts to fork into the background: > > listen(5, 512) = 0 > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 6 > setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > bind(6, {sa_family=AF_INET, sin_port=htons(15003), > sin_addr=inet_addr("0.0.0.0")}, 16) = 0 > listen(6, 512) = 0 > fcntl(4, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0 > clone(Process 23938 attached (waiting for parent) > Process 23938 resumed (parent 23937 ready) > child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, > child_tidptr=0x2aaaaad30db0) = 23938 > [pid 23937] exit_group(0) = ? > getsockname(3, 0x7fff6b7728a0, [128]) = -1 ENOTSOCK (Socket > operation on non-socket) > fcntl(3, F_GETFD) = 0 > dup(3) = 7 > fcntl(7, F_SETFD, 0) = 0 > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8 > close(3) = 0 > fcntl(8, F_GETFD) = 0 > dup2(8, 3) = 3 > fcntl(3, F_SETFD, 0) = 0 > close(8) = 0 > write(3, "\25\3\1\0\22\334\362\36\233\253\205\2633\323\322q\4\3T\rxK\210", > 23) = -1 EPIPE (Broken pipe) > --- SIGPIPE (Broken pipe) @ 0 (0) --- > Process 23938 detached > > > This is pretty much the same thing that happened to dhcpd. In both > cases they applications work fine in debug mode when they don't > attempt to fork, but quietly die when ran normally. A third set of > apps, wrappers for the client part of torque (pbs_mom) do this: > > stat("/usr/local/sbin/pbs_iff", {st_mode=S_IFREG|S_ISUID|0755, > st_size=21412, ...}) = 0 > pipe([5, 6]) = 0 > clone(Process 24068 attached (waiting for parent) > Process 24068 resumed (parent 24067 ready) > child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, > child_tidptr=0x2aaaaad31ce0) = 24068 > [pid 24067] close(6) = 0 > [pid 24067] fcntl(5, F_GETFL) = 0 (flags O_RDONLY) > [pid 24067] read(5, <unfinished ...> > [pid 24068] getsockname(3, {sa_family=AF_INET, sin_port=htons(41855), > sin_addr=inet_addr("129.123.148.49")}, [1164321820984213520]) = 0 > [pid 24068] getpeername(3, {sa_family=AF_INET, sin_port=htons(636), > sin_addr=inet_addr("129.123.20.92")}, [68719476752]) = 0 > [pid 24068] fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC) > [pid 24068] dup(3) = 7 > [pid 24068] fcntl(7, F_SETFD, FD_CLOEXEC) = 0 > [pid 24068] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8 > [pid 24068] close(3) = 0 > [pid 24068] fcntl(8, F_GETFD) = 0 > [pid 24068] dup2(8, 3) = 3 > [pid 24068] fcntl(3, F_SETFD, 0) = 0 > [pid 24068] close(8) = 0 > [pid 24068] write(3, > "\25\3\1\0\22\346h\357n\r\17x\374B\312\217\374x\276\311\217\342%", 23) > = -1 EPIPE (Broken pipe) > [pid 24068] --- SIGPIPE (Broken pipe) @ 0 (0) --- > Process 24068 detached > <... read resumed> "", 4) = 0 > --- SIGCHLD (Child exited) @ 0 (0) --- > close(5) = 0 > wait4(24068, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGPIPE}], 0, NULL) = 24068 > close(4) = 0 > write(2, "No Permission.\n", 15No Permission. > ) = 15 > write(2, "qstat: cannot connect to server "..., 63qstat: cannot > connect to server moab.hpc.usu.edu (errno=15007) > ) = 63 > exit_group(-1) = ? > > Once again, the app dies after it attempts to fork into the > background. There are other things running on these systems that can > successfully fork and I have been unable to figure out any pattern, > other than if I don't use additional repos then it doesn't seem to > break. That may be coincidental though, I haven't repeated it enough > yet to be certain. > > Any hints or suggestions would be appreciated. Unfortunately I noticed > this after deciding it was "safe" to update *all* my machines and so > I'm suffering through a lot of rebuilds/restores because of this. > > Thanks, > > jbh > Just fouund yet another system demonstrating pipe related weirdness. Here's the tail of an strace where this app (qsub, another part of Torque) hangs after the SIGPIPE: write(5, "\3\34\177\25\4\32", 6) = 6 write(5, "WINSIZE 36,137,822,504\0\0R(A\240:\0\0\0"..., 80) = 80 write(1, "qsub: job 7.jobs.hpc.usu.edu rea"..., 36qsub: job 7.jobs.hpc.usu.edu ready ) = 36 rt_sigaction(SIGINT, {SIG_IGN}, NULL, 8) = 0 rt_sigaction(SIGTERM, {SIG_IGN}, NULL, 8) = 0 rt_sigaction(SIGALRM, {SIG_IGN}, NULL, 8) = 0 rt_sigaction(SIGTSTP, {SIG_IGN}, NULL, 8) = 0 clone(Process 3149 attached (waiting for parent) Process 3149 resumed (parent 3143 ready) child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b43da67f770) = 3149 [pid 3149] getsockname(4, <unfinished ...> [pid 3143] rt_sigaction(SIGCHLD, {0x402c00, [], SA_RESTORER, 0x3aa08301b0}, NULL, 8) = 0 [pid 3143] fcntl(0, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE) [pid 3143] read(0, <unfinished ...> [pid 3149] <... getsockname resumed> {sa_family=AF_INET, sin_port=htons(52700), sin_addr=inet_addr("129.123.148.50")}, [16]) = 0 [pid 3149] getpeername(4, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("129.123.20.92")}, [68719476752]) = 0 [pid 3149] fcntl(4, F_GETFD) = 0x1 (flags FD_CLOEXEC) [pid 3149] dup(4) = 6 [pid 3149] fcntl(6, F_SETFD, FD_CLOEXEC) = 0 [pid 3149] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7 [pid 3149] close(4) = 0 [pid 3149] fcntl(7, F_GETFD) = 0 [pid 3149] dup2(7, 4) = 4 [pid 3149] fcntl(4, F_SETFD, 0) = 0 [pid 3149] close(7) = 0 [pid 3149] write(4, "\25\3\1\0\22%\341U\3202\323i\207\240Z\220iTL\202\'\264\t", 23) = -1 EPIPE (Broken pipe) [pid 3149] --- SIGPIPE (Broken pipe) @ 0 (0) --- Process 3149 detached <... read resumed> 0x7fffd0682aff, 1) = ? ERESTARTSYS (To be restarted) --- SIGCHLD (Child exited) @ 0 (0) --- wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGPIPE}], WNOHANG|WSTOPPED, NULL) = 3149 kill(3149, SIGTERM) = -1 ESRCH (No such process) ioctl(0, SNDCTL_TMR_START or TCSETS, {B9600 opost isig icanon echo ...}) = 0 ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B9600 opost isig icanon echo ...}) = 0 exit_group(0) = ? jbh