Commands failing silently?

List overview All Threads
Download

newer

older

Postfix - Maildir - MySQL - Cyrus...

Should I update to DRBD 82?

Dan Bongert

24 Mar 2008 24 Mar '08

7:27 p.m.

Hello all:

I have a couple CentOS 4 servers (all up-to-date) that are having strange command failures. I first noticed this with a perl script that uses lots of system calls.

Basically, sometimes a command just won't run:

thoth(52) /tmp> ls

thoth(53) /tmp> ls

thoth(54) /tmp> ls

thoth(55) /tmp> ls learner lost+found/

thoth(56) /tmp> ls learner lost+found/

thoth(57) /tmp> ls learner lost+found/

thoth(58) /tmp> ls learner lost+found/

thoth(59) /tmp> ls learner lost+found/

thoth(60) /tmp> ls learner lost+found/

thoth(61) /tmp> ls learner lost+found/

thoth(62) /tmp> ls

thoth(63) /tmp> ls

thoth(64) /tmp> ls

thoth(65) /tmp> ls

thoth(66) /tmp> uname -a Linux thoth.ssc.wisc.edu 2.6.9-67.0.7.ELsmp #1 SMP Sat Mar 15 06:54:55 EDT 2008 i686 i686 i386 GNU/Linux

Nothing in either dmesg or /var/log/messages seems to indicate any problems. It also doesn't seem to matter what the command is -- ls is the quickest test, but sshd will sometimes to fail to spawn children, etc. There aren't a large amount of processes on the machine either -- only 122 at the moment.

Has anyone seen this behavior before? Have I been hit with some sort of cunning rootkit? This machine shouldn't be publicly accessible; it's behind our firewall.

Thanks.

-- Dan Bongert dbongert@wisc.edu

Show replies by date

Bill Campbell

24 Mar 24 Mar

7:55 p.m.

On Mon, Mar 24, 2008, Dan Bongert wrote:

...

Hello all:

I have a couple CentOS 4 servers (all up-to-date) that are having strange command failures. I first noticed this with a perl script that uses lots of system calls.

Basically, sometimes a command just won't run:

thoth(52) /tmp> ls

...

thoth(66) /tmp> uname -a Linux thoth.ssc.wisc.edu 2.6.9-67.0.7.ELsmp #1 SMP Sat Mar 15 06:54:55 EDT 2008 i686 i686 i386 GNU/Linux

Nothing in either dmesg or /var/log/messages seems to indicate any problems. It also doesn't seem to matter what the command is -- ls is the quickest test, but sshd will sometimes to fail to spawn children, etc. There aren't a large amount of processes on the machine either -- only 122 at the moment.

There is a very good chance that the machine has been cracked, and the system's /bin/ls routine replaced by one hacked to hide the cracker's programs. ``rpm -V coreutils procps util-linux'' may well show several critical programs changed.

You can also try running ``strace /bin/ls'' to see what is going on.

Bill -- INTERNET: bill@celestial.com Bill Campbell; Celestial Software LLC URL: http://www.celestial.com/ PO Box 820; 6641 E. Mercer Way FAX: (206) 232-9186 Mercer Island, WA 98040-0820; (206) 236-1676

When I hear a man applauded by the mob I always feel a pang of pity for him. All he has to do to be hissed is to live long enough. -- H.L. Mencken, Minority Report

Dan Bongert

9:18 p.m.

Bill Campbell wrote:

...

On Mon, Mar 24, 2008, Dan Bongert wrote:

...
Hello all:

I have a couple CentOS 4 servers (all up-to-date) that are having strange command failures. I first noticed this with a perl script that uses lots of system calls.

Basically, sometimes a command just won't run:

thoth(52) /tmp> ls

...

...
thoth(66) /tmp> uname -a Linux thoth.ssc.wisc.edu 2.6.9-67.0.7.ELsmp #1 SMP Sat Mar 15 06:54:55 EDT 2008 i686 i686 i386 GNU/Linux

Nothing in either dmesg or /var/log/messages seems to indicate any problems. It also doesn't seem to matter what the command is -- ls is the quickest test, but sshd will sometimes to fail to spawn children, etc. There aren't a large amount of processes on the machine either -- only 122 at the moment.

There is a very good chance that the machine has been cracked, and the system's /bin/ls routine replaced by one hacked to hide the cracker's programs. ``rpm -V coreutils procps util-linux'' may well show several critical programs changed.

Everything seems OK there:

thoth(96) /tmp> sudo rpm -V coreutils procps util-linux

...

You can also try running ``strace /bin/ls'' to see what is going on.

Funnily enough, running strace will work just fine. Though, as I said, just about any command will fail -- 'ls' was just for testing purposes.

...

Bill

INTERNET: bill@celestial.com Bill Campbell; Celestial Software LLC URL: http://www.celestial.com/ PO Box 820; 6641 E. Mercer Way FAX: (206) 232-9186 Mercer Island, WA 98040-0820; (206) 236-1676

When I hear a man applauded by the mob I always feel a pang of pity for him. All he has to do to be hissed is to live long enough. -- H.L. Mencken, Minority Report _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

-- Dan Bongert dbongert@wisc.edu

Peter l Jakobi

10:06 p.m.

On Mon, Mar 24, 2008 at 04:18:49PM -0500, Dan Bongert wrote:

...

...
You can also try running ``strace /bin/ls'' to see what is going on.

Funnily enough, running strace will work just fine. Though, as I said, just about any command will fail -- 'ls' was just for testing purposes.

That's funny. Or due to the output of strace changing timing & stress.

Try redirecting the strace output to a separate (local filesystem or ramdisk) file, possibly restricted to file operations. Also: check top - you don't have swap or ram problems?

-- cu Peter l Jakobi lists@kefk.oa.shuttle.de

mouss

8:13 p.m.

Dan Bongert wrote:

...

Hello all:

I have a couple CentOS 4 servers (all up-to-date) that are having strange command failures. I first noticed this with a perl script that uses lots of system calls.

Basically, sometimes a command just won't run:

thoth(52) /tmp> ls

thoth(53) /tmp> ls

thoth(54) /tmp> ls

thoth(55) /tmp> ls learner lost+found/

thoth(56) /tmp> ls learner lost+found/

thoth(57) /tmp> ls learner lost+found/

thoth(58) /tmp> ls learner lost+found/

thoth(59) /tmp> ls learner lost+found/

thoth(60) /tmp> ls learner lost+found/

thoth(61) /tmp> ls learner lost+found/

thoth(62) /tmp> ls

thoth(63) /tmp> ls

thoth(64) /tmp> ls

thoth(65) /tmp> ls

thoth(66) /tmp> uname -a Linux thoth.ssc.wisc.edu 2.6.9-67.0.7.ELsmp #1 SMP Sat Mar 15 06:54:55 EDT 2008 i686 i686 i386 GNU/Linux

Nothing in either dmesg or /var/log/messages seems to indicate any problems. It also doesn't seem to matter what the command is -- ls is the quickest test, but sshd will sometimes to fail to spawn children, etc. There aren't a large amount of processes on the machine either -- only 122 at the moment.

Has anyone seen this behavior before? Have I been hit with some sort of cunning rootkit? This machine shouldn't be publicly accessible; it's behind our firewall.

where is /tmp mounted? is this an external disk (usb, ...)? is it an nfs mount?

Dan Bongert

9:19 p.m.

mouss wrote:

...

Dan Bongert wrote:

...
Hello all:

I have a couple CentOS 4 servers (all up-to-date) that are having strange command failures. I first noticed this with a perl script that uses lots of system calls.

thoth(66) /tmp> uname -a Linux thoth.ssc.wisc.edu 2.6.9-67.0.7.ELsmp #1 SMP Sat Mar 15 06:54:55 EDT 2008 i686 i686 i386 GNU/Linux

Nothing in either dmesg or /var/log/messages seems to indicate any problems. It also doesn't seem to matter what the command is -- ls is the quickest test, but sshd will sometimes to fail to spawn children, etc. There aren't a large amount of processes on the machine either -- only 122 at the moment.

Has anyone seen this behavior before? Have I been hit with some sort of cunning rootkit? This machine shouldn't be publicly accessible; it's behind our firewall.

where is /tmp mounted? is this an external disk (usb, ...)? is it an nfs mount?

It's a local disk:

thoth(97) /tmp> df -h . Filesystem Size Used Avail Use% Mounted on /dev/md4 16G 77M 15G 1% /tmp

Though 'ls' was just an example -- just about any program will fail. The 'w' command will fail too:

thoth(118) /tmp> w 16:06:51 up 5:34, 1 user, load average: 0.94, 1.46, 2.04 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT dbongert pts/0 copland.ssc.wisc 14:16 0.00s 0.22s 0.05s w

thoth(119) /tmp> w 16:06:52 up 5:34, 1 user, load average: 0.94, 1.46, 2.04 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT dbongert pts/0 copland.ssc.wisc 14:16 0.00s 0.22s 0.05s w

thoth(120) /tmp> w

thoth(121) /tmp> w

-- Dan Bongert dbongert@wisc.edu

William L. Maltby

10:08 p.m.

On Mon, 2008-03-24 at 16:19 -0500, Dan Bongert wrote:

...

mouss wrote:

...
Dan Bongert wrote:

...
Hello all:

<snip>

...

Though 'ls' was just an example -- just about any program will fail. The 'w' command will fail too:

thoth(118) /tmp> w 16:06:51 up 5:34, 1 user, load average: 0.94, 1.46, 2.04 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT dbongert pts/0 copland.ssc.wisc 14:16 0.00s 0.22s 0.05s w

thoth(119) /tmp> w 16:06:52 up 5:34, 1 user, load average: 0.94, 1.46, 2.04 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT dbongert pts/0 copland.ssc.wisc 14:16 0.00s 0.22s 0.05s w

thoth(120) /tmp> w

thoth(121) /tmp> w

Hmmm... Sure it's failing? Maybe just the output is going somewhere else? After the command runs, what does "echo $?" show? Does it even work? Echo is a bash internal command, so I would expect it to never fail.

What is your output device? A serial terminal? If so, could be simple flow control issues. In fact, any serial connection (even a PC emulating a terminal) could suffer from flow control problems. And they would tend to be erratic in nature.

If you are on a normal console, try running the commands similart to this (trying to determine if *something* else is receiving output or not)

<your command> &> /dev/tty

if this works reliably, maybe that's a starting point.

There's a couple kernel guys who frequent this list. Maybe one of them will have a clue as to what could go wrong. Corrupted libraries and whatnot.

You might try that rpm -V command earlier against all packages (add a "a" IIRC). Maybe some library accessed by the coreutils, but which is not itself part of coreutils, is corrupt.

HTH

-- Bill

Dan Bongert

25 Mar 25 Mar

6:21 p.m.

William L. Maltby wrote:

...

On Mon, 2008-03-24 at 16:19 -0500, Dan Bongert wrote:

...
mouss wrote:

...
Dan Bongert wrote:

...
Hello all:

<snip>

...
Though 'ls' was just an example -- just about any program will fail. The 'w' command will fail too:

thoth(118) /tmp> w 16:06:51 up 5:34, 1 user, load average: 0.94, 1.46, 2.04 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT dbongert pts/0 copland.ssc.wisc 14:16 0.00s 0.22s 0.05s w

thoth(119) /tmp> w 16:06:52 up 5:34, 1 user, load average: 0.94, 1.46, 2.04 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT dbongert pts/0 copland.ssc.wisc 14:16 0.00s 0.22s 0.05s w

thoth(120) /tmp> w

thoth(121) /tmp> w

Hmmm... Sure it's failing? Maybe just the output is going somewhere else? After the command runs, what does "echo $?" show? Does it even work? Echo is a bash internal command, so I would expect it to never fail.

Ok, it's definitely getting an error from somewhere:

thoth(3) /tmp> ls

thoth(4) /tmp> echo $? 141

Although:

thoth(31) ~> top

thoth(32) ~> echo $? 0

...

What is your output device? A serial terminal? If so, could be simple flow control issues. In fact, any serial connection (even a PC emulating a terminal) could suffer from flow control problems. And they would tend to be erratic in nature.

I'm usually sshing into the machine, but I've also experienced the problem on the console.

...

If you are on a normal console, try running the commands similart to this (trying to determine if *something* else is receiving output or not)
<your command> &> /dev/tty
if this works reliably, maybe that's a starting point.

Nope, that fails intermittently as well.

...

There's a couple kernel guys who frequent this list. Maybe one of them will have a clue as to what could go wrong. Corrupted libraries and whatnot.

You might try that rpm -V command earlier against all packages (add a "a" IIRC). Maybe some library accessed by the coreutils, but which is not itself part of coreutils, is corrupt.

Hmm....when I do a 'rpm -Va', I get lots of "at least one of file's dependencies has changed since prelinking" errors. Even if I run prelink manually, and then do a 'rpm -Va' immediately afterwards.

-- Dan Bongert dbongert@wisc.edu

William L. Maltby

8:18 p.m.

On Tue, 2008-03-25 at 13:21 -0500, Dan Bongert wrote:

...

William L. Maltby wrote:

...
On Mon, 2008-03-24 at 16:19 -0500, Dan Bongert wrote:

...
mouss wrote:

...
Dan Bongert wrote:

...
Hello all:

<snip>

...
Though 'ls' was just an example -- just about any program will fail. The 'w' command will fail too:

<snip>

...

...
Hmmm... Sure it's failing? Maybe just the output is going somewhere else? After the command runs, what does "echo $?" show? Does it even work? Echo is a bash internal command, so I would expect it to never fail.

Ok, it's definitely getting an error from somewhere:

thoth(3) /tmp> ls

thoth(4) /tmp> echo $? 141

Although:

thoth(31) ~> top

"~>" ? Got me on that one.

...

thoth(32) ~> echo $? 0

Ditto. Although I should mention that unless you "man bash" and find the magic incantation I can't remember that gets return codes from a pipeline (if that's what "~>" is supposed to be), the return from the last command in the pipeline is what's returned. If echo is from bash, as I expected, it should not fail and should return a 0 code regardless of what happened ahead of it.

Your best tack is simplicity: one command, no pipes, just redirect output with "&>" like so

cat <your file> &>/tmp/test.out

Then you can see if the output file has greater than zero length, use vim on in (if that works), etc.

...

<snip possibility of serial connection>

...

I'm usually sshing into the machine, but I've also experienced the problem on the console.

Ssh via e'net or serial? On the console, is the failure as reliable or less frequent?

...

...
If you are on a normal console, try running the commands similart to this (trying to determine if *something* else is receiving output or not)
<your command> &> /dev/tty
if this works reliably, maybe that's a starting point.
Nope, that fails intermittently as well.

I would surmise that means that basic kernel operations are good and there is some common library routine involved.

...

...
There's a couple kernel guys who frequent this list. Maybe one of them will have a clue as to what could go wrong. Corrupted libraries and whatnot.

You might try that rpm -V command earlier against all packages (add a "a" IIRC). Maybe some library accessed by the coreutils, but which is not itself part of coreutils, is corrupt.

Hmm....when I do a 'rpm -Va', I get lots of "at least one of file's dependencies has changed since prelinking" errors. Even if I run prelink manually, and then do a 'rpm -Va' immediately afterwards.

Well, I'd "man rpm" (no, I don't hate you, but I don't do rpm stuff enough to remember it all and *I* am not going to "man rpm" unless I suddenly become quite masochistic :-), select some promising looking options and run it again, redirecting output to a file you can examine (possibly have to get it to a machine that works reliably - "man nc" someone mentioned in another thread looks like a useful tool).

You want to get the diagnostic output from rpm and see what files it complains about. The ones tagged with a "c" are config files and will often show up there. If your system hasn't been compromised, it's safe to ignore these.

Examine all the ones that were unexpectedly tagged and see if there is a pattern.

If your HDs are "smart", maybe a "smartctl -l <more params>" will identify some sectors gone bad in a critical area of your HD.

I don't have a clue why right after prelink is run the rpm would claim they had been changed, unless it's a matter of the rpm data base has not yet been updated. I don't know how it all works together. Maybe the rpm update runs at night or something?

WHERE'S THE KNOWLEDGEABLE FOLKS WHEN NEEDED? It's the blind leading the blind ATM. 8-O

HTH

-- Bill

Kai Schaetzl

26 Mar 26 Mar

11:31 a.m.

William L. Maltby wrote on Tue, 25 Mar 2008 16:18:51 -0400:

...

"~>" ? Got me on that one.

home dir plus prompt. It looks funny, yes :-)

Kai

-- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com

Dan Bongert

27 Mar 27 Mar

3:27 p.m.

Kai Schaetzl wrote:

...

William L. Maltby wrote on Tue, 25 Mar 2008 16:18:51 -0400:

...
"~>" ? Got me on that one.

home dir plus prompt. It looks funny, yes :-)

Yup, that's exactly it -- I had run that command from my homedir instead of from /tmp.

-- Dan Bongert dbongert@wisc.edu

Filipe Brandenburger

26 Mar 26 Mar

2:47 a.m.

Hi,

On Tue, Mar 25, 2008 at 2:21 PM, Dan Bongert dbongert@wisc.edu wrote:

...

thoth(3) /tmp> ls

thoth(4) /tmp> echo $? 141

141 is SIGPIPE. If the process is killed by a signal, the return code will be 128+signal number. 141-128=13, and kill -l says: 13) SIGPIPE.

SIGPIPE means that something that ls is writing to is being closed. That's really strange, and I couldn't find why.

I still think strace would be the best way to trace it. Please try:

# rm -f /tmp/ls-strace.txt; strace -o /tmp/ls-strace.txt -tt -s 1024 -f ls --color=tty

Repeat it until ls doesn't print anything. Then less your /tmp/ls-strace.txt file, you'll probably have something like +++ killed by SIGPIPE +++ as the last line of it. Then try to figure out what happened before it got the SIGPIPE. Probably a "write" to something, try to figure out to which file descriptor. If you can't do it, try to post the last few lines of the file here.

Also, can you post the output of this command? # ls -la /proc/$$/fd/

Filipe

Dan Bongert

27 Mar 27 Mar

3:31 p.m.

Filipe Brandenburger wrote:

...

Hi,

On Tue, Mar 25, 2008 at 2:21 PM, Dan Bongert dbongert@wisc.edu wrote:

...
thoth(3) /tmp> ls

thoth(4) /tmp> echo $? 141

141 is SIGPIPE. If the process is killed by a signal, the return code will be 128+signal number. 141-128=13, and kill -l says: 13) SIGPIPE.

SIGPIPE means that something that ls is writing to is being closed. That's really strange, and I couldn't find why.

I still think strace would be the best way to trace it. Please try:

# rm -f /tmp/ls-strace.txt; strace -o /tmp/ls-strace.txt -tt -s 1024 -f ls --color=tty

Repeat it until ls doesn't print anything. Then less your /tmp/ls-strace.txt file, you'll probably have something like +++ killed by SIGPIPE +++ as the last line of it. Then try to figure out what happened before it got the SIGPIPE. Probably a "write" to something, try to figure out to which file descriptor. If you can't do it, try to post the last few lines of the file here.

I tried it, but as I said before, strace somehow interferes with what's going on. I wasn't able to get a program to fail via strace.

...

Also, can you post the output of this command? # ls -la /proc/$$/fd/

thoth(265) /tmp> ls -la /proc/$$/fd/

thoth(266) /tmp> ls -la /proc/$$/fd/ total 5 dr-x------ 2 dbongert dbongert 0 Mar 27 10:17 . dr-xr-xr-x 3 dbongert dbongert 0 Mar 27 10:03 .. lrwx------ 1 dbongert dbongert 64 Mar 27 10:17 0 -> /dev/pts/0 lrwx------ 1 dbongert dbongert 64 Mar 27 10:17 1 -> /dev/pts/0 lrwx------ 1 dbongert dbongert 64 Mar 27 10:17 2 -> /dev/pts/0 lrwx------ 1 dbongert dbongert 64 Mar 27 10:17 255 -> /dev/pts/0 lrwx------ 1 dbongert dbongert 64 Mar 27 10:17 3 -> socket:[4425494]

-- Dan Bongert dbongert@wisc.edu

Dan Bongert

3:44 p.m.

Dan Bongert wrote:

...

Filipe Brandenburger wrote:

...
Hi,

On Tue, Mar 25, 2008 at 2:21 PM, Dan Bongert dbongert@wisc.edu wrote:

...
thoth(3) /tmp> ls

thoth(4) /tmp> echo $? 141

141 is SIGPIPE. If the process is killed by a signal, the return code will be 128+signal number. 141-128=13, and kill -l says: 13) SIGPIPE.

SIGPIPE means that something that ls is writing to is being closed. That's really strange, and I couldn't find why.

I still think strace would be the best way to trace it. Please try:

# rm -f /tmp/ls-strace.txt; strace -o /tmp/ls-strace.txt -tt -s 1024 -f ls --color=tty

Repeat it until ls doesn't print anything. Then less your /tmp/ls-strace.txt file, you'll probably have something like +++ killed by SIGPIPE +++ as the last line of it. Then try to figure out what happened before it got the SIGPIPE. Probably a "write" to something, try to figure out to which file descriptor. If you can't do it, try to post the last few lines of the file here.

I tried it, but as I said before, strace somehow interferes with what's going on. I wasn't able to get a program to fail via strace.

...
Also, can you post the output of this command? # ls -la /proc/$$/fd/

thoth(265) /tmp> ls -la /proc/$$/fd/

thoth(266) /tmp> ls -la /proc/$$/fd/ total 5 dr-x------ 2 dbongert dbongert 0 Mar 27 10:17 . dr-xr-xr-x 3 dbongert dbongert 0 Mar 27 10:03 .. lrwx------ 1 dbongert dbongert 64 Mar 27 10:17 0 -> /dev/pts/0 lrwx------ 1 dbongert dbongert 64 Mar 27 10:17 1 -> /dev/pts/0 lrwx------ 1 dbongert dbongert 64 Mar 27 10:17 2 -> /dev/pts/0 lrwx------ 1 dbongert dbongert 64 Mar 27 10:17 255 -> /dev/pts/0 lrwx------ 1 dbongert dbongert 64 Mar 27 10:17 3 -> socket:[4425494]

Ok, here I am replying to myself. On a lark, I tried to strace a different program, since I couldn't get strace + ls to fail. Here's the end of the output from 'strace w':

connect(4, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = 0 poll([{fd=4, events=POLLOUT|POLLERR|POLLHUP, revents=POLLOUT|POLLHUP}], 1, 5000) = 1 writev(4, [{"\2\0\0\0\1\0\0\0\2\0\0\0", 12}, {"0\0", 2}], 2) = -1 EPIPE (Broken pipe) --- SIGPIPE (Broken pipe) @ 0 (0) --- +++ killed by SIGPIPE +++

Looks like a nscd problem, and disabling it seems to fix the problem.

-- Dan Bongert dbongert@wisc.edu

Ross S. W. Walker

3:53 p.m.

Dan Bongert wrote:

...

Dan Bongert wrote:

...
Filipe Brandenburger wrote:

...
Hi,

On Tue, Mar 25, 2008 at 2:21 PM, Dan Bongert dbongert@wisc.edu wrote:

...
thoth(3) /tmp> ls

thoth(4) /tmp> echo $? 141

141 is SIGPIPE. If the process is killed by a signal, the return code will be 128+signal number. 141-128=13, and kill -l says: 13) SIGPIPE.

SIGPIPE means that something that ls is writing to is being closed. That's really strange, and I couldn't find why.

I still think strace would be the best way to trace it. Please try:

# rm -f /tmp/ls-strace.txt; strace -o /tmp/ls-strace.txt -tt -s 1024 -f ls --color=tty

Repeat it until ls doesn't print anything. Then less your /tmp/ls-strace.txt file, you'll probably have something like +++ killed by SIGPIPE +++ as the last line of it. Then try to figure out what happened before it got the SIGPIPE. Probably a "write" to something, try to figure out to which file descriptor. If you can't do it, try to post the last few lines of the file here.

I tried it, but as I said before, strace somehow interferes with what's going on. I wasn't able to get a program to fail via strace.

...
Also, can you post the output of this command? # ls -la /proc/$$/fd/

thoth(265) /tmp> ls -la /proc/$$/fd/

thoth(266) /tmp> ls -la /proc/$$/fd/ total 5 dr-x------ 2 dbongert dbongert 0 Mar 27 10:17 . dr-xr-xr-x 3 dbongert dbongert 0 Mar 27 10:03 .. lrwx------ 1 dbongert dbongert 64 Mar 27 10:17 0 -> /dev/pts/0 lrwx------ 1 dbongert dbongert 64 Mar 27 10:17 1 -> /dev/pts/0 lrwx------ 1 dbongert dbongert 64 Mar 27 10:17 2 -> /dev/pts/0 lrwx------ 1 dbongert dbongert 64 Mar 27 10:17 255 -> /dev/pts/0 lrwx------ 1 dbongert dbongert 64 Mar 27 10:17 3 -> socket:[4425494]

Ok, here I am replying to myself. On a lark, I tried to strace a different program, since I couldn't get strace + ls to fail. Here's the end of the output from 'strace w':

connect(4, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = 0 poll([{fd=4, events=POLLOUT|POLLERR|POLLHUP, revents=POLLOUT|POLLHUP}], 1, 5000) = 1 writev(4, [{"\2\0\0\0\1\0\0\0\2\0\0\0", 12}, {"0\0", 2}], 2) = -1 EPIPE (Broken pipe) --- SIGPIPE (Broken pipe) @ 0 (0) --- +++ killed by SIGPIPE +++

Looks like a nscd problem, and disabling it seems to fix the problem.

Good stuff, actually the nscd problem may even be a symptom to an nsswitch problem. Check to make sure you don't have a name service enabled in /etc/nsswitch that isn't actually working.

-Ross

______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.

mouss

26 Mar 26 Mar

2:11 a.m.

Dan Bongert wrote:

...

mouss wrote:

...
Dan Bongert wrote:

...
Hello all:

I have a couple CentOS 4 servers (all up-to-date) that are having strange command failures. I first noticed this with a perl script that uses lots of system calls.

thoth(66) /tmp> uname -a Linux thoth.ssc.wisc.edu 2.6.9-67.0.7.ELsmp #1 SMP Sat Mar 15 06:54:55 EDT 2008 i686 i686 i386 GNU/Linux

Nothing in either dmesg or /var/log/messages seems to indicate any problems. It also doesn't seem to matter what the command is -- ls is the quickest test, but sshd will sometimes to fail to spawn children, etc. There aren't a large amount of processes on the machine either -- only 122 at the moment.

Has anyone seen this behavior before? Have I been hit with some sort of cunning rootkit? This machine shouldn't be publicly accessible; it's behind our firewall.

where is /tmp mounted? is this an external disk (usb, ...)? is it an nfs mount?

It's a local disk:

thoth(97) /tmp> df -h . Filesystem Size Used Avail Use% Mounted on /dev/md4 16G 77M 15G 1% /tmp

Though 'ls' was just an example -- just about any program will fail. The 'w' command will fail too:

maybe check your PATH. try $ /bin/ls

...

thoth(118) /tmp> w 16:06:51 up 5:34, 1 user, load average: 0.94, 1.46, 2.04 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT dbongert pts/0 copland.ssc.wisc 14:16 0.00s 0.22s 0.05s w

thoth(119) /tmp> w 16:06:52 up 5:34, 1 user, load average: 0.94, 1.46, 2.04 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT dbongert pts/0 copland.ssc.wisc 14:16 0.00s 0.22s 0.05s w

thoth(120) /tmp> w

thoth(121) /tmp> w

Dan Bongert

27 Mar 27 Mar

3:30 p.m.

mouss wrote:

...

Dan Bongert wrote:

...
mouss wrote:

...
Dan Bongert wrote:

...
Hello all:

I have a couple CentOS 4 servers (all up-to-date) that are having strange command failures. I first noticed this with a perl script that uses lots of system calls.

thoth(66) /tmp> uname -a Linux thoth.ssc.wisc.edu 2.6.9-67.0.7.ELsmp #1 SMP Sat Mar 15 06:54:55 EDT 2008 i686 i686 i386 GNU/Linux

Nothing in either dmesg or /var/log/messages seems to indicate any problems. It also doesn't seem to matter what the command is -- ls is the quickest test, but sshd will sometimes to fail to spawn children, etc. There aren't a large amount of processes on the machine either -- only 122 at the moment.

Has anyone seen this behavior before? Have I been hit with some sort of cunning rootkit? This machine shouldn't be publicly accessible; it's behind our firewall.

where is /tmp mounted? is this an external disk (usb, ...)? is it an nfs mount?

It's a local disk:

thoth(97) /tmp> df -h . Filesystem Size Used Avail Use% Mounted on /dev/md4 16G 77M 15G 1% /tmp

Though 'ls' was just an example -- just about any program will fail. The 'w' command will fail too:

maybe check your PATH. try $ /bin/ls

Ok, here's a heck of a thing. When I run 'ls' using the full path (and also when I unalias it -- I have 'ls' aliased to 'ls -F --color'), 'ls' no longer fails.

However, my other test case, 'w', still fails.

(and these are all test cases because I noticed a nightly job with a lot of system() calls was failing).

-- Dan Bongert dbongert@wisc.edu

6546

Age (days ago)

6549

Last active (days ago)

discuss@lists.centos.org

16 comments

8 participants

tags (0)

participants (8)

Bill Campbell
Dan Bongert
Filipe Brandenburger
Kai Schaetzl
mouss
Peter l Jakobi
Ross S. W. Walker
William L. Maltby