[CentOS] NFS server problem

Sun May 1 02:15:52 UTC 2005
Alex Tkachenko <tiahino at gmail.com>

Hi Everybody!

I've got a nasty problem with a fresh install of CenOS 4 (actually
re-install, to rule out any configuration problems accumulated over
time).

We have a c4-based machine (further Server), which is supposed to
serve a lot of files over NFS. In the course of the setup I was
checking how the backups (using rsync) would work and got a lot of
errors reported by rsync about non-existent (sub) directories. When I
went to check the reported directory on the Client, it was indeed not
accessible, although it's name in the parent directory was visible.
Later I've found that rsync is irrelevant, the problem could be
reproduced by just running find.
To illustrate (run on the client)

cd /mnt/Server/export
find . > /dev/null
./a/b/c: No such file or directory
ls -l ./a/b
ls: /mnt/server/a/b/c: No such file or directory
(then follows the listing of remaining files in ./a/b)
If you fo just 'ls ./a/b', the output includes 'c' among the others.

NOW THE TRICK:

On the Server, do

ls /export/a/b/c
(expected output follows)

Now on the client everything is OK!

I've found a similar bug here: 
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=144556
although it is filed against fc3, the symptoms seem to be the same,
although noone mentioned that server-side trick.

I've tried several client machines (rhl7.3, c3,c4), tcp/udp, v2/v3,
sync/async and also set actimeo=0 - nothing seem to affect the
behavior. And nothing is printed in the logs - neither on client nor
server (granted, I don't have debug enabled).

At some point I suspected hardware (like lost or corrupted ethernet
packets), but the missing directory on the client side is missing
until the trick is applied on the server side, pretty consistently.

umount/mount may help with that particular dir, but the bug would show
up in different place. And as per bug referenced above, I could
confirm, that all the dirs reported missing have the size 12288
(haven't  spotted bigger, but maybe because I haven't checked every
one of them).

Just in case: the Server is Dell PowerEdge 2850 connected through
Gigabit interface (e1000) (latest BIOS updates applied in the course
of troubleshooting).

I did not try to re-install the server with c3, but even if it works,
it does not seem to be an option, because we plan to use selinux,
which is missing in c3 (and yes, I tried disabling selinux with
setenforce=0).

Any suggestion would be greatly appreciated.

Have a great day,
Alex