NFS / DNS problem

List overview All Threads
Download

newer

older

MS Exchange to MBOX

dns propagation problem

Simone

15 Aug 2007 15 Aug '07

5:19 a.m.

Hi all,

Today we have had a strange problem that has taken down our website, we understand what happened but not why so I am hoping someone has seen this before.

We have our web servers (web1 web2 web3 ..... web10) mounting an NFS share (/export/data) from server nfs1. On the web server side we use autofs in the format nfs-dedicated:/export/data where nfs-dedicated is an alias in our internal DNS servers pointing to server nfs1. We run a primary and a secondary DNS (bind) server ns0, ns1 authoritative for our zones and our webservers have them configured in /etc/resolv.conf Today we had to run some upgrade on the dns servers (bios firmwares etc) so we took down ns0 and with it our website went down. All the nfs shares disappeared from the web servers (the logs show requests to mount/unmount timing out), but at the same time on nfs1 the logs show requests (mount and unmount) coming from the web servers and no errors.

As soon as ns0 is back up, all gets back to normal. Minutes later we take down ns1 for maintenance and it doesn't have any impact on the website.

dig @ns0 nfs-web gives exactly the same results on ns0/1

Back to the office we try to reproduce the same scenario configuring iptables on web3 to block traffic to ns0 but the server (web3) keeps working fine reverting to ns1 for name resolution (as you would expect).

Has anybody seen this happening before? Any comment/suggestion much appreciated.

Thanks

Simone

Show replies by date

Peter (CentOS List)

15 Aug 15 Aug

5:58 a.m.

The first thing that popped in my head was reverse lookup, but as I kept reading and saw your test with web3 it could ave been a sync problem between the two nameservers. By restarting ns1 all the zones were synced again and your initial problem isn't there anymore and so your test with web3 was successful as in it didn't loose it's mount. Keep an eye on ns1 when you make updates in the zones on ns0. I have seen problems where the sync didn't occur automatically and I had to sync "manually" by stopping and starting bind on the secondary server.

Hope it helps you a little bit.

Peter

Simone wrote:

...

Hi all,

Today we have had a strange problem that has taken down our website, we understand what happened but not why so I am hoping someone has seen this before.

We have our web servers (web1 web2 web3 ..... web10) mounting an NFS share (/export/data) from server nfs1. On the web server side we use autofs in the format nfs-dedicated:/export/data where nfs-dedicated is an alias in our internal DNS servers pointing to server nfs1. We run a primary and a secondary DNS (bind) server ns0, ns1 authoritative for our zones and our webservers have them configured in /etc/resolv.conf Today we had to run some upgrade on the dns servers (bios firmwares etc) so we took down ns0 and with it our website went down. All the nfs shares disappeared from the web servers (the logs show requests to mount/unmount timing out), but at the same time on nfs1 the logs show requests (mount and unmount) coming from the web servers and no errors.

As soon as ns0 is back up, all gets back to normal. Minutes later we take down ns1 for maintenance and it doesn't have any impact on the website.

dig @ns0 nfs-web gives exactly the same results on ns0/1

Back to the office we try to reproduce the same scenario configuring iptables on web3 to block traffic to ns0 but the server (web3) keeps working fine reverting to ns1 for name resolution (as you would expect).

Has anybody seen this happening before? Any comment/suggestion much appreciated.

Thanks

Simone

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Simone Mangelio

2:20 p.m.

Hi Peter

Thanks for your reply.

Some more info:

/etc/resolv.conf on ns1 nameserver ns0IP nameserver ns1IP

At the time ns0 was down, I can see that even ns1 fails mounting the nfs shares (timed out):

Aug 14 08:30:31 ns1 automount[4093]: >> mount: mount to NFS server 'nfs-web' failed: timed out (retrying). Aug 14 08:31:53 ns1 last message repeated 2 times Aug 14 08:32:13 ns1 automount[4093]: >> mount: mount to NFS server 'nfs-web' failed: timed out (giving up).

If I go back in the logs I can see a full zone synch happening on the 2nd of August, no chnages have been made after that so I am pretty confident the zones were ok.

In what way reverse lookup would affect it?

We are still scratching our heads.....

Thanks

Simone

On 8/15/07, Peter (CentOS List) centos@ourvirtualhome.com wrote:

...

The first thing that popped in my head was reverse lookup, but as I kept reading and saw your test with web3 it could ave been a sync problem between the two nameservers. By restarting ns1 all the zones were synced again and your initial problem isn't there anymore and so your test with web3 was successful as in it didn't loose it's mount. Keep an eye on ns1 when you make updates in the zones on ns0. I have seen problems where the sync didn't occur automatically and I had to sync "manually" by stopping and starting bind on the secondary server.

Hope it helps you a little bit.

Peter

Simone wrote:

...
Hi all,

Today we have had a strange problem that has taken down our website, we understand what happened but not why so I am hoping someone has seen this before.

We have our web servers (web1 web2 web3 ..... web10) mounting an NFS share (/export/data) from server nfs1. On the web server side we use autofs in the format nfs-dedicated:/export/data where nfs-dedicated is an alias in our internal DNS servers pointing to server nfs1. We run a primary and a secondary DNS (bind) server ns0, ns1 authoritative for our zones and our webservers have them configured in /etc/resolv.conf Today we had to run some upgrade on the dns servers (bios firmwares etc) so we took down ns0 and with it our website went down. All the nfs shares disappeared from the web servers (the logs show requests to mount/unmount timing out), but at the same time on nfs1 the logs show requests (mount and unmount) coming from the web servers and no errors.

As soon as ns0 is back up, all gets back to normal. Minutes later we take down ns1 for maintenance and it doesn't have any impact on the website.

dig @ns0 nfs-web gives exactly the same results on ns0/1

Back to the office we try to reproduce the same scenario configuring iptables on web3 to block traffic to ns0 but the server (web3) keeps working fine reverting to ns1 for name resolution (as you would expect).

Has anybody seen this happening before? Any comment/suggestion much appreciated.

Thanks

Simone

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Andreas Rogge

5:09 p.m.

Hi Simone,

what nameservers are configured for the nfs-servers?

Afaict the nfs-server does forward and reverse lookup the clients. So if your nfs-server's DNS breaks (i.e. if only ns0 is configured there and you shut down ns0) you might see the issue you described.

Regards, Andreas

Simone schrieb:

...

Hi all,

Today we have had a strange problem that has taken down our website, we understand what happened but not why so I am hoping someone has seen this before.

We have our web servers (web1 web2 web3 ..... web10) mounting an NFS share (/export/data) from server nfs1. On the web server side we use autofs in the format nfs-dedicated:/export/data where nfs-dedicated is an alias in our internal DNS servers pointing to server nfs1. We run a primary and a secondary DNS (bind) server ns0, ns1 authoritative for our zones and our webservers have them configured in /etc/resolv.conf Today we had to run some upgrade on the dns servers (bios firmwares etc) so we took down ns0 and with it our website went down. All the nfs shares disappeared from the web servers (the logs show requests to mount/unmount timing out), but at the same time on nfs1 the logs show requests (mount and unmount) coming from the web servers and no errors.

As soon as ns0 is back up, all gets back to normal. Minutes later we take down ns1 for maintenance and it doesn't have any impact on the website.

dig @ns0 nfs-web gives exactly the same results on ns0/1

Back to the office we try to reproduce the same scenario configuring iptables on web3 to block traffic to ns0 but the server (web3) keeps working fine reverting to ns1 for name resolution (as you would expect).

Has anybody seen this happening before? Any comment/suggestion much appreciated.

Thanks

Simone

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Simone Mangelio

5:45 p.m.

Hi Andreas,

All the servers have the same /etc/resolv.conf file:

search our.domain.com options ndots:2 nameserver ns0IP nameserver ns1IP

I can't see any errors on the web servers or nfs servers referring a host lookup failure.

Thanks

Simone

On 8/15/07, Andreas Rogge arogge@gmx.de wrote:

...

Hi Simone,

what nameservers are configured for the nfs-servers?

Afaict the nfs-server does forward and reverse lookup the clients. So if your nfs-server's DNS breaks (i.e. if only ns0 is configured there and you shut down ns0) you might see the issue you described.

Regards, Andreas

Simone schrieb:

...
Hi all,

Today we have had a strange problem that has taken down our website, we understand what happened but not why so I am hoping someone has seen this before.

We have our web servers (web1 web2 web3 ..... web10) mounting an NFS share (/export/data) from server nfs1. On the web server side we use autofs in the format nfs-dedicated:/export/data where nfs-dedicated is an alias in our internal DNS servers pointing to server nfs1. We run a primary and a secondary DNS (bind) server ns0, ns1 authoritative for our zones and our webservers have them configured in /etc/resolv.conf Today we had to run some upgrade on the dns servers (bios firmwares etc) so we took down ns0 and with it our website went down. All the nfs shares disappeared from the web servers (the logs show requests to mount/unmount timing out), but at the same time on nfs1 the logs show requests (mount and unmount) coming from the web servers and no errors.

As soon as ns0 is back up, all gets back to normal. Minutes later we take down ns1 for maintenance and it doesn't have any impact on the website.

dig @ns0 nfs-web gives exactly the same results on ns0/1

Back to the office we try to reproduce the same scenario configuring iptables on web3 to block traffic to ns0 but the server (web3) keeps working fine reverting to ns1 for name resolution (as you would expect).

Has anybody seen this happening before? Any comment/suggestion much appreciated.

Thanks

Simone

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

6531

Age (days ago)

6532

Last active (days ago)

discuss@lists.centos.org

4 comments

4 participants

tags (0)

participants (4)

Andreas Rogge
Peter (CentOS List)
Simone
Simone Mangelio