[CentOS] Date drift and ntpd

Jason Pyeron wrote, On 08/12/2010 08:01 AM:
>  
> 
>> -----Original Message-----
>> From: centos-bounces at centos.org 
>> [mailto:centos-bounces at centos.org] On Behalf Of Simon Billis
>> Sent: Thursday, August 12, 2010 7:36
>> To: 'CentOS mailing list'
>> Subject: Re: [CentOS] Date drift and ntpd
>>
>> Jason Pyeron sent a missive on 2010-08-12:
>>
>>> We have a local time server and all of our machines are 
>> pointed at it 
>>> for the time.
>>>
>>> How can the clock drift by a day and a half?
>>>
>>> [root at devserver21 ~]# date
>>> Fri Aug 13 14:43:29 EDT 2010
>>> [root at devserver21 ~]# rdate -s 192.168.1.67
>>> [root at devserver21 ~]# date
>>> Thu Aug 12 07:02:39 EDT 2010
>>> [root at devserver21 ~]# cat /etc/ntp.conf | grep -v ^# | grep -v ^$ 
>>> restrict default nomodify notrap noquery restrict 127.0.0.1 server
>>> 192.168.1.67 server 192.168.1.66 server 192.168.1.65
>>> server  127.127.1.0     # local clock
>>> fudge   127.127.1.0 stratum 10
>>> driftfile /var/lib/ntp/drift
>>> broadcastdelay  0.008
>>> keys            /etc/ntp/keys
>>>
>>>
>> Hi,
>>
>> It is unlikely that the machine in question drifted forward 
>> in time if ntpd was running. Have a look at the logs 
>> /var/log/messages it should contain the ntpd log messages 
> 
> [root at devserver21 ~]# grep ntpd /var/log/messages
> </snip>
> Jul 28 20:34:41 devserver21 ntpd[3475]: synchronized to 192.168.1.65, stratum 3
> Jul 28 21:08:00 devserver21 ntpd[3475]: synchronized to LOCAL(0), stratum 10
> Jul 28 21:08:00 devserver21 ntpd[3475]: frequency error -512 PPM exceeds
> tolerance 500 PPM
> Jul 28 21:08:11 devserver21 ntpd[3475]: synchronized to 192.168.1.66, stratum 3
> Jul 28 21:24:58 devserver21 ntpd[3475]: synchronized to 192.168.1.65, stratum 3
> Jul 28 21:41:26 devserver21 ntpd[3475]: synchronized to 192.168.1.67, stratum 3
> Jul 28 21:42:16 devserver21 ntpd[3475]: synchronized to LOCAL(0), stratum 10
> Jul 28 21:42:16 devserver21 ntpd[3475]: frequency error -512 PPM exceeds
> tolerance 500 PPM
> Jul 28 21:42:34 devserver21 ntpd[3475]: frequency error -512 PPM exceeds
> tolerance 500 PPM
> Jul 28 21:43:37 devserver21 ntpd[3475]: frequency error -512 PPM exceeds
> tolerance 500 PPM

> tolerance 500 PPM
> Jul 28 22:12:07 devserver21 ntpd[3475]: frequency error -512 PPM exceeds
> tolerance 500 PPM
> Jul 28 22:13:13 devserver21 ntpd[3475]: frequency error -512 PPM exceeds
> tolerance 500 PPM
> Jul 28 22:14:17 devserver21 ntpd[3475]: frequency error -512 PPM exceeds
> tolerance 500 PPM
> Jul 28 22:15:11 devserver21 ntpd[3475]: synchronized to 192.168.1.66, stratum 3
> Jul 28 22:31:41 devserver21 ntpd[3475]: synchronized to LOCAL(0), stratum 10
> Jul 28 22:31:41 devserver21 ntpd[3475]: frequency error -512 PPM exceeds
> tolerance 500 PPM

> Jul 29 15:14:01 devserver21 ntpd[3475]: synchronized to LOCAL(0), stratum 10
> Jul 29 15:26:05 devserver21 ntpd[3475]: synchronized to 192.168.1.65, stratum 3
> Jul 29 15:59:17 devserver21 ntpd[3475]: time reset -1.599691 s
> Jul 29 16:03:31 devserver21 ntpd[3475]: synchronized to LOCAL(0), stratum 10
> Jul 29 16:05:38 devserver21 ntpd[3475]: synchronized to 192.168.1.67, stratum 3
> Jul 29 16:08:46 devserver21 ntpd[3475]: synchronized to 192.168.1.66, stratum 3
> Jul 29 16:11:55 devserver21 ntpd[3475]: synchronized to 192.168.1.65, stratum 3

> Jul 29 17:23:57 devserver21 ntpd[3475]: synchronized to 192.168.1.67, stratum 3
> Jul 29 17:24:59 devserver21 ntpd[3475]: synchronized to LOCAL(0), stratum 10
> Jul 29 17:30:46 devserver21 ntpd[3475]: synchronized to 192.168.1.65, stratum 3
> Jul 29 17:47:24 devserver21 ntpd[3475]: synchronized to LOCAL(0), stratum 10
> Aug 12 22:48:29 devserver21 ntpd[3475]: sendto(192.168.1.66): Operation not
> permitted
> [root at devserver21 ~]# uptime
>  08:10:19 up 164 days,  9:56,  2 users,  load average: 0.20, 0.54, 0.81
> [root at devserver21 ~]#

Assumption: this is not from any kind of virtual machine.
Assumption: Your local time server is NOT a GPS with an ovenized crystal or even a cell phone time
source, i.e. NOT very stable.
Assumption: the time servers that you are following (192.168.1.6[57]) are:
	a) each following the same timeserver(s), or at least have one in common.
	b) peering with one another
	c) following time servers that are reasonably stable.
Assumption: the time farm is on real, non busy (an old cisco router serving as the internet
connection to 1000+ computers does not qualify as non busy), hardware and is configured to archive
maxpoll 10 or higher.

one problem that you have is that your timeserver farm (192.168.1.6[57]) is occasionally loosing its
servers, i.e. we see "synchronized to LOCAL(0)" occasionally, which should not happen with a well
configured time farm for hours to days, not minutes.

the second problem is that a machine which is not intended to be a time server is configured with a
local clock with a stratum better than 15.

suggestion 1: 65 should have local clock at stratum 13, 66 and 67 should have local clock at stratum
14 or 15, all other machines should not have a local clock or should not have one with a stratum
better than 15. Yes I, after reading the ntp documentation, disagree with RedHat's default.
net result should be that you don't get any local clock loops in the setup because you have a
defined leader, but if even the defined leader is lost the other machines should do a stable drift.

suggestion 2: 65, 66 & 67 should ALL peer with one another for added stability in the time farm.

suggestion 3: client machines should 'prefer' one of your servers over the others.

suggestion 4: see if someone has been messing with the kernel ticks on the machine...
run `tickadj` file:///usr/share/doc/ntp-4.2.2p1/tickadj.html
I had one computer where I needed to tweak the default value up or down one (I don't remember) to
have it be real stable, this should be a last resort.

-- 
Todd Denniston
Crane Division, Naval Surface Warfare Center (NSWC Crane)
Harnessing the Power of Technology for the Warfighter