Strange NTP problem

List overview All Threads
Download

newer

older

RE: [CentOS] COBOL

Failed boot drive

Alfred von Campe

20 May 2008 20 May '08

7:46 p.m.

I have 30 identical Lenovo desktop systems running CentOS 5.1. On one of those systems the clock is running slow (5+ minutes from yesterday to this morning and another minute since this morning) despite the fact that NTP is running on all of them and they all have the exact same /etc/ntp.conf file (I compared the MD5 sums of that file on all the systems). Here is the output of "grep ntp /var/log messages" on the system with the problem since I restarted the NTP daemon earlier today:

May 20 11:35:38 hepdsw03 ntpd[31791]: ntpd 4.2.2p1@1.1570-o Sat Nov 10 12:33:50 UTC 2007 (1) May 20 11:35:38 hepdsw03 ntpd[31792]: precision = 1.000 usec May 20 11:35:38 hepdsw03 ntpd[31792]: Listening on interface wildcard, 0.0.0.0#123 Disabled May 20 11:35:38 hepdsw03 ntpd[31792]: Listening on interface wildcard, ::#123 Disabled May 20 11:35:38 hepdsw03 ntpd[31792]: Listening on interface lo, :: 1#123 Enabled May 20 11:35:38 hepdsw03 ntpd[31792]: Listening on interface eth0, fe80::210:c6ff:feab:dd92#123 Enabled May 20 11:35:38 hepdsw03 ntpd[31792]: Listening on interface lo, 127.0.0.1#123 Enabled May 20 11:35:38 hepdsw03 ntpd[31792]: Listening on interface eth0, 10.66.42.109#123 Enabled May 20 11:35:38 hepdsw03 ntpd[31792]: kernel time sync status 0040 May 20 11:35:38 hepdsw03 ntpd[31792]: frequency initialized 0.000 PPM from /var/lib/ntp/drift May 20 11:38:55 hepdsw03 ntpd[31792]: synchronized to LOCAL(0), stratum 10 May 20 11:38:55 hepdsw03 ntpd[31792]: kernel time sync disabled 0001 May 20 11:39:59 hepdsw03 ntpd[31792]: synchronized to 10.101.32.104, stratum 3 May 20 11:40:58 hepdsw03 ntpd[31792]: synchronized to LOCAL(0), stratum 10 May 20 11:42:09 hepdsw03 ntpd[31792]: synchronized to 10.101.32.104, stratum 3 May 20 11:47:26 hepdsw03 ntpd[31792]: synchronized to LOCAL(0), stratum 10 May 20 11:49:31 hepdsw03 ntpd[31792]: synchronized to 10.101.32.104, stratum 3 May 20 11:52:48 hepdsw03 ntpd[31792]: synchronized to LOCAL(0), stratum 10 May 20 11:54:54 hepdsw03 ntpd[31792]: synchronized to 10.101.32.104, stratum 3 May 20 12:01:26 hepdsw03 ntpd[31792]: synchronized to LOCAL(0), stratum 10

Any idea what could be causing this? For now I have disabled the NTP daemon and am running ntpdate once an hour. What complicates the matter is that we are using Kerberos for authentication, and after a day or so the user can not log into his system anymore because of the time skew.

Alfred

Show replies by date

Paul Heinlein

20 May 20 May

8:56 p.m.

On Tue, 20 May 2008, Alfred von Campe wrote:

...

I have 30 identical Lenovo desktop systems running CentOS 5.1. On one of those systems the clock is running slow (5+ minutes from yesterday to this morning and another minute since this morning) despite the fact that NTP is running on all of them and they all have the exact same /etc/ntp.conf file (I compared the MD5 sums of that file on all the systems). Here is the output of "grep ntp /var/log messages" on the system with the problem since I restarted the NTP daemon earlier today:

A slew of 5 min/24 hrs should be in the range of fixable.

...

May 20 11:35:38 hepdsw03 ntpd[31792]: frequency initialized 0.000 PPM from /var/lib/ntp/drift

This is very suspect. Are there any SELinux or other log messages suggesting that ntpd isn't able to write to its drift file? Your local clock is definitely drifting, so a 0.000 value is bogus. It may indicate that there's a disconnect between ntpd and the filesystem.

I'd be interested in the output of "ntpdc -c kerninfo"; on most systems the 'pll frequency' value is a close match to the figure in the drift file.

...

May 20 11:38:55 hepdsw03 ntpd[31792]: synchronized to LOCAL(0), stratum 10 May 20 11:38:55 hepdsw03 ntpd[31792]: kernel time sync disabled 0001 May 20 11:39:59 hepdsw03 ntpd[31792]: synchronized to 10.101.32.104, stratum 3

This is ungood. Sync-ing to local before your network time server means that your machine doesn't want to believe your server -- and you should see a "kernel time sync enabled" message once the machine has sync-ed with the time server.

You said the machines are identical. Could there be any variation in the BIOS revision level or its settings? Sometimes ACPI stuff can mess up ntp.

Also -- the log messages you provide have no "step time server" reference. Do you have a valid /etc/ntp/step-tickers file?

-- Paul Heinlein <> heinlein@madboa.com <> http://www.madboa.com/

Alfred von Campe

21 May 21 May

12:49 a.m.

On May 20, 2008, at 16:56, Paul Heinlein wrote:

...

A slew of 5 min/24 hrs should be in the range of fixable.

If the NTP daemon was doing its job :-).

...

This is very suspect. Are there any SELinux or other log messages suggesting that ntpd isn't able to write to its drift file? Your local clock is definitely drifting, so a 0.000 value is bogus. It may indicate that there's a disconnect between ntpd and the filesystem.

I grep'ed for ntp in the /var/log/messages file and there were no other instances of ntp. SELinux is disabled, and all my systems are built with Kickstart from the same template file. I'm fairly confident that the other 29 systems are configured identically to each other.

...

I'd be interested in the output of "ntpdc -c kerninfo"; on most systems the 'pll frequency' value is a close match to the figure in the drift file.

# ntpdc -c kerninfo pll offset: 0 s pll frequency: 0.000 ppm maximum error: 0.02724 s estimated error: 0 s status: 0001 pll pll time constant: 6 precision: 1e-06 s frequency tolerance: 512 ppm

On a system where it's working, it looks like this:

# ntpdc -c kerninfo pll offset: 0.011131 s pll frequency: 81.440 ppm maximum error: 0.241978 s estimated error: 0.005287 s status: 0001 pll pll time constant: 4 precision: 1e-06 s frequency tolerance: 512 ppm

I re-started the NTP daemon around 5:30 this afternoon, and the clock is already off by 45 seconds. Here are the latest entries from /var/ log/messages:

May 20 17:25:25 hepdsw03 ntpd[4225]: ntpd 4.2.2p1@1.1570-o Sat Nov 10 12:33:50 UTC 2007 (1) May 20 17:25:25 hepdsw03 ntpd[4226]: precision = 1.000 usec May 20 17:25:25 hepdsw03 ntpd[4226]: Listening on interface wildcard, 0.0.0.0#123 Disabled May 20 17:25:25 hepdsw03 ntpd[4226]: Listening on interface wildcard, ::#123 Disabled May 20 17:25:25 hepdsw03 ntpd[4226]: Listening on interface lo, :: 1#123 Enabled May 20 17:25:25 hepdsw03 ntpd[4226]: Listening on interface eth0, fe80::210:c6ff:feab:dd92#123 Enabled May 20 17:25:25 hepdsw03 ntpd[4226]: Listening on interface lo, 127.0.0.1#123 Enabled May 20 17:25:25 hepdsw03 ntpd[4226]: Listening on interface eth0, 10.66.42.109#123 Enabled May 20 17:25:25 hepdsw03 ntpd[4226]: kernel time sync status 0040 May 20 17:25:25 hepdsw03 ntpd[4226]: frequency initialized 0.000 PPM from /var/lib/ntp/drift May 20 17:28:37 hepdsw03 ntpd[4226]: synchronized to LOCAL(0), stratum 10 May 20 17:28:37 hepdsw03 ntpd[4226]: kernel time sync disabled 0001 May 20 17:30:47 hepdsw03 ntpd[4226]: synchronized to 10.101.32.104, stratum 3 May 20 17:35:04 hepdsw03 ntpd[4226]: synchronized to LOCAL(0), stratum 10 May 20 17:37:13 hepdsw03 ntpd[4226]: synchronized to 10.101.32.104, stratum 3 May 20 17:38:18 hepdsw03 ntpd[4226]: synchronized to LOCAL(0), stratum 10

In comparison, here are the entries from a system where NTP is working as expected:

May 20 14:36:51 balboa01 ntpd[3374]: ntpd 4.2.2p1@1.1570-o Sat Nov 10 12:33:50 UTC 2007 (1) May 20 14:36:51 balboa01 ntpd[3375]: precision = 1.000 usec May 20 14:36:51 balboa01 ntpd[3375]: Listening on interface wildcard, 0.0.0.0#123 Disabled May 20 14:36:51 balboa01 ntpd[3375]: Listening on interface wildcard, ::#123 Disabled May 20 14:36:51 balboa01 ntpd[3375]: Listening on interface lo, :: 1#123 Enabled May 20 14:36:51 balboa01 ntpd[3375]: Listening on interface eth0, fe80::21a:6bff:fe46:33d1#123 Enabled May 20 14:36:51 balboa01 ntpd[3375]: Listening on interface lo, 127.0.0.1#123 Enabled May 20 14:36:51 balboa01 ntpd[3375]: Listening on interface eth0, 10.66.43.100#123 Enabled May 20 14:36:51 balboa01 ntpd[3375]: kernel time sync status 0040 May 20 14:40:06 balboa01 ntpd[3375]: synchronized to LOCAL(0), stratum 10 May 20 14:41:10 balboa01 ntpd[3375]: synchronized to 10.101.32.104, stratum 3 May 20 15:00:26 balboa01 ntpd[3375]: time reset -1.533233 s May 20 15:00:26 balboa01 ntpd[3375]: kernel time sync enabled 0001 May 20 15:03:45 balboa01 ntpd[3375]: synchronized to LOCAL(0), stratum 10 May 20 15:05:20 balboa01 ntpd[3375]: synchronized to 10.101.32.104, stratum 3

...

You said the machines are identical. Could there be any variation in the BIOS revision level or its settings? Sometimes ACPI stuff can mess up ntp.

Yes, there may be some BIOS revision differences, but I'm pretty sure that at least one of the other 29 systems has an identical BIOS revision.

...

Also -- the log messages you provide have no "step time server" reference. Do you have a valid /etc/ntp/step-tickers file?

The files exists but is empty on all my systems.

Alfred

Jason Clark

22 May 22 May

3:31 p.m.

I had a similar problem on a different server that I fixed last night. Evidently it had a BIOS level feature that tried to modify the CPU clock rate, much like cpu-freq does within the kernel, and was doing so by messing with the system clock impacting the RTC. I was drifting all over the place until I found and disabled that feature (foxcon board, something like foxstep I believe is what it is called in BIOS). Not sure if your lenovo boards have that feature, but i know that some ASUS boards do.

Jason www.cyborgworkshop.org

Paul Heinlein wrote:

...

On Tue, 20 May 2008, Alfred von Campe wrote:

...
I have 30 identical Lenovo desktop systems running CentOS 5.1. On one of those systems the clock is running slow (5+ minutes from yesterday to this morning and another minute since this morning) despite the fact that NTP is running on all of them and they all have the exact same /etc/ntp.conf file (I compared the MD5 sums of that file on all the systems). Here is the output of "grep ntp /var/log messages" on the system with the problem since I restarted the NTP daemon earlier today:

A slew of 5 min/24 hrs should be in the range of fixable.

...
May 20 11:35:38 hepdsw03 ntpd[31792]: frequency initialized 0.000 PPM from /var/lib/ntp/drift

This is very suspect. Are there any SELinux or other log messages suggesting that ntpd isn't able to write to its drift file? Your local clock is definitely drifting, so a 0.000 value is bogus. It may indicate that there's a disconnect between ntpd and the filesystem.

I'd be interested in the output of "ntpdc -c kerninfo"; on most systems the 'pll frequency' value is a close match to the figure in the drift file.

...
May 20 11:38:55 hepdsw03 ntpd[31792]: synchronized to LOCAL(0), stratum 10 May 20 11:38:55 hepdsw03 ntpd[31792]: kernel time sync disabled 0001 May 20 11:39:59 hepdsw03 ntpd[31792]: synchronized to 10.101.32.104, stratum 3

This is ungood. Sync-ing to local before your network time server means that your machine doesn't want to believe your server -- and you should see a "kernel time sync enabled" message once the machine has sync-ed with the time server.

You said the machines are identical. Could there be any variation in the BIOS revision level or its settings? Sometimes ACPI stuff can mess up ntp.

Also -- the log messages you provide have no "step time server" reference. Do you have a valid /etc/ntp/step-tickers file?

Filipe Brandenburger

21 May 21 May

12:25 a.m.

On Tue, May 20, 2008 at 3:46 PM, Alfred von Campe alfred@von-campe.com wrote:

...

I have 30 identical Lenovo desktop systems running CentOS 5.1. On one of those systems the clock is running slow (5+ minutes from yesterday to this morning and another minute since this morning) despite the fact that NTP is running on all of them and they all have the exact same /etc/ntp.conf file (I compared the MD5 sums of that file on all the systems). Here is the output of "grep ntp /var/log messages" on the system with the problem since I restarted the NTP daemon earlier today:

Hi Alfred,

What is the output of "ntpq -np" ? You should have a line with a star (*), otherwise it is not synchronizing. Start running NTP again, wait for half an hour and issue that command to see what your output is. It could be a problem related to filtering UDP traffic, either with iptables on the machine itself, or some other router in the network your machine is in.

HTH, Filipe

Alfred von Campe

12:53 a.m.

On May 20, 2008, at 20:25, Filipe Brandenburger wrote:

...

What is the output of "ntpq -np" ? You should have a line with a star (*), otherwise it is not synchronizing. Start running NTP again, wait for half an hour and issue that command to see what your output is. It could be a problem related to filtering UDP traffic, either with iptables on the machine itself, or some other router in the network your machine is in.

Bad system:

# ntpq -np remote refid st t when poll reach delay offset jitter ======================================================================== ====== 10.101.32.104 67.128.71.65 3 u 689 1024 377 0.659 54095.7 4263.68 *127.127.1.0 .LOCL. 10 l 28 64 377 0.000 0.000 0.001

Good system:

# ntpq -np remote refid st t when poll reach delay offset jitter ======================================================================== ====== *10.101.32.104 67.128.71.65 3 u 232 256 377 0.280 15.503 1.204 127.127.1.0 .LOCL. 10 l 64 64 377 0.000 0.000 0.001

The bad system is synch'ing with itself (which was apparent from the messages in /var/log/messages). The question is why?

Alfred

Filipe Brandenburger

1:09 a.m.

On Tue, May 20, 2008 at 8:53 PM, Alfred von Campe alfred@von-campe.com wrote:

...

# ntpq -np remote refid st t when poll reach delay offset jitter ============================================================================== 10.101.32.104 67.128.71.65 3 u 689 1024 377 0.659 54095.7 4263.68 *127.127.1.0 .LOCL. 10 l 28 64 377 0.000 0.000 0.001

The bad system is synch'ing with itself (which was apparent from the messages in /var/log/messages). The question is why?

When the offset is too big, NTP will discard the server as a source of synchronization. Try stoping ntpd, setting the date with ntpdate, then starting ntpq -np again to see what happens.

I saw that the jitter appears to be really big too. That might be caused by your server going too slow or too fast in a rate that NTP couldn't keep up to. Once it happened to me, somehow (IRQ problems?) one machine started going so fast that NTP couldn't keep up with it, so I had to put periodic ntpdate's to keep it at the right time. I rebooted it off-hours and the problem went away. Did you try rebooting this machine and seeing if it fixes the problem?

One good side is the "reach = 377", that means your network is good and all NTP packets are being exchanged correctly. 377 is octal for binary 11111111, which means the last 8 NTP packet exchanges were successful.

Good luck! Filipe

Paul Heinlein

4:55 a.m.

On Tue, 20 May 2008, Alfred von Campe wrote:

...

Bad system:

# ntpq -np remote refid st t when poll reach delay offset jitter ============================================================================== 10.101.32.104 67.128.71.65 3 u 689 1024 377 0.659 54095.7 4263.68

Yeah, with an offset of 54 seconds, it's a bad system. :-)

Try this (assuming 10.101.32.104 is your preferred local NTP server):

service ntpd stop echo "10.101.32.104" > /etc/ntp/step-tickers service ntpd start

Adding a server to the step-tickers file will tell the ntpd init script to do an ntpdate sync against that host before starting ntpd.

-- Paul Heinlein <> heinlein@madboa.com <> www.madboa.com

Alfred von Campe

10:32 a.m.

On May 21, 2008, at 0:55, Paul Heinlein wrote:

...

Yeah, with an offset of 54 seconds, it's a bad system. :-)

Try this (assuming 10.101.32.104 is your preferred local NTP server):

service ntpd stop echo "10.101.32.104" > /etc/ntp/step-tickers service ntpd start

Adding a server to the step-tickers file will tell the ntpd init script to do an ntpdate sync against that host before starting ntpd.

In effect, that's what I did manually. When I found out that about this problem, I did the following:

# service ntpd stop # ntpdate ntp # service ntpd start

(ntp is a CNAME record for our local NTP server). But after 24 hours, the system was off my more than 5 minutes. Why would it not bind to the NTP server. Here is some output from tshark while ntpd was running:

# tshark udp port 123 Running as user "root" and group "root". This could be dangerous. Capturing on eth0 0.000000 10.66.42.109 -> 10.101.32.104 NTP NTP client 0.000683 10.101.32.104 -> 10.66.42.109 NTP NTP server 1025.000547 10.66.42.109 -> 10.101.32.104 NTP NTP client 1025.002017 10.101.32.104 -> 10.66.42.109 NTP NTP server 2050.000145 10.66.42.109 -> 10.101.32.104 NTP NTP client 2050.000812 10.101.32.104 -> 10.66.42.109 NTP NTP server 3076.000485 10.66.42.109 -> 10.101.32.104 NTP NTP client 3076.001917 10.101.32.104 -> 10.66.42.109 NTP NTP server 4101.001065 10.66.42.109 -> 10.101.32.104 NTP NTP client 4101.002016 10.101.32.104 -> 10.66.42.109 NTP NTP server 5126.000671 10.66.42.109 -> 10.101.32.104 NTP NTP client 5126.001551 10.101.32.104 -> 10.66.42.109 NTP NTP server 6151.001270 10.66.42.109 -> 10.101.32.104 NTP NTP client 6151.001868 10.101.32.104 -> 10.66.42.109 NTP NTP server 7175.999866 10.66.42.109 -> 10.101.32.104 NTP NTP client 7176.000482 10.101.32.104 -> 10.66.42.109 NTP NTP server 8201.999206 10.66.42.109 -> 10.101.32.104 NTP NTP client 8202.000563 10.101.32.104 -> 10.66.42.109 NTP NTP server 9227.000796 10.66.42.109 -> 10.101.32.104 NTP NTP client 9227.001462 10.101.32.104 -> 10.66.42.109 NTP NTP server

Does this make any sense?

Alfred

Alfred von Campe

22 May 22 May

3:21 p.m.

I'm still having this issue. Here is another update. I noticed that the drift file for the system with the problem contained "0.000". On most other systems this contains a positive number (and on two a negative number). I deleted the drift file, resynch'ed the time with "ntpdate <hostname>", restarted the NTP daemon, and waited for the drift file to be recreated. It again contained "0.000" and the output of "ntpq -np looked like this:

remote refid st t when poll reach delay offset jitter

======================================================================== ====== 10.101.32.104 67.128.71.65 3 u 49 64 377 0.611 1871.40 987.132 *127.127.1.0 .LOCL. 10 l 48 64 377 0.000 0.000 0.001

I replaced the drift file with the contents of the file before the upgrade to CentOS, resynch'ed the time, and restarted the NTP daemon. But after a little while, the system is "bound" to itself again.

BTW, the output of the cron job running ntpdate once an hour showed that the system has a very steady drift of 14.9 seconds every hour. Does that seem excessive? NTP should be able to handle this, right?

Alfred

6439

Age (days ago)

6441

Last active (days ago)

discuss@lists.centos.org

9 comments

4 participants

tags (0)

participants (4)

Alfred von Campe
Filipe Brandenburger
Jason Clark
Paul Heinlein