--------------------- Kernel Begin ------------------------
1 Time(s): Clock: inserting leap second 23:59:60 UTC
---------------------- Kernel End -------------------------
hee hee.
gotta love it....
On 07/01/2012 03:05 PM, Bob Hoffman wrote:
--------------------- Kernel Begin ------------------------
1 Time(s): Clock: inserting leap second 23:59:60 UTC
---------------------- Kernel End -------------------------
hee hee.
gotta love it....
My oracle database running on CentOS 6 didn't love it :-(
Some java processes were >100% CPU after the leap second was added.
Rebooting...
Mogens
You could have just done: service ntpd stop; date -s "`date`"; service ntpd start Fixed here without even stopping any jvm.
On Sun, Jul 1, 2012 at 5:07 PM, Mogens Kjaer mk@lemo.dk wrote:
On 07/01/2012 03:05 PM, Bob Hoffman wrote:
--------------------- Kernel Begin ------------------------
1 Time(s): Clock: inserting leap second 23:59:60 UTC
---------------------- Kernel End -------------------------
hee hee.
gotta love it....
My oracle database running on CentOS 6 didn't love it :-(
Some java processes were >100% CPU after the leap second was added.
Rebooting...
Mogens
-- Mogens Kjaer, mk@lemo.dk http://www.lemo.dk _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Sun, 1 Jul 2012, Erez Zarum wrote:
You could have just done: service ntpd stop; date -s "`date`"; service ntpd start Fixed here without even stopping any jvm.
On Sun, Jul 1, 2012 at 5:07 PM, Mogens Kjaer mk@lemo.dk wrote:
On 07/01/2012 03:05 PM, Bob Hoffman wrote:
--------------------- Kernel Begin ------------------------
1 Time(s): Clock: inserting leap second 23:59:60 UTC
---------------------- Kernel End -------------------------
hee hee.
gotta love it....
My oracle database running on CentOS 6 didn't love it :-(
Some java processes were >100% CPU after the leap second was added.
Rebooting...
The interesting thing to me is that my c5 systems just kept on ticking but my c6 systems had the load go through the roof and fill the logs with things like the following:
Jun 30 19:59:59 casper kernel: Clock: inserting leap second 23:59:60 UTC Jun 30 19:59:59 casper tgtd: work_timer_evt_handler(89) failed to read from timerfd, Resource temporarily unavailable Jun 30 19:59:59 casper tgtd: work_timer_evt_handler(89) failed to read from timerfd, Resource temporarily unavailable Jun 30 19:59:59 casper tgtd: work_timer_evt_handler(89) failed to read from timerfd, Resource temporarily unavailable
Regards,
On Sun, 1 Jul 2012, me@tdiehl.org wrote:
To: CentOS mailing list centos@centos.org From: me@tdiehl.org Subject: Re: [CentOS] leap second
On Sun, 1 Jul 2012, Erez Zarum wrote:
You could have just done: service ntpd stop; date -s "`date`"; service ntpd start Fixed here without even stopping any jvm.
I thought this was some sort of late April fools joke, untill I read the article about ntpd on slashdot.
My Centos 5.8 box is running ntpd, and I did not notice any problems with it. I do a weekly yum update early Sunday mornings, but AFAIR I have not rebooted the box yet.
Checking qps, it tells me the uptime is 4 days 23hours, 53 minutes.
Kind Regards,
Keith
----------------------------------------------------------- Websites: http://www.karsites.net http://www.php-debuggers.net http://www.raised-from-the-dead.org.uk
All email addresses are challenge-response protected with TMDA [http://tmda.net] -----------------------------------------------------------
On Mon, Jul 2, 2012 at 11:02 AM, Keith Roberts keith@karsites.net wrote:
I thought this was some sort of late April fools joke, untill I read the article about ntpd on slashdot.
I'm sort of curious about how a bug of this magnitude slips through the QA process (into java and RHEL, not CentOS). With all the furor about y2k, did no one even bother to simulate a leap second ahead of the real occurrence?
My Centos 5.8 box is running ntpd, and I did not notice any problems with it. I do a weekly yum update early Sunday mornings, but AFAIR I have not rebooted the box yet.
I don't think it affected 5.x.
On Mon, Jul 02, 2012 at 11:09:41AM -0500, Les Mikesell wrote:
I'm sort of curious about how a bug of this magnitude slips through the QA process (into java and RHEL, not CentOS). With all the furor about y2k, did no one even bother to simulate a leap second ahead of the real occurrence?
The kernel bug is a race condition; simulations may not have detected it.
On Mon, Jul 2, 2012 at 11:24 AM, Stephen Harris lists@spuddy.org wrote:
On Mon, Jul 02, 2012 at 11:09:41AM -0500, Les Mikesell wrote:
I'm sort of curious about how a bug of this magnitude slips through the QA process (into java and RHEL, not CentOS). With all the furor about y2k, did no one even bother to simulate a leap second ahead of the real occurrence?
The kernel bug is a race condition; simulations may not have detected it.
The java one seemed to be a pretty sure thing. Was this just openjdk or was the current Oracle version affected too?
Hi Les,
I'm sort of curious about how a bug of this magnitude slips through the QA process (into java and RHEL, not CentOS). With all the furor about y2k, did no one even bother to simulate a leap second ahead of the real occurrence?
... and leap seconds are not even scarce. According to http://en.wikipedia.org/wiki/Leap_second, this was the third one since 2000, and it is actually the first time I heard of any problems.
On the other hand I'm a bit surprised that the problems were comparably few - actually there is a time '01:59:60' for one second, and any plausibility check I've ever seen assumes that minutes and seconds are in the range from 0..59. Wrongly, it seems.
Apparently Google uses an approach that looks much less risky to me - they use a time window over which they 'smear' the leap second by making their time servers lie about the time for a while, making it pass a little bit slower. That way they avoid the unlucky 61st second and still advance the clocks within a reasonable time.
http://googleblog.blogspot.de/2011/09/time-technology-and-leaping-seconds.html
Kind Regards,
Peter.
On Mon, Jul 2, 2012 at 11:24 AM, Peter Eckel lists@eckel-edv.de wrote:
On the other hand I'm a bit surprised that the problems were comparably few - actually there is a time '01:59:60' for one second, and any plausibility check I've ever seen assumes that minutes and seconds are in the range from 0..59. Wrongly, it seems.
Apparently Google uses an approach that looks much less risky to me - they use a time window over which they 'smear' the leap second by making their time servers lie about the time for a while, making it pass a little bit slower. That way they avoid the unlucky 61st second and still advance the clocks within a reasonable time.
http://googleblog.blogspot.de/2011/09/time-technology-and-leaping-seconds.html
Interesting, but I thought that ntp clients always advanced the clock by small fractions of a second anyway even when the master source differs by more.
Hi Les,
Interesting, but I thought that ntp clients always advanced the clock by small fractions of a second anyway even when the master source differs by more.
they do. But the leap second is quite a different thing: Actually the time doesn't really diverge from the server's, but the stratum 1 server deliveres a totally unexpected 01:59:60, and the stratum 2 server follows.
The Google approach is not to use that time at all, but slow the clock down a bit on the stratum 2 server so that the stratum 1 (that has the 'genuine' time and jumps to the :60 time stamp after :59) is, after the time window is over, about one second ahead of the stratum 2. So approximately the instant when the stratum 1 server jumps from :59 to :60, the stratum 2 server jumps from :58 to :59, and at the next second tick, they will both jump to 02:00:00 and be in synch again. The same approach works with a negative leap second, which was never needed yet, however.
The disadvantage of this method is that you have to know in advance when the leap second will happen, which requires tables that regularly have to be updated since it is fairly unpredictable in the long run when a leap second will be necessary. I don't know why they didn't simply use the 'LI' bit in the NTP protocol to determine when to start 'smearing' - at least the article doesn't say they did:
http://www.networksorcery.com/enp/protocol/ntp.htm
Maybe 24 hours notification in advance did not seem long enough for the smear interval. I doubt it, because I would not really like the time to differ from the real time for more than a day.
Best regards,
Peter.
On Mon, Jul 02, 2012 at 07:37:46PM +0200, Peter Eckel wrote:
Maybe 24 hours notification in advance did not seem long enough for the smear interval. I doubt it, because I would not really like the time to differ from the real time for more than a day.
Yeah, there are some regularity requirements in some industries that the server clocks are within 100ms of UTC (or, at least, that's how internal audit have interpreted the regulations where I work).
Allowing the clock to drift by a second would normally be bad, but I guess it wouldn't matter on a non-business day. (Well, non-business for 95% of the company where this matters - some areas were still open).
On 07/02/2012 10:37 AM, Peter Eckel wrote:
they do. But the leap second is quite a different thing: Actually the time doesn't really diverge from the server's, but the stratum 1 server deliveres a totally unexpected 01:59:60, and the stratum 2 server follows.
That's not quite correct. The NTP protocol (as you mentioned later) actually indicates that the current day should include a leap second, the NTP server notifies the kernel that the day should include a leap second, and the kernel inserts the leap second at the end of the day by extending the duration of one of the system clock's seconds.
The "60" second doesn't exist in NTP or in the POSIX system clock, both of which are counters from their respective epochs. The "60" second is present only in time representations that are converted from the system clock or NTP clock.
On 07/02/2012 09:24 AM, Peter Eckel wrote:
On the other hand I'm a bit surprised that the problems were comparably few - actually there is a time '01:59:60' for one second, and any plausibility check I've ever seen assumes that minutes and seconds are in the range from 0..59. Wrongly, it seems.
As far as I've been able to understand it, the problem had nothing to do with validity checks or other date handling code. The problem was simply a bug in the API provided by the Linux kernel for notification of leap seconds. The kernel messed up some internal data that led to futexes going nuts. The affected programs weren't handling dates poorly, they were just threaded applications.
Apparently Google uses an approach that looks much less risky to me - they use a time window over which they 'smear' the leap second by making their time servers lie about the time for a while, making it pass a little bit slower. That way they avoid the unlucky 61st second and still advance the clocks within a reasonable time.
Google's approach was reliable by chance. They used a different kernel API to adjust the clock, and that one didn't break futexes.
On Mon, Jul 2, 2012 at 1:32 PM, Gordon Messmer yinyang@eburg.com wrote:
On 07/02/2012 09:24 AM, Peter Eckel wrote:
On the other hand I'm a bit surprised that the problems were comparably few - actually there is a time '01:59:60' for one second, and any plausibility check I've ever seen assumes that minutes and seconds are in the range from 0..59. Wrongly, it seems.
As far as I've been able to understand it, the problem had nothing to do with validity checks or other date handling code. The problem was simply a bug in the API provided by the Linux kernel for notification of leap seconds. The kernel messed up some internal data that led to futexes going nuts. The affected programs weren't handling dates poorly, they were just threaded applications.
Apparently Google uses an approach that looks much less risky to me - they use a time window over which they 'smear' the leap second by making their time servers lie about the time for a while, making it pass a little bit slower. That way they avoid the unlucky 61st second and still advance the clocks within a reasonable time.
Google's approach was reliable by chance. They used a different kernel API to adjust the clock, and that one didn't break futexes.
So it wasn't anything special about java? I did find one one not-very-busy instance of a Centos 6.x with a java application still running that did not appear to have a problem.
On 07/02/2012 11:45 AM, Les Mikesell wrote:
So it wasn't anything special about java? I did find one one not-very-busy instance of a Centos 6.x with a java application still running that did not appear to have a problem.
Only that java applications tend to be threaded, and threaded applications were the ones likely to be affected by the bug.
On Mon, Jul 2, 2012 at 2:10 PM, Gordon Messmer yinyang@eburg.com wrote:
On 07/02/2012 11:45 AM, Les Mikesell wrote:
So it wasn't anything special about java? I did find one one not-very-busy instance of a Centos 6.x with a java application still running that did not appear to have a problem.
Only that java applications tend to be threaded, and threaded applications were the ones likely to be affected by the bug.
Sooo... Are the 6.x boxes that did not exhibit a problem yet still likely to have it if you start a threaded program or did it have to happen in the 1 second window?
On 07/02/2012 12:54 PM, Les Mikesell wrote:
Sooo... Are the 6.x boxes that did not exhibit a problem yet still likely to have it if you start a threaded program or did it have to happen in the 1 second window?
As far as I know, it could still pop up. The futex handling in the kernel will be screwed up until the system reboots, or until the time is set using an API that wasn't affected by the bug. That's why one of the recommended fixes is just to:
date -s "`date`"
Gordon Messmer wrote:
On 07/02/2012 12:54 PM, Les Mikesell wrote:
Sooo... Are the 6.x boxes that did not exhibit a problem yet still likely to have it if you start a threaded program or did it have to happen in the 1 second window?
As far as I know, it could still pop up. The futex handling in the kernel will be screwed up until the system reboots, or until the time is set using an API that wasn't affected by the bug. That's why one of the recommended fixes is just to:
date -s "`date`"
Dumb question, but I haven't followed this thread that closely - been busy at work - but why not $ service ntp stop $ ntpdate $ service ntp start ? mark
On 07/02/2012 01:06 PM, m.roth@5-cent.us wrote:
Dumb question, but I haven't followed this thread that closely - been busy at work - but why not $ service ntp stop $ ntpdate $ service ntp start
Today that might work, but would be slower than using "date". On Saturday, I think that would have triggered the bug.
On 7/2/2012 2:06 PM, m.roth@5-cent.us wrote:
Dumb question, but I haven't followed this thread that closely - been busy at work - but why not $ service ntp stop $ ntpdate $ service ntp start
Because that results in a call to adjtimex(2), which is also the syscall used by ntpd, which in turn is affected by the kernel bug.
Calling date(1) instead uses the clock_settime(2) syscall, which isn't affected.
One isn't implemented in terms of the other, for reasons that should be obvious from the manpages.
On 7/2/2012 10:24 AM, Peter Eckel wrote:
... and leap seconds are not even scarce.
An event on an unpredictable schedule averaging 1.7 years since 1972 doesn't count as "scarce"?
That's the answer to Les's outrage, too, by the way. Might as well expect the JRE to have code to deal with cosmic ray damage that gets by ECC, too.
On Mon, Jul 2, 2012 at 5:52 PM, Warren Young warren@etr-usa.com wrote:
On 7/2/2012 10:24 AM, Peter Eckel wrote:
... and leap seconds are not even scarce.
An event on an unpredictable schedule averaging 1.7 years since 1972 doesn't count as "scarce"?
"Unpredictible" means you don't know something is coming in time to test for what to expect from its effect. I don't see how that applies here.
That's the answer to Les's outrage, too, by the way. Might as well expect the JRE to have code to deal with cosmic ray damage that gets by ECC, too.
Well, if there were a well known, long-standing API for that, and the time it was going to happen announced months ahead yes, I would expect it to be tested too. But, per the earlier discussion it is a kernel bug, not the JRE. I'd sort of expect java builds to have unit tests for their APIs.
On Mon, 2 Jul 2012 18:11:25 +0200 Peter Eckel wrote:
I did not have any problems on CentOS 5.8, but on one CentOS 6.2 box running a Java application.
I had problems with Firefox on four computers running fully updated Centos 6. Firefox was suddenly taking up a lot of CPU power showing nothing but a blank webpage, on all four computers. Closing and re-opening Firefox didn't fix it, logging out and back in didn't fix it, but rebooting the machines did.
Some google searching indicates to me that there was a problem with Firefox using a futex that got confused by the leap second, and getting into a loop.
You could have just done: service ntpd stop; date -s "`date`"; service ntpd start Fixed here without even stopping any jvm.
Would have loved to know that then ;-)
We have 2 8-node clusters that runs many java applications, and many java applications on seperate servers. I went nuts when all java running servers cam to 100% cpu all at once !
The guy I spoke to at RedHat GSS at about 00:45 UTC baiscly told me to reboot the server, wich ended up rebooting our 2 clusters... Bad... But since computers all over the world crashed, our clients did understood that the problem was far lower than they beleived it...
On 7/1/2012 10:07 AM, Mogens Kjaer wrote:
On 07/01/2012 03:05 PM, Bob Hoffman wrote:
--------------------- Kernel Begin ------------------------ 1 Time(s): Clock: inserting leap second 23:59:60 UTC ---------------------- Kernel End -------------------------
hee hee.
gotta love it....
My oracle database running on CentOS 6 didn't love it :-(
Some java processes were>100% CPU after the leap second was added.
Rebooting...
Mogens
Millions of dollars and years of lobbying by the RIAA and all it took was a leap second to sink The Pirate Bay http://www.zeropaid.com/news/101460/leap-second-crashes-the-pirate-bay/
From: bob bob@bobhoffman.com
To: CentOS mailing list centos@centos.org Sent: Sunday, July 1, 2012 9:55 AM Subject: Re: [CentOS] leap second
On 7/1/2012 10:07 AM, Mogens Kjaer wrote:
On 07/01/2012 03:05 PM, Bob Hoffman wrote:
--------------------- Kernel Begin ------------------------
1 Time(s): Clock: inserting leap second 23:59:60 UTC
---------------------- Kernel End -------------------------
hee hee.
gotta love it....
My oracle database running on CentOS 6 didn't love it :-(
Some java processes were>100% CPU after the leap second was added.
Rebooting...
Mogens
I had a VM crash, but it was on an old 2.4 kernel. I remember this happening last time with some older 2.4 systems.
Hi Morgens,
Some java processes were >100% CPU after the leap second was added.
same problem here ... OpenNMS hat 100% CPU and didn't do anything anymore.
Rebooting is not necessary, though. For me it worked to just set the time manually once, and everything was back to normal.
It doesn't strike me as a particularly good idea to insert a ':60' second - software that does proper sanity checks on date/time values is supposed to barf on that.
Peter.
On 07/02/2012 10:21 AM, Peter Eckel wrote:
Hi Morgens,
Some java processes were >100% CPU after the leap second was added.
same problem here ... OpenNMS hat 100% CPU and didn't do anything anymore.
Rebooting is not necessary, though. For me it worked to just set the time manually once, and everything was back to normal.
It doesn't strike me as a particularly good idea to insert a ':60' second - software that does proper sanity checks on date/time values is supposed to barf on that.
Peter. _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Hello.
try execute: date -s "`date -u`" && service ntpd restart