Drew Weaver wrote: > -----Original Message----- >> From: centos-bounces at centos.org [mailto:centos-bounces at centos.org] On >> Behalf Of Johnny Hughes >> Sent: Tuesday, February 13, 2007 6:30 AM >> To: CentOS ML >> Subject: Re: [CentOS] reboot long uptimes? >> >> On Tue, 2007-02-13 at 12:06 +0100, D Ivago wrote: >> >>> Hi, >>> >>> I was just wondering if I should reboot some servers that are running > >>> over 180 days? >>> >>> They are still stable and have no problems, also top shows no zombie >>> processes or such, but maybe it's better for the hardware (like ext3 > >>> disk checks f.e.) to reboot every six months... >>> >> > About the only other reason I can think of is just to make sure it will > restart when an emergency arises. > > For instance, fans, drives, etc..... > > Some servers will balk if a fan doesn't run. Some servers balk if a hard > drive isn't up to speed. These types of things only show up during a > reboot. In the case of scsi raids, hot swap drives... if a drive goes > bad some equipment will require some action for the boot up to > continue.. some don't. > > For instance, considering RAID5 hot swappable.... > > If it's one drive on a raid, no biggie.. if it's two and you don't have > hot spares.. that is a bigger issue. 'Scheduled' reboots, like when a > new kernel comes out and you have time to be there and do something or > have someone there if needed... it is a good time to be sure the self > checks done by the server pass. > > Basically, the longer the time before reboots, the more likely a error > will occur. And it would be really bad if three or four of your drives > suddenly didn't have enough strength to get up to speed... better that > it is only one which can be easily swapped out. > > -- > > That's not really statistically accurate. > > X event occuring or not occuring has no probable impact on whether > random event Y occurs. > > Where X = rebooting, and y = 'something funky'. > > Something funky could happen 5 minutes after the system starts, or 5 > years. > > -Drew > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos > Drew - I don't think you are correct about those events being independent. 'X' isn't *rebooting* , it is the number of days *between* rebooting. define/confirm some of the terms; "drive error" - errors which don't kill the drive immediately, but lurk until the next bounce. "Reboot period" - number of days between each reboot. If the probability of a drive failing in any one 24h period is 'p', then the probability of a drive failing in a reboot period of 7 days (ie for a Windows Vista server ;) ) is 7p. If the reboot period is one year, the probability is 365p One point John is making (i think!) is that, particularly with raid arrays, dealing with drive errors one at a time is easier than waiting until there are multiple. The point at question ; How does a long reboot period contribute to the probability of >1 drive errors occurring at any boot event. Statistically i believe the following is true. If (as in the above example) the probability of 1 drive failing is; p(1 drive failing)=365p assuming independent probability then the probability of 2 drives failing is; p(2 drives failing)=365p * 365p or days^2 * p^2 compare this to a 1day reboot period (ie an MS Exchange box?) p(1 drive failing)=p p(2 drives failing)=p^2 So the probability of 'problems' (ie one drive failing) is linear w respect to reboot period (days times p) The probability of 'disaster' (ie two drives failing) is massively higher with long reboot periods - 133,000 times higher for 365 days then 1 day. Of course 'p' is a very low number we hope! These are the same calcs as for failure in RAID arrays - as non-intuitive as it may be, more drives in your array means a *greater* risk of a (any) drive failure - however you can of course mitigate the *effect* of this easily with hot-spares. Food-for-thought? How do we mitigate the effect of multiple failures of this type? Imagine the situation where a box has been running for 10 years. We have to expect that the box will not keep BIOS time during a cold reboot - not a problem with ntp. What about BIOS on mobo/video cards/BIOS on Raid etc - can NVRAM be trusted to be 'NV' after 10 years of being hot? Obviously data is safe because of our meticulous backups??? Regards, MrKiwi.