Drew Weaver wrote:
-----Original Message-----
From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Johnny Hughes Sent: Tuesday, February 13, 2007 6:30 AM To: CentOS ML Subject: Re: [CentOS] reboot long uptimes?
On Tue, 2007-02-13 at 12:06 +0100, D Ivago wrote:
Hi,
I was just wondering if I should reboot some servers that are running
over 180 days?
They are still stable and have no problems, also top shows no zombie processes or such, but maybe it's better for the hardware (like ext3
disk checks f.e.) to reboot every six months...
About the only other reason I can think of is just to make sure it will restart when an emergency arises.
For instance, fans, drives, etc.....
Some servers will balk if a fan doesn't run. Some servers balk if a hard drive isn't up to speed. These types of things only show up during a reboot. In the case of scsi raids, hot swap drives... if a drive goes bad some equipment will require some action for the boot up to continue.. some don't.
For instance, considering RAID5 hot swappable....
If it's one drive on a raid, no biggie.. if it's two and you don't have hot spares.. that is a bigger issue. 'Scheduled' reboots, like when a new kernel comes out and you have time to be there and do something or have someone there if needed... it is a good time to be sure the self checks done by the server pass.
Basically, the longer the time before reboots, the more likely a error will occur. And it would be really bad if three or four of your drives suddenly didn't have enough strength to get up to speed... better that it is only one which can be easily swapped out.
--
That's not really statistically accurate.
X event occuring or not occuring has no probable impact on whether random event Y occurs.
Where X = rebooting, and y = 'something funky'.
Something funky could happen 5 minutes after the system starts, or 5 years.
-Drew _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Drew - I don't think you are correct about those events being independent.
'X' isn't *rebooting* , it is the number of days *between* rebooting.
define/confirm some of the terms; "drive error" - errors which don't kill the drive immediately, but lurk until the next bounce. "Reboot period" - number of days between each reboot.
If the probability of a drive failing in any one 24h period is 'p', then the probability of a drive failing in a reboot period of 7 days (ie for a Windows Vista server ;) ) is 7p. If the reboot period is one year, the probability is 365p
One point John is making (i think!) is that, particularly with raid arrays, dealing with drive errors one at a time is easier than waiting until there are multiple.
The point at question ; How does a long reboot period contribute to the probability of >1 drive errors occurring at any boot event.
Statistically i believe the following is true. If (as in the above example) the probability of 1 drive failing is; p(1 drive failing)=365p assuming independent probability then the probability of 2 drives failing is; p(2 drives failing)=365p * 365p or days^2 * p^2
compare this to a 1day reboot period (ie an MS Exchange box?) p(1 drive failing)=p p(2 drives failing)=p^2
So the probability of 'problems' (ie one drive failing) is linear w respect to reboot period (days times p) The probability of 'disaster' (ie two drives failing) is massively higher with long reboot periods - 133,000 times higher for 365 days then 1 day. Of course 'p' is a very low number we hope!
These are the same calcs as for failure in RAID arrays - as non-intuitive as it may be, more drives in your array means a *greater* risk of a (any) drive failure - however you can of course mitigate the *effect* of this easily with hot-spares.
Food-for-thought? How do we mitigate the effect of multiple failures of this type? Imagine the situation where a box has been running for 10 years. We have to expect that the box will not keep BIOS time during a cold reboot - not a problem with ntp. What about BIOS on mobo/video cards/BIOS on Raid etc - can NVRAM be trusted to be 'NV' after 10 years of being hot? Obviously data is safe because of our meticulous backups???
Regards,
MrKiwi.