[CentOS] reboot long uptimes?

Wed Feb 14 21:59:48 UTC 2007
MrKiwi <mrkiwi at gmail.com>

Drew Weaver wrote:
>  -----Original Message-----
>> From: centos-bounces at centos.org [mailto:centos-bounces at centos.org] On 
>> Behalf Of Johnny Hughes
>> Sent: Tuesday, February 13, 2007 6:30 AM
>> To: CentOS ML
>> Subject: Re: [CentOS] reboot long uptimes?
>>
>> On Tue, 2007-02-13 at 12:06 +0100, D Ivago wrote:
>>   
>>> Hi,
>>>
>>> I was just wondering if I should reboot some servers that are running
> 
>>> over 180 days?
>>>
>>> They are still stable and have no problems, also top shows no zombie 
>>> processes or  such, but maybe it's better for the hardware (like ext3
> 
>>> disk checks f.e.) to reboot  every six months...
>>>     
>>   
> About the only other reason I can think of is just to make sure it will
> restart when an emergency arises.
> 
> For instance, fans, drives, etc.....
> 
> Some servers will balk if a fan doesn't run. Some servers balk if a hard
> drive isn't up to speed. These types of things only show up during a
> reboot. In the case of scsi raids, hot swap drives... if a drive goes
> bad some equipment will require some action for the boot up to
> continue.. some don't.
> 
> For instance, considering RAID5 hot swappable....
> 
> If it's one drive on a raid, no biggie.. if it's two and you don't have
> hot spares.. that is a bigger issue. 'Scheduled' reboots, like when a
> new kernel comes out and you have time to be there and do something or
> have someone there if needed... it is a good time to be sure the self
> checks done by the server pass.
> 
> Basically, the longer the time before reboots, the more likely a error
> will occur. And it would be really bad if three or four of your drives
> suddenly didn't have enough strength to get up to speed... better that
> it is only one which can be easily swapped out.
> 
> --
> 
> That's not really statistically accurate.
> 
> X event occuring or not occuring has no probable impact on whether
> random event Y occurs.
> 
> Where X = rebooting, and y = 'something funky'.
> 
> Something funky could happen 5 minutes after the system starts, or 5
> years.
> 
> -Drew
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
> 
Drew - I don't think you are correct about those events 
being independent.

'X' isn't *rebooting* , it is the number of days *between* 
rebooting.

define/confirm some of the terms;
"drive error" - errors which don't kill the drive 
immediately, but lurk until the next bounce.
"Reboot period" - number of days between each reboot.

If the probability of a drive failing in any one 24h period 
is 'p', then the probability of a drive failing in a reboot 
period of 7 days (ie for a Windows Vista server ;) ) is 7p.
If the reboot period is one year, the probability is 365p

One point John is making (i think!) is that, particularly 
with raid arrays, dealing with drive errors one at a time is 
easier than waiting until there are multiple.

The point at question ; How does a long reboot period 
contribute to the probability of >1 drive errors occurring 
at any boot event.

Statistically i believe the following is true.
If (as in the above example) the probability of 1 drive 
failing is;
p(1 drive failing)=365p
assuming independent probability then the probability of 2 
drives failing is;
p(2 drives failing)=365p * 365p
or
days^2 * p^2

compare this to a 1day reboot period (ie an MS Exchange box?)
p(1 drive failing)=p
p(2 drives failing)=p^2

So the probability of 'problems' (ie one drive failing) is 
linear w respect to reboot period (days times p)
The probability of 'disaster' (ie two drives failing) is 
massively higher with long reboot periods -  133,000 times 
higher for 365 days then 1 day.
Of course 'p' is a very low number we hope!

These are the same calcs as for failure in RAID arrays - as 
non-intuitive as it may be, more drives in your array means 
a *greater* risk of a (any) drive failure - however you can 
of course mitigate the *effect* of this easily with hot-spares.

Food-for-thought?
How do we mitigate the effect of multiple failures of this 
type? Imagine the situation where a box has been running for 
10 years. We have to expect that the box will not keep BIOS 
time during a cold reboot - not a problem with ntp. What 
about BIOS on mobo/video cards/BIOS on Raid etc - can NVRAM 
be trusted to be 'NV' after 10 years of being hot?
Obviously data is safe because of our meticulous backups???

Regards,

MrKiwi.