[CentOS] reboot long uptimes?

MrKiwi mrkiwi at gmail.com
Wed Feb 14 21:59:48 UTC 2007


Drew Weaver wrote:
>  -----Original Message-----
>> From: centos-bounces at centos.org [mailto:centos-bounces at centos.org] On 
>> Behalf Of Johnny Hughes
>> Sent: Tuesday, February 13, 2007 6:30 AM
>> To: CentOS ML
>> Subject: Re: [CentOS] reboot long uptimes?
>>
>> On Tue, 2007-02-13 at 12:06 +0100, D Ivago wrote:
>>   
>>> Hi,
>>>
>>> I was just wondering if I should reboot some servers that are running
> 
>>> over 180 days?
>>>
>>> They are still stable and have no problems, also top shows no zombie 
>>> processes or  such, but maybe it's better for the hardware (like ext3
> 
>>> disk checks f.e.) to reboot  every six months...
>>>     
>>   
> About the only other reason I can think of is just to make sure it will
> restart when an emergency arises.
> 
> For instance, fans, drives, etc.....
> 
> Some servers will balk if a fan doesn't run. Some servers balk if a hard
> drive isn't up to speed. These types of things only show up during a
> reboot. In the case of scsi raids, hot swap drives... if a drive goes
> bad some equipment will require some action for the boot up to
> continue.. some don't.
> 
> For instance, considering RAID5 hot swappable....
> 
> If it's one drive on a raid, no biggie.. if it's two and you don't have
> hot spares.. that is a bigger issue. 'Scheduled' reboots, like when a
> new kernel comes out and you have time to be there and do something or
> have someone there if needed... it is a good time to be sure the self
> checks done by the server pass.
> 
> Basically, the longer the time before reboots, the more likely a error
> will occur. And it would be really bad if three or four of your drives
> suddenly didn't have enough strength to get up to speed... better that
> it is only one which can be easily swapped out.
> 
> --
> 
> That's not really statistically accurate.
> 
> X event occuring or not occuring has no probable impact on whether
> random event Y occurs.
> 
> Where X = rebooting, and y = 'something funky'.
> 
> Something funky could happen 5 minutes after the system starts, or 5
> years.
> 
> -Drew
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
> 
Drew - I don't think you are correct about those events 
being independent.

'X' isn't *rebooting* , it is the number of days *between* 
rebooting.

define/confirm some of the terms;
"drive error" - errors which don't kill the drive 
immediately, but lurk until the next bounce.
"Reboot period" - number of days between each reboot.

If the probability of a drive failing in any one 24h period 
is 'p', then the probability of a drive failing in a reboot 
period of 7 days (ie for a Windows Vista server ;) ) is 7p.
If the reboot period is one year, the probability is 365p

One point John is making (i think!) is that, particularly 
with raid arrays, dealing with drive errors one at a time is 
easier than waiting until there are multiple.

The point at question ; How does a long reboot period 
contribute to the probability of >1 drive errors occurring 
at any boot event.

Statistically i believe the following is true.
If (as in the above example) the probability of 1 drive 
failing is;
p(1 drive failing)=365p
assuming independent probability then the probability of 2 
drives failing is;
p(2 drives failing)=365p * 365p
or
days^2 * p^2

compare this to a 1day reboot period (ie an MS Exchange box?)
p(1 drive failing)=p
p(2 drives failing)=p^2

So the probability of 'problems' (ie one drive failing) is 
linear w respect to reboot period (days times p)
The probability of 'disaster' (ie two drives failing) is 
massively higher with long reboot periods -  133,000 times 
higher for 365 days then 1 day.
Of course 'p' is a very low number we hope!

These are the same calcs as for failure in RAID arrays - as 
non-intuitive as it may be, more drives in your array means 
a *greater* risk of a (any) drive failure - however you can 
of course mitigate the *effect* of this easily with hot-spares.

Food-for-thought?
How do we mitigate the effect of multiple failures of this 
type? Imagine the situation where a box has been running for 
10 years. We have to expect that the box will not keep BIOS 
time during a cold reboot - not a problem with ntp. What 
about BIOS on mobo/video cards/BIOS on Raid etc - can NVRAM 
be trusted to be 'NV' after 10 years of being hot?
Obviously data is safe because of our meticulous backups???

Regards,

MrKiwi.







More information about the CentOS mailing list