Re: [CentOS] reboot long uptimes?

14 Feb 2007


      Drew Weaver wrote:
...
-----Original Message-----
...
From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On 
Behalf Of Johnny Hughes
Sent: Tuesday, February 13, 2007 6:30 AM
To: CentOS ML
Subject: Re: [CentOS] reboot long uptimes?
On Tue, 2007-02-13 at 12:06 +0100, D Ivago wrote:
...
Hi,
I was just wondering if I should reboot some servers that are running
...
...
over 180 days?
They are still stable and have no problems, also top shows no zombie 
processes or  such, but maybe it's better for the hardware (like ext3
...
...
disk checks f.e.) to reboot  every six months...
About the only other reason I can think of is just to make sure it will
restart when an emergency arises.
For instance, fans, drives, etc.....
Some servers will balk if a fan doesn't run. Some servers balk if a hard
drive isn't up to speed. These types of things only show up during a
reboot. In the case of scsi raids, hot swap drives... if a drive goes
bad some equipment will require some action for the boot up to
continue.. some don't.
For instance, considering RAID5 hot swappable....
If it's one drive on a raid, no biggie.. if it's two and you don't have
hot spares.. that is a bigger issue. 'Scheduled' reboots, like when a
new kernel comes out and you have time to be there and do something or
have someone there if needed... it is a good time to be sure the self
checks done by the server pass.
Basically, the longer the time before reboots, the more likely a error
will occur. And it would be really bad if three or four of your drives
suddenly didn't have enough strength to get up to speed... better that
it is only one which can be easily swapped out.
--
That's not really statistically accurate.
X event occuring or not occuring has no probable impact on whether
random event Y occurs.
Where X = rebooting, and y = 'something funky'.
Something funky could happen 5 minutes after the system starts, or 5
years.
-Drew
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
Drew - I don't think you are correct about those events 
being independent.
'X' isn't *rebooting* , it is the number of days *between* 
rebooting.
define/confirm some of the terms;
"drive error" - errors which don't kill the drive 
immediately, but lurk until the next bounce.
"Reboot period" - number of days between each reboot.
If the probability of a drive failing in any one 24h period 
is 'p', then the probability of a drive failing in a reboot 
period of 7 days (ie for a Windows Vista server ;) ) is 7p.
If the reboot period is one year, the probability is 365p
One point John is making (i think!) is that, particularly 
with raid arrays, dealing with drive errors one at a time is 
easier than waiting until there are multiple.
The point at question ; How does a long reboot period 
contribute to the probability of >1 drive errors occurring 
at any boot event.
Statistically i believe the following is true.
If (as in the above example) the probability of 1 drive 
failing is;
p(1 drive failing)=365p
assuming independent probability then the probability of 2 
drives failing is;
p(2 drives failing)=365p * 365p
or
days^2 * p^2
compare this to a 1day reboot period (ie an MS Exchange box?)
p(1 drive failing)=p
p(2 drives failing)=p^2
So the probability of 'problems' (ie one drive failing) is 
linear w respect to reboot period (days times p)
The probability of 'disaster' (ie two drives failing) is 
massively higher with long reboot periods -  133,000 times 
higher for 365 days then 1 day.
Of course 'p' is a very low number we hope!
These are the same calcs as for failure in RAID arrays - as 
non-intuitive as it may be, more drives in your array means 
a *greater* risk of a (any) drive failure - however you can 
of course mitigate the *effect* of this easily with hot-spares.
Food-for-thought?
How do we mitigate the effect of multiple failures of this 
type? Imagine the situation where a box has been running for 
10 years. We have to expect that the box will not keep BIOS 
time during a cold reboot - not a problem with ntp. What 
about BIOS on mobo/video cards/BIOS on Raid etc - can NVRAM 
be trusted to be 'NV' after 10 years of being hot?
Obviously data is safe because of our meticulous backups???
Regards,
MrKiwi.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [CentOS] reboot long uptimes?