[CentOS] how long to reboot server ?

Fri Sep 3 18:56:50 UTC 2010

On Fri, 2010-09-03 at 06:59 -0400, Stephen Harris wrote:
> On Fri, Sep 03, 2010 at 08:28:57AM +0300, kalinix wrote:
> > On Thu, 2010-09-02 at 16:39 -0400, Stephen Harris wrote:
> 
> > > You never upgrade the application?  The database?  Make config changes?
> > > Wow... to live in such a static world :-)
> > > 
> > > Most of our problems aren't OS related, they're app or config
> > > related... "change shared memory parameters for oracle", "start this at
> > > boot time", "add new network interface"...  these all may prevent the
> > > server from booting cleanly and aren't the OS's fault.  You don't want to
> > > find that out during a crisis scenario!
> 
> > For this kind of issues there are testing servers and testing
> > environment.
> 
> Which are fine for the testing servers...but how do you verify the change
> was properly implemented into production?
> 

You're kidding right? You mean you restart the production servers just
to test if your application works??? IMHO this should be part of the
testing scenario.

> > Gee people, Linux ain't windows, to get rebooted every day. Most of the
> > problems you mentioned can be set on the fly, except of course hw
> 
> The problem _isn't_ the "on the fly" changes.  In fact it's because most
> of this stuff can be done on the fly that implementation issues don't get
> noticed until reboot time.
> 
> Here's a great example that I came across 10 years ago...
> 
> The sybase rc script would su to the sybase user to pick up the required
> environment variables, then start all the databases.  Fine, no problem.
> Except sometime in the past 3 years some new Sybase DBA decided to modify
> the .profile used by the sybase user so that it would ask what version
> of sybase to use.  So when the DBAs su'd to sybase they'd get their
> variables set.  Indeed the DBAs would source this file into their own
> .profile and they were all happy.  This mistake went unnoticed for years
> because the machines didn't reboot... until one day there was a failure
> requiring a reboot... and the machine didn't complete booting.  Why?
> Because the console was waiting for someone to select the sybase version
> to use.
> 

Tipical example of BOFH. Sorry, BDBAFH :). In this caes, at least the
sysadmin should be consulted (if not requested permision) to perform
such a change on a production server. If, let's say, web application
designer, one day decide that the application needs to run php with low
security settings, he just lower the security of the whole system
without asking anyone if he can do that?

> > servers. Did you people heard about change management in the first
> > place? What kind of enterprise environment is that where changes are
> > made without any change process? What if such an update breaks the core
> 
> I'm glad you have perfect people who never make mistakes.  I wish we did
> at my place! No amount of paperwork (and, wow, we have lots of that!)
> will prevent mistakes :-(

It's not about of paperwork. It's about the change process which should
be wery well implemented and tested, re-tested and tested again. And
when you think it's done then you should re-test once more.
I remember once, on an w2k3 (alas) when the first SP just get out. We
had a developing team which deployed an java portal (don't ask). Anyway,
I asked them to test whether we could deploy the SP on production
server, as it had several important fixes. Of course they said it was
tested and it was ok to deploy it. Which I did. And of course the portal
was scrambled. In the end, turned out they didn't test the SP.

> 
> > Anyway, what I'm worried about is seeing the "windows
> > philosophy" (rebooting for cleaning memory leak - instead of killing the
> > process which generates that leak, rebooting in order to update your
> > applications - instead of restart only that particular application aso)
> > becoming dominant in the linux world. And this is not good.
> 
> You're not seeing this.  You're seeing contingency planning and
> verification that services _will_ restart after an outage with minimum
> disruption.

> Prior to this policy my server had been up 1300+ days and was stable.  It
> didn't require patching because I'd removed all unnecessary packages and
> none of the security alerts had any impact on my machine and we hadn't
> encountered any OS bugs needing fixing.
> 
> I've been a Unix "geek" for 20+ years now; I don't like a 90 day reboot
> policy; I just pointed out what we have, and a rationale for it.
> However I don't get to tell the CIO of a fortune 100 (fortune 50;
> fortune 10?) company that his policy is... questionable :-)
> 

I know exactly what you mean. Those days managers looks only for how
many colors their excel sheets has. Anyway, I stand up for my
principles, proving they are right. One of them is never boot a linux
unless you change the kernel or the hardware (never both on the same
time).

Memory leaks, DBA issues, testing all this should be fixed either
development or testing environment and only after a extensive testing
deployed in production.

-- 

Calin

Key fingerprint = 37B8 0DA5 9B2A 8554 FB2B 4145 5DC1 15DD A3EF E857

=================================================
standards, n.: The principles we use to reject other people's code.