On Fri, Sep 03, 2010 at 08:28:57AM +0300, kalinix wrote:
On Thu, 2010-09-02 at 16:39 -0400, Stephen Harris wrote:
You never upgrade the application? The database? Make config changes? Wow... to live in such a static world :-)
Most of our problems aren't OS related, they're app or config related... "change shared memory parameters for oracle", "start this at boot time", "add new network interface"... these all may prevent the server from booting cleanly and aren't the OS's fault. You don't want to find that out during a crisis scenario!
For this kind of issues there are testing servers and testing environment.
Which are fine for the testing servers...but how do you verify the change was properly implemented into production?
Gee people, Linux ain't windows, to get rebooted every day. Most of the problems you mentioned can be set on the fly, except of course hw
The problem _isn't_ the "on the fly" changes. In fact it's because most of this stuff can be done on the fly that implementation issues don't get noticed until reboot time.
Here's a great example that I came across 10 years ago...
The sybase rc script would su to the sybase user to pick up the required environment variables, then start all the databases. Fine, no problem. Except sometime in the past 3 years some new Sybase DBA decided to modify the .profile used by the sybase user so that it would ask what version of sybase to use. So when the DBAs su'd to sybase they'd get their variables set. Indeed the DBAs would source this file into their own .profile and they were all happy. This mistake went unnoticed for years because the machines didn't reboot... until one day there was a failure requiring a reboot... and the machine didn't complete booting. Why? Because the console was waiting for someone to select the sybase version to use.
servers. Did you people heard about change management in the first place? What kind of enterprise environment is that where changes are made without any change process? What if such an update breaks the core
I'm glad you have perfect people who never make mistakes. I wish we did at my place! No amount of paperwork (and, wow, we have lots of that!) will prevent mistakes :-(
Anyway, what I'm worried about is seeing the "windows philosophy" (rebooting for cleaning memory leak - instead of killing the process which generates that leak, rebooting in order to update your applications - instead of restart only that particular application aso) becoming dominant in the linux world. And this is not good.
You're not seeing this. You're seeing contingency planning and verification that services _will_ restart after an outage with minimum disruption.
Prior to this policy my server had been up 1300+ days and was stable. It didn't require patching because I'd removed all unnecessary packages and none of the security alerts had any impact on my machine and we hadn't encountered any OS bugs needing fixing.
I've been a Unix "geek" for 20+ years now; I don't like a 90 day reboot policy; I just pointed out what we have, and a rationale for it. However I don't get to tell the CIO of a fortune 100 (fortune 50; fortune 10?) company that his policy is... questionable :-)