On Fri, 2010-09-03 at 06:59 -0400, Stephen Harris wrote:
On Fri, Sep 03, 2010 at 08:28:57AM +0300, kalinix wrote:
On Thu, 2010-09-02 at 16:39 -0400, Stephen Harris wrote:
You never upgrade the application? The database? Make config changes? Wow... to live in such a static world :-)
Most of our problems aren't OS related, they're app or config related... "change shared memory parameters for oracle", "start this at boot time", "add new network interface"... these all may prevent the server from booting cleanly and aren't the OS's fault. You don't want to find that out during a crisis scenario!
For this kind of issues there are testing servers and testing environment.
Which are fine for the testing servers...but how do you verify the change was properly implemented into production?
You're kidding right? You mean you restart the production servers just to test if your application works??? IMHO this should be part of the testing scenario.
Gee people, Linux ain't windows, to get rebooted every day. Most of the problems you mentioned can be set on the fly, except of course hw
The problem _isn't_ the "on the fly" changes. In fact it's because most of this stuff can be done on the fly that implementation issues don't get noticed until reboot time.
Here's a great example that I came across 10 years ago...
The sybase rc script would su to the sybase user to pick up the required environment variables, then start all the databases. Fine, no problem. Except sometime in the past 3 years some new Sybase DBA decided to modify the .profile used by the sybase user so that it would ask what version of sybase to use. So when the DBAs su'd to sybase they'd get their variables set. Indeed the DBAs would source this file into their own .profile and they were all happy. This mistake went unnoticed for years because the machines didn't reboot... until one day there was a failure requiring a reboot... and the machine didn't complete booting. Why? Because the console was waiting for someone to select the sybase version to use.
Tipical example of BOFH. Sorry, BDBAFH :). In this caes, at least the sysadmin should be consulted (if not requested permision) to perform such a change on a production server. If, let's say, web application designer, one day decide that the application needs to run php with low security settings, he just lower the security of the whole system without asking anyone if he can do that?
servers. Did you people heard about change management in the first place? What kind of enterprise environment is that where changes are made without any change process? What if such an update breaks the core
I'm glad you have perfect people who never make mistakes. I wish we did at my place! No amount of paperwork (and, wow, we have lots of that!) will prevent mistakes :-(
It's not about of paperwork. It's about the change process which should be wery well implemented and tested, re-tested and tested again. And when you think it's done then you should re-test once more. I remember once, on an w2k3 (alas) when the first SP just get out. We had a developing team which deployed an java portal (don't ask). Anyway, I asked them to test whether we could deploy the SP on production server, as it had several important fixes. Of course they said it was tested and it was ok to deploy it. Which I did. And of course the portal was scrambled. In the end, turned out they didn't test the SP.
Anyway, what I'm worried about is seeing the "windows philosophy" (rebooting for cleaning memory leak - instead of killing the process which generates that leak, rebooting in order to update your applications - instead of restart only that particular application aso) becoming dominant in the linux world. And this is not good.
You're not seeing this. You're seeing contingency planning and verification that services _will_ restart after an outage with minimum disruption.
Prior to this policy my server had been up 1300+ days and was stable. It didn't require patching because I'd removed all unnecessary packages and none of the security alerts had any impact on my machine and we hadn't encountered any OS bugs needing fixing.
I've been a Unix "geek" for 20+ years now; I don't like a 90 day reboot policy; I just pointed out what we have, and a rationale for it. However I don't get to tell the CIO of a fortune 100 (fortune 50; fortune 10?) company that his policy is... questionable :-)
I know exactly what you mean. Those days managers looks only for how many colors their excel sheets has. Anyway, I stand up for my principles, proving they are right. One of them is never boot a linux unless you change the kernel or the hardware (never both on the same time).
Memory leaks, DBA issues, testing all this should be fixed either development or testing environment and only after a extensive testing deployed in production.