On Tue, Jul 8, 2014 at 11:25 AM, Lamar Owen lowen@pari.edu wrote:
Memory tests are redundant with ECC. (I know; I have an older SuperMicro server here that passes memory testing in POST but throws nearly continuous ECC errors in operation; it does operate, though). If it fails during spinup, flag the failure while spinning up another server.
I don't think that is generally true. I've seen several IBM systems disable memory during POST and come up running will a smaller amount.
Virtual servers have no need of POST (they also don't save as much power; although dynamic load balancing can do some predictive heuristics and spin up host hypervisors as needed and do live migration of server processes dynamically).
Our services that need scaling need all of the hardware capability and aren't virtualized. That might change someday...
To detect failures early, spin up every server in a rotating sequence with a testing instance, and skip POST entirely.
If you have to, spin up the server in a stateless mode and put it to sleep. Then wake it up with dynamic state.
Our servers tend to just run till they die. If we didn't need them we wouldn't have bought them in the first place. I suppose there are businesses with different processes that come and go, but I'm not sure that is desirable.
Long POSTs need to go away, with better fault tolerance after spinup being far more desirable, much like the promise of the old as dirt Tandem NonStop system. (I say the 'promise' rather than the 'implementation' for a reason.....).
If you need load balancing anyway you just run enough spares to cover the failures.