On 3/8/2011 12:31 PM, Michael Eager wrote: > >>> Any suggestions where I might look for a clue? >> >> Probably something hardware related. Bad memory, overheating, power >> supply, etc. I've even seen some rare cases where a bios update would >> fix it although it didn't make much sense for a machine to run for >> years, then need a firmware change. > > The system is on a UPS and temps seem reasonable. > Locating a transient memory problem is time consuming. > Identifying a power supply which sometimes spikes is > even more difficult. I'd like to have a clue about the > likely problem before shutting down the server for an > extended period. > > I'll set up sar and sensord to periodically log system > status and see if this gives me a clue for the next > time this happens. The times I've seen things like that it would happen too quickly to log anything. One other possibility is an individual bad CPU fan, but then you might have to shut down completely for a while to wake it up. -- Les Mikesell lesmikesell at gmail.com