On 2016-05-29 10:42, John Cenile wrote:
Also, the last message in /var/log/messages before the crash was:
<snip />
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@May 29 07:30:10 *hostname* kernel: imklog 5.8.10, log source = /proc/kmsg started
Which seems very concerning.
Hi John,
TL;DR: prevention.
I can't say what happened, but I've a long-standing dread of your situation. Here's some ways to prepare for (or prevent) the next time this happens. Possibly you're already doing all this but a recitation here might help someone else too.
- Set up remote logging. I favor rsyslog, but you can also use syslog-ng. Have one central logging server. This way you can look for signs of trouble that preceded the crash.
- Keep baselines from the guest VMs. You can run rpm --verify and preserve the output off-host (last step in yum update). Disable the nightly pre-link behavior (was this ever a good idea?) to make comparing results more meaningful. Post-crash, mount the victim read-only and re-run the verify to pin-point what part of the filesystem was clobbered. Knowing what was clobbered (and when) can help. Not long ago an errant script in production cleared the wrong directory but only when transaction volume crested some threshold, wiping out a critical monitoring script.
- Treat your hosts like cattle, not pets. Automating creation and maintenance of hosts gives you more and better options for recovery when hosts go insane.
- Test and re-test your storage system. There are bugs lurking in every storage code base and every HBA's firmware. The physical connectors in your data path are built on a mass of compromises and contradictory design goals and are just waiting to fail. Flush bugs out before putting gear into production.
- Restores, not backups, are your friends.[1] I ran into a bug in Gnu tar (this year) that left me with silently corrupted archives but only for thin-provisioned virtual filesystems >16GB that compressed to <8GB. Only a full restore unearthed the ugly truth.
- Consider ECC RAM. Once you have a few tens of GB's you've essentially got your own cosmic ray detector. If you figure your time at $50/hour and it takes ten hours to deal with with one ephemeral mysterious incident then springing for $500 worth of ECC RAM is a good bet. Figure in the cost of downtime and it's a no brainer.
Best regards,