[CentOS] Unable to boot CentOS 6 - Segmentation Erorr

Sun May 29 17:03:26 UTC 2016

On 2016-05-29 10:42, John Cenile wrote:
> Also, the last message in /var/log/messages before the crash was:
<snip />
> ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@May
> 29 07:30:10 *hostname* kernel: imklog 5.8.10, log source = /proc/kmsg
> started
> 
> Which seems very concerning.

Hi John,

TL;DR: prevention.

I can't say what happened, but I've a long-standing dread of 
your situation. Here's some ways to prepare for (or prevent)
the next time this happens. Possibly you're already doing all 
this but a recitation here might help someone else too.

- Set up remote logging. I favor rsyslog, but you can also
  use syslog-ng. Have one central logging server. This way you 
  can look for signs of trouble that preceded the crash.

- Keep baselines from the guest VMs. You can run rpm --verify 
  and preserve the output off-host (last step in yum update).
  Disable the nightly pre-link behavior (was this ever a good 
  idea?) to make comparing results more meaningful. 
  Post-crash, mount the victim read-only and re-run the verify
  to pin-point what part of the filesystem was clobbered.
  Knowing what was clobbered (and when) can help. Not long ago
  an errant script in production cleared the wrong
  directory but only when transaction volume crested some
  threshold, wiping out a critical monitoring script.

- Treat your hosts like cattle, not pets. Automating creation
  and maintenance of hosts gives you more and better options 
  for recovery when hosts go insane.

- Test and re-test your storage system. There are bugs lurking
  in every storage code base and every HBA's firmware. The
  physical connectors in your data path are built on a mass
  of compromises and contradictory design goals and are just 
  waiting to fail. Flush bugs out before putting gear into
  production.

- Restores, not backups, are your friends.[1] I ran into a
  bug in Gnu tar (this year) that left me with silently
  corrupted archives but only for thin-provisioned virtual 
  filesystems >16GB that compressed to <8GB. Only a full 
  restore unearthed the ugly truth.

- Consider ECC RAM. Once you have a few tens of GB's  you've 
  essentially got your own cosmic ray detector. If you 
  figure your time at $50/hour and it takes ten hours to deal 
  with with one ephemeral mysterious incident then springing 
  for $500 worth of ECC RAM is a good bet. Figure in the cost 
  of downtime and it's a no brainer.

Best regards,
-- 
Charles Polisher

[1] http://web.archive.org/web/20070920215346/http://people.qualcomm.com/ggr/GoB.txt