[CentOS] Unable to boot CentOS 6 - Segmentation Erorr
cpolish at surewest.net
cpolish at surewest.net
Sun May 29 17:03:26 UTC 2016
On 2016-05-29 10:42, John Cenile wrote:
> Also, the last message in /var/log/messages before the crash was:
<snip />
> ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@May
> 29 07:30:10 *hostname* kernel: imklog 5.8.10, log source = /proc/kmsg
> started
>
> Which seems very concerning.
Hi John,
TL;DR: prevention.
I can't say what happened, but I've a long-standing dread of
your situation. Here's some ways to prepare for (or prevent)
the next time this happens. Possibly you're already doing all
this but a recitation here might help someone else too.
- Set up remote logging. I favor rsyslog, but you can also
use syslog-ng. Have one central logging server. This way you
can look for signs of trouble that preceded the crash.
- Keep baselines from the guest VMs. You can run rpm --verify
and preserve the output off-host (last step in yum update).
Disable the nightly pre-link behavior (was this ever a good
idea?) to make comparing results more meaningful.
Post-crash, mount the victim read-only and re-run the verify
to pin-point what part of the filesystem was clobbered.
Knowing what was clobbered (and when) can help. Not long ago
an errant script in production cleared the wrong
directory but only when transaction volume crested some
threshold, wiping out a critical monitoring script.
- Treat your hosts like cattle, not pets. Automating creation
and maintenance of hosts gives you more and better options
for recovery when hosts go insane.
- Test and re-test your storage system. There are bugs lurking
in every storage code base and every HBA's firmware. The
physical connectors in your data path are built on a mass
of compromises and contradictory design goals and are just
waiting to fail. Flush bugs out before putting gear into
production.
- Restores, not backups, are your friends.[1] I ran into a
bug in Gnu tar (this year) that left me with silently
corrupted archives but only for thin-provisioned virtual
filesystems >16GB that compressed to <8GB. Only a full
restore unearthed the ugly truth.
- Consider ECC RAM. Once you have a few tens of GB's you've
essentially got your own cosmic ray detector. If you
figure your time at $50/hour and it takes ten hours to deal
with with one ephemeral mysterious incident then springing
for $500 worth of ECC RAM is a good bet. Figure in the cost
of downtime and it's a no brainer.
Best regards,
--
Charles Polisher
[1] http://web.archive.org/web/20070920215346/http://people.qualcomm.com/ggr/GoB.txt
More information about the CentOS
mailing list