Hi all, I had an issue this morning with one of my virtual machines. It wouldn't boot (into any runlevel), nor could I chroot into the root partition using a rescue disk.
Unfortunately I didn't grab a screenshot, however the error(s) when booting were:
/pre-pivot/50selinux-loadpolicy.sh: 14
<other messages>
init: readahead main process (425) killed by SEGV signal init: readahead-col lector main process (421) killed by SEGV signal init: rcS pre-start process (425) killed by SEGV signal init: Error while reading from descriptor: Bad file descriptor init: readahead-col lector post-stop process (424) killed by SEGV signal init: rcS post-stop process (427) killed by SEGV signal init: readahead-disable-services main process (428) killed by SEGV signal
When using a rescue CD and chrooting into the root partition, I would get :
Segmentation Fault: Core Dumped
In the end, the fix was to boot into a rescue CD with networking, and SCP the entire contents of /bin and /sbin from another (working) server to the broken installation. This finally allowed CentOS 6 to boot correctly.
So I'm left to assume some of the files in /bin *or */sbin were corrupt.
My question is, does anyone have any ideas on how this might have happened? I did do a quick memory test using the rescue CD (it didn't complete) and there weren't any errors. The virtual machine is running on VMWare with 3 other VMs, which all seem fine. There wasn't any unexpected power loss either.
Thanks.
Also, the last message in /var/log/messages before the crash was:
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@May 29 07:30:10 *hostname* kernel: imklog 5.8.10, log source = /proc/kmsg started
Which seems very concerning.
On 29 May 2016 at 10:27, John Cenile jcenile1983@gmail.com wrote:
Hi all, I had an issue this morning with one of my virtual machines. It wouldn't boot (into any runlevel), nor could I chroot into the root partition using a rescue disk.
Unfortunately I didn't grab a screenshot, however the error(s) when booting were:
/pre-pivot/50selinux-loadpolicy.sh: 14
<other messages>
init: readahead main process (425) killed by SEGV signal init: readahead-col lector main process (421) killed by SEGV signal init: rcS pre-start process (425) killed by SEGV signal init: Error while reading from descriptor: Bad file descriptor init: readahead-col lector post-stop process (424) killed by SEGV signal init: rcS post-stop process (427) killed by SEGV signal init: readahead-disable-services main process (428) killed by SEGV signal
When using a rescue CD and chrooting into the root partition, I would get :
Segmentation Fault: Core Dumped
In the end, the fix was to boot into a rescue CD with networking, and SCP the entire contents of /bin and /sbin from another (working) server to the broken installation. This finally allowed CentOS 6 to boot correctly.
So I'm left to assume some of the files in /bin *or */sbin were corrupt.
My question is, does anyone have any ideas on how this might have happened? I did do a quick memory test using the rescue CD (it didn't complete) and there weren't any errors. The virtual machine is running on VMWare with 3 other VMs, which all seem fine. There wasn't any unexpected power loss either.
Thanks.
On 2016-05-29 10:42, John Cenile wrote:
Also, the last message in /var/log/messages before the crash was:
<snip />
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@May 29 07:30:10 *hostname* kernel: imklog 5.8.10, log source = /proc/kmsg started
Which seems very concerning.
Hi John,
TL;DR: prevention.
I can't say what happened, but I've a long-standing dread of your situation. Here's some ways to prepare for (or prevent) the next time this happens. Possibly you're already doing all this but a recitation here might help someone else too.
- Set up remote logging. I favor rsyslog, but you can also use syslog-ng. Have one central logging server. This way you can look for signs of trouble that preceded the crash.
- Keep baselines from the guest VMs. You can run rpm --verify and preserve the output off-host (last step in yum update). Disable the nightly pre-link behavior (was this ever a good idea?) to make comparing results more meaningful. Post-crash, mount the victim read-only and re-run the verify to pin-point what part of the filesystem was clobbered. Knowing what was clobbered (and when) can help. Not long ago an errant script in production cleared the wrong directory but only when transaction volume crested some threshold, wiping out a critical monitoring script.
- Treat your hosts like cattle, not pets. Automating creation and maintenance of hosts gives you more and better options for recovery when hosts go insane.
- Test and re-test your storage system. There are bugs lurking in every storage code base and every HBA's firmware. The physical connectors in your data path are built on a mass of compromises and contradictory design goals and are just waiting to fail. Flush bugs out before putting gear into production.
- Restores, not backups, are your friends.[1] I ran into a bug in Gnu tar (this year) that left me with silently corrupted archives but only for thin-provisioned virtual filesystems >16GB that compressed to <8GB. Only a full restore unearthed the ugly truth.
- Consider ECC RAM. Once you have a few tens of GB's you've essentially got your own cosmic ray detector. If you figure your time at $50/hour and it takes ten hours to deal with with one ephemeral mysterious incident then springing for $500 worth of ECC RAM is a good bet. Figure in the cost of downtime and it's a no brainer.
Best regards,