[Centos] crahsing app

Wed Jan 26 04:23:04 UTC 2005
Jonathan Dill <jfdill_2 at jfdill.com>

Michael Best wrote:

> A reformat reinstall is nothing more than erasing the disk and 
> starting over, give or take some log files.
>
> *) Erase all files associated with the application and reinstall, test
> *) swap space is being corrupted?  turn off swap
>    make swap with the check for bad blocks option, turn swap back on
> *) Run harddrive diagnostics perhaps the harddrives are going bad
> *) Run memtest+, if you can do this as an overnight burn in
> *) Perhaps Steam is the problem, who knows?

First off, I'd set up a way to recover the system from a time before it 
stopped working correctly.  Mirroring the files to another disk is 
probably the fastest way, then you could just boot off KNOPPIX and rsync 
the backup files over what's there.  If you were running from within 
VMware, you could just schedule snapshots and revert to a previous 
snapshot of the system when it was working properly.  Restoring from 
tape is too slow and a pain to do, not something you want to get stuck 
doing every couple weeks, much easier and faster to just rsync off an 
extra disk or another computer.

If possible, I would repartition the system with two "root" partitions 
called "/" and "/r2" and "/r2" would just be an rsync backup of "/".  
You would set up LILO or GRUB entries to be able to boot off /r2, plus 
you need to modify /r2/etc/fstab since the root partition is different 
to "run" on that copy.  Say for example you sync / -> /r2 daily, and the 
day the app stops working, you just reboot the system into /r2 and the 
system is probably working again, then sync /r2 -> /.  I have used this 
approach a lot so that I could install a newer OS on /r2 but leave the 
old OS on / in case there was a problem getting any custom applications 
to work I could just reboot to the old OS and it's no problem.  Some 
people have complained that it "wastes disk space" so lately I split the 
OS into "/" and "/usr/share" then when I want to install a newer OS I 
move /usr/share to a loop filesystem on /home for example and the 
/usr/share partition becomes /r2.  It's a bit more complicated but it works.

Offhand, I'd burn a stresslinux CD and try memtest86 and the appropriate 
versions of cpuburn to see if anything weird happens:

http://www.stresslinux.org/

Then still running off the CD, I'd try the "mkswap -c" and "e2fsck -c 
-f" on all of the filesystems, assuming you're using ext2 or ext3 
filesystem--usual disclaimer to BACK UP your system and so on.  Maybe 
Steam is getting its data out of the "wrong place" w.r.t. the Linux 
kernel like reading from fs buffers that may be out of sync as happened 
with "dump" but that is just a wild guess because I deleted the original 
message and have no idea what Steam is.  What follows is just some 
general advice to come up with a plan of attack and brainstorming ideas.

I'd avoid having any part of Steam rely on files on an NFS filesystem, 
if that is your setup that would be my #1 suspect.  If Steam uses any 
databases, you might want to run any tools to optimize / fix the tables 
periodically.

Try to break down the problem into "layers" analogous to the OSI layers 
of what is processing the data where, and then check out the problem at 
each "layer" just roughly something like:

application
libraries
database
kernel
filesystem
hardware
network

Those might not be in the "right" order but could give you some ideas.  
The goal is to try to isolate and identify what "layer" the problem is 
occurring at.  I have solved problems where the cause of the problem was 
unexpected, like ypbind losing its connection to the NIS server which 
seemed like it shouldn't have affected the application but did for some 
reason, like the application used certain system calls that ended up 
tying into NIS.

--jonathan