Michael Best wrote:
A reformat reinstall is nothing more than erasing the disk and starting over, give or take some log files.
*) Erase all files associated with the application and reinstall, test *) swap space is being corrupted? turn off swap make swap with the check for bad blocks option, turn swap back on *) Run harddrive diagnostics perhaps the harddrives are going bad *) Run memtest+, if you can do this as an overnight burn in *) Perhaps Steam is the problem, who knows?
First off, I'd set up a way to recover the system from a time before it stopped working correctly. Mirroring the files to another disk is probably the fastest way, then you could just boot off KNOPPIX and rsync the backup files over what's there. If you were running from within VMware, you could just schedule snapshots and revert to a previous snapshot of the system when it was working properly. Restoring from tape is too slow and a pain to do, not something you want to get stuck doing every couple weeks, much easier and faster to just rsync off an extra disk or another computer.
If possible, I would repartition the system with two "root" partitions called "/" and "/r2" and "/r2" would just be an rsync backup of "/". You would set up LILO or GRUB entries to be able to boot off /r2, plus you need to modify /r2/etc/fstab since the root partition is different to "run" on that copy. Say for example you sync / -> /r2 daily, and the day the app stops working, you just reboot the system into /r2 and the system is probably working again, then sync /r2 -> /. I have used this approach a lot so that I could install a newer OS on /r2 but leave the old OS on / in case there was a problem getting any custom applications to work I could just reboot to the old OS and it's no problem. Some people have complained that it "wastes disk space" so lately I split the OS into "/" and "/usr/share" then when I want to install a newer OS I move /usr/share to a loop filesystem on /home for example and the /usr/share partition becomes /r2. It's a bit more complicated but it works.
Offhand, I'd burn a stresslinux CD and try memtest86 and the appropriate versions of cpuburn to see if anything weird happens:
Then still running off the CD, I'd try the "mkswap -c" and "e2fsck -c -f" on all of the filesystems, assuming you're using ext2 or ext3 filesystem--usual disclaimer to BACK UP your system and so on. Maybe Steam is getting its data out of the "wrong place" w.r.t. the Linux kernel like reading from fs buffers that may be out of sync as happened with "dump" but that is just a wild guess because I deleted the original message and have no idea what Steam is. What follows is just some general advice to come up with a plan of attack and brainstorming ideas.
I'd avoid having any part of Steam rely on files on an NFS filesystem, if that is your setup that would be my #1 suspect. If Steam uses any databases, you might want to run any tools to optimize / fix the tables periodically.
Try to break down the problem into "layers" analogous to the OSI layers of what is processing the data where, and then check out the problem at each "layer" just roughly something like:
application libraries database kernel filesystem hardware network
Those might not be in the "right" order but could give you some ideas. The goal is to try to isolate and identify what "layer" the problem is occurring at. I have solved problems where the cause of the problem was unexpected, like ypbind losing its connection to the NIS server which seemed like it shouldn't have affected the application but did for some reason, like the application used certain system calls that ended up tying into NIS.
--jonathan