On Wed, 2006-07-12 at 19:33 -0400, Paul wrote: > OK, I'm still trying to solve this. Though the server has been up rock > steady, but the errors concern me. I built this on a test box months ago > and now that I am thinking, I may have built it originally on a drive of a > different manufacturer, although about the same size (20g). This may have > something to do with it. What is the easiest way to get these errors > taken care of? I've tried e2fsck, and also ran fsck on Vol00. Looks like > I made a fine mess of things. Is there I wasy to fix it without reloading AFAIK, there is no "easiest way". From my *limited* knowledge, you have a couple different problems (maybe) and they are not identified. I'll offer some guesses and suggestions, but without my own hard-headed stubbornness in play, results are even more iffy. > Centos? Here are some outputs: > > > snapshot from /var/log/messages: > > Jul 12 04:03:21 hostname kernel: hda: dma_intr: status=0x51 { DriveReady > SeekComplete Error } > Jul 12 04:03:21 hostname kernel: hda: dma_intr: error=0x84 { > DriveStatusError BadCRC } > Jul 12 04:03:21 hostname kernel: ide: failed opcode was: unknown I've experienced these regularly on a certain brand of older drive (*really* older, probably not your situation). Maxtor IIRC. Anyway, the problem occurred mostly on cold boot or when re-spinning the drive after it slept. It apparently had a really *slow* spin up speed and timeout would occur (not handled in the protocol I guess), IIRC. Your post doesn't mention if this might be related. If all your log occurrences tend to indicate it happens only after long periods of inactivity, or upon cold boot, it might not be an issue. But even there, hdparm might have some help. Also, if it does seem to be only on cold- boot or long periods of "sleeping", is it possible that a bunch of things starting at the same time are taxing the power supply? Is the PS "weak". Remember that PSs must have not only a maximum wattage sufficient to support the maximum draw of all devices at the same time (plus a margin for safety), but that also various 5/12 volt lines are limited. Different PSs have different limits on those lines and often they are not published on the PS label. Lots of 12 or 5 volt draws at the same time (as happens in a non-sequenced start-up) might be producing an unacceptable voltage or amperage drop. Is your PCI bus 33/66/100 MHz? Do you get messages on boot saying "assume 33MHz.... use idebus=66"? I hear it's OK to have an idebus param that is too fast, but it's a problem if your bus is faster than what the kernel thinks it is. Re-check and make sure all cables are well-seated and that power is well connected. Speaking of cables, is it new or "old"? Maybe cable has a small intermittent break? Try replacing the cable. Try using an 80- conductor (UDMA?) cable, if not using that already. If the problem is only on cold boot, can you get a DC volt-meter on the power connector? If so, look for the voltages to "sag". That might tell you that you are taxing your PS. Or use the labels, do the math and calculate if your are close to the max wattage in a worst-case scenario. I suggest using hdparm (*very* carefully) to see if the problem can be replicated on demand. Take the drive into various reduced-power modes and restart it and see if the problem is fairly consistent. > > > sfdisk -l: > > Disk /dev/hda: 39870 cylinders, 16 heads, 63 sectors/track > Warning: The partition table looks like it was made > for C/H/S=*/255/63 (instead of 39870/16/63). > For this listing I'll assume that geometry. > Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0 > > Device Boot Start End #cyls #blocks Id System > /dev/hda1 * 0+ 12 13- 104391 83 Linux > /dev/hda2 13 2500 2488 19984860 8e Linux LVM > /dev/hda3 0 - 0 0 0 Empty > /dev/hda4 0 - 0 0 0 Empty > Warning: start=63 - this looks like a partition rather than > the entire disk. Using fdisk on it is probably meaningless. > [Use the --force option if you really want this] What does your BIOS show for this drive? It's likely here that the drive was labeled (or copied from a drive that was labeled) in another machine. The "key" for me is the "255" vs. "16". The only fix here (not important to do it though) is to get the drive properly labeled for this machine. B/u data, make sure BIOS is set correctly, fdisk (or sfdisk) it to get partitions correct. WARNING! Although this can be done "live", use sfdisk -l -uS to get starting sector numbers and make the partitions match. When you re-label at "255", some of the calculated translations internal to the drivers(?) might change (Do things *still* translate to CHS on modern drives? I'll need to look into that some day. I bet not.). Also, the *desired* starting and ending sectors of the partitions are likely to change. What I'm saying is that the final partitioning will likely be "non-standard" in layout and laying in wait to bite your butt. I would backup the data, change BIOS, sfdisk it (or fdisk or cfdisk, or any other partitioner, your choice). If system is hot, sfdisk -R will re-read the params and get them into the kernel. Then reload data (if needed). If it's "hot", single user, or run level 1, mounted "ro", of course. Careful reading of sfdisk can allow you to script and test (on another drive) parts of this. Easy enough so far? >:-) > > > sfdisk -lf The "f" does you no good here, as you can see. It is really useful only when trying to change disk label. What would be useful (maybe) to you is "-uS". > > <snip> HTH -- Bill -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: <http://lists.centos.org/pipermail/centos/attachments/20060713/c215ce17/attachment-0005.sig>