[CentOS] sfdisk -l output errors and crc erros

On Wed, 2006-07-12 at 19:33 -0400, Paul wrote:
> OK, I'm still trying to solve this.  Though the server has been up rock
> steady, but the errors concern me.  I built this on a test box months ago
> and now that I am thinking, I may have built it originally on a drive of a
> different manufacturer, although about the same size (20g).  This may have
> something to do with it.  What is the easiest way to get these errors
> taken care of?  I've tried e2fsck, and also ran fsck on Vol00.  Looks like
> I made a fine mess of things.  Is there I wasy to fix it without reloading

AFAIK, there is no "easiest way". From my *limited* knowledge, you have
a couple different problems (maybe) and they are not identified. I'll
offer some guesses and suggestions, but without my own hard-headed
stubbornness in play, results are even more iffy.

> Centos?  Here are some outputs:
> 
> 
> snapshot from /var/log/messages:
> 
> Jul 12 04:03:21 hostname kernel: hda: dma_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Jul 12 04:03:21 hostname kernel: hda: dma_intr: error=0x84 {
> DriveStatusError BadCRC }
> Jul 12 04:03:21 hostname kernel: ide: failed opcode was: unknown

I've experienced these regularly on a certain brand of older drive
(*really* older, probably not your situation). Maxtor IIRC. Anyway, the
problem occurred mostly on cold boot or when re-spinning the drive after
it slept. It apparently had a really *slow* spin up speed and timeout
would occur (not handled in the protocol I guess), IIRC.

Your post doesn't mention if this might be related. If all your log
occurrences tend to indicate it happens only after long periods of
inactivity, or upon cold boot, it might not be an issue. But even there,
hdparm might have some help. Also, if it does seem to be only on cold-
boot or long periods of "sleeping", is it possible that a bunch of
things starting at the same time are taxing the power supply? Is the PS
"weak". Remember that PSs must have not only a maximum wattage
sufficient to support the maximum draw of all devices at the same time
(plus a margin for safety), but that also various 5/12 volt lines are
limited. Different PSs have different limits on those lines and often
they are not published on the PS label. Lots of 12 or 5 volt draws at
the same time (as happens in a non-sequenced start-up) might be
producing an unacceptable voltage or amperage drop.

Is your PCI bus 33/66/100 MHz? Do you get messages on boot saying
"assume 33MHz.... use idebus=66"? I hear it's OK to have an idebus param
that is too fast, but it's a problem if your bus is faster than what the
kernel thinks it is.

Re-check and make sure all cables are well-seated and that power is well
connected. Speaking of cables, is it new or "old"? Maybe cable has a
small intermittent break? Try replacing the cable. Try using an 80-
conductor (UDMA?) cable, if not using that already. If the problem is
only on cold boot, can you get a DC volt-meter on the power connector?
If so, look for the voltages to "sag". That might tell you that you are
taxing your PS. Or use the labels, do the math and calculate if your are
close to the max wattage in a worst-case scenario.

I suggest using hdparm (*very* carefully) to see if the problem can be
replicated on demand. Take the drive into various reduced-power modes
and restart it and see if the problem is fairly consistent.

> 
> 
> sfdisk -l:
> 
> Disk /dev/hda: 39870 cylinders, 16 heads, 63 sectors/track
> Warning: The partition table looks like it was made
>   for C/H/S=*/255/63 (instead of 39870/16/63).
> For this listing I'll assume that geometry.
> Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0
> 
>    Device Boot Start     End   #cyls    #blocks   Id  System
> /dev/hda1   *      0+     12      13-    104391   83  Linux
> /dev/hda2         13    2500    2488   19984860   8e  Linux LVM
> /dev/hda3          0       -       0          0    0  Empty
> /dev/hda4          0       -       0          0    0  Empty
> Warning: start=63 - this looks like a partition rather than
> the entire disk. Using fdisk on it is probably meaningless.
> [Use the --force option if you really want this]

What does your BIOS show for this drive? It's likely here that the drive
was labeled (or copied from a drive that was labeled) in another
machine. The "key" for me is the "255" vs. "16". The only fix here (not
important to do it though) is to get the drive properly labeled for this
machine. B/u data, make sure BIOS is set correctly, fdisk (or sfdisk) it
to get partitions correct.

WARNING! Although this can be done "live", use sfdisk -l -uS to get
starting sector numbers and make the partitions match. When you re-label
at "255", some of the calculated translations internal to the drivers(?)
might change (Do things *still* translate to CHS on modern drives? I'll
need to look into that some day. I bet not.). Also, the *desired*
starting and ending sectors of the partitions are likely to change. What
I'm saying is that the final partitioning will likely be "non-standard"
in layout and laying in wait to bite your butt.

I would backup the data, change BIOS, sfdisk it (or fdisk or cfdisk, or
any other partitioner, your choice). If system is hot, sfdisk -R will
re-read the params and get them into the kernel. Then reload data (if
needed). If it's "hot", single user, or run level 1, mounted "ro", of
course. Careful reading of sfdisk can allow you to script and test (on
another drive) parts of this.

Easy enough so far?  >:-)

> 
> 
> sfdisk -lf

The "f" does you no good here, as you can see. It is really useful only
when trying to change disk label. What would be useful (maybe) to you is
"-uS".

> 
> <snip>

HTH
-- 
Bill
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.centos.org/pipermail/centos/attachments/20060713/c215ce17/attachment-0005.sig>