[CentOS] Now I can't shutdown [was: Screen blanks afteral p (Centos 5)]

Sat May 19 20:36:40 UTC 2007
Itay <centos at nospammail.net>

On Sat, 19 May 2007, William L. Maltby wrote:

> On Sat, 2007-05-19 at 17:54 +0300, Itay wrote:

[snip]

>> I tried the padding technique - the media errors were gone;
>> kernel panic - stayed.  I hope that you or others may help me
>> with this.
>
> Most likely it will be others. My ignorance is boundless and,
> fortunately, my ego is inversely proportional to that! :-)
>
> I'm glad the media errors are gone.

Which leaves me with the more difficult alternatives.  Arrrgh.

[snip]

>> 3 I tried several things, each one of them ended in *kernel panic*
>>    either before logging in as root, or some minutes after.  The panic
>>    appeared after idling the machine for some time.
>
> *sniff* Smells hardware-related. But whether it's bad hardware or kernel
> handling of it, I'm too ignorant to hazard a guess. I googled and found
> your original post (BTW, don't high-jack threads, even you own. It made
> it more difficult to find you brief originally-posted hardware ref). :-O

I thought (and still do) that the two issues were related, and 
therefore modifying the subject line and including a [was:...] 
clause are sufficient.  Sorry for the extra work.

> I was going to ask about x586 or C5 processors, but I did manage to find
> your OP and saw AMD 4200+, IIRC. So we don't have to worry about that.

:-)

>> 4 A couple of strange things
>>    + I have found out that the default run level was set to 3.
>>      When, as a root I tried 'telinit 5', the machine responded with a
>>      blank screen.  I had to reset.
>
> Have you tried a <CTRL>-<ALT>-<F1> when this happens? Since desktop is
> being started on tty7, if it fails and seems blank, maybe switching to
> virtual console 1 will work, if the machine is still alive. If so, maybe
> some answers are there (view /var/log/messages, the X log, etc.).

Wasn't able to switch to virtual consoles.  (I begin to suspect 
that there are some problems with the keyboard as well, though.)
No clues in /var/log/messages.
And no X.log at all!

>>    + Rebooting the machine was accompanied with messages regarding
>>      ntp/clock skew.  Later, I have found out that I have lost the
>>      network connection, probably while playing with the
>>      installation, so this probably explains the clock skew.
>>      Am not sure if this has any relevance.
>>    + At no point I was prompted to setup a non-root user.
>
> IIRC, when I did my C5 install, I got that prompt. If that's normal, it
> may mean that the problem actually bit your during the install phase and
> not everything got done correctly.

Possibly.  But there were no hints for that in install.log and 
anaconda.*log*

>> 5 For each crash / kernel panic I got a screen-load of trace and other
>>    cryptic output.  Each time, so it seems, the output was different.
>>    *Q* Is there a way to dump those messages into a file?
>
> I'm too ignorant to answer that. But if you do get up and running for a
> few minutes in a text console, clues may be laying around
> in /var/log/messages. Search backwards for "restart" (twice) or some
> other word, like "panic", and read around there.

No hints except for what I have mentioned below.

>> 6 Only suspicious thing I have found in /var/log/messages was lines
>>    like this
>>
>> May 19 11:27:36 bilbo kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
>> May 19 11:27:36 bilbo kernel: ata1.00: tag 0 cmd 0xb0 Emask 0x1 stat 0x51 err 0x4 (device error)
>> May 19 11:27:36 bilbo kernel: ata1: EH complete
>>
>> 7 Also, /var/log/secure had these errors - I believe for every reboot.
>>
>> ...
>> May 19 11:25:30 bilbo login: ROOT LOGIN ON tty1
>> May 19 11:26:06 bilbo login: pam_unix(login:session): session closed for user root
>> May 19 11:26:09 bilbo sshd[2677]: Received signal 15; terminating.
>> May 19 11:27:26 bilbo sshd[2687]: Server listening on :: port 22.
>> May 19 11:27:26 bilbo sshd[2687]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use.
>> May 19 11:29:51 bilbo login: pam_unix(login:session): session opened for user root by LOGIN(uid=0)
>> May 19 11:29:51 bilbo login: pam_selinux(login:session): Warning!  Could not get new context for /dev/tty1, not relabeling: Invalid argument
>> May 19 11:29:51 bilbo login: pam_selinux(login:session): usercon=(null), prev_context=system_u:object_r:tty_device_t
>> May 19 11:29:51 bilbo login: ROOT LOGIN ON tty1
>
> I'm too ignorant to answer authoritatively.
>>
>>> My *guess* is that the application related errors you reported may be a
>>> result of certain installation steps terminating early due to the false
>>> I/O errors reported by the kernel/driver(s).
>>>
>>> HTH
>>> --
>>> Bill
>>
>> Any recommendation how to proceed?
>> (The most pressing question: is it the hardware? Should I take
>> the box back to the seller?)
>
> If the panics are random, IIRC, could be memory, could be ... But a good
> run of memtest386 from the install CD should help determine that. Also,
> it is not uncommon for new hardware to have the occasional loose
> connector or PCI card. Maybe too small power supply. Maybe CPU fan not
> spinning. Maybe ambient temperature of the room is too high and internal
> box temperature excessive.

Running memtest now for the night (runs for 2 hours already).
If it was a question of excess heat I would expect to have 
trouble during memtest run as well; no?

> If you suspect hardware, check all connectors. Make sure memory, power
> supply connectors and PCI cards are firmly seated. Make sure your power
> supply is adequate (my EPOX board needed much more than the PS for the
> ACER box, into which the EPOX was originally installed, could supply.
> Had random panics, usually near startup times, sometimes a few minutes
> after. That's natural because the ACER had an integrated SiS chip set
> which needs much less power than the Via-based EPOX.
>
> Make sure the CPU fan is seated and working.
>
> Is your AC power from the electric company reliable? Fluctuations of 20%
> are not uncommon here. Battery backup with power conditioning helps a
> lot.

Actually, the power supply is not stable enough.  But there 
were no fluctuations that I could notice during my attempts 
this morning.  We'll keep this in mind, though.

> Since you mentioned a delay sometimes (IIRC), heat sounds like a
> possible culprit. If the room is cool, take the covers off and see if it
> runs longer. If it stays up long enough, do

Again: memtest'ing for few hours should produce a similar 
challenge I should think.
I could try running knoppix 5 for a while and straining somehow 
the CPU.

>  # cat /proc/acpi/thermal_zone/THRM/temperature
>  temperature:             36 C
>
> Make sure it's in the range for the AMD you have. BTW, mine is lower
> than it used to be. I added an expensive Zallman FHS a few months back.
> May try overclocking someday if I get enough interest.

We'll check tomorrow when attempting to reboot into centos.

> Use google with "site:centos.org" added, e.g. like this
>
>    screen blanks after initial setup site:centos.org
>
> in advanced search fields (I had site:... in the "all of the words"
> field and "screen blanks after initial setup" in the "exact phrase"
> field. You'll find lots of instances of kernel panics discussed on the
> list and some suggestions, in some cases, for "noapic" and similar boot-
> time parameters.

Yup.  I noticed that some of them were related to nVidia 
hardware.  Well, my box has a few nVidia's, so maybe...

Thanks.
-- 
   Itay Furman  <centos at nospammail.net>
--