[CentOS] Diagnosing random hangs

Tue Dec 19 14:03:55 UTC 2006
Alfred von Campe <alfred at 110.net>

On Dec 18, 2006, at 16:17, Mark Belanger wrote:

> I have many different centos machines that are hanging
> regulary.  I believe this is due to something our application
> is doing - not a centos specific problem.

I have the same problem.  I even posted something to this list titled  
"Strange system hangs" on 11/27 but didn't get any responses.

> When the machines hang, there is no access to the console
> or remote access(ssh, rsh, etc).

I have that symptom as well.  No way to do any debugging after it  
gets into that state.  So I added the following two lines to the /etc/ 
syslog.conf file:

   kern.*                                        @<central server>
   *.info;mail.none;authpriv.none;cron.none      @<central server>

Should I add any other levels to the selector field?  BTW, my systems  
are running completely stock CentOS distribution EXCEPT for the  
binary nVidia driver, which was the only way I could get these  
systems to drive the 20" LCD displays at their native 1600x1200  
resolution using the correct refresh rate.

I had another report of a hang this morning, but in this case even  
though the machine appears frozen (the screen saver is stuck and I  
can't get to the alternate consoles), I can in fact log into the  
machine remotely and top shows me that the X server is using 100% of  
the CPU:

   top - 08:44:22 up 10 days, 23:00, 10 users,  load average: 1.04,  
1.01, 1.00
   Tasks: 115 total,   2 running, 113 sleeping,   0 stopped,   0 zombie
   Cpu(s): 99.7% us,  0.3% sy,  0.0% ni,  0.0% id,  0.0% wa,  0.0%  
hi,  0.0% si
   Mem:   3113468k total,  1361240k used,  1752228k free,    87312k  
buffers
   Swap:  3047416k total,        0k used,  3047416k free,   957756k  
cached

     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    4381 root      25   0 67748  42m 7776 R 99.8  1.4 782:53.37 X

I also see the following in /var/log/messages:

   Dec 18 19:56:02 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel  
00000001
   Dec 18 19:56:03 hepdsw04 kernel: NVRM: Xid (0001:00): 9, Channel  
00000020 Instance 00000000 Intr 00100000
   Dec 18 19:56:09 hepdsw04 Synergy 1.3.1: NOTE: CServerProxy.cpp, 
315: server is dead
   Dec 18 19:56:10 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel  
00000020
   Dec 18 19:56:11 hepdsw04 kernel: NVRM: Xid (0001:00): 9, Channel  
00000020 Instance 00000000 Intr 00100000
   Dec 18 19:56:18 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel  
00000020
   Dec 18 19:56:19 hepdsw04 kernel: NVRM: Xid (0001:00): 9, Channel  
00000020 Instance 00000000 Intr 00100000
   Dec 18 19:56:26 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel  
00000020
   Dec 18 19:56:27 hepdsw04 kernel: NVRM: Xid (0001:00): 9, Channel  
00000020 Instance 00000000 Intr 00100000
   Dec 18 19:56:34 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel  
00000001

What is the meaning of the NVRM entries?  The Synergy entry is from  
the keyboard/mouse sharing Synergy utility (great program BTW, I  
couldn't live without it).

Anyway, sorry to inject my own problems into this thread, but maybe  
these hangs are all related.

Alfred