I have many different centos machines that are hanging regulary. I believe this is due to something our application is doing - not a centos specific problem.
When the machines hang, there is no access to the console or remote access(ssh, rsh, etc).
Any tips on debugging this issue? It is becoming quite a show stopper as we migrate our product from Solaris to Linux.
tia,
FYI - The "application" is a collection of programs that communicate with each other and to a large chip tester via a proprietary serial bus. The hangs are random but pretty frequent - in the range of several per day to several per week.
-Mark
On Mon, Dec 18, 2006 at 04:17:59PM -0500, Mark Belanger wrote:
I have many different centos machines that are hanging regulary. I believe this is due to something our application is doing - not a centos specific problem. When the machines hang, there is no access to the console or remote access(ssh, rsh, etc).
Do you mean that there's no *access* to the console, or that it doesn't *respond* on the console?
Matthew Miller wrote:
On Mon, Dec 18, 2006 at 04:17:59PM -0500, Mark Belanger wrote:
I have many different centos machines that are hanging regulary. I believe this is due to something our application is doing - not a centos specific problem. When the machines hang, there is no access to the console or remote access(ssh, rsh, etc).
Do you mean that there's no *access* to the console, or that it doesn't *respond* on the console?
X is frozen, no way to switch the console(i.e. Ctrl-Alt-F1, Ctrl-Alt-Backspace), and no way to access the machine remotely.
-Mark
On Mon, Dec 18, 2006 at 04:44:38PM -0500, Mark Belanger wrote:
On Mon, Dec 18, 2006 at 04:17:59PM -0500, Mark Belanger wrote:
I have many different centos machines that are hanging regulary. I believe this is due to something our application is doing - not a centos specific problem. When the machines hang, there is no access to the console or remote access(ssh, rsh, etc).
Do you mean that there's no *access* to the console, or that it doesn't *respond* on the console?
X is frozen, no way to switch the console(i.e. Ctrl-Alt-F1, Ctrl-Alt-Backspace), and no way to access the machine remotely.
Is the primary, problematic application an X app, or is X just running just because?
Matthew Miller wrote:
On Mon, Dec 18, 2006 at 04:44:38PM -0500, Mark Belanger wrote:
On Mon, Dec 18, 2006 at 04:17:59PM -0500, Mark Belanger wrote:
I have many different centos machines that are hanging regulary. I believe this is due to something our application is doing - not a centos specific problem. When the machines hang, there is no access to the console or remote access(ssh, rsh, etc).
Do you mean that there's no *access* to the console, or that it doesn't *respond* on the console?
X is frozen, no way to switch the console(i.e. Ctrl-Alt-F1, Ctrl-Alt-Backspace), and no way to access the machine remotely.
Is the primary, problematic application an X app, or is X just running just because?
It is a collection of apps - some of which are X based, some of which are not. The user interface is X, so it is required. I suppose I could get them to run in a VNC server - to remove X and the nvidia driver from the equation.
-Mark
It is a collection of apps - some of which are X based, some of which are not. The user interface is X, so it is required. I suppose I could get them to run in a VNC server - to remove X and the nvidia driver from the equation.
or run the X apps on an external X server. I've had decent luck using Cygwin/X on my Windows machines... setup putty or another ssh client to enable X forwarding, run cygwin's shell, `startx`, let that X term open on your desktop and minimize it, then log onto the linux/unix machine with putty/crt, then run your X app and it should open on your windows desktop. Works much better than VNC.
... I've had decent luck using Cygwin/X on my Windows machines...
I would recommend Xming (http://sourceforge.net/projects/xming) instead of Cygwin/X. It runs noticeably faster than Cygwin/X, and is even faster if you use fullscreen mode. We have also had trouble with a Windows webcam driver breaking Cygwin/X, but it did not interfere with Xming.
Dan
Mark Belanger wrote:
Matthew Miller wrote:
On Mon, Dec 18, 2006 at 04:17:59PM -0500, Mark Belanger wrote:
I have many different centos machines that are hanging regulary. I believe this is due to something our application is doing - not a centos specific problem. When the machines hang, there is no access to the console or remote access(ssh, rsh, etc).
Do you mean that there's no *access* to the console, or that it doesn't *respond* on the console?
X is frozen, no way to switch the console(i.e. Ctrl-Alt-F1, Ctrl-Alt-Backspace), and no way to access the machine remotely.
if you don't need X running, I'd stop loading it entirely (edit /etc/inittab, and change the default runlevel to 3), and before your app hangs, log onto the system console, and leave this command running as root...
# tail -f /var/log/messages /var/log/secure
this way, any error logging will be displayed as the system crashes.
On Mon, Dec 18, 2006 at 01:54:24PM -0800, John R Pierce wrote:
X is frozen, no way to switch the console(i.e. Ctrl-Alt-F1, Ctrl-Alt-Backspace), and no way to access the machine remotely.
if you don't need X running, I'd stop loading it entirely (edit /etc/inittab, and change the default runlevel to 3), and before your app hangs, log onto the system console, and leave this command running as root...
# tail -f /var/log/messages /var/log/secure
this way, any error logging will be displayed as the system crashes.
Or, you can start your application running in X, then switch to Ctrl-Alt-F1, login there, and run the tails on the console.
If this isn't possible I would use syslog to forward *.debug to another syslog server on the microscopic chance the system has some transmitable last words.
John R Pierce wrote:
Mark Belanger wrote:
Matthew Miller wrote:
On Mon, Dec 18, 2006 at 04:17:59PM -0500, Mark Belanger wrote:
I have many different centos machines that are hanging regulary. I believe this is due to something our application is doing - not a centos specific problem. When the machines hang, there is no access to the console or remote access(ssh, rsh, etc).
Do you mean that there's no *access* to the console, or that it doesn't *respond* on the console?
X is frozen, no way to switch the console(i.e. Ctrl-Alt-F1, Ctrl-Alt-Backspace), and no way to access the machine remotely.
if you don't need X running, I'd stop loading it entirely (edit /etc/inittab, and change the default runlevel to 3), and before your app hangs, log onto the system console, and leave this command running as root...
# tail -f /var/log/messages /var/log/secure
this way, any error logging will be displayed as the system crashes.
X is required - though I could accomplish the same thing by logging in remotely. So far, the log files haven't shown anything interesting.
-Mark
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On 12/18/06, Mark Belanger mark_belanger@ltx.com wrote:
Matthew Miller wrote:
On Mon, Dec 18, 2006 at 04:17:59PM -0500, Mark Belanger wrote:
I have many different centos machines that are hanging regulary. I believe this is due to something our application is doing - not a centos specific problem. When the machines hang, there is no access to the console or remote access(ssh, rsh, etc).
Do you mean that there's no *access* to the console, or that it doesn't *respond* on the console?
X is frozen, no way to switch the console(i.e. Ctrl-Alt-F1, Ctrl-Alt-Backspace), and no way to access the machine remotely.
Consider setting up a serial console so that you might be able use magic-sysreq to gather infomation on the kernel and processes. The serial console should not be effect by X freezing (i.e. you would have a seperatate terminal session setup too it, prefereably in a GUI environment not on that machine.
Cheers...james
P.S. The documentation on magic sysreq is in the kernel docs.
On 12/18/06, Mark Belanger mark_belanger@ltx.com wrote:
I have many different centos machines that are hanging regulary. I believe this is due to something our application is doing - not a centos specific problem.
When the machines hang, there is no access to the console or remote access(ssh, rsh, etc).
Any tips on debugging this issue? It is becoming quite a show stopper as we migrate our product from Solaris to Linux.
tia,
FYI - The "application" is a collection of programs that communicate with each other and to a large chip tester via a proprietary serial bus. The hangs are random but pretty frequent - in the range of several per day to several per week.
-Mark
Have you tried the magic sysreq sequence on the console?
Cheers...james
On 12/18/06, James Olin Oden james.oden@gmail.com wrote:
On 12/18/06, Mark Belanger mark_belanger@ltx.com wrote:
I have many different centos machines that are hanging regulary. I believe this is due to something our application is doing - not a centos specific problem.
When the machines hang, there is no access to the console or remote access(ssh, rsh, etc).
Any tips on debugging this issue? It is becoming quite a show stopper as we migrate our product from Solaris to Linux.
tia,
FYI - The "application" is a collection of programs that communicate with each other and to a large chip tester via a proprietary serial bus. The hangs are random but pretty frequent - in the range of several per day to several per week.
-Mark
Have you tried the magic sysreq sequence on the console?
Oh, I just noticed propietary serial bus, does mean you have your own device driver(s)? Still see if magic sysreq works, but what you do with this should be driven by your driver writers.
Cheers...james
James Olin Oden wrote:
On 12/18/06, James Olin Oden james.oden@gmail.com wrote:
On 12/18/06, Mark Belanger mark_belanger@ltx.com wrote:
I have many different centos machines that are hanging regulary. I believe this is due to something our application is doing - not a centos specific problem.
When the machines hang, there is no access to the console or remote access(ssh, rsh, etc).
Any tips on debugging this issue? It is becoming quite a show stopper as we migrate our product from Solaris to Linux.
tia,
FYI - The "application" is a collection of programs that communicate with each other and to a large chip tester via a proprietary serial bus. The hangs are random but pretty frequent - in the range of several per day to several per week.
-Mark
Have you tried the magic sysreq sequence on the console?
Oh, I just noticed propietary serial bus, does mean you have your own device driver(s)? Still see if magic sysreq works, but what you do with this should be driven by your driver writers.
We do have our own drivers(supplied by a 3rd party). These drivers are my biggest suspect.
-Mark
Cheers...james _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Mon, December 18, 2006 10:17 pm, Mark Belanger wrote:
I have many different centos machines that are hanging regulary. I believe this is due to something our application is doing - not a centos specific problem.
A normal unprivileged userland application should not be able to bring down the kernel. Do you use any additional drivers, e.g. for the serial bus? If so, it is a good idea to enable kernel crash dumps, and send it to your driver developers to analyze the bug.
Other than that, you could (in no particular order):
- Let syslog send messages to some remote site. - As others suggested, use remote X or a serial console to be able to track important messages. - 'systrace -f' the X11 program, and redirect the output somewhere safe, to see the last actions that were performed by the program.
With kind regards, Daniel de Kok
On Dec 18, 2006, at 16:17, Mark Belanger wrote:
I have many different centos machines that are hanging regulary. I believe this is due to something our application is doing - not a centos specific problem.
I have the same problem. I even posted something to this list titled "Strange system hangs" on 11/27 but didn't get any responses.
When the machines hang, there is no access to the console or remote access(ssh, rsh, etc).
I have that symptom as well. No way to do any debugging after it gets into that state. So I added the following two lines to the /etc/ syslog.conf file:
kern.* @<central server> *.info;mail.none;authpriv.none;cron.none @<central server>
Should I add any other levels to the selector field? BTW, my systems are running completely stock CentOS distribution EXCEPT for the binary nVidia driver, which was the only way I could get these systems to drive the 20" LCD displays at their native 1600x1200 resolution using the correct refresh rate.
I had another report of a hang this morning, but in this case even though the machine appears frozen (the screen saver is stuck and I can't get to the alternate consoles), I can in fact log into the machine remotely and top shows me that the X server is using 100% of the CPU:
top - 08:44:22 up 10 days, 23:00, 10 users, load average: 1.04, 1.01, 1.00 Tasks: 115 total, 2 running, 113 sleeping, 0 stopped, 0 zombie Cpu(s): 99.7% us, 0.3% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 3113468k total, 1361240k used, 1752228k free, 87312k buffers Swap: 3047416k total, 0k used, 3047416k free, 957756k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4381 root 25 0 67748 42m 7776 R 99.8 1.4 782:53.37 X
I also see the following in /var/log/messages:
Dec 18 19:56:02 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel 00000001 Dec 18 19:56:03 hepdsw04 kernel: NVRM: Xid (0001:00): 9, Channel 00000020 Instance 00000000 Intr 00100000 Dec 18 19:56:09 hepdsw04 Synergy 1.3.1: NOTE: CServerProxy.cpp, 315: server is dead Dec 18 19:56:10 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel 00000020 Dec 18 19:56:11 hepdsw04 kernel: NVRM: Xid (0001:00): 9, Channel 00000020 Instance 00000000 Intr 00100000 Dec 18 19:56:18 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel 00000020 Dec 18 19:56:19 hepdsw04 kernel: NVRM: Xid (0001:00): 9, Channel 00000020 Instance 00000000 Intr 00100000 Dec 18 19:56:26 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel 00000020 Dec 18 19:56:27 hepdsw04 kernel: NVRM: Xid (0001:00): 9, Channel 00000020 Instance 00000000 Intr 00100000 Dec 18 19:56:34 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel 00000001
What is the meaning of the NVRM entries? The Synergy entry is from the keyboard/mouse sharing Synergy utility (great program BTW, I couldn't live without it).
Anyway, sorry to inject my own problems into this thread, but maybe these hangs are all related.
Alfred
Dec 18 19:56:02 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel 00000001 Dec 18 19:56:03 hepdsw04 kernel: NVRM: Xid (0001:00): 9, Channel 00000020 Instance 00000000 Intr 00100000 Dec 18 19:56:09 hepdsw04 Synergy 1.3.1: NOTE: CServerProxy.cpp,315: server is dead Dec 18 19:56:10 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel 00000020 Dec 18 19:56:11 hepdsw04 kernel: NVRM: Xid (0001:00): 9, Channel 00000020 Instance 00000000 Intr 00100000 Dec 18 19:56:18 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel 00000020 Dec 18 19:56:19 hepdsw04 kernel: NVRM: Xid (0001:00): 9, Channel 00000020 Instance 00000000 Intr 00100000 Dec 18 19:56:26 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel 00000020 Dec 18 19:56:27 hepdsw04 kernel: NVRM: Xid (0001:00): 9, Channel 00000020 Instance 00000000 Intr 00100000 Dec 18 19:56:34 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel 00000001
What is the meaning of the NVRM entries? The Synergy entry is from the keyboard/mouse sharing Synergy utility (great program BTW, I couldn't live without it).
The NVRM: Xid messages are from your nvidia driver/module.
-Jay
Quoting Mark Belanger mark_belanger@ltx.com:
I have many different centos machines that are hanging regulary. I believe this is due to something our application is doing - not a centos specific problem.
When the machines hang, there is no access to the console or remote access(ssh, rsh, etc).
Any tips on debugging this issue? It is becoming quite a show stopper as we migrate our product from Solaris to Linux.
Consider setting up serial console (you'll still be able to run X11 on your keyboard/monitor). Or alternatively, you might try setting up network console.
Check out serial-console.txt and networking/netconsole.txt in kernel documentation.