CentOS 5.5 Java Process Death

List overview All Threads
Download

newer

older

Basic Bash Script Question

Recommendation for a Good...

Martin Hewitt

10 Feb 2011 10 Feb '11

6:37 p.m.

Hi all,

I'm running CentOS 5.5 Final, Java version "1.6.0_17" OpenJDK Runtime Environment (IcedTea6 1.7.5) (rhel-1.16.b17.el5-x86_64) OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode) installed via Yum.

We have a java application, packaged as a jar, running on our servers which, periodically, crawls RSS feeds and writes the articles to a database.

Randomly, and seemingly without cause, these processes will die, not through the application exiting, or due to my killing it, but due to something that seems to kill without leaving a trace.

My first step in diagnosing this was to log all output from the application, as well as sending stderr and stdout to a logfile, but none of these output logs contain anything that would indicate why these processes have died.

My next instinct was the kernel-level out of memory killer, but the system is never low on memory (8GB installed, routinely showing 6.5GB in free cache) and usually has somewhere between 1GB and 3GB of memory free at any given point in time.

I next thought the system could be hitting bad memory, segfaulting, and killing the process because of that, but I've mirrored the system on an identically configured server in a different datacentre, and the processes are still being killed.

The java virtual machine does trigger JVM core dumps on exit, so the process is being killed by something, but the JVM dumps don't have any useful information.

My question is: does anyone know what might be causing it, and where I should start looking to diagnose the cause?

Thanks.

Show replies by date

m.roth＠5-cent.us

10 Feb 10 Feb

6:58 p.m.

Martin Hewitt wrote:

...

Hi all,

I'm running CentOS 5.5 Final, Java version "1.6.0_17" OpenJDK Runtime Environment (IcedTea6 1.7.5) (rhel-1.16.b17.el5-x86_64) OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode) installed via Yum.

We have a java application, packaged as a jar, running on our servers which, periodically, crawls RSS feeds and writes the articles to a database.

Randomly, and seemingly without cause, these processes will die, not through the application exiting, or due to my killing it, but due to something that seems to kill without leaving a trace.

<snip> The hard (but correct) way would be to put try {} catch in the code, and work your way down. Trying to debug it using a debugger might be real problematical, if you can't repeatably provoke it. I *suppose* you could attach strace to it, and dump the o/p into a file (on a filesystem with a *lot* of disk space)....

mark

Martin Hewitt

7:20 p.m.

Hi Mark,

Thanks, I didn't know about the strace command, so that's useful. Fortunately, this is on a dedicated server, so there's a fair amount of free disk.

I've also remembered that one server was previously running CentOS 5.4, so I'm rebuilding the mirror server with 5.4 to see if that made a difference.

Thanks for the help.

Martin

On 10 February 2011 18:58, m.roth@5-cent.us wrote:

...

Martin Hewitt wrote:

...
Hi all,

I'm running CentOS 5.5 Final, Java version "1.6.0_17" OpenJDK Runtime Environment (IcedTea6 1.7.5) (rhel-1.16.b17.el5-x86_64) OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode) installed via Yum.

We have a java application, packaged as a jar, running on our servers which, periodically, crawls RSS feeds and writes the articles to a database.

Randomly, and seemingly without cause, these processes will die, not through the application exiting, or due to my killing it, but due to something that seems to kill without leaving a trace.

<snip> The hard (but correct) way would be to put try {} catch in the code, and work your way down. Trying to debug it using a debugger might be real problematical, if you can't repeatably provoke it. I *suppose* you could attach strace to it, and dump the o/p into a file (on a filesystem with a *lot* of disk space)....

mark

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

m.roth＠5-cent.us

7:37 p.m.

Hey, Martin,

Martin Hewitt wrote:

...

Thanks, I didn't know about the strace command, so that's useful. Fortunately, this is on a dedicated server, so there's a fair amount of free disk.

<snip> If you can do the code changes (and the try/catch is *supposed* to be in there, according to java style), work your way down, y'know...

main

... try { First actual call to do the job } catch writeln error;

And if it fails there, then you know; otherwise, go to the next main call, sorry, "invocation of a method"....

Then again, this time in each of the main function calls under that, and step down until you find the function it's dying in. That'll give you a much better handle on what's happening.

...

Thanks for the help.

Good luck.

...

Martin

On 10 February 2011 18:58, m.roth@5-cent.us wrote:

...
Martin Hewitt wrote:

...
Hi all,

I'm running CentOS 5.5 Final, Java version "1.6.0_17" OpenJDK Runtime Environment (IcedTea6 1.7.5) (rhel-1.16.b17.el5-x86_64) OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode) installed via Yum.

We have a java application, packaged as a jar, running on our servers which, periodically, crawls RSS feeds and writes the articles to a database.

Randomly, and seemingly without cause, these processes will die, not through the application exiting, or due to my killing it, but due to something that seems to kill without leaving a trace.

<snip> The hard (but correct) way would be to put try {} catch in the code, and work your way down. Trying to debug it using a debugger might be real problematical, if you can't repeatably provoke it. I *suppose* you could attach strace to it, and dump the o/p into a file (on a filesystem with a *lot* of disk space)....

mark

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

m.roth＠5-cent.us

7:37 p.m.

Hey, Martin,

Martin Hewitt wrote:

...

Thanks, I didn't know about the strace command, so that's useful. Fortunately, this is on a dedicated server, so there's a fair amount of free disk.

<snip> If you can do the code changes (and the try/catch is *supposed* to be in there, according to java style), work your way down, y'know...

main

... try { First actual call to do the job } catch writeln error;

And if it fails there, then you know; otherwise, go to the next main call, sorry, "invocation of a method"....

Then again, this time in each of the main function calls under that, and step down until you find the function it's dying in. That'll give you a much better handle on what's happening.

...

Thanks for the help.

Good luck.

mark

...

Martin

On 10 February 2011 18:58, m.roth@5-cent.us wrote:

...
Martin Hewitt wrote:

...
Hi all,

I'm running CentOS 5.5 Final, Java version "1.6.0_17" OpenJDK Runtime Environment (IcedTea6 1.7.5) (rhel-1.16.b17.el5-x86_64) OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode) installed via Yum.

We have a java application, packaged as a jar, running on our servers which, periodically, crawls RSS feeds and writes the articles to a database.

Randomly, and seemingly without cause, these processes will die, not through the application exiting, or due to my killing it, but due to something that seems to kill without leaving a trace.

<snip> The hard (but correct) way would be to put try {} catch in the code, and work your way down. Trying to debug it using a debugger might be real problematical, if you can't repeatably provoke it. I *suppose* you could attach strace to it, and dump the o/p into a file (on a filesystem with a *lot* of disk space)....

mark

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Martin Hewitt

11 Feb 11 Feb

1:42 a.m.

Hi Mark,

I've exhausted the Java avenues for debugging this issue, but, since my last email, the process I pointed strace at has been killed, but I'm afraid the rather raw format of the strace file is lost on me. The last six lines of the ouput file are:

clone(child_stack=0x4202a250, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x4202a9d0, tls=0x4202a940, child_tidptr=0x4202a9d0) = 23241 futex(0x4202a9d0, FUTEX_WAIT, 23241, NULL) = -1 EINTR (Interrupted system call) --- SIGHUP (Hangup) @ 0 (0) --- futex(0x2ab0b620a000, FUTEX_WAKE_PRIVATE, 1) = 1 rt_sigreturn(0x2ab0b620a000) = -1 EINTR (Interrupted system call) futex(0x4202a9d0, FUTEX_WAIT, 23241, NULL <unfinished ... exit status 129>

The SIGHUP is new information, and appears to be what's causing the java app to exit. Surely Java should be aware of the Interrupted system call?

There are no other signals in the output file, and the only EINTRs are in the passage above.

Looks like I need to delve back into Java...

Martin

On 10 February 2011 19:37, m.roth@5-cent.us wrote:

...

Hey, Martin,

Martin Hewitt wrote:

...
Thanks, I didn't know about the strace command, so that's useful. Fortunately, this is on a dedicated server, so there's a fair amount of free disk.

<snip> If you can do the code changes (and the try/catch is *supposed* to be in there, according to java style), work your way down, y'know...

main

... try { First actual call to do the job } catch writeln error;

And if it fails there, then you know; otherwise, go to the next main call, sorry, "invocation of a method"....

Then again, this time in each of the main function calls under that, and step down until you find the function it's dying in. That'll give you a much better handle on what's happening.

...
Thanks for the help.

Good luck.

mark

...
Martin

On 10 February 2011 18:58, m.roth@5-cent.us wrote:

...
Martin Hewitt wrote:

...
Hi all,

I'm running CentOS 5.5 Final, Java version "1.6.0_17" OpenJDK Runtime Environment (IcedTea6 1.7.5) (rhel-1.16.b17.el5-x86_64) OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode) installed via Yum.

We have a java application, packaged as a jar, running on our servers which, periodically, crawls RSS feeds and writes the articles to a database.

Randomly, and seemingly without cause, these processes will die, not through the application exiting, or due to my killing it, but due to something that seems to kill without leaving a trace.

<snip> The hard (but correct) way would be to put try {} catch in the code, and work your way down. Trying to debug it using a debugger might be real problematical, if you can't repeatably provoke it. I *suppose* you could attach strace to it, and dump the o/p into a file (on a filesystem with a *lot* of disk space)....

mark

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Keith Roberts

7:05 a.m.

On Fri, 11 Feb 2011, Martin Hewitt wrote:

...

To: CentOS mailing list centos@centos.org From: Martin Hewitt martin.hewitt@gmail.com Subject: Re: [CentOS] CentOS 5.5 Java Process Death

Hi Mark,

I've exhausted the Java avenues for debugging this issue, but, since my last email, the process I pointed strace at has been killed, but I'm afraid the rather raw format of the strace file is lost on me. The last six lines of the ouput file are:

Do you have different versions of JAVA from different vendors installed? I don't use Iced Tea as it's not always 100% compatible. Try to use just *one* vendor's version of JAVA as your active JAVA installation. I only use Sun's SDK as I have noticed problems using other vendors versions.

HTH

Keith Roberts

...

clone(child_stack=0x4202a250, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x4202a9d0, tls=0x4202a940, child_tidptr=0x4202a9d0) = 23241 futex(0x4202a9d0, FUTEX_WAIT, 23241, NULL) = -1 EINTR (Interrupted system call) --- SIGHUP (Hangup) @ 0 (0) --- futex(0x2ab0b620a000, FUTEX_WAKE_PRIVATE, 1) = 1 rt_sigreturn(0x2ab0b620a000) = -1 EINTR (Interrupted system call) futex(0x4202a9d0, FUTEX_WAIT, 23241, NULL <unfinished ... exit status 129>

The SIGHUP is new information, and appears to be what's causing the java app to exit. Surely Java should be aware of the Interrupted system call?

There are no other signals in the output file, and the only EINTRs are in the passage above.

Looks like I need to delve back into Java...

Martin

On 10 February 2011 19:37, m.roth@5-cent.us wrote:

...
Hey, Martin,

Martin Hewitt wrote:

...
Thanks, I didn't know about the strace command, so that's useful. Fortunately, this is on a dedicated server, so there's a fair amount of free disk.

<snip> If you can do the code changes (and the try/catch is *supposed* to be in there, according to java style), work your way down, y'know...

main

... try { First actual call to do the job } catch writeln error;

And if it fails there, then you know; otherwise, go to the next main call, sorry, "invocation of a method"....

Then again, this time in each of the main function calls under that, and step down until you find the function it's dying in. That'll give you a much better handle on what's happening.

...
Thanks for the help.

Good luck.

mark

...
Martin

On 10 February 2011 18:58, m.roth@5-cent.us wrote:

...
Martin Hewitt wrote:

...
Hi all,

I'm running CentOS 5.5 Final, Java version "1.6.0_17" OpenJDK Runtime Environment (IcedTea6 1.7.5) (rhel-1.16.b17.el5-x86_64) OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode) installed via Yum.

We have a java application, packaged as a jar, running on our servers which, periodically, crawls RSS feeds and writes the articles to a database.

Randomly, and seemingly without cause, these processes will die, not through the application exiting, or due to my killing it, but due to something that seems to kill without leaving a trace.

<snip> The hard (but correct) way would be to put try {} catch in the code, and work your way down. Trying to debug it using a debugger might be real problematical, if you can't repeatably provoke it. I *suppose* you could attach strace to it, and dump the o/p into a file (on a filesystem with a *lot* of disk space)....

mark

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

-- ----------------------------------------------------------------- Websites: http://www.karsites.net http://www.php-debuggers.net http://www.raised-from-the-dead.org.uk All email addresses are challenge-response protected with TMDA [http://tmda.net] -----------------------------------------------------------------

Martin Hewitt

8:53 a.m.

Hi Keith,

Interesting idea, I've built the Sun SDK on one server, and left the yum-installed version on the other, and have started the same java application on both servers with strace, so I'll see if there's any difference.

Thanks for all the help,

Martin

On 11 Feb 2011, at 07:05, Keith Roberts wrote:

...

On Fri, 11 Feb 2011, Martin Hewitt wrote:

...
To: CentOS mailing list centos@centos.org From: Martin Hewitt martin.hewitt@gmail.com Subject: Re: [CentOS] CentOS 5.5 Java Process Death Hi Mark,

I've exhausted the Java avenues for debugging this issue, but, since my last email, the process I pointed strace at has been killed, but I'm afraid the rather raw format of the strace file is lost on me. The last six lines of the ouput file are:

Do you have different versions of JAVA from different vendors installed? I don't use Iced Tea as it's not always 100% compatible. Try to use just *one* vendor's version of JAVA as your active JAVA installation. I only use Sun's SDK as I have noticed problems using other vendors versions.

HTH

Keith Roberts

...
clone(child_stack=0x4202a250, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x4202a9d0, tls=0x4202a940, child_tidptr=0x4202a9d0) = 23241 futex(0x4202a9d0, FUTEX_WAIT, 23241, NULL) = -1 EINTR (Interrupted system call) --- SIGHUP (Hangup) @ 0 (0) --- futex(0x2ab0b620a000, FUTEX_WAKE_PRIVATE, 1) = 1 rt_sigreturn(0x2ab0b620a000) = -1 EINTR (Interrupted system call) futex(0x4202a9d0, FUTEX_WAIT, 23241, NULL <unfinished ... exit status 129>

The SIGHUP is new information, and appears to be what's causing the java app to exit. Surely Java should be aware of the Interrupted system call?

There are no other signals in the output file, and the only EINTRs are in the passage above.

Looks like I need to delve back into Java...

Martin

On 10 February 2011 19:37, m.roth@5-cent.us wrote:

...
Hey, Martin,

Martin Hewitt wrote:

...
Thanks, I didn't know about the strace command, so that's useful. Fortunately, this is on a dedicated server, so there's a fair amount of free disk.

<snip> If you can do the code changes (and the try/catch is *supposed* to be in there, according to java style), work your way down, y'know...

main

... try { First actual call to do the job } catch writeln error;

And if it fails there, then you know; otherwise, go to the next main call, sorry, "invocation of a method"....

Then again, this time in each of the main function calls under that, and step down until you find the function it's dying in. That'll give you a much better handle on what's happening.

...
Thanks for the help.

Good luck.
   mark
...
Martin

On 10 February 2011 18:58, m.roth@5-cent.us wrote:

...
Martin Hewitt wrote:

...
Hi all,

I'm running CentOS 5.5 Final, Java version "1.6.0_17" OpenJDK Runtime Environment (IcedTea6 1.7.5) (rhel-1.16.b17.el5-x86_64) OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode) installed via Yum.

We have a java application, packaged as a jar, running on our servers which, periodically, crawls RSS feeds and writes the articles to a database.

Randomly, and seemingly without cause, these processes will die, not through the application exiting, or due to my killing it, but due to something that seems to kill without leaving a trace.

<snip> The hard (but correct) way would be to put try {} catch in the code, and work your way down. Trying to debug it using a debugger might be real problematical, if you can't repeatably provoke it. I *suppose* you could attach strace to it, and dump the o/p into a file (on a filesystem with a *lot* of disk space)....
   mark
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
--

Websites: http://www.karsites.net http://www.php-debuggers.net http://www.raised-from-the-dead.org.uk

All email addresses are challenge-response protected with TMDA [http://tmda.net] -----------------------------------------------------------------_______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Keith Roberts

2:28 p.m.

On Fri, 11 Feb 2011, Martin Hewitt wrote:

...

To: CentOS mailing list centos@centos.org From: Martin Hewitt martin.hewitt@gmail.com Subject: Re: [CentOS] CentOS 5.5 Java Process Death

Hi Keith,

Interesting idea, I've built the Sun SDK on one server, and left the yum-installed version on the other, and have started the same java application on both servers with strace, so I'll see if there's any difference.

Thanks for all the help,

Well, it's not an idea Martin, it's what I've learnt by experience ;) . For example, the newer versions of Eclipse IDE will not work with GCJ. Please see this excerpt from the installation directory docs of Helios - Eclipse 3.6:

/eclipse/readme/readme_eclipse.html

3.1.2 General - GCJ

GCJ is an effort by the GCC team to provide an open source Java compiler and runtime environment to interpret Java bytecode. Unfortunately, the GCJ runtime environment is not an environment that is often tested on by Eclipse developers.

The most common problems surrounding GCJ are:

* Eclipse does not start at all * Eclipse throws a 'java.lang.ClassNotFoundException: org.eclipse.core.runtime.Plugin' that can be found in the logs (located in workspace/.metadata/.log)

The workspace's log file is a good place to check to identify whether GCJ is being used or not. Every Eclipse log session is prepended with information about the runtime environment that was used to run Eclipse. The log may include something like the following:

java.fullversion=GNU libgcj 4.2.1 (Debian 4.2.1-5)

If Eclipse does start, one can check which runtime environment is being used to run Eclipse by going to Help > About Eclipse SDK

...

Installation

Details > Configuration. The About dialog itself can also provide other information, the build identifier can be of particular interest as it is tagged by some distributions. This allows the user to identify whether Eclipse was downloaded through the distribution's package management system or directly from the eclipse.org web site.

Eg: Build id: M20070212-1330 (Ubuntu version: 3.2.2-0ubuntu3)

The two most common workarounds are:

* download the Eclipse binary from eclipse.org directly * run Eclipse using an alternate Java runtime environment

To download Eclipse, try one of the links below:

* [40]http://www.eclipse.org/downloads/ * [41]http://download.eclipse.org/eclipse/downloads/

It is imperative that 64-bit builds are downloaded and used if a 64-bit Java runtime environment has been installed. Below are two sample tarball names of version 3.6.0 of the Eclipse SDK packaged for 32-bit and 64-bit processors.

eclipse-SDK-3.6-linux-gtk.tar.gz (32-bit) eclipse-SDK-3.6-linux-gtk-x86_64.tar.gz (64-bit)

To run Eclipse with an alternate Java runtime environment, the path to the Java virtual machine's binary must be identified. With an Eclipse installation from the distribution, altering the $PATH variable to include the path to the alternate Java runtime environment is often not enough as the Eclipse that Linux distributions package often performs a scan internally to pick up GCJ by itself whilst ignoring what's on the $PATH. An example of the terminal's output is shown below:

searching for compatible vm... testing /usr/lib/jvm/java-7-icedtea...not found testing /usr/lib/jvm/java-gcj...found

Once the path to the virtual machine's binary has been identified, try running Eclipse with the following command:

./eclipse -vm /path/to/jre/bin/java

For an actual example, it might look something like the following:

./eclipse -vm /usr/lib/jvm/sun-java-6/bin/java ./eclipse -vm /opt/sun-jdk-1.6.0.02/bin/java

If this seems to solve the problem, it is likely that the problem really was related to the use of GCJ as the Java runtime for running Eclipse. The eclipse.ini file located within Eclipse's folder can be altered to automatically pass this argument to Eclipse at startup...

I use the following from Sun's download website:

jdk-6u18-linux-i586-rpm.bin

The latest 32bit version is available from here:

http://www.oracle.com/technetwork/java/javase/install-linux-rpm-137089.html

I've noticed other JAVA applications do not work correctly either, on other vendor's java offerings. Which is the reason for grabbing the rpm.bin from Sun/Oracle and installing that.

HTH

Keith Roberts

----------------------------------------------------------------- Websites: http://www.karsites.net http://www.php-debuggers.net http://www.raised-from-the-dead.org.uk

All email addresses are challenge-response protected with TMDA [http://tmda.net] -----------------------------------------------------------------

m.roth＠5-cent.us

2:13 p.m.

Martin Hewitt wrote:

...

Hi Mark,

I've exhausted the Java avenues for debugging this issue, but, since my last email, the process I pointed strace at has been killed, but I'm afraid the rather raw format of the strace file is lost on me. The last six lines of the ouput file are:

clone(child_stack=0x4202a250,

At a guess, looks like it's creating a child process. <snip>

...

futex(0x4202a9d0, FUTEX_WAIT, 23241, NULL) = -1 EINTR (Interrupted system call) --- SIGHUP (Hangup) @ 0 (0) --- futex(0x2ab0b620a000, FUTEX_WAKE_PRIVATE, 1) = 1 rt_sigreturn(0x2ab0b620a000) = -1 EINTR (Interrupted system call) futex(0x4202a9d0, FUTEX_WAIT, 23241, NULL <unfinished ... exit status 129>

The SIGHUP is new information, and appears to be what's causing the java app to exit. Surely Java should be aware of the Interrupted system call?

There are no other signals in the output file, and the only EINTRs are in the passage above.

Does the exit status of 129 say anything other than SIGHUP?

...

Looks like I need to delve back into Java...

Yeah. I think you need to try what I was suggesting: start wrapping function calls in try/catch, and work your way down (when you find the one it fails in, then go into that function, er, method, and wrap the calls in there (and/or even put a writeln in a few choice spots, until you find the exact function the SIGHUP (or whatever) is happening in.

mark "why, yes, I *was* a developer longer than I've been an admin"

...

Martin

On 10 February 2011 19:37, m.roth@5-cent.us wrote:

...
Hey, Martin,

Martin Hewitt wrote:

...
Thanks, I didn't know about the strace command, so that's useful. Fortunately, this is on a dedicated server, so there's a fair amount of free disk.

<snip> If you can do the code changes (and the try/catch is *supposed* to be in there, according to java style), work your way down, y'know...

main

... try { First actual call to do the job } catch writeln error;

And if it fails there, then you know; otherwise, go to the next main call, sorry, "invocation of a method"....

Then again, this time in each of the main function calls under that, and step down until you find the function it's dying in. That'll give you a much better handle on what's happening.

...
Thanks for the help.

Good luck.

mark

...
Martin

On 10 February 2011 18:58, m.roth@5-cent.us wrote:

...
Martin Hewitt wrote:

...
Hi all,

I'm running CentOS 5.5 Final, Java version "1.6.0_17" OpenJDK Runtime Environment (IcedTea6 1.7.5) (rhel-1.16.b17.el5-x86_64) OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode) installed via Yum.

We have a java application, packaged as a jar, running on our servers which, periodically, crawls RSS feeds and writes the articles to a database.

Randomly, and seemingly without cause, these processes will die, not through the application exiting, or due to my killing it, but due to something that seems to kill without leaving a trace.

<snip> The hard (but correct) way would be to put try {} catch in the code, and work your way down. Trying to debug it using a debugger might be real problematical, if you can't repeatably provoke it. I *suppose* you could attach strace to it, and dump the o/p into a file (on a filesystem with a *lot* of disk space)....

mark

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Martin Hewitt

14 Feb 14 Feb

10:54 a.m.

Hi Mark,

Over the weekend I've been testing the environment under various circumstances, and it seems that the kill issue is not confined to one app - it's afflicting all jars I've packaged with Eclipse.

I added in as many try...catch blocks as I could and got no useful output, but it occurred to me that the Eclipse loader is adding in another level of code between my application and the kernel.

Due to the fact that Eclipse uses a jar-in-jar loader to package in classpath libraries, I'm going to be experimenting today with a different jar packager and with executing the application without jar packaging.

Martin

On 11 February 2011 14:13, m.roth@5-cent.us wrote:

...

Martin Hewitt wrote:

...
Hi Mark,

I've exhausted the Java avenues for debugging this issue, but, since my last email, the process I pointed strace at has been killed, but I'm afraid the rather raw format of the strace file is lost on me. The last six lines of the ouput file are:

clone(child_stack=0x4202a250,

At a guess, looks like it's creating a child process.

<snip> > futex(0x4202a9d0, FUTEX_WAIT, 23241, NULL) = -1 EINTR (Interrupted system > call) > --- SIGHUP (Hangup) @ 0 (0) --- > futex(0x2ab0b620a000, FUTEX_WAKE_PRIVATE, 1) = 1 > rt_sigreturn(0x2ab0b620a000) = -1 EINTR (Interrupted system > call) > futex(0x4202a9d0, FUTEX_WAIT, 23241, NULL <unfinished ... exit status 129> > > The SIGHUP is new information, and appears to be what's causing the > java app to exit. Surely Java should be aware of the Interrupted > system call? > > There are no other signals in the output file, and the only EINTRs are > in the passage above. > Does the exit status of 129 say anything other than SIGHUP?

...
Looks like I need to delve back into Java...

Yeah. I think you need to try what I was suggesting: start wrapping function calls in try/catch, and work your way down (when you find the one it fails in, then go into that function, er, method, and wrap the calls in there (and/or even put a writeln in a few choice spots, until you find the exact function the SIGHUP (or whatever) is happening in.

mark "why, yes, I *was* a developer longer than I've been an admin"

...
Martin

On 10 February 2011 19:37, m.roth@5-cent.us wrote:

...
Hey, Martin,

Martin Hewitt wrote:

...
Thanks, I didn't know about the strace command, so that's useful. Fortunately, this is on a dedicated server, so there's a fair amount of free disk.

<snip> If you can do the code changes (and the try/catch is *supposed* to be in there, according to java style), work your way down, y'know...

main

... try { First actual call to do the job } catch writeln error;

And if it fails there, then you know; otherwise, go to the next main call, sorry, "invocation of a method"....

Then again, this time in each of the main function calls under that, and step down until you find the function it's dying in. That'll give you a much better handle on what's happening.

...
Thanks for the help.

Good luck.

mark

...
Martin

On 10 February 2011 18:58, m.roth@5-cent.us wrote:

...
Martin Hewitt wrote:

...
Hi all,

I'm running CentOS 5.5 Final, Java version "1.6.0_17" OpenJDK Runtime Environment (IcedTea6 1.7.5) (rhel-1.16.b17.el5-x86_64) OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode) installed via Yum.

We have a java application, packaged as a jar, running on our servers which, periodically, crawls RSS feeds and writes the articles to a database.

Randomly, and seemingly without cause, these processes will die, not through the application exiting, or due to my killing it, but due to something that seems to kill without leaving a trace.

<snip> The hard (but correct) way would be to put try {} catch in the code, and work your way down. Trying to debug it using a debugger might be real problematical, if you can't repeatably provoke it. I *suppose* you could attach strace to it, and dump the o/p into a file (on a filesystem with a *lot* of disk space)....

mark

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Mathieu Baudier

11:36 a.m.

...

I added in as many try...catch blocks as I could and got no useful output, but it occurred to me that the Eclipse loader is adding in another level of code between my application and the kernel.

Can you please give more details about this "additional" code? How did you find out?

Do you mean that the application is running in an OSGi runtime? Can you please give a bit more details about the architecture and deployment of your application?

Is it a headless application or with an Eclipse UI?

I have had similar issues recently with the OpenJDK shipped in CentOS, and if your application is based on OSGi I may be able to help you analyze further.

Martin Hewitt

11:56 a.m.

Hi Mathieu,

...

Can you please give more details about this "additional" code? How did you find out?

When I package a "Runnable JAR" using the Eclipse Export wizard, in the manifest file, the main-class is given as org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader, which I presume is a little bit of code to redirect the main method to the main method of my actual application. This is the "extra layer" I was referring to.

...

Do you mean that the application is running in an OSGi runtime? Can you please give a bit more details about the architecture and deployment of your application?

The architecture is quite simple - the primary test case I'm using is a http request forwarder but I'm keeping it idle to monitor its state. It sets up an HTTP server on port 8080 and listens for requests that match a certain domain, and forwards them on. As I said - no requests are being passed through it while it's in this sandboxed environment.

It is a headless application, executed as follows:

...

java -jar /path/to/my/application.jar > out.log 2&>1

or, in the strace environment:

...

strace -o strace.out.log java -jar /path/to/my/application.jar > out.log 2&>1

This app is running on a server, but it's just plain Java code, using Jetty as the HTTP server, and no frameworks.

Martin

On 14 February 2011 11:36, Mathieu Baudier mbaudier@argeo.org wrote:

...

...
I added in as many try...catch blocks as I could and got no useful output, but it occurred to me that the Eclipse loader is adding in another level of code between my application and the kernel.

Can you please give more details about this "additional" code? How did you find out?

Do you mean that the application is running in an OSGi runtime? Can you please give a bit more details about the architecture and deployment of your application?

Is it a headless application or with an Eclipse UI?

I have had similar issues recently with the OpenJDK shipped in CentOS, and if your application is based on OSGi I may be able to help you analyze further. _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Mathieu Baudier

12:17 p.m.

...

When I package a "Runnable JAR" using the Eclipse Export wizard, in the manifest file, the main-class is given as org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader, which I presume is a little bit of code to redirect the main method to the main method of my actual application. This is the "extra layer" I was referring to.

Ok, if I well understand, Eclipse packages a big jar containing all your code and jar dependencies, and then uses its own classloader to access them.

As you suggested this is an interesting trail to follow. I already had issues with "exotic" classloaders using OpenJdk on CentOS.

Try indeed to do a "pure" java deployment (java -cp myjar1,myjar2,... com.example.MyAppWithMainMethod) and see if the issue still happens.

What was the result of your tests with Sun JRE (cf. your post from Feb 11th)? Do you have the issue with Sun JRE as well?

Martin Hewitt

1:28 p.m.

Hi Mathieu,

On 14 Feb 2011, at 12:17, Mathieu Baudier wrote:

...

...
When I package a "Runnable JAR" using the Eclipse Export wizard, in the manifest file, the main-class is given as org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader, which I presume is a little bit of code to redirect the main method to the main method of my actual application. This is the "extra layer" I was referring to.

Ok, if I well understand, Eclipse packages a big jar containing all your code and jar dependencies, and then uses its own classloader to access them.

Yes, this seems to be the case.

...

As you suggested this is an interesting trail to follow. I already had issues with "exotic" classloaders using OpenJdk on CentOS.

Try indeed to do a "pure" java deployment (java -cp myjar1,myjar2,... com.example.MyAppWithMainMethod) and see if the issue still happens.

What was the result of your tests with Sun JRE (cf. your post from Feb 11th)? Do you have the issue with Sun JRE as well?

Yes, I tried with combinations of CentOT 5.4 and 5.5, and the yum-installed JRE and the Sun JRE, all combinations had the same problem, which again led me to believe the JAR was the problem.

Martin

Martin Hewitt

17 Feb 17 Feb

8:02 a.m.

On 14 February 2011 12:17, Mathieu Baudier mbaudier@argeo.org wrote:

...

...
When I package a "Runnable JAR" using the Eclipse Export wizard, in the manifest file, the main-class is given as org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader, which I presume is a little bit of code to redirect the main method to the main method of my actual application. This is the "extra layer" I was referring to.

Ok, if I well understand, Eclipse packages a big jar containing all your code and jar dependencies, and then uses its own classloader to access them.

As you suggested this is an interesting trail to follow. I already had issues with "exotic" classloaders using OpenJdk on CentOS.

Try indeed to do a "pure" java deployment (java -cp myjar1,myjar2,... com.example.MyAppWithMainMethod) and see if the issue still happens.

What was the result of your tests with Sun JRE (cf. your post from Feb 11th)? Do you have the issue with Sun JRE as well? _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Hi All,

I've been running our apps as purely as I can (java -cp /path/to/libs/* path.to.the.App) and they're still being send SIGHUP signals for reasons I can't understand.

I've added timestamps to my strace output and it seems to come out of the blue:

19:49:07.438591 futex(0x40d889d0, FUTEX_WAIT, 29119, NULL) = -1 EINTR (Interrupted system call) 01:45:14.275055 --- SIGHUP (Hangup) @ 0 (0) --- 01:45:14.275106 futex(0x2b3f32c7a000, FUTEX_WAKE_PRIVATE, 1) = 1 01:45:14.275417 rt_sigreturn(0x2b3f32c7a000) = -1 EINTR (Interrupted system call) 01:45:14.275461 futex(0x40d889d0, FUTEX_WAIT, 29119, NULL <unfinished ... exit status 129>

Does anyone know why this signal would be sent?

Martin

Mathieu Baudier

8:35 a.m.

...

I've been running our apps as purely as I can (java -cp /path/to/libs/* path.to.the.App) and they're still being send SIGHUP signals for reasons I can't understand.

So, to sum you have tried: - with various classloading approaches - various JVMs - on various systems

I must say that I'm really puzzled by your problem. Especially since your app sounds to be not very complex and does not use JNI.

I would do the following: reproduce cleanly the problem with OpenJdk and submit it to the IcedTea project as a bug. They may be able to help you more, since they know what is going on in the JVM.

Last question: did you always have the problem, or did it suddenly appear? (if yes, after which changes in the app code, or update in the OS, etc.)

Cameron Kerr

18 Feb 18 Feb

4:33 a.m.

On 17/02/2011, at 9:35 PM, Mathieu Baudier wrote:

...

...
I've been running our apps as purely as I can (java -cp /path/to/libs/* path.to.the.App) and they're still being send SIGHUP signals for reasons I can't understand.

I have only started in this thread, but your description of unexplainable SIGHUPs tweaked my memory from long ago, whereby it turned out to be bad memory.

...

might be worth checking, stranger things have happened.

Cheers, Cameron

Martin Hewitt

9:47 a.m.

Hi Cameron,

On 18 February 2011 04:33, Cameron Kerr cameron@humbledown.org wrote:

...

On 17/02/2011, at 9:35 PM, Mathieu Baudier wrote:

...
...
I've been running our apps as purely as I can (java -cp /path/to/libs/* path.to.the.App) and they're still being send SIGHUP signals for reasons I can't understand.

I have only started in this thread, but your description of unexplainable SIGHUPs tweaked my memory from long ago, whereby it turned out to be bad memory.

...

might be worth checking, stranger things have happened.

Cheers, Cameron

Thanks, but I've got these tests running on a couple of machines of different generations, so I've ruled out the hardware as being at fault.

Martin

...

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Martin Hewitt

9:53 a.m.

It's strange how one can wake up and suddenly notice a pattern...

Looking through the straces, and the disconnect timestamps of the SSH sessions, it seems that the processes are dying as soon as, or shortly after the SSH session is closed.

My command is something along the lines of:

java -cp /path/to/shared/libs/*:/path/to/class/directory/ path.to.MyApp > out.log 2>&1 &

Does anyone have an idea as to why this process is closing when the SSH window that started it closes?

Martin

On 18 February 2011 09:47, Martin Hewitt martin.hewitt@gmail.com wrote:

...

Hi Cameron,

On 18 February 2011 04:33, Cameron Kerr cameron@humbledown.org wrote:

...
On 17/02/2011, at 9:35 PM, Mathieu Baudier wrote:

...
...
I've been running our apps as purely as I can (java -cp /path/to/libs/* path.to.the.App) and they're still being send SIGHUP signals for reasons I can't understand.

I have only started in this thread, but your description of unexplainable SIGHUPs tweaked my memory from long ago, whereby it turned out to be bad memory.

...

might be worth checking, stranger things have happened.

Cheers, Cameron

Thanks, but I've got these tests running on a couple of machines of different generations, so I've ruled out the hardware as being at fault.

Martin

...

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Michael Gliwinski

9:49 a.m.

On Friday 18 Feb 2011 09:53:39 Martin Hewitt wrote:

...

My command is something along the lines of:

java -cp /path/to/shared/libs/*:/path/to/class/directory/ path.to.MyApp > out.log 2>&1 &

Does anyone have an idea as to why this process is closing when the SSH window that started it closes?

Try adding 'nohup' before 'java'. Closing SSH session closes the shell which sends HUP to its children.

But, it is not your main problem is it? I mean the app wasn't always started manually from an interactive shell?

-- Michael Gliwinski Henderson Group Information Services 9-11 Hightown Avenue, Newtownabby, BT36 4RT Phone: 028 9034 3319 ********************************************************************************************** The information in this email is confidential and may be legally privileged. It is intended solely for the addressee and access to the email by anyone else is unauthorised. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients, any opinions or advice contained in this e-mail are subject to the terms and conditions expressed in the governing client engagement leter or contract. If you have received this email in error please notify support@henderson-group.com John Henderson (Holdings) Ltd Registered office: 9 Hightown Avenue, Mallusk, County Antrim, Northern Ireland, BT36 4RT. Registered in Northern Ireland Registration Number NI010588 Vat No.: 814 6399 12 *********************************************************************************

Martin Hewitt

10:06 a.m.

On 18 February 2011 09:49, Michael Gliwinski Michael.Gliwinski@henderson-group.com wrote:

...

On Friday 18 Feb 2011 09:53:39 Martin Hewitt wrote:

...
My command is something along the lines of:

java -cp /path/to/shared/libs/*:/path/to/class/directory/ path.to.MyApp > out.log 2>&1 &

Does anyone have an idea as to why this process is closing when the SSH window that started it closes?

Try adding 'nohup' before 'java'. Closing SSH session closes the shell which sends HUP to its children.

I've just discovered this command, and have added it to the invocation.

...

But, it is not your main problem is it? I mean the app wasn't always started manually from an interactive shell?

You know, I've been debugging this for so long that I just can't remember. The processes are either started manually, or from a web trigger, which could cause the same behaviour if/when the web server worker thread is detroyed or renewed.

...

-- Michael Gliwinski Henderson Group Information Services 9-11 Hightown Avenue, Newtownabby, BT36 4RT Phone: 028 9034 3319

The information in this email is confidential and may be legally privileged. It is intended solely for the addressee and access to the email by anyone else is unauthorised. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients, any opinions or advice contained in this e-mail are subject to the terms and conditions expressed in the governing client engagement leter or contract. If you have received this email in error please notify support@henderson-group.com

John Henderson (Holdings) Ltd Registered office: 9 Hightown Avenue, Mallusk, County Antrim, Northern Ireland, BT36 4RT. Registered in Northern Ireland Registration Number NI010588 Vat No.: 814 6399 12

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Anthony

19 Feb 19 Feb

2:18 a.m.

On 18/02/11 20:49, Michael Gliwinski wrote:

...

Try adding 'nohup' before 'java'. Closing SSH session closes the shell which sends HUP to its children.

I religiously use 'screen' when logging in remotely to do any work. Not only has saved me from interrupted work the connection breaks, but it is also saves me from having to remember to use 'nohup' before starting any Jobs!

Ciao, Ak.

Martin Hewitt

21 Feb 21 Feb

10:30 a.m.

After 3 days of continual operation ( I barely managed 9hrs before ) it seems I have narrowed this down to the saddeningly basic cause of the process being sent the SIGHUP signal when its owner process dies.

Using the nohup prefix solves the problem.

Thanks for all the help on this everyone!

Martin

On 19 February 2011 02:18, Anthony akcentos@anroet.com wrote:

...

On 18/02/11 20:49, Michael Gliwinski wrote:

...
Try adding 'nohup' before 'java'. Closing SSH session closes the shell which sends HUP to its children.

I religiously use 'screen' when logging in remotely to do any work. Not only has saved me from interrupted work the connection breaks, but it is also saves me from having to remember to use 'nohup' before starting any Jobs!

Ciao, Ak.

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

m.roth＠5-cent.us

18 Feb 18 Feb

1:18 p.m.

Martin Hewitt wrote:

...

It's strange how one can wake up and suddenly notice a pattern...

Looking through the straces, and the disconnect timestamps of the SSH sessions, it seems that the processes are dying as soon as, or shortly after the SSH session is closed.

My command is something along the lines of:

java -cp /path/to/shared/libs/*:/path/to/class/directory/ path.to.MyApp > out.log 2>&1 &

Does anyone have an idea as to why this process is closing when the SSH window that started it closes?

<snip> Just for the sheer halibut*, try nohup <java cmd>

It's been something like 10 years or more since I had to do that, but....

mark

* I know, it's fishy....

5281

Age (days ago)

5292

Last active (days ago)

discuss@lists.centos.org

24 comments

7 participants

tags (0)

participants (7)

Anthony
Cameron Kerr
Keith Roberts
m.roth＠5-cent.us
Martin Hewitt
Mathieu Baudier
Michael Gliwinski