On 06/06/2015 02:23 AM, Markus "Shorty" Uckelmann wrote:
When we start a job the first time after several hours we get a lot of timeouts. A second run mostly helps.
In addition to capturing swap use before and after a run that times out, I'd cold boot all of the systems involved and see if that job times out as well. If that times out, it's likely that you need to prime your caches before a job, or break the job into smaller bits, or extend your timeout.
BTW: Is there a way to find out which parts of a programm are swapped out without using monsters like Valgrind? Damn, sounds like an interesting start of the week...
The smaps file has that information, in a general sense. If you want to know what variables hold references to the areas that are swapped out, you'll need a debugger.