I just spent a while trying to debug some weird slowdown issues, and I thought I’d let the mailing list know in case it helps anyone else, or someone has a cleaner solution (or better ways of debugging the real issue).
Just by way of context, we have a cluster using Infiniband and CentOS 6 (up to the latest patches as of a couple weeks ago, so 6.9+anything newer) and OpenMPI 3.1.0. Most of the nodes are "Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz” or "Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz”, currently with kernel .6.32-696.30.1.el6.x86_64
Recently (I believe when we upgraded to that kernel, but not 100% sure that's really when it started), certain parallel floating-point-compute intensive jobs would sometimes (maybe 1 time in 20) run very slowly. The overall speed was reduced by about a factor of 2, but the slowdown appeared to be localized to specific routines, which would go from taking ~1s to ~50-100s. I first assumed it had to do with communication, but as I made the timing information more fine grained, it became clear that nearly trivial loops like DO I=1,GRIDC%RL%NP DWORK1(I,ISP)= REAL( DWORK(I,ISP) ,KIND=q) ENDDO were a large part of the problem. The only thing I could find after poking around that was at all odd is that when this happened, khugepaged would be taking 100% of cpu for tens of seconds on the affected nodes. It appears, although it’s not completely certain yet (since the problem appears rarely enough to that it’s hard to definitively rule out), that disabling THP by putting “never” in /sys/kernel/mm/transparent_hugepage/enabled and /sys/kernel/mm/transparent_hugepage/defrag fixes the problem.
I guess I am also still wondering what exactly khugepaged does, and what kind of problems I should worry about because I disabled it. The documentation I’ve found for it doesn’t really explain what the transparent huge pages are good for.
Noam