On Wed, January 7, 2015 10:54 am, Les Mikesell wrote:
On Wed, Jan 7, 2015 at 10:43 AM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
Not junk - these are mostly IBM 3550/3650 boxes - pretty much top of the line in their day (before the M2/3/4 versions), They have Adaptec raid contollers,
I never had Adaptec in _my_ list of good RAID hardware... But certainly I can note be the one to offer judgement on hardware I avoid to the best of my ability. If you can afford, I would do the test: replace Adaptec with something else (in my list it would be either 3ware or LSI or areca), leaving the rest of hardware as it is. And see it the problems continue. I do realize that there is more to it than just pulling one card and sticking another in its place (that's why I said if you can "afford" it meaning in more general sense, not just monetary).
It's not something happening as a repeatable thing or that I could consider better/worse after replacing something. Maybe 3 times a year across a few hundred machines and generally not repeating on the same ones. But if there is anything in common it is on very 'active' filesystems.
Too bad... Reminds me one of my 32 node clusters in which one of the nodes crashed in a crashed once a month (always different node, so probability of run is 32 Month before crash ;-( Too bad for troubleshooting. Only after 6 Months I pinpointed particular brand of RAM mixed in into each node - when I got rid of it, the trouble ended... I would bet on Adaptec cards in your case... though ideally I shouldn't be offering judgement on hardware of the brand I almost never use. Good luck!
Valeri
++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++