On Wed, January 7, 2015 10:54 am, Les Mikesell wrote: > On Wed, Jan 7, 2015 at 10:43 AM, Valeri Galtsev > <galtsev at kicp.uchicago.edu> wrote: >>> >>> Not junk - these are mostly IBM 3550/3650 boxes - pretty much top of >>> the line in their day (before the M2/3/4 versions), They have >>> Adaptec raid contollers, >> >> I never had Adaptec in _my_ list of good RAID hardware... But certainly >> I >> can note be the one to offer judgement on hardware I avoid to the best >> of >> my ability. If you can afford, I would do the test: replace Adaptec with >> something else (in my list it would be either 3ware or LSI or areca), >> leaving the rest of hardware as it is. And see it the problems continue. >> I >> do realize that there is more to it than just pulling one card and >> sticking another in its place (that's why I said if you can "afford" it >> meaning in more general sense, not just monetary). > > It's not something happening as a repeatable thing or that I could > consider better/worse after replacing something. Maybe 3 times a year > across a few hundred machines and generally not repeating on the same > ones. But if there is anything in common it is on very 'active' > filesystems. > Too bad... Reminds me one of my 32 node clusters in which one of the nodes crashed in a crashed once a month (always different node, so probability of run is 32 Month before crash ;-( Too bad for troubleshooting. Only after 6 Months I pinpointed particular brand of RAM mixed in into each node - when I got rid of it, the trouble ended... I would bet on Adaptec cards in your case... though ideally I shouldn't be offering judgement on hardware of the brand I almost never use. Good luck! Valeri ++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++