[CentOS] reboot - is there a timeout on filesystem flush?

Wed Jan 7 17:10:47 UTC 2015
Valeri Galtsev <galtsev at kicp.uchicago.edu>

On Wed, January 7, 2015 10:54 am, Les Mikesell wrote:
> On Wed, Jan 7, 2015 at 10:43 AM, Valeri Galtsev
> <galtsev at kicp.uchicago.edu> wrote:
>>> Not junk - these are mostly IBM 3550/3650 boxes - pretty much top of
>>> the line in their day (before the M2/3/4 versions),  They have
>>> Adaptec raid contollers,
>> I never had Adaptec in _my_ list of good RAID hardware... But certainly
>> I
>> can note be the one to offer judgement on hardware I avoid to the best
>> of
>> my ability. If you can afford, I would do the test: replace Adaptec with
>> something else (in my list it would be either 3ware or LSI or areca),
>> leaving the rest of hardware as it is. And see it the problems continue.
>> I
>> do realize that there is more to it than just pulling one card and
>> sticking another in its place (that's why I said if you can "afford" it
>> meaning in more general sense, not just monetary).
> It's not something happening as a repeatable thing or that I could
> consider better/worse after replacing something.  Maybe 3 times a year
> across a few hundred machines and generally not repeating on the same
> ones. But if there is anything in common it is on very 'active'
> filesystems.

Too bad... Reminds me one of my 32 node clusters in which one of the nodes
crashed in a crashed once a month (always different node, so probability
of run is 32 Month before crash ;-( Too bad for troubleshooting. Only
after 6 Months I pinpointed particular brand of RAM mixed in into each
node - when I got rid of it, the trouble ended... I would bet on Adaptec
cards in your case... though ideally I shouldn't be offering judgement on
hardware of the brand I almost never use. Good luck!


Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247