I have a friend who has been having an intermittent problem with his nice, shiny new IBM x226 server. It's a Xeon processor, 1.5Gb RAM, hardware RAID controller running CentOS 3
The system will will sometimes run for a couple of weeks, then simply lock up -- nothing on the console, no response to pings, no "caps-lock" lights, no kernel panic indicators, nothing in the logs to indicate the problem. Other times the machine may lock up three times in one day. At the last call to me he did report that the hard drive lights flicker every once in a while during one of the systems catatonic states, but other than that there was no indication of life.
We have been in contact with IBM servicce and have run their hardware tests, which perform flawlessly. We have now come to the point where they claim that "CentOS is not a supported OS" and even though they acknowledge CentOS as a RHEL clone, they argue that there is no support path for them to follow to pursue possible software problems.
The system owner is ready to buy RHEL, just to get service on his machine, but I am wondering -- how difficult is it to convert from CentOS to RHEL? Is it similar to the conversion from RHEL to CentOS or from WBEL to CentOS? I would like to avoid a complete reload.
Does anyone else have any suggestions?
TIA,
Barry
Barry L. Kline wrote:
The system owner is ready to buy RHEL, just to get service on his machine, but I am wondering -- how difficult is it to convert from CentOS to RHEL? Is it similar to the conversion from RHEL to CentOS or from WBEL to CentOS? I would like to avoid a complete reload.
In all honesty, trying to extract support from IBM, I would personally recommend a complete reinstall. Once the problem is sorted out (and it's Not CentOS) he can always migrate back to CentOS via one of the published paths.
Does anyone else have any suggestions?
Does Redhat support that box? They must if you have tried support.
.dn
Hi Donavan. Thanks for the reply.
donavan nelson wrote:
In all honesty, trying to extract support from IBM, I would personally recommend a complete reinstall. Once the problem is sorted out (and it's Not CentOS) he can always migrate back to CentOS via one of the published paths.
It may come to a complete reload, but what a PITA! I'm hoping to avoid it, if possible.
Does Redhat support that box? They must if you have tried support.
I'm not sure if Redhat supports the box or not. Our tech support was through IBM and their comment about RH was so that they could have a support path to work things through. They claimed RHEL as one of the supported OSes.
On Tue, 01 Feb 2005 14:33:27 -0500, Barry L. Kline wrote:
I have a friend who has been having an intermittent problem with his nice, shiny new IBM x226 server. It's a Xeon processor, 1.5Gb RAM, hardware RAID controller running CentOS 3
The system will will sometimes run for a couple of weeks, then simply lock up -- nothing on the console, no response to pings, no "caps-lock" lights, no kernel panic indicators, nothing in the logs to indicate the problem. Other times the machine may lock up three times in one day. At the last call to me he did report that the hard drive lights flicker every once in a while during one of the systems catatonic states, but other than that there was no indication of life.
I had a very similar problem at a customer of mine. It took a few month to figure the problem.
The system was running backups of the proc directory and that would crash the system on random basis because it would create the wrong drive size. After that no crash, ever.
-- Thanks syv@911networks.com When the network has to work
Syv Ritch wrote:
I had a very similar problem at a customer of mine. It took a few month to figure the problem.
The system was running backups of the proc directory and that would crash the system on random basis because it would create the wrong drive size. After that no crash, ever.
Hi Syv. Thanks for the reply.
When you say that the system was running backups of the proc directory, do you mean a userland application was doing it (e.g. that the user requested) or was it something that the system does automatically?
The reason I ask is that the only backups being done with that system is the one done across the network via rsnapshot. This wasn't happening when the error occurred and it doesn't touch the /proc directory when it does.
Barry
Barry L. Kline wrote:
Does anyone else have any suggestions?
Hey Barry -- have you tried upgrading/downgrading along the kernel line? Ie, try the core kernel from the very first release, then maybe a kernel from update 2, then the update 4 era kernels?
Same goes for glibc -- see if you can try upgrading/downgrading along the update chain and see if it has any effect, coupled with the above kernels. Something as simple as a different gcc used to compile the kernels/glibcs may have introduced a bugaboo down deep where you'd never find it...
Just a thought. -te
Troy Engel wrote:
Barry L. Kline wrote:
Hey Barry -- have you tried upgrading/downgrading along the kernel line? Ie, try the core kernel from the very first release, then maybe a kernel from update 2, then the update 4 era kernels?
That's not a bad idea... I'll have to give it a try. This machine does not face the internet so security patches won't be as big of a problem.
Right now I have them booted on the uniprocessor kernel (it's a hyperthreaded CPU) to see if it gains any stability there. Next stop is to go into the BIOS and turn off hyperthreading (just to make sure).
Failing that I'll try dropping back kernels and see what happens.
Thanks Troy!
Barry
Barry L. Kline wrote:
Failing that I'll try dropping back kernels and see what happens.
Another possible test, although it's the worst case scenario - compile your own kernel only using the patches from the SRPM you really need. Sometimes RedHat kernels actually break the real kernel in obscure ways.
Historical anecdote: back in RH73, the Intel L440GX+ motherboard (server mobo) was everywhere. The initial 7.3 kernels worked fine, then along the way they broke - no L440GX+ motherboard would work anymore. At that point I had to run hand-compiled kernels on all the servers (same core version as RPM patched ones) and it worked perfectly fine.
Eventually newer kernels came out that at least worked with 'noapic' on the boot line (still have a few running, knock on wood), but unpatched kernels worked just fine without this as before. There's a mile-long bug buried somewhere on the redhat.com bugzilla.
So, there is some history of RedHat patches breaking a perfectly good kernel - you may possibly be in this situation, but it's a real bugger to figure out.
-te