Hi all,
I have a server, with an Intel DG35EC motherboard, Q9300 CPU, 8GB Kingston DDRII RAM which can't take a lot of load. I have 4 XEN VPS's on there, which doesn't consume more than 4GBM RAM at this stage. Yet, the machine sky rockets at some times. I've moved the XEN VPS's to another server, with 4GM RAM, and it doesn't cause the same problems.
So, apart from memtest86 how else can I stress test the server to find out what the problem is?
Rudi Ahlers wrote:
Hi all,
I have a server, with an Intel DG35EC motherboard, Q9300 CPU, 8GB Kingston DDRII RAM which can't take a lot of load. I have 4 XEN VPS's on there, which doesn't consume more than 4GBM RAM at this stage. Yet, the machine sky rockets at some times. I've moved the XEN VPS's to another server, with 4GM RAM, and it doesn't cause the same problems.
So, apart from memtest86 how else can I stress test the server to find out what the problem is?
4 instances of mprime (www.mersenne.org), running the torture test, each set to affinity on a different CPU.
and, next time get a real server board with ECC.
On Tue, Nov 18, 2008 at 9:19 AM, John R Pierce pierce@hogranch.com wrote:
Rudi Ahlers wrote:
Hi all,
I have a server, with an Intel DG35EC motherboard, Q9300 CPU, 8GB Kingston DDRII RAM which can't take a lot of load. I have 4 XEN VPS's on there, which doesn't consume more than 4GBM RAM at this stage. Yet, the machine sky rockets at some times. I've moved the XEN VPS's to another server, with 4GM RAM, and it doesn't cause the same problems.
So, apart from memtest86 how else can I stress test the server to find out what the problem is?
4 instances of mprime (www.mersenne.org), running the torture test, each set to affinity on a different CPU.
and, next time get a real server board with ECC.
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Next time I won't ask, don't worry
On Tue, Nov 18, 2008 at 9:27 AM, Rudi Ahlers rudiahlers@gmail.com wrote:
On Tue, Nov 18, 2008 at 9:19 AM, John R Pierce pierce@hogranch.com wrote:
Rudi Ahlers wrote:
Hi all,
I have a server, with an Intel DG35EC motherboard, Q9300 CPU, 8GB Kingston DDRII RAM which can't take a lot of load. I have 4 XEN VPS's on there, which doesn't consume more than 4GBM RAM at this stage. Yet, the machine sky rockets at some times. I've moved the XEN VPS's to another server, with 4GM RAM, and it doesn't cause the same problems.
So, apart from memtest86 how else can I stress test the server to find out what the problem is?
4 instances of mprime (www.mersenne.org), running the torture test, each set to affinity on a different CPU.
and, next time get a real server board with ECC.
John, just cause the machines we use to serve web content to our clients doesn't use the grade of equipment you prefer to use, and can afford, doesn't mean equipment that other people use is inferior, or worthless.
I have a problem with one of my machines, and have narrowed down that it could either be the CPU, RAM or motherboard, but before I take it back to the suppliers, I need to know what is wrong. They will switch it on, and see that it works. But it's not taking the load that I expect it could. In fact, it's not taking the same load as a machine with a Intel E6750 Core 2 Duo & 4GB RAM. This server should be 2 - 4 times faster & handle 2 - 4 times the load of the E6750, yet it doesn't and I need to know why. I don't appreciate being told that the hardware I have if inferior.
Rudi Ahlers wrote:
John, just cause the machines we use to serve web content to our clients doesn't use the grade of equipment you prefer to use, and can afford, doesn't mean equipment that other people use is inferior, or worthless.
ECC memory would have caught any memory errors, (including memory timing), and give a diagnostic and we wouldn't be having this conversation, this system would be in production, and you'd be working on the next customers job.
oh yeah, those 'server' motherboards generally use registered/buffered memory, which can handle higher memory fanouts and support a full load of memory banks robustly.
I meant to suggest the other night, go into the Intel BIOS, find the memory settings area, and set it to custom timings, and add a clock to each of the timings, like if its 4-4-4-12, try 5-5-5-15 (or whatever the next increment is). running 8GB on a desktop board, I'm guessing you have all slots full, this increseas the capacitive load on the address and data bus, and makes marginal timing more marginal.
On Tue, Nov 18, 2008 at 10:24 AM, John R Pierce pierce@hogranch.com wrote:
Rudi Ahlers wrote:
John, just cause the machines we use to serve web content to our clients doesn't use the grade of equipment you prefer to use, and can afford, doesn't mean equipment that other people use is inferior, or worthless.
ECC memory would have caught any memory errors, (including memory timing), and give a diagnostic and we wouldn't be having this conversation, this system would be in production, and you'd be working on the next customers job.
oh yeah, those 'server' motherboards generally use registered/buffered memory, which can handle higher memory fanouts and support a full load of memory banks robustly.
I meant to suggest the other night, go into the Intel BIOS, find the memory settings area, and set it to custom timings, and add a clock to each of the timings, like if its 4-4-4-12, try 5-5-5-15 (or whatever the next increment is). running 8GB on a desktop board, I'm guessing you have all slots full, this increseas the capacitive load on the address and data bus, and makes marginal timing more marginal.
John, I know what ECC does. I have 2 Dell PE860 servers with 8GB ECC DDRII RAM as well, and they're both giving RAM problems. I had top swap-out the RAM 2 times with the suppliers already, and swapped out a motherboard on the one of the servers. Honestly, ECC isn't my favourate to use.
At the same time, I have about 8 servers with cheap Gigabyte motherboards and non-ECC RAM, which have been running for close to 4 years now, without any hickups at all.
It's the first time I try the Intel board, since it's supposed to be a step-up from the desktop boards, and has 4 memory slots as apposed to only 2.
The server had the same problems when I only had 4GBM RAM (2 slots used & 2 slots open), so I don't think that the capacitive load is the problem here. Right now the server is still at the datacentre - which is 2 hours drive there & back with traffic, so I'm going to get it later today / tonight, as soon as I've moved all the data across to the slower gigabyte server, and then I can try the RAM timings thing in the BIOS.
But, how can I put a LOT of load onto it, and see what's causing the problem? For all I know, the motherboard could be faulty, or the CPU, or maybe even the SATA bus?
Rudi Ahlers wrote on Tue, 18 Nov 2008 11:25:31 +0200:
But, how can I put a LOT of load onto it, and see what's causing the problem
http://httpd.apache.org/docs/2.0/programs/ab.html as a starter
Kai
On Tue, 2008-11-18 at 11:25 +0200, Rudi Ahlers wrote:
But, how can I put a LOT of load onto it, and see what's causing the problem? For all I know, the motherboard could be faulty, or the CPU, or maybe even the SATA bus?
stress! Configured correctly it will abuse a server pretty hard
http://weather.ou.edu/~apw/projects/stress/ http://dag.wieers.com/rpm/packages/stress/
run like
cd /var/tmp screen -dmS stress screen -rx stress stress --cpu 16 --io 8 --vm 12 --vm-bytes 512M --hdd 4 --hdd-bytes 1G --timeout 86400
Oh and be sure to set a timeout if you run it remotely or you'll lock yourself out ;)
Rudi Ahlers wrote:
John, I know what ECC does. I have 2 Dell PE860 servers with 8GB ECC DDRII RAM as well, and they're both giving RAM problems. I had top swap-out the RAM 2 times with the suppliers already, and swapped out a motherboard on the one of the servers. Honestly, ECC isn't my favourate to use.
Wow! Everybody doing serious business wouldn't go without it (i work for a couple of Banks and government agencies), but that's your choice and i respect that. But if you want to talk about five 9's, then you'd surely go with ECC and other invaluable features like watchdog timer, management cards, BIOS serial redirection, chip kill, etc.
It all depend on your needs i agree but don't reject server grade hardware so easily!
At the same time, I have about 8 servers with cheap Gigabyte motherboards and non-ECC RAM, which have been running for close to 4 years now, without any hickups at all.
That's bad stats. It's not because my neighbour has a problem with his Mercedes and that i have no problem with my 4 Hyundai that Hyundai are better than Mercedes!!! Not only that, but sitting 6 adults in a mini Hyundai may be possible but we'll be much more confortable in the big Mercedes! Know what i mean?
It's the first time I try the Intel board, since it's supposed to be a step-up from the desktop boards, and has 4 memory slots as apposed to only 2.
... and limited by the fanout of the CPU / Chipset... As you put more memory, you'll have to relax timing and use proper memory brand that is certified for the mainboard.
The server had the same problems when I only had 4GBM RAM (2 slots used & 2 slots open), so I don't think that the capacitive load is the problem here. Right now the server is still at the datacentre - which is 2 hours drive there & back with traffic, so I'm going to get it later today / tonight, as soon as I've moved all the data across to the slower gigabyte server, and then I can try the RAM timings thing in the BIOS.
This could be a chipset problem, bad power supply, and the list goes on.
But, how can I put a LOT of load onto it, and see what's causing the problem? For all I know, the motherboard could be faulty, or the CPU, or maybe even the SATA bus?
Putting high load without having hardware monitoring won't tell you much IMHO.
I'd first test the power supply. Then remove everything you can and test with Memtest86+ (let's say, overnight, and while you're at it watch the power supply under load).
Swap memory with some you know is good. If the problem persist, you could possibly have a chipset problem.
Good luck.
Guy Boisvert, ing IngTegration inc.
On Wed, Nov 19, 2008 at 11:48 PM, Guy Boisvert boisvert.guy@videotron.ca wrote:
Rudi Ahlers wrote:
John, I know what ECC does. I have 2 Dell PE860 servers with 8GB ECC DDRII RAM as well, and they're both giving RAM problems. I had top swap-out the RAM 2 times with the suppliers already, and swapped out a motherboard on the one of the servers. Honestly, ECC isn't my favourate to use.
Wow! Everybody doing serious business wouldn't go without it (i work for a couple of Banks and government agencies), but that's your choice and i respect that. But if you want to talk about five 9's, then you'd surely go with ECC and other invaluable features like watchdog timer, management cards, BIOS serial redirection, chip kill, etc.
It all depend on your needs i agree but don't reject server grade hardware so easily!
I wasn't rejecting server grade hardware. I was a bit irritated by the fact that I don't have server grade hardware, and every says get proper hardware. It ticked me off a bit that only a server can be good, and not a standard desktop which is also used to serve content to many people over the net.
At the same time, I have about 8 servers with cheap Gigabyte motherboards and non-ECC RAM, which have been running for close to 4 years now, without any hickups at all.
That's bad stats. It's not because my neighbour has a problem with his Mercedes and that i have no problem with my 4 Hyundai that Hyundai are better than Mercedes!!! Not only that, but sitting 6 adults in a mini Hyundai may be possible but we'll be much more confortable in the big Mercedes! Know what i mean?
Sure, but my argument will stay the same, " you don't need a 10 Ton truck to move a 2 Ton load" and I think the same applies to everything else. No need to get something totally over specced & over priced todo the same job. For mission critical stuff, I'd go the server grade route, but for this it isn't really necessary in my eyes. The clients wouldn't pay the more expensive price for the more expensive hardware either.
It's the first time I try the Intel board, since it's supposed to be a step-up from the desktop boards, and has 4 memory slots as apposed to only 2.
... and limited by the fanout of the CPU / Chipset... As you put more memory, you'll have to relax timing and use proper memory brand that is certified for the mainboard.
The server had the same problems when I only had 4GBM RAM (2 slots used & 2 slots open), so I don't think that the capacitive load is the problem here. Right now the server is still at the datacentre - which is 2 hours drive there & back with traffic, so I'm going to get it later today / tonight, as soon as I've moved all the data across to the slower gigabyte server, and then I can try the RAM timings thing in the BIOS.
This could be a chipset problem, bad power supply, and the list goes on.
Well, this is what I want to find out, what exactly is causing the problem :)
But, how can I put a LOT of load onto it, and see what's causing the problem? For all I know, the motherboard could be faulty, or the CPU, or maybe even the SATA bus?
Putting high load without having hardware monitoring won't tell you much IMHO.
I'd first test the power supply. Then remove everything you can and test with Memtest86+ (let's say, overnight, and while you're at it watch the power supply under load).
I don't know how to test the PSU. Everything starts up fine, the fans work fine, etc. Memtest86+ didn't report any errors.
Swap memory with some you know is good. If the problem persist, you could possibly have a chipset problem.
Good luck.
Guy Boisvert, ing IngTegration inc. _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
I wasn't rejecting server grade hardware. I was a bit irritated by the fact that I don't have server grade hardware, and every says get proper hardware. It ticked me off a bit that only a server can be good, and not a standard desktop which is also used to serve content to many people over the net.
Rudi,
dont be irritated. you can afford server grade hardware no prob.
goto ebay and search for
DL380 G3 Dual
or
DL360 G3 Dual
both units do hardware raid stock. if you need more that 72GB hotswap hardware RAID1 total disk space, get a DL380 now or if over time disk space needs to grow a lot. DL380 will do hardware raid 1 and 5.
the space you need and the scsi drives you purchase will be the larger cost factor depending on your decisions. the hot swap drives above 72gb in size will be the limiting factor in price if you want huge drives. 72gb and below are very inexpensive
your current machine can serve, and possibly meet your needs, yet it wasnt designed to be a set and forget *server*
we regularly see these machines for 50 to 150 dollars and with drives add another 50 to 100 bucks unless you need to go qty or super large.
or get a buy it now package of some sort.
way less expensive and much more reliable than any wothwhile desktops out there.
the only drawback is knowing who to buy from and who not to because of shipping costs and if the seller is a pro and packs properly.
- rh
RobertH wrote on Thu, 20 Nov 2008 09:15:03 -0800:
DL380 G3 Dual
And how does he squeeze that in 1U?
AFAIR, Rudi's located in South Africa and has already stated several times that prices there for servergrade stuff are not as cheap as you can get it in some other areas of the world. May apply for Ebay deliveries to SA as well (if they ship at all to SA). And, looking at the Ebay ads, I sure wouldn't buy such a machine. I can understand that Rudi builds his own servers in that situation, I do this sometimes as well. One thing what I would have avoided if possible, though, is buy a board with such a new chipset that you can't even get lm_sensors to run. (Or you didn't research too well, Rudi. You can upgrade lm_sensors from rpmforge and you can get coretemp.ko patches for some chipsets. I had to do this myself for an "oldish" Intel 5000 server chipset.) That's another thing where server-grade stuff comes nice into play: the machines usually include BMC, so you are not dependant on lm_sensors and kernel.
Kai
And how does he squeeze that in 1U?
AFAIR, Rudi's located in South Africa and has already stated several times that prices there for servergrade stuff are not as cheap as you can get it in some other areas of the world. May apply for Ebay deliveries to SA as well (if they ship at all to SA). And, looking at the Ebay ads, I sure wouldn't buy such a machine. I can understand that Rudi builds his own servers in that situation, I do this sometimes as well. One thing what I would have avoided if possible, though, is buy a board with such a new chipset that you can't even get lm_sensors to run. (Or you didn't research too well, Rudi. You can upgrade lm_sensors from rpmforge and you can get coretemp.ko patches for some chipsets. I had to do this myself for an "oldish" Intel 5000 server chipset.) That's another thing where server-grade stuff comes nice into play: the machines usually include BMC, so you are not dependant on lm_sensors and kernel.
Kai
--
Kai,
it can be purcased and shipped to him, no biggie.
:-)
you ask how does a DL380 G3 fit in 1U?
not a bright question. dont care.
:-)
all previous info was FYI
it was noted in the email the DL360 G3 unit as well. it is 1U
dont care & doesnt matter if you wont buy said equipment.
i have built between 600 to 1000 computers and would still rather have industrial type hardware if at all possible.
for all of you on a serious budget and you want long term bullet proof basic server, the hardware mentioned does a most excellent job as do the G4 scsi units.
they have hot swap hardware raid scsi drives & they also can be purchased with dual hot swap redundant power supplies & have what are called ILO ports so you can get into the machine from remote and power it on or off or see and interact with the console from remote (just like you were there in person) if you accidentally foobar a box and cannot get into it with ssh etc.
way better than a home build in many circumstances, yet not all, of course.
- rh
A413A83E9A59C79A1C076@msys1>
RobertH wrote on Thu, 20 Nov 2008 13:09:58 -0800:
you ask how does a DL380 G3 fit in 1U?
not a bright question. dont care.
:-)
all previous info was FYI
it was noted in the email the DL360 G3 unit as well. it is 1U
Sorry, I overlooked that. Doesn't change the rest of what I wrote, though.
Kai
Sorry, I overlooked that. Doesn't change the rest of what I wrote, though.
Kai
kai,
dont be sorry, i miss things in email here and there too.
im make more *general* mistakes than anyone ive ever met.
yet, when such inexpensive, need meeting, industrial hardware is available, i just cannot imagine building and fighting with higher cost frequently less reliable systems unless there are *a lot* of substantive reasons to justify it.
- rh
On Thu, Nov 20, 2008 at 9:31 PM, Kai Schaetzl maillists@conactive.com wrote:
RobertH wrote on Thu, 20 Nov 2008 09:15:03 -0800:
DL380 G3 Dual
And how does he squeeze that in 1U?
AFAIR, Rudi's located in South Africa and has already stated several times that prices there for servergrade stuff are not as cheap as you can get it in some other areas of the world. May apply for Ebay deliveries to SA as well (if they ship at all to SA). And, looking at the Ebay ads, I sure wouldn't buy such a machine. I can understand that Rudi builds his own servers in that situation, I do this sometimes as well. One thing what I would have avoided if possible, though, is buy a board with such a new chipset that you can't even get lm_sensors to run. (Or you didn't research too well, Rudi. You can upgrade lm_sensors from rpmforge and you can get coretemp.ko patches for some chipsets. I had to do this myself for an "oldish" Intel 5000 server chipset.) That's another thing where server-grade stuff comes nice into play: the machines usually include BMC, so you are not dependant on lm_sensors and kernel.
Kai
-- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com
Hi Kai,
It's as you stated, server grade hardware are a) rather expensive in South Africa, b) very limited in terms of what we can actually get, and c) crappy support from vendors.
I'd LOVE to purchase some SuperMicro servers, but there's only 1 supplier in South Africa, and they clearly state that they don't sell loose components which is a problem, since they RAM, HDD's & CPU's are approx 30 - 50% more expensive than from other suppliers, for the exact same components. They also won't allow you (me) to upgrade the server myself, I need to take it in to them and they do it. They are also the only people who are allowed to even work on the machines. I don't like this very much, and don't see the benefits in supporting a monopoly like this. From Dell's side the support is almost the same, but they didn't have a problem with the fact that I upgraded the HDD & RAM myself. I got the components cheaper from another supplier / importer / retailer than from Dell directly. And really, how can KingMax RAM or Seagate HDD's from one supplier be better than from another supplier? I only use the recommended types, i.e ECC (non registered), and Seagate RAID edition SATAII HDD's.
The motherboard I now have, was the only one I could find at that time which wasn't XEON, and had 4 memory slots. Since most suppliers who sell these kind of motherboards don't keep the nice ones in stock, I need to take what I can get. I honestly didn't even think of checking to see if lm_sensors will work with it. All the other motherboards, even the Socket 775 server boards either had 2 memory modules, or had to be ordered & will take 2 weeks to get here.
The problem is, that I already spend a lot of money on this "server", and it can't be taken back for a credit return even, only for swap-outs of the exact same components, and I need to use it. To get a new server will mean a whole new server from Dell or the Super Micro stuff, which like I said would work out between 30 - 50% more than the exact same components I now have. In fact, the Super Micro doesn't come with a Q9300, but with a Q6600 - which by chance is 15% more expensive than what I paid for the Q9300.
Ebay isn't an option, purely cause of the import duties in our country, and if I did import it, how / wherw would I find support / replacement components for it? The suppliers here will only support servers which was purchased from them. I have tried this route already. I told the one guy that I got server from someone else and need to replace the motherboard. "Sorry, we can't support you, since you didn't purchase it from us" - or something to that sense was the response.
I'm sitting with a very expensive paper weight right now, and I don't know what todo. The same websites are running very well on a machine with a Gigabyte G31MX-S motherboard + 4GB DDRII 800 RAM + C2D 6750 CPU. This is what baffles me, how can the same load on a slower machine work fine, but on the faster one not?
On Fri, 2008-11-21 at 18:38 +0200, Rudi Ahlers wrote:
<snip>
I'm sitting with a very expensive paper weight right now, and I don't know what todo. The same websites are running very well on a machine with a Gigabyte G31MX-S motherboard + 4GB DDRII 800 RAM + C2D 6750 CPU. This is what baffles me, how can the same load on a slower machine work fine, but on the faster one not?
Having watched all this thread, I note that certain things are not mentioned. Assuming that you followed all the previous suggestions, I'll add my own that is based on practical experience some years back, and one recent experience.
Like you, I always built my own. Since you have no way to check the PS, try removing all components you can and see if that helps. _Usually_ a weak PS will show symptoms on boot, since all things are spinning up asnd doing max current draw, but sometimes not. Some BIOS have settings that allow or automatically "spin up" in a stepped sequence. This would not stress the PS as much. Keep in mind that PS's have different amperage draw capabilities for different rails. A seemingly "sufficient" PS in terms of wattage may be weak on one or more of the rails. Specs for the mobo and PS might indicate a problem.
Have you checked the voltage settings in the BIOS for the CPU and memory? Many/most these days automatically detect, but...
Check the spec sheets for the CPU and memory sticks.
I recently upgraded a mobo memory and it would not boot or run reliably. The spec for the memory was not available and I left the settings as with the previous memory. Not wanting to fry the sticks and possibly void the warranty, I picked up the whole thing an carried it back to my local supplier. I explained the symptoms and told him I suspected memory voltage but didn't want to try/fry the sticks and risk the warranty.
Hmmm... he said. Well, long story short, he eventually kicked up the voltage (I guess the "auto" in the BIOS was flaky or something) and all worked. Required +.2 volts. Most memory sticks can be run at slightly higher (+.1, +.2) volts without harm. Larger memory may require a slight increase in voltage. I guess the "automatic" settings can't always be trusted.
Running about 6 months now, NPs.
Another thing about pulling all components you can: if there is some kind of IRQ conflict, this can (used to?) cause slowdowns. Maybe that will be shown there. But that should also leave some traces in the /var/log/messages or dmesg log.
Let's presume that the "obvious" problem is not the problem. What if it is not hardware directly?
Examine your /var/log/dmesg carefully for any "suspect" messages. I've also found that occasionally drivers selected by the system may not be exactly correct. Check the specs for mobo and add-in cards and see if it looks like the best drivers for the chip sets are loaded (lsmod and modinfo help here).
Grab any old performance/diagnostics software (maybe some on this list have current knowledge - I don't) and run it. Compare to published data for same or similar systems.
Enable sar on the system, run the reports and see where the slowdowns are.
I haven't used multi-core yet, but I would first check to see if all the cores are being effectively used. Maybe top will help here? Not sure.
BIOS: some have oddball (not really, but legacy issues abound) settings that may limit amount of memory seen/used? Keep an eye out for those. Memory timings may not be properly detected and set. Check the specs for the memory and see if the BIOS has them properly set. BTW, _some_ memory and mobo combos will allow faster settings, but be careful. I haven't dinked with them for a long time, so I can't make any Q & A suggestions.
Have you upgraded to the latest BIOS on the system? Most retail mobos come with an early BIOS version that has... "issues". Check the manufacturers web site and see if there is a later BIOS.
OTHER: Of course, you have manually "re-seated" all connections, yes? A slightly loose cable, add-in card or memory not fully seated can do things such as you describe.
Visually inspect cables for "micro-fractures". Better, if you have access to meters, check for excessive resistance or opens. If not, try changing out cables. You might want to look in this area only if SAR reports show slow disk activity. Also hdparm might give some information. Maybe some settings there would help too.
That's all I can think of ATM. I hope something of use here.
Does this system have shared video/system RAM? If you have video memory shared with system memory, there is going to be memory that can't be tested unless you rotate memory chips or put in a vga card. In memtest+ 2.10 configuration, set for no reserved memory and watch the memtest corrupt the video output on a shared memory system.
i have some several year old DL360's and ML370's and love em - especially hw raid, but i my local supplier hasn't had any for several months. Uptil a few months ago, password reset info on ebay was sent in the clear, so i have a very hard time trusting ebay. It would be great if something like LinuxBios / OpenBios could stresstest the machine and then disable any RAM addresses that proved flaky - whether ECC or not.
Rob Townley wrote:
i have some several year old DL360's and ML370's and love em - especially hw raid, but i my local supplier hasn't had any for several months. Uptil a few months ago, password reset info on ebay was sent in the clear, so i have a very hard time trusting ebay. It would be great if something like LinuxBios / OpenBios could stresstest the machine and then disable any RAM addresses that proved flaky - whether ECC or not.
Wondering if you had heard of this project? I don't trust doing this sort of thing myself(paranoid) but..
http://rick.vanrein.org/linux/badram/
I think that (or another similar project) can even take memtest86 output and use that as a map for blocking out the bad parts of the ram chips.
nate
on 11-21-2008 10:10 AM nate spake the following:
Rob Townley wrote:
i have some several year old DL360's and ML370's and love em - especially hw raid, but i my local supplier hasn't had any for several months. Uptil a few months ago, password reset info on ebay was sent in the clear, so i have a very hard time trusting ebay. It would be great if something like LinuxBios / OpenBios could stresstest the machine and then disable any RAM addresses that proved flaky - whether ECC or not.
Wondering if you had heard of this project? I don't trust doing this sort of thing myself(paranoid) but..
http://rick.vanrein.org/linux/badram/
I think that (or another similar project) can even take memtest86 output and use that as a map for blocking out the bad parts of the ram chips.
nate
I wouldn't trust that anymore. That is from a day when ram was priced like it was made out of gold or platinum. Bad ram over time will just make an ever expanding hole.
On Fri, Nov 21, 2008 at 7:28 PM, William L. Maltby CentOS4Bill@triad.rr.com wrote:
On Fri, 2008-11-21 at 18:38 +0200, Rudi Ahlers wrote:
<snip>
I'm sitting with a very expensive paper weight right now, and I don't know what todo. The same websites are running very well on a machine with a Gigabyte G31MX-S motherboard + 4GB DDRII 800 RAM + C2D 6750 CPU. This is what baffles me, how can the same load on a slower machine work fine, but on the faster one not?
Having watched all this thread, I note that certain things are not mentioned. Assuming that you followed all the previous suggestions, I'll add my own that is based on practical experience some years back, and one recent experience.
Like you, I always built my own. Since you have no way to check the PS, try removing all components you can and see if that helps. _Usually_ a weak PS will show symptoms on boot, since all things are spinning up asnd doing max current draw, but sometimes not. Some BIOS have settings that allow or automatically "spin up" in a stepped sequence. This would not stress the PS as much. Keep in mind that PS's have different amperage draw capabilities for different rails. A seemingly "sufficient" PS in terms of wattage may be weak on one or more of the rails. Specs for the mobo and PS might indicate a problem.
I also thought the problems was related to the power supply, but I don't have a spare one of these at the moment. I did, however, swap-out the PSU with a standard 350W PSU, and the sympoms were the same, so it's not PSU related in this case. This also reminds me that I should get a spare PSU ASAP :)
Have you checked the voltage settings in the BIOS for the CPU and memory? Many/most these days automatically detect, but...
I normally leave those on automatic, since I don't like running components outside suppliers specs.
Check the spec sheets for the CPU and memory sticks.
I recently upgraded a mobo memory and it would not boot or run reliably. The spec for the memory was not available and I left the settings as with the previous memory. Not wanting to fry the sticks and possibly void the warranty, I picked up the whole thing an carried it back to my local supplier. I explained the symptoms and told him I suspected memory voltage but didn't want to try/fry the sticks and risk the warranty.
Hmmm... he said. Well, long story short, he eventually kicked up the voltage (I guess the "auto" in the BIOS was flaky or something) and all worked. Required +.2 volts. Most memory sticks can be run at slightly higher (+.1, +.2) volts without harm. Larger memory may require a slight increase in voltage. I guess the "automatic" settings can't always be trusted.
Running about 6 months now, NPs.
Another thing about pulling all components you can: if there is some kind of IRQ conflict, this can (used to?) cause slowdowns. Maybe that will be shown there. But that should also leave some traces in the /var/log/messages or dmesg log.
There no add-in cards, nor a CD-ROM / DVD-ROM, only the on-board devices & the HDD's. Taking the HDD's out doesn't help much, since the problems only occur when there's a bit of load on the system.
Let's presume that the "obvious" problem is not the problem. What if it is not hardware directly?
Examine your /var/log/dmesg carefully for any "suspect" messages. I've also found that occasionally drivers selected by the system may not be exactly correct. Check the specs for mobo and add-in cards and see if it looks like the best drivers for the chip sets are loaded (lsmod and modinfo help here).
/var/log/messages didn't show anything related to the problem, at all.
Grab any old performance/diagnostics software (maybe some on this list have current knowledge - I don't) and run it. Compare to published data for same or similar systems.
Enable sar on the system, run the reports and see where the slowdowns are.
I haven't used multi-core yet, but I would first check to see if all the cores are being effectively used. Maybe top will help here? Not sure.
BIOS: some have oddball (not really, but legacy issues abound) settings that may limit amount of memory seen/used? Keep an eye out for those. Memory timings may not be properly detected and set. Check the specs for the memory and see if the BIOS has them properly set. BTW, _some_ memory and mobo combos will allow faster settings, but be careful. I haven't dinked with them for a long time, so I can't make any Q & A suggestions.
Have you upgraded to the latest BIOS on the system? Most retail mobos come with an early BIOS version that has... "issues". Check the manufacturers web site and see if there is a later BIOS.
No, I don't like BIOS upgrades unless absolutely necessary.
OTHER: Of course, you have manually "re-seated" all connections, yes? A slightly loose cable, add-in card or memory not fully seated can do things such as you describe.
Other than RAM, the only other cables to reseat are the power cables & SATA cables :)
Visually inspect cables for "micro-fractures". Better, if you have access to meters, check for excessive resistance or opens. If not, try changing out cables. You might want to look in this area only if SAR reports show slow disk activity. Also hdparm might give some information. Maybe some settings there would help too.
That's all I can think of ATM. I hope something of use here.
-- Bill
But, I did make an interesting discovery, when I tried to install a fesh copy of CentOS on a new HDD. The installation itself didn't succeed. Everytime I had to choose an option, on any screen, during installation, all the fans would spin up to it's max speed & everything would be really slow. It's almost like trying to install CentOS on a 486 computer. Yet, none of the heatsinks felt warm, even as warm as the hard drives. So, I came to the conclusion that the motherboard is faulty. Right now, I only have a spare Gigabyte motherboard handy, which when I used it didn't give me any problems whatsoever. I'm using the same 1U chassis with limited air flow and small fans, and it runs as smooth as it should.
I have since swapped out the motherboard with the supplier, and the new motherboard seems to run very well. Installation took about 20minutes to complete.
Thanx for all your help :)
Rudi Ahlers wrote:
I got the components cheaper from another supplier / importer / retailer than from Dell directly. And really, how can KingMax RAM or Seagate HDD's from one supplier be better than from another supplier? I only use the recommended types, i.e ECC (non registered), and Seagate RAID edition SATAII HDD's.
I dunno about Dell, but most vendors, their own 'branded' hard drives have customized firmware thats been tested and validated to work in all their various raid systems.
its a lot of little things. a Sun 72GB SCSI drive will always be an exact size, no matter what "72GB" drive is in it, while a whitebox generic drive from the same OEM(seagate/hitachi/etc) might be 50MB bigger or 10MB smaller or whatever. this really matters when you replace a raid drive. raid controllers in particular interact with hard drive firmware in some rather complex and subtle ways, and the drives really need to be tested and qualified for a specific application. as an example, a seagate ST3100xxxx drive might have 100 or more variations, indicated by different part numbers (the 9L9005-xxx number in the case of Seagate) to meet specific OEM requirements. mix and match the generic 'whitebox' versions of the drives in systems, and you're the one doing the qualification testing in production.
Memory has a lot of little specs that aren't readily apparent, and "DDR2-533 Registered ECC" can have differing CAS timings, different voltages, and even if all that is identical on paper, may or may not work reliably in a given system due to timing subtleties.. The HP or Sun or whatever ram has been fully qualified to work in their systems and most importantly is supported by their field service people. The stuff you get cheaper at mom-n-pops compuRus, who knows, you're the one doing the 'qualification testing' on your production systems.
since you've mentioned dell, I'd have to say, in my personal experience, Dell's are the cheapest and least reliable of the brand name servers... their field service in the US at least is decent, but they have a far higher 'infant mortality' rate than about anything else I've used (mostly HP, Sun, IBM).
your supermicro vendor, he doesn't want anyone elses parts in the system he sells and warranties because he doesn't want to be be responsible for fixing ensuing problems. he's selling stuff he knows works, he knows meets the specifications, and that he's warrantying and supporting. If you bought a new Volkswagen, then installed an aftermarket camshaft, and the engine eats a valve, you're not going to expect Volkswagen to repair the piston damage, are you?
On Fri, Nov 21, 2008 at 8:17 PM, John R Pierce pierce@hogranch.com wrote:
Rudi Ahlers wrote:
I got the components cheaper from another supplier / importer / retailer than from Dell directly. And really, how can KingMax RAM or Seagate HDD's from one supplier be better than from another supplier? I only use the recommended types, i.e ECC (non registered), and Seagate RAID edition SATAII HDD's.
I dunno about Dell, but most vendors, their own 'branded' hard drives have customized firmware thats been tested and validated to work in all their various raid systems.
its a lot of little things. a Sun 72GB SCSI drive will always be an exact size, no matter what "72GB" drive is in it, while a whitebox generic drive from the same OEM(seagate/hitachi/etc) might be 50MB bigger or 10MB smaller or whatever. this really matters when you replace a raid drive. raid controllers in particular interact with hard drive firmware in some rather complex and subtle ways, and the drives really need to be tested and qualified for a specific application. as an example, a seagate ST3100xxxx drive might have 100 or more variations, indicated by different part numbers (the 9L9005-xxx number in the case of Seagate) to meet specific OEM requirements. mix and match the generic 'whitebox' versions of the drives in systems, and you're the one doing the qualification testing in production.
This is interesting, thank you for letting me know :)
Memory has a lot of little specs that aren't readily apparent, and "DDR2-533 Registered ECC" can have differing CAS timings, different voltages, and even if all that is identical on paper, may or may not work reliably in a given system due to timing subtleties.. The HP or Sun or whatever ram has been fully qualified to work in their systems and most importantly is supported by their field service people. The stuff you get cheaper at mom-n-pops compuRus, who knows, you're the one doing the 'qualification testing' on your production systems.
since you've mentioned dell, I'd have to say, in my personal experience, Dell's are the cheapest and least reliable of the brand name servers... their field service in the US at least is decent, but they have a far higher 'infant mortality' rate than about anything else I've used (mostly HP, Sun, IBM).
Sun, IBM & HP servers in our country are far over rated, and they don't deal with the "small companies", only the larger corporates, so it's not really an option for me. Intel server are easier to get hold of, but also very expensive. Tyan - I don't know who sells it in our country
your supermicro vendor, he doesn't want anyone elses parts in the system he sells and warranties because he doesn't want to be be responsible for fixing ensuing problems. he's selling stuff he knows works, he knows meets the specifications, and that he's warrantying and supporting. If you bought a new Volkswagen, then installed an aftermarket camshaft, and the engine eats a valve, you're not going to expect Volkswagen to repair the piston damage, are you?
Not quite. The CPU, RAM & HDD's that sells is 30% more expensive than the other suppliers on the same thing, and this make the servers also more expensive than what they can be. I'm sitting with a lot of CPU's, RAM & HDD's which I'd still like to use, and don't see the point and throwing them in the bin to buy new ones. My other big problem, is if I want to upgrade anything, then I need to take the servers back to their warehouse, which with traffic is 2 - 3 hours drive from the DC, during office hours, and then there's a 2 day turn-around on upgrades. Our dells can get upgraded by ourselves, we get the component from Dell and then schedule upgrades for a Sunday night - very convenient. And Dell will also come to the DC 24/7 if needed. For this reason, I don't want to purchase from the current SuperMicro supplier.
See, the thing in our country is, some companies have monopolies in their market, and they set the trends for how their clients may use their products / services - which doesn't always make business sense. It has nothing todo with "he doesn't want anyone elses parts in the system he sells and warranties because he doesn't want to be be responsible for fixing ensuing problems. he's selling stuff he knows works" Even our Intel suppliers (there are a few of them) don't have this stupid policy. IF I wanted to upgrade, I get the necessary components and upgrade when convenient, not when the supplier feels they can do it.
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Rudi Ahlers wrote:
It has nothing todo with "he doesn't want anyone elses parts in the system he sells and warranties because he doesn't want to be be responsible for fixing ensuing problems. he's selling stuff he knows works"
I'm confused. Aren't you the same person who just put together some stuff that doesn't work well - or bought from a supplier that wasn't picky about parts?
On Mon, Dec 1, 2008 at 1:07 AM, Les Mikesell lesmikesell@gmail.com wrote:
Rudi Ahlers wrote:
It has nothing todo with "he doesn't want anyone elses parts in the system he sells and warranties because he doesn't want to be be responsible for fixing ensuing problems. he's selling stuff he knows works"
I'm confused. Aren't you the same person who just put together some stuff that doesn't work well - or bought from a supplier that wasn't picky about parts?
-- Les Mikesell lesmikesell@gmail.com _______________________________________________
Yes, the motherboard ended up being faulty. But what I said here is from a different supplier. I work with about 8 suppliers, and there's only supplier in the whole country who supplies Super Micro, but their after-sales support really sux, which is why I won't support them, and also why I can't use SuperMicro.
Currently, when one of my Dell's give hassles, Dell will come out within 4 hours to fix it up. On the stuff that I build myself, I can drive to a supplier and get a replacement component and have it swapped out within an hour. And since I use desktop type components, I can also use components from other suppliers, not just one.
on 11-21-2008 8:38 AM Rudi Ahlers spake the following:
On Thu, Nov 20, 2008 at 9:31 PM, Kai Schaetzl maillists-2Wq4MnPE5eVl57MIdRCFDg@public.gmane.org wrote:
RobertH wrote on Thu, 20 Nov 2008 09:15:03 -0800:
DL380 G3 Dual
And how does he squeeze that in 1U?
AFAIR, Rudi's located in South Africa and has already stated several times that prices there for servergrade stuff are not as cheap as you can get it in some other areas of the world. May apply for Ebay deliveries to SA as well (if they ship at all to SA). And, looking at the Ebay ads, I sure wouldn't buy such a machine. I can understand that Rudi builds his own servers in that situation, I do this sometimes as well. One thing what I would have avoided if possible, though, is buy a board with such a new chipset that you can't even get lm_sensors to run. (Or you didn't research too well, Rudi. You can upgrade lm_sensors from rpmforge and you can get coretemp.ko patches for some chipsets. I had to do this myself for an "oldish" Intel 5000 server chipset.) That's another thing where server-grade stuff comes nice into play: the machines usually include BMC, so you are not dependant on lm_sensors and kernel.
Kai
-- Kai Sch�tzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com
Hi Kai,
It's as you stated, server grade hardware are a) rather expensive in South Africa, b) very limited in terms of what we can actually get, and c) crappy support from vendors.
I'd LOVE to purchase some SuperMicro servers, but there's only 1 supplier in South Africa, and they clearly state that they don't sell loose components which is a problem, since they RAM, HDD's & CPU's are approx 30 - 50% more expensive than from other suppliers, for the exact same components. They also won't allow you (me) to upgrade the server myself, I need to take it in to them and they do it. They are also the only people who are allowed to even work on the machines. I don't like this very much, and don't see the benefits in supporting a monopoly like this. From Dell's side the support is almost the same, but they didn't have a problem with the fact that I upgraded the HDD & RAM myself. I got the components cheaper from another supplier / importer / retailer than from Dell directly. And really, how can KingMax RAM or Seagate HDD's from one supplier be better than from another supplier? I only use the recommended types, i.e ECC (non registered), and Seagate RAID edition SATAII HDD's.
The motherboard I now have, was the only one I could find at that time which wasn't XEON, and had 4 memory slots. Since most suppliers who sell these kind of motherboards don't keep the nice ones in stock, I need to take what I can get. I honestly didn't even think of checking to see if lm_sensors will work with it. All the other motherboards, even the Socket 775 server boards either had 2 memory modules, or had to be ordered & will take 2 weeks to get here.
The problem is, that I already spend a lot of money on this "server", and it can't be taken back for a credit return even, only for swap-outs of the exact same components, and I need to use it. To get a new server will mean a whole new server from Dell or the Super Micro stuff, which like I said would work out between 30 - 50% more than the exact same components I now have. In fact, the Super Micro doesn't come with a Q9300, but with a Q6600 - which by chance is 15% more expensive than what I paid for the Q9300.
Ebay isn't an option, purely cause of the import duties in our country, and if I did import it, how / wherw would I find support / replacement components for it? The suppliers here will only support servers which was purchased from them. I have tried this route already. I told the one guy that I got server from someone else and need to replace the motherboard. "Sorry, we can't support you, since you didn't purchase it from us" - or something to that sense was the response.
I'm sitting with a very expensive paper weight right now, and I don't know what todo. The same websites are running very well on a machine with a Gigabyte G31MX-S motherboard + 4GB DDRII 800 RAM + C2D 6750 CPU. This is what baffles me, how can the same load on a slower machine work fine, but on the faster one not?
I don't want to get flamed on this suggestion, but have you tried running something with a newer kernel like Ubuntu? I think they also have a live CD with a fairly current kernel. Sometimes if you can't get the hardware you want, you have to also not run the software you want. It could be that the MB just isn't properly supported with the current kernel in 5.2. I have seen several desktop systems running the sata bus under what ammounts to PIO mode. That would tax a system very heavily that is also trying to do useful work. The more that the IO was in wait, the worse it would get.
Rudi Ahlers wrote:
I'm sitting with a very expensive paper weight right now, and I don't know what todo. The same websites are running very well on a machine with a Gigabyte G31MX-S motherboard + 4GB DDRII 800 RAM + C2D 6750 CPU. This is what baffles me, how can the same load on a slower machine work fine, but on the faster one not?
I kind of lost track of this thread and missed any useful performance information if you posted it. Is this the same machine that has mysterious crashes? If so, I would just give up, or replace the RAM and power supply and then give up.
Performance wise, what kind of load does it have? Most servers that aren't doing graphics or number crunching are limited by disk i/o, not cpu so the disks and controllers are the interesting things to compare although if the controller requires a lot of CPU intervention (ide, sata in some modes) it may look like a cpu problem.
on 11-17-2008 11:36 PM Rudi Ahlers spake the following:
On Tue, Nov 18, 2008 at 9:27 AM, Rudi Ahlers rudiahlers-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wrote:
On Tue, Nov 18, 2008 at 9:19 AM, John R Pierce pierce-BRp9yk6zKL1Wk0Htik3J/w@public.gmane.org wrote:
Rudi Ahlers wrote:
Hi all,
I have a server, with an Intel DG35EC motherboard, Q9300 CPU, 8GB Kingston DDRII RAM which can't take a lot of load. I have 4 XEN VPS's on there, which doesn't consume more than 4GBM RAM at this stage. Yet, the machine sky rockets at some times. I've moved the XEN VPS's to another server, with 4GM RAM, and it doesn't cause the same problems.
So, apart from memtest86 how else can I stress test the server to find out what the problem is?
4 instances of mprime (www.mersenne.org), running the torture test, each set to affinity on a different CPU.
and, next time get a real server board with ECC.
John, just cause the machines we use to serve web content to our clients doesn't use the grade of equipment you prefer to use, and can afford, doesn't mean equipment that other people use is inferior, or worthless.
I have a problem with one of my machines, and have narrowed down that it could either be the CPU, RAM or motherboard, but before I take it back to the suppliers, I need to know what is wrong. They will switch it on, and see that it works. But it's not taking the load that I expect it could. In fact, it's not taking the same load as a machine with a Intel E6750 Core 2 Duo & 4GB RAM. This server should be 2 - 4 times faster & handle 2 - 4 times the load of the E6750, yet it doesn't and I need to know why. I don't appreciate being told that the hardware I have if inferior.
Does the board recommend a certain memory config? I have had systems that specify a ram module down to the part number, and others just don't work the same. Some boards can be real picky, and also some boards don't have enough heat sync on their supporting chips and need a little extra ventilation.
Rudi Ahlers wrote:
John, just cause the machines we use to serve web content to our clients doesn't use the grade of equipment you prefer to use, and can afford, doesn't mean equipment that other people use is inferior, or worthless.
I have a problem with one of my machines, and have narrowed down that it could either be the CPU, RAM or motherboard, but before I take it back to the suppliers, I need to know what is wrong. They will switch it on, and see that it works. But it's not taking the load that I expect it could. In fact, it's not taking the same load as a machine with a Intel E6750 Core 2 Duo & 4GB RAM. This server should be 2 - 4 times faster & handle 2 - 4 times the load of the E6750, yet it doesn't and I need to know why. I don't appreciate being told that the hardware I have if inferior.
Hi Rudy,
John is a veteran on this list and you could probably learn many things from him. I suggest you read:
http://www.mit.edu/~jcb/tact.html
(that could maybe explain why you seemed irritated by John!)
By the way, server grade hardware is just that: Meant to serv and include specific features for that particular kind of job. That doesn't mean that a workstation board won't be able to do a server job at all. Having said that, i manage servers since a long time and i can assure you that using serious server hardware with ECC translates in lower costs in the long run.
Sure you can still have problems with server grade hardware! But then, we could try to obtain statistics and MTBF to get a better idea.
But i find ECC error indicator to be invaluable and there a many other features that will help to pinpoint problems rapidly.
So Rudy, i can understand that you may have hardware problems and probably pressure to solve them but IMHO, it's just a classic example of TCO. And i read that you had problems with Dell servers, then try something else! It's OT but can say that i have many Tyan and HP servers in production and no problem at all.
Hope you'll solve your problem.
Regards,
Guy Boisvert, ing. IngTegration inc.
Rudi Ahlers schrieb:
Hi all,
I have a server, with an Intel DG35EC motherboard, Q9300 CPU, 8GB Kingston DDRII RAM which can't take a lot of load. I have 4 XEN VPS's on there, which doesn't consume more than 4GBM RAM at this stage. Yet, the machine sky rockets at some times. I've moved the XEN VPS's to another server, with 4GM RAM, and it doesn't cause the same problems.
So, apart from memtest86 how else can I stress test the server to find out what the problem is?
http://oca.microsoft.com/en/windiag.asp
(Yeah, it's MSFT - but I heard good things about it - memtest is not everything....)
I'm not sure if 8 GB and non-ECC (and non-buffered!) actually works that well....
Rainer
On Tue, Nov 18, 2008 at 09:52:15AM +0100, Rainer Duffner wrote:
Rudi Ahlers schrieb:
Hi all,
...
So, apart from memtest86 how else can I stress test the server to find out what the problem is?
http://oca.microsoft.com/en/windiag.asp
(Yeah, it's MSFT - but I heard good things about it - memtest is not everything....)
I'm not sure if 8 GB and non-ECC (and non-buffered!) actually works that well....
worse: ... Appendix
System requirement ... Windows Memory Diagnostic is limited to testing only the first 4 gigabytes (GB) of RAM. If you have more than 4 GB of RAM, the remaining RAM after the first 4 GB will not be tested by Windows Memory Diagnostic.
Thanks for the pointer anyway. ;)
Tru
On Tue, Nov 18, 2008 at 10:59 AM, Tru Huynh tru@centos.org wrote:
On Tue, Nov 18, 2008 at 09:52:15AM +0100, Rainer Duffner wrote:
Rudi Ahlers schrieb:
Hi all,
...
So, apart from memtest86 how else can I stress test the server to find out what the problem is?
http://oca.microsoft.com/en/windiag.asp
(Yeah, it's MSFT - but I heard good things about it - memtest is not everything....)
I'm not sure if 8 GB and non-ECC (and non-buffered!) actually works that well....
worse: ... Appendix
System requirement ... Windows Memory Diagnostic is limited to testing only the first 4 gigabytes (GB) of RAM. If you have more than 4 GB of RAM, the remaining RAM after the first 4 GB will not be tested by Windows Memory Diagnostic.
Thanks for the pointer anyway. ;)
Tru
I don't use Windows, so this wouldn't have helped in anycase :)
On Tue, Nov 18, 2008 at 11:36 AM, Rainer Duffner rainer@ultra-secure.de wrote:
Rudi Ahlers schrieb:
I don't use Windows, so this wouldn't have helped in anycase :)
It's a boot CD...
Rainer _______________________________________________
Oh, my bad. I saw the microsoft.com URL :)
On Tue, Nov 18, 2008 at 11:19:51AM +0200, Rudi Ahlers wrote:
I don't use Windows, so this wouldn't have helped in anycase :)
It's a diag floppy/cdrom, you don't need windows... except to expand it to the media.
Read at least the url!
Rudi, you have a strange attitude: - you ask for help for hardware issue - that is nearly off topic here - there not much people can help you with but giving advices - most of these advices, you choose to ignore (fine with me)
if your hardware fails, replace it, bug your vendor there is nothing more to say.
Tru
On Tue, Nov 18, 2008 at 11:47 AM, Tru Huynh tru@centos.org wrote:
On Tue, Nov 18, 2008 at 11:19:51AM +0200, Rudi Ahlers wrote:
I don't use Windows, so this wouldn't have helped in anycase :)
It's a diag floppy/cdrom, you don't need windows... except to expand it to the media.
Read at least the url!
Rudi, you have a strange attitude:
- you ask for help for hardware issue
- that is nearly off topic here
- there not much people can help you with but giving advices
- most of these advices, you choose to ignore (fine with me)
if your hardware fails, replace it, bug your vendor there is nothing more to say.
Tru
Tru Huynh (mirrors, CentOS-3 i386/x86_64 Package Maintenance) http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xBEFA581B
Tru,
The hardware works, but the moment I start running server based application (i.e. XEN VPS's), then the load goes very high. I'm running CentOS Linux on it, and was thinking this would be a great place to get help, but I can see that I'm wrong, since the hardware that I chose to use (and can afford in my country) is clearly not the right choice of hardware to use.
And although you're right in saying that a hardware problem is off-topic, I need a way to prove to the suppliers that it is in fact a problem with the hardware. Since everything works fine when you switch it on, yet when I start-up the XEN virtual machines, the load goed exsesively high. I have reinstalled the OS, but since I use yum to update to the latest version of everything, it could very well be a OS / kernel / software bug as well. I don't know, and I was hoping to get some insight on it from this list.
My other choice is to go and purchase Windows & install it, to see what happens. Then, if the same problem persists I can say it's hardware, if not, then it's software related.
Sorry for sounding so rude in my earlier posts, I just spend 3 days without sleep @ the datacentre trying to sort this out, and I need to tell my clients why the machine performs so poorly compared to the previous one which only has a Core 2 Dou CPU with 4GB RAM in it. See my problem?
On Tue, Nov 18, 2008 at 11:55 AM, Rudi Ahlers rudiahlers@gmail.com wrote:
On Tue, Nov 18, 2008 at 11:47 AM, Tru Huynh tru@centos.org wrote:
On Tue, Nov 18, 2008 at 11:19:51AM +0200, Rudi Ahlers wrote:
I don't use Windows, so this wouldn't have helped in anycase :)
It's a diag floppy/cdrom, you don't need windows... except to expand it to the media.
Read at least the url!
Rudi, you have a strange attitude:
- you ask for help for hardware issue
- that is nearly off topic here
- there not much people can help you with but giving advices
- most of these advices, you choose to ignore (fine with me)
if your hardware fails, replace it, bug your vendor there is nothing more to say.
Tru
Tru Huynh (mirrors, CentOS-3 i386/x86_64 Package Maintenance) http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xBEFA581B
Tru,
The hardware works, but the moment I start running server based application (i.e. XEN VPS's), then the load goes very high. I'm running CentOS Linux on it, and was thinking this would be a great place to get help, but I can see that I'm wrong, since the hardware that I chose to use (and can afford in my country) is clearly not the right choice of hardware to use.
And although you're right in saying that a hardware problem is off-topic, I need a way to prove to the suppliers that it is in fact a problem with the hardware. Since everything works fine when you switch it on, yet when I start-up the XEN virtual machines, the load goed exsesively high. I have reinstalled the OS, but since I use yum to update to the latest version of everything, it could very well be a OS / kernel / software bug as well. I don't know, and I was hoping to get some insight on it from this list.
My other choice is to go and purchase Windows & install it, to see what happens. Then, if the same problem persists I can say it's hardware, if not, then it's software related.
Sorry for sounding so rude in my earlier posts, I just spend 3 days without sleep @ the datacentre trying to sort this out, and I need to tell my clients why the machine performs so poorly compared to the previous one which only has a Core 2 Dou CPU with 4GB RAM in it. See my problem?
--
Kind Regards Rudi Ahlers
Oh, and don't take this the wrong way, but the link to a microsoft related program (in my opinion) is even more OT. Isn't there something similar for Linux that I can use? I'd prefer not to go the Windows route, if that's ok with you.
Rudi Ahlers schrieb:
Oh, and don't take this the wrong way, but the link to a microsoft related program (in my opinion) is even more OT. Isn't there something similar for Linux that I can use? I'd prefer not to go the Windows route, if that's ok with you.
SPEC 2006
;-)))
Rainer
On Nov 18, 2008, at 5:11 AM, Rainer Duffner rainer@ultra-secure.de wrote:
Rudi Ahlers schrieb:
Oh, and don't take this the wrong way, but the link to a microsoft related program (in my opinion) is even more OT. Isn't there something similar for Linux that I can use? I'd prefer not to go the Windows route, if that's ok with you.
SPEC 2006
;-)))
Or the LTP (Linux Test Project).
-Ross
On Tue, Nov 18, 2008 at 11:55:58AM +0200, Rudi Ahlers wrote:
Tru,
The hardware works, but the moment I start running server based application (i.e. XEN VPS's), then the load goes very high.
install sysstat and read the collected data with sar(1). Check for your IOwait. You don't provide any valuable data, only partial information of what you think is wrong...
And although you're right in saying that a hardware problem is off-topic, I need a way to prove to the suppliers that it is in fact a problem with the hardware. Since everything works fine when you switch it on, yet when I start-up the XEN virtual machines, the load goed exsesively high.
give figures, logs, information on your xen vm setups (file disk based, lvm , ....) what os on the xen domU.
You force people to ask you question in order to try to help you, that's not the way it should be if you want peoplet to be interested in helping you... /rant finished.
My other choice is to go and purchase Windows & install it, to see what happens. Then, if the same problem persists I can say it's hardware, if not, then it's software related.
try another linux distribution with xen support, or BSD's Even if it's software/driver (CentOS-5) related, that will not help you much beforei/until it is fixed upstream...
Sorry for sounding so rude in my earlier posts, I just spend 3 days without sleep @ the datacentre trying to sort this out, and I need to tell my clients why the machine performs so poorly compared to the previous one which only has a Core 2 Dou CPU with 4GB RAM in it. See my problem?
sure, take a quick break :) while audit/sysstat collect data
Tru
So, apart from memtest86 how else can I stress test the server to find out what the problem is?
Have you looked at Inquisitor? There is a nice article about it which includes a download link at http://www.linux.com/articles/149774
Hope this helps.
Barry
Hi Rudi,
On Tue, Nov 18, 2008 at 02:13, Rudi Ahlers rudiahlers@gmail.com wrote:
...which can't take a lot of load... ...the machine sky rockets at some times...
The problem you have is that the Load Average is too high?
If that is indeed your problem, there is no way that this can be a memory or CPU issue, since those would cause crashes and not high Load Average.
If what you have is high Load Average, check this: - Your machine has 8GB RAM. Are you using the 64-bit version of CentOS? There would be an overhead in using a 32-bit PAE version on a machine with more than 4GB, last time I tried it (some years ago) the overhead was big enough to make a difference in the server's performance. - Your machine has SATA. If you don't use the correct SATA settings on the BIOS, CentOS may use it in a backwards compatible mode and you will not get enough performance out of it (see previous posts on problems on SATA and on AHCI). If that's the case, changing the BIOS settings might make a huge difference, but beware that if you do your machine may no longer boot with the OS you installed right now. Better thing to do would be to reinstall it once you found the right setting.
And next time, please state your problem clearly ("high Load Average") instead of jumping the gun and saying you have a CPU or RAM issue which does not seem to be the case here.
HTH, Filipe
On Tue, Nov 18, 2008 at 3:58 PM, Filipe Brandenburger filbranden@gmail.com wrote:
Hi Rudi,
On Tue, Nov 18, 2008 at 02:13, Rudi Ahlers rudiahlers@gmail.com wrote:
...which can't take a lot of load... ...the machine sky rockets at some times...
The problem you have is that the Load Average is too high?
If that is indeed your problem, there is no way that this can be a memory or CPU issue, since those would cause crashes and not high Load Average.
If what you have is high Load Average, check this:
- Your machine has 8GB RAM. Are you using the 64-bit version of
CentOS? There would be an overhead in using a 32-bit PAE version on a machine with more than 4GB, last time I tried it (some years ago) the overhead was big enough to make a difference in the server's performance.
- Your machine has SATA. If you don't use the correct SATA settings on
the BIOS, CentOS may use it in a backwards compatible mode and you will not get enough performance out of it (see previous posts on problems on SATA and on AHCI). If that's the case, changing the BIOS settings might make a huge difference, but beware that if you do your machine may no longer boot with the OS you installed right now. Better thing to do would be to reinstall it once you found the right setting.
And next time, please state your problem clearly ("high Load Average") instead of jumping the gun and saying you have a CPU or RAM issue which does not seem to be the case here.
HTH, Filipe _______________________________________________
Hi Flippie,
I have checked the BIOS settings, purely cause the new HDD was installed on a machine withou AHCI settings, so I had to change the settings in the BIOS to nativ IDE mode (the only other mode this motherboard supports).
The reason why I'm suspecting the MB / RAM / CPU is that I already swapped the HDD's out, and reinstalled CentOS - first it was x64, now it's i386 (well, i686 as per uname -a). The only serivce that runs on the host node is HyperVM (which include the XEN tools, PHP, Apache, MySQL.
I have the exact same setup on a few other machines, using Gigabyte motherboards + 4GB RAM. Other than that, the HDD's are the same, the OS is the same, and HyperVM is the same. I basically run yum upgrade once a week on all the machines. The only difference is this one has an Intel DG35EC motherboard with a Q9300 Quad Core CPU on it, which is supposed to be more power efficient than some of the Core 2 Duo CPU's on the other machine.
As a matter of interest, all 5 Virtual Machines have been running on a Gigabyte motherboard + i6450 CPU + 4GB RAM since yesterday, and it's very very stable.
So, my thinking is, it's the motherboard. It could also be the RAM, but I'm not 100% sure yet. The machine had 4GB initially, and then I added another 4GB hoping the problem would go away, but it didn't.
On Tue, 2008-11-18 at 16:48 +0200, Rudi Ahlers wrote:
I have the exact same setup on a few other machines, using Gigabyte motherboards + 4GB RAM. Other than that, the HDD's are the same, the OS is the same, and HyperVM is the same. I basically run yum upgrade once a week on all the machines. The only difference is this one has an Intel DG35EC motherboard with a Q9300 Quad Core CPU on it, which is supposed to be more power efficient than some of the Core 2 Duo CPU's on the other machine.
As a matter of interest, all 5 Virtual Machines have been running on a Gigabyte motherboard + i6450 CPU + 4GB RAM since yesterday, and it's very very stable.
So, my thinking is, it's the motherboard. It could also be the RAM, but I'm not 100% sure yet. The machine had 4GB initially, and then I added another 4GB hoping the problem would go away, but it didn't.
I seem to recall that one of the differences between AMD and Intel virtualization is that AMD chips have additional memory management capabilities that are specific to virtualization on the CPU chip, where Intel processors require additional support circuitry. The fact that your problems surface when you're running xen suggests that possibly the additional support isn't functioning correctly. Is it possible that there's some obfuscated BIOS setting that's necessary to enable it, or that it's just not present on the motherboard?
Dave
I seem to recall that one of the differences between AMD and Intel virtualization is that AMD chips have additional memory management capabilities that are specific to virtualization on the CPU chip, where Intel processors require additional support circuitry. The fact that your problems surface when you're running xen suggests that possibly the additional support isn't functioning correctly. Is it possible that there's some obfuscated BIOS setting that's necessary to enable it, or that it's just not present on the motherboard?
Dave
Hi Dave,
My experience & knowledge of AMD is limited, so I stick to what I know, Intel. The only setting I know of in the BIOS related to virtualization is Intel's VT - which is enabled. But even when it was disabled I had the problem. I only enabled it last week to see if I could install FreeBSD as a fully virtualized guest.
Rudi Ahlers wrote:
Hi all,
I have a server, with an Intel DG35EC motherboard, Q9300 CPU, 8GB Kingston DDRII RAM which can't take a lot of load. I have 4 XEN VPS's on there, which doesn't consume more than 4GBM RAM at this stage. Yet, the machine sky rockets at some times. I've moved the XEN VPS's to another server, with 4GM RAM, and it doesn't cause the same problems.
So, apart from memtest86 how else can I stress test the server to find out what the problem is?
I think I mentioned this already but I use the Cerberus test suite
http://sourceforge.net/projects/va-ctcs/
Haven't had to use it in a while but works quite well, a lot of big OEMs use it as well for their burn in tests. For me it found problems much faster than memtest86. Apparently it was developed by VA Linux(If your familiar with that name)
Been meaning to setup a pxe linux boot environment with this in there so I can run it without the full blown OS on there, but haven't had a chance yet.
nate
I think I mentioned this already but I use the Cerberus test suite
http://sourceforge.net/projects/va-ctcs/
Haven't had to use it in a while but works quite well, a lot of big OEMs use it as well for their burn in tests. For me it found problems much faster than memtest86. Apparently it was developed by VA Linux(If your familiar with that name)
Been meaning to setup a pxe linux boot environment with this in there so I can run it without the full blown OS on there, but haven't had a chance yet.
nate
Thanx nate, I'll check it out :)
on 11-17-2008 11:13 PM Rudi Ahlers spake the following:
Hi all,
I have a server, with an Intel DG35EC motherboard, Q9300 CPU, 8GB Kingston DDRII RAM which can't take a lot of load. I have 4 XEN VPS's on there, which doesn't consume more than 4GBM RAM at this stage. Yet, the machine sky rockets at some times. I've moved the XEN VPS's to another server, with 4GM RAM, and it doesn't cause the same problems.
So, apart from memtest86 how else can I stress test the server to find out what the problem is?
how can I stress a server?
Tell it the printer is pregnant?
Sorry... I couldn't resist. ;-P
It has been a long day.