I was having problems with the same server locking up to the point I can't even get in via SSH. I've already used HTB/TC to reserve bandwidth for my SSH port but the problem now isn't an attack on the bandwidth. So I'm trying to figure out if there's a way to ensure that SSH is given cpu and i/o priority.
However, so far reading seems to imply that it's probably not going to help if the issue is i/o related and/or it would require escalating SSH to such levels (above paging/filesystem processes) that makes it a really bad idea.
Since I'm not the only person who face problems trying to remotely access a locked up server, surely somebody must had come up with a solution that didn't involve somebody/something hitting the power button?
Am 29.06.2011 um 21:50 schrieb Emmanuel Noobadmin:
Since I'm not the only person who face problems trying to remotely access a locked up server, surely somebody must had come up with a solution that didn't involve somebody/something hitting the power button?
Yes, it's called "out of band management". Have dial-in access to IPMI/iLO interfaces or just an APC remote controlled power-switch to power-off the server.
Rainer
Am 29.06.2011 um 21:50 schrieb Emmanuel Noobadmin:
Since I'm not the only person who face problems trying to remotely access a locked up server, surely somebody must had come up with a solution that didn't involve somebody/something hitting the power button?
Yes, it's called "out of band management". Have dial-in access to IPMI/iLO interfaces or just an APC remote controlled power-switch to power-off the server.
Perhaps this suggestion is applicable: setup a cron job where the sshd server is restarted (once or several times per day, or per week, etc).
At one time, I had a server on an ISP that, with time, became woefully underpowered (the anti-spam/anti-virus program ate CPU power and RAM) to the point that occasionally, and with more frequency (once a week?) sshd would become unresponsive. This required that someone be at console to restart sshd; or if the problem was not understandable, reboot the box.
Having sshd restarted in cron worked until we got a new, soopa-doopa box.
Rainer
Max
Am 29.06.2011 um 22:08 schrieb Max Pyziur:
Am 29.06.2011 um 21:50 schrieb Emmanuel Noobadmin:
Since I'm not the only person who face problems trying to remotely access a locked up server, surely somebody must had come up with a solution that didn't involve somebody/something hitting the power button?
Yes, it's called "out of band management". Have dial-in access to IPMI/iLO interfaces or just an APC remote controlled power-switch to power-off the server.
Perhaps this suggestion is applicable: setup a cron job where the sshd server is restarted (once or several times per day, or per week, etc).
If the problem is lack of I/O, only power-on/off will work. Or shutting down the offending process(es).
OOB-management is a necessity nevertheless. You don't have to be control-freak to love it ;-)
At Wed, 29 Jun 2011 16:08:02 -0400 (EDT) CentOS mailing list centos@centos.org wrote:
Am 29.06.2011 um 21:50 schrieb Emmanuel Noobadmin:
Since I'm not the only person who face problems trying to remotely access a locked up server, surely somebody must had come up with a solution that didn't involve somebody/something hitting the power button?
Yes, it's called "out of band management". Have dial-in access to IPMI/iLO interfaces or just an APC remote controlled power-switch to power-off the server.
Perhaps this suggestion is applicable: setup a cron job where the sshd server is restarted (once or several times per day, or per week, etc).
At one time, I had a server on an ISP that, with time, became woefully underpowered (the anti-spam/anti-virus program ate CPU power and RAM) to the point that occasionally, and with more frequency (once a week?) sshd would become unresponsive. This required that someone be at console to restart sshd; or if the problem was not understandable, reboot the box.
Having sshd restarted in cron worked until we got a new, soopa-doopa box.
If the problem is excessive load because Sendmail / Mimedefang / spamd / etc. is too busy handling tons of mail/spam being dumped on your server, you might want to look at these sendmail options:
ConnectionRateThrottle (34.8.12) MaxDaemonChildren (34.8.35)
also
QueueLA (34.8.50) RefuseLA (34.8.54)
setting these can keep Sendmail (and Mimedefang, spamd, etc.) from overwhelming the system.
Rainer
Max _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On 6/30/11, Robert Heller heller@deepsoft.com wrote:
If the problem is excessive load because Sendmail / Mimedefang / spamd / etc. is too busy handling tons of mail/spam being dumped on your server, you might want to look at these sendmail options:
Mail was my first suspect because I had similar issues with exim/spamd locking up bad on another server. But usually that includes a high cpu % as well. Although this suspicion did help me pinpoint one of the causes, a script that periodically went through the email accounts/Maildirs and that was fixed from learning about ionice on the list.
For a while I thought problem solved, but these couple of days, it's acting up again and nothing's jumping out screaming "I'm the problem!" and not being able to SSH to see what's exactly going on is making it difficult.
On 6/29/2011 4:12 PM, Emmanuel Noobadmin wrote:
On 6/30/11, Robert Hellerheller@deepsoft.com wrote:
If the problem is excessive load because Sendmail / Mimedefang / spamd / etc. is too busy handling tons of mail/spam being dumped on your server, you might want to look at these sendmail options:
Mail was my first suspect because I had similar issues with exim/spamd locking up bad on another server. But usually that includes a high cpu % as well. Although this suspicion did help me pinpoint one of the causes, a script that periodically went through the email accounts/Maildirs and that was fixed from learning about ionice on the list.
For a while I thought problem solved, but these couple of days, it's acting up again and nothing's jumping out screaming "I'm the problem!" and not being able to SSH to see what's exactly going on is making it difficult.
What's the physical disk system? I remember seeing something like that long ago where a raid controller had a large write cache that normally made it seem fast, but once in a while either filling it to a high-water mark or something else would trigger it to complete catch up before responding again - which could take several minutes with everything blocked. And nothing else ever looked out of the ordinary.
On 6/30/11, Les Mikesell lesmikesell@gmail.com wrote:
OK, but without knowing the cause, you already know the cure. Make the virtual servers not share physical disks - they will always want a single head to be in different places at the same time.
Same old problem: budget :D Also, I expect similar setups in the future so I need to be able to know why and not simply throw hardware at it since the amount of disk activity is relatively low. The curious part is that this doesn't appear to happen during expected heavy usage. It almost never occurs during working hours on a weekday ever-since I ionice the other script.
And there is also probably some ugly stuff about how using files for virtual disk
Unfortunately yes, this was one part I misread/understood, should had gone with raw partitions. However the real amount of i/o on these aren't expected to be high, especially not during a lull hour like 1am or on a Sunday.
images and perhaps LVM on both the real and virtual side makes your disk blocks misaligned. Fixing that might help too.
No LVM on either side, kept unnecessary layers off the guest. And manually fdisk'd the drive so ensure 4K alignment.
What's the physical disk system? I remember seeing something like that long ago where a raid controller had a large write cache that normally made it seem fast, but once in a while either filling it to a high-water mark or something else would trigger it to complete catch up before responding again - which could take several minutes with everything blocked. And nothing else ever looked out of the ordinary.
Standard Intel-based board, onboard SATA controller with a pair of SATA2 disks mirrored with mdadm. As I said, budget setup :D
On 6/29/2011 4:47 PM, Emmanuel Noobadmin wrote:
OK, but without knowing the cause, you already know the cure. Make the virtual servers not share physical disks - they will always want a single head to be in different places at the same time.
Same old problem: budget :D
If an extra SATA disk (or pair) is an issue I'd worry about whether your paycheck is going to bounce. Head contention is the one thing you can't virtualize away, although adding a bunch of RAM can sometimes help. And a virtual machine with its own raid set is still a pretty cheap machine.
Also, I expect similar setups in the future so I need to be able to know why and not simply throw hardware at it since the amount of disk activity is relatively low. The curious part is that this doesn't appear to happen during expected heavy usage. It almost never occurs during working hours on a weekday ever-since I ionice the other script.
If you have sysstat installed, sar should tell you when the busy times occur. Maybe you can match it up with a cron job or some email user's connection that might be downloading a bazillion messages. Are you sure it isn't the daily build of the mlocate database running from cron.daily. That's probably not pretty when running simultaneously on multiple vm that have lots of little files on the same physical disk.
At Thu, 30 Jun 2011 05:12:12 +0800 CentOS mailing list centos@centos.org wrote:
On 6/30/11, Robert Heller heller@deepsoft.com wrote:
If the problem is excessive load because Sendmail / Mimedefang / spamd / etc. is too busy handling tons of mail/spam being dumped on your server, you might want to look at these sendmail options:
Mail was my first suspect because I had similar issues with exim/spamd locking up bad on another server. But usually that includes a high cpu % as well. Although this suspicion did help me pinpoint one of the causes, a script that periodically went through the email accounts/Maildirs and that was fixed from learning about ionice on the list.
For a while I thought problem solved, but these couple of days, it's acting up again and nothing's jumping out screaming "I'm the problem!" and not being able to SSH to see what's exactly going on is making it difficult.
I have discovered that my VPS (which is a Mail and Web server), would become impossible to ssh into sometimes. If I was patient enoungh, slogin would eventualy get me on the system. Ps would show lots and lots of sendmail, mimedefang, spamd, and clamav processes and insane load average values. I generally could manage to stop sendmail, and the load average to begin to go down as the various mail related processes wound down (once things became sane, I'd restart sendmail and any crashed daemons). I put in sendmail settings to throttle back on accepting connections when things got excessively 'busy'. This was NOT anything running on my server, but caused but some overeager spambot (or spambot farm), pushing a vast amount of spam at my server. This is a 'random' event and does not seem to follow any sort of meaningful or predictable schedule. I guess being proactive with sendmail settings, including the throttling setting and populating the accessdb with DSL/Cable modem networks (DISCARD) and various other random troublesome networks (REJECT) helps. (The networks in the accessdb cut off lots of connections without firing up mimedefang and crew.) I also have the SpamCop rule enabled as well.
If the machine is a public-facing smtp server, I would look first to see if you are getting the problem I was having. Maybe looking at the maillog to see if the volume of incoming mail is just overwhelming the system. In which case you need to do things to keep sendmail from running to many processes, either by throttling the connection rate and/or be using the accessdb to discard or reject connection from known problem networks.
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Robert Heller wrote:
If the machine is a public-facing smtp server, I would look first to see if you are getting the problem I was having. Maybe looking at the maillog to see if the volume of incoming mail is just overwhelming the system. In which case you need to do things to keep sendmail from running to many processes, either by throttling the connection rate and/or be using the accessdb to discard or reject connection from known problem networks.
Very simple solution is to implement Reverse DNS check. My Postfix mail server refuses to accept any mail from FQDN without valid reverse DNS. I (was?) also use graylisting and few other measures, but Reverse DNS helped immensely in lowering SPAM that comes to my mailboxes. I would say that reduction is some 70-80%.
Ljubomir
On 6/30/11, Rainer Duffner rainer@ultra-secure.de wrote:
Yes, it's called "out of band management". Have dial-in access to IPMI/iLO interfaces or just an APC remote controlled power-switch to power-off the server.
I don't want to reboot the server everytime something like that happens. I'll expect pretty nasty problems will develop after a few dozen unclean shutdowns like that.
Would ILO work on a server that's unresponsive due to heavy load? The actual network access isn't a problem so dial up isn't necessary. The other problem is the server in question probably doesn't have ILO features on the mainboard.
Am 29.06.2011 um 22:15 schrieb Emmanuel Noobadmin:
On 6/30/11, Rainer Duffner rainer@ultra-secure.de wrote:
Yes, it's called "out of band management". Have dial-in access to IPMI/iLO interfaces or just an APC remote controlled power-switch to power-off the server.
I don't want to reboot the server everytime something like that happens. I'll expect pretty nasty problems will develop after a few dozen unclean shutdowns like that.
Would ILO work on a server that's unresponsive due to heavy load?
ILO used to be a separate board with a separate NIC and a separate CPU etc. Nowadays, it's just an additional chip on the board.
It works until the power-supply is fried.
The actual network access isn't a problem so dial up isn't necessary. The other problem is the server in question probably doesn't have ILO features on the mainboard.
If it's a server that actually deserves that name, it should have IPMI on board. You can buy add-on PCI-cards for OOB-management, though.
On 6/30/11, Rainer Duffner rainer@ultra-secure.de wrote:
If it's a server that actually deserves that name, it should have IPMI on board.
Problem is some of us work for budget constraints customers and define server by purpose and not specifications. So very often they buy servers based on budget and that it's good enough to run most applications for X users. Unfortunately, very often I'm the one who ends up managing these simply because our applications run on them.
You can buy add-on PCI-cards for OOB-management, though.
Thanks for the information, although unless they are really cheap...
On 6/29/2011 3:26 PM, Emmanuel Noobadmin wrote:
On 6/30/11, Rainer Duffnerrainer@ultra-secure.de wrote:
If it's a server that actually deserves that name, it should have IPMI on board.
Problem is some of us work for budget constraints customers and define server by purpose and not specifications. So very often they buy servers based on budget and that it's good enough to run most applications for X users. Unfortunately, very often I'm the one who ends up managing these simply because our applications run on them.
You can buy add-on PCI-cards for OOB-management, though.
Thanks for the information, although unless they are really cheap...
The seriously on-the-cheap approach is to run a few virtual servers on hardware slightly better than one of the individual servers would need. You are much less likely to kill the host (expecially something like ESXi) to the point where you can't connect, and more likely to be able to afford the out-of-band management where you need it since you have fewer boxes.
Les Mikesell wrote:
You can buy add-on PCI-cards for OOB-management, though.
Thanks for the information, although unless they are really cheap...
The seriously on-the-cheap approach is to run a few virtual servers on hardware slightly better than one of the individual servers would need. You are much less likely to kill the host (expecially something like ESXi) to the point where you can't connect, and more likely to be able to afford the out-of-band management where you need it since you have fewer boxes.
OP has stated that he manages servers for his clientS meaning that they are in separate customer sites probably even cities. Virtualization is an option only if they all belong to the same company or provide cloud computing.
Ljubomir
On 6/30/11, Les Mikesell lesmikesell@gmail.com wrote:
The seriously on-the-cheap approach is to run a few virtual servers on hardware slightly better than one of the individual servers would need.
Actually THAT is the fundamental problem ;) The physical server is frankly much more powerful than the two guest running on it. I have the same applications + public web/email running on old dual core machines with less memory than the guests.
Nothing that's being done is out of ordinary except something ordinary coupled with two virtual guest doing it at the same thing on the same physical disks causes everything to go haywire. But because "it" is otherwise normal, I haven't figured out a way to pinpoint what is it after the previous issue was solved.
You are much less likely to kill the host (expecially something like ESXi) to the point where you can't connect,
Just my luck :D
and more likely to be able to afford the out-of-band management where you need it since you have fewer boxes.
Unfortunately not the case. Most of these are basically applications + email/web servers for small/medium customers so they are usually scattered at the client's office or different datacenters.
On 6/29/2011 4:04 PM, Emmanuel Noobadmin wrote:
On 6/30/11, Les Mikeselllesmikesell@gmail.com wrote:
The seriously on-the-cheap approach is to run a few virtual servers on hardware slightly better than one of the individual servers would need.
Actually THAT is the fundamental problem ;) The physical server is frankly much more powerful than the two guest running on it. I have the same applications + public web/email running on old dual core machines with less memory than the guests.
Nothing that's being done is out of ordinary except something ordinary coupled with two virtual guest doing it at the same thing on the same physical disks causes everything to go haywire. But because "it" is otherwise normal, I haven't figured out a way to pinpoint what is it after the previous issue was solved.
OK, but without knowing the cause, you already know the cure. Make the virtual servers not share physical disks - they will always want a single head to be in different places at the same time. And there is also probably some ugly stuff about how using files for virtual disk images and perhaps LVM on both the real and virtual side makes your disk blocks misaligned. Fixing that might help too.
Les Mikesell wrote:
On 6/29/2011 4:04 PM, Emmanuel Noobadmin wrote:
On 6/30/11, Les Mikeselllesmikesell@gmail.com wrote:
The seriously on-the-cheap approach is to run a few virtual servers on hardware slightly better than one of the individual servers would need.
Actually THAT is the fundamental problem ;) The physical server is frankly much more powerful than the two guest running on it. I have the same applications + public web/email running on old dual core machines with less memory than the guests.
<snip>
OK, but without knowing the cause, you already know the cure. Make the virtual servers not share physical disks - they will always want a single head to be in different places at the same time. And there is also probably some ugly stuff about how using files for virtual disk images and perhaps LVM on both the real and virtual side makes your disk blocks misaligned. Fixing that might help too.
Here's another one, that I got from another admin talking to VMware: watch out just how many virtual CPUs you assign to each VM. If you've assigned 4, it is actually going to sit there waiting until it gets 4 virtual CPUs. As of '09, VMware was recommending assigning 2.
mark
On 6/30/11, m.roth@5-cent.us m.roth@5-cent.us wrote:
Here's another one, that I got from another admin talking to VMware: watch out just how many virtual CPUs you assign to each VM. If you've assigned 4, it is actually going to sit there waiting until it gets 4 virtual CPUs. As of '09, VMware was recommending assigning 2.
That was one of the first thing I was careful about when setting this one up. The guests got 1 and 2 cores, leaving 1 spare core to the host. I manually pinned the guest to specific cores as well to avoid any potential issues from spinlocking. But not helping apparently.
On 6/29/2011 4:22 PM, m.roth@5-cent.us wrote:
Here's another one, that I got from another admin talking to VMware: watch out just how many virtual CPUs you assign to each VM. If you've assigned 4, it is actually going to sit there waiting until it gets 4 virtual CPUs. As of '09, VMware was recommending assigning 2.
That is, of course when you have overcommitted them... If you have specied a certain number, you don't get a timeslice until that number are available at once - which sort of makes sense.
-- Les Mikesell lesmikesell@gmail.com
On Wed, Jun 29, 2011 at 5:22 PM, m.roth@5-cent.us wrote:
Les Mikesell wrote:
On 6/29/2011 4:04 PM, Emmanuel Noobadmin wrote:
On 6/30/11, Les Mikeselllesmikesell@gmail.com wrote:
The seriously on-the-cheap approach is to run a few virtual servers on hardware slightly better than one of the individual servers would need.
Actually THAT is the fundamental problem ;) The physical server is frankly much more powerful than the two guest running on it. I have the same applications + public web/email running on old dual core machines with less memory than the guests.
<snip> > OK, but without knowing the cause, you already know the cure. Make the > virtual servers not share physical disks - they will always want a > single head to be in different places at the same time. And there is > also probably some ugly stuff about how using files for virtual disk > images and perhaps LVM on both the real and virtual side makes your disk > blocks misaligned. Fixing that might help too.
Here's another one, that I got from another admin talking to VMware: watch out just how many virtual CPUs you assign to each VM. If you've assigned 4, it is actually going to sit there waiting until it gets 4 virtual CPUs. As of '09, VMware was recommending assigning 2.
mark
This is no longer true [1], but it's still a good idea to only assign as many CPUs as you need.
[1] Source: VMware Engineer at VMware Forum 2011.
-☙ Brian Mathis ❧-
Brian Mathis wrote:
On Wed, Jun 29, 2011 at 5:22 PM, m.roth@5-cent.us wrote:
<snip>
Here's another one, that I got from another admin talking to VMware: watch out just how many virtual CPUs you assign to each VM. If you've assigned 4, it is actually going to sit there waiting until it gets 4 virtual CPUs. As of '09, VMware was recommending assigning 2.
This is no longer true [1], but it's still a good idea to only assign as many CPUs as you need.
[1] Source: VMware Engineer at VMware Forum 2011.
Ah, thanks! Yeah, the problem was with overcommitting. Glad to hear that's solved.
mark
--On Thursday, June 30, 2011 05:04:07 AM +0800 Emmanuel Noobadmin centos.admin@gmail.com wrote:
On 6/30/11, Les Mikesell lesmikesell@gmail.com wrote:
The seriously on-the-cheap approach is to run a few virtual servers on hardware slightly better than one of the individual servers would need.
Actually THAT is the fundamental problem ;) The physical server is frankly much more powerful than the two guest running on it. I have the same applications + public web/email running on old dual core machines with less memory than the guests.
I don't recall you mentioning which VM solution you're using.
Some problematic areas that I've seen when using VMs:
+ memory ballooning sometimes causes problems (I've not actually seen it, but I've seen various warnings about having it enabled and resultant flakiness, and I run with it disabled)
+ I/O stacks not doing TCP segment offload correctly. This is an ugly one that bit me hard and took a while to track down. It's happened in both ESXi and Xen (and I'm not saying that KVM isn't affected, either).
The symptoms of this is things seem to be fine under low load, but as network traffic starts to increase TCP sessions start stalling out or dying. I've seen it to the point where I can't even maintain an ssh session long enough to get a login prompt.
What it comes down to is the top level (virtual) OS decides to hand off TCP segmentation to the (virtual) NIC. To make a long story short, between the guest OS, the virtual NICs, the virtual switches, the host OS, and the physical NICs, there exists a path (depending on versions and hardware) where everyone things somebody else is doing TCP segment handling, and nobody is. So as I/O goes up or fragmentation occurs, the protocol goes into the toilet. Sometimes you miss packets and sometimes the data is corrupt.
Disabling tcp segment offload in both the host and guest avoids the problem (forcing the OS to do it instead of the VM & physical layers). Be aware of reboots and update processes that want to reenable it ...
Devin
On 6/30/11, Devin Reade gdr@gno.org wrote:
I don't recall you mentioning which VM solution you're using.
KVM :)
Some problematic areas that I've seen when using VMs:
- memory ballooning sometimes causes problems (I've not actually seen it, but I've seen various warnings about having it enabled and resultant flakiness, and I run with it disabled)
This might be one of the problems, because I just realized while the swap used is still pretty small at around 200MB, it's about 5x the "normal" amount of about 40MB. But since I set an initial 1GB with an upper limit of 1.5GB, I'll expect the amount of memory available to be 1.5GB at least when swap usage goes up. However, this isn't the case, the ballooning doesn't seem to be happening so maybe that's part of the problem: one of them just wanted to use a bit more memory for whatsoever reasons but didn't get it and start hitting swap and the i/o starts going crazy.
I/O stacks not doing TCP segment offload correctly. This is an ugly one that bit me hard and took a while to track down. It's happened in both ESXi and Xen (and I'm not saying that KVM isn't affected, either).
The symptoms of this is things seem to be fine under low load, but as network traffic starts to increase TCP sessions start stalling out or dying. I've seen it to the point where I can't even maintain an ssh session long enough to get a login prompt.
This might be possible but at the moment I'll consider it unlikely since the problem don't usually happen during low load periods i.e. not when the users are connecting to the email or app service during working hours.
So I'll KIV this first and see if simply setting the max/current memory without relying on ballooning works.
Am 29.06.2011 um 22:26 schrieb Emmanuel Noobadmin:
On 6/30/11, Rainer Duffner rainer@ultra-secure.de wrote:
If it's a server that actually deserves that name, it should have IPMI on board.
Problem is some of us work for budget constraints customers and define server by purpose and not specifications. So very often they buy servers based on budget and that it's good enough to run most applications for X users. Unfortunately, very often I'm the one who ends up managing these simply because our applications run on them.
I'd go for a power-switch then. Less logic.
http://computers.shop.ebay.com/Computers-Networking-/58058/i.html?_nkw=remot... http://www.amazon.com/NP-0801D-Switchable-manufactured-Temperature-Monitorin...
You can buy add-on PCI-cards for OOB-management, though.
Thanks for the information, although unless they are really cheap...
Define "cheap". I live and work in 2011's 6th most expensive city of the world....
Virtualization is an option, but the trouble is: if the server is I/O- constrained anyway, virtualization won't help. Everything will just be even slower.
On 6/29/2011 3:43 PM, Rainer Duffner wrote:
Virtualization is an option, but the trouble is: if the server is I/O- constrained anyway, virtualization won't help. Everything will just be even slower.
That's sort-of true, but you don't have to manage the host through the same interface the guests use - and you need to solve the problem causing the issue anyway, not continue to work around it with restarts.
On Wednesday, June 29, 2011 04:43:09 PM Rainer Duffner wrote:
Virtualization is an option, but the trouble is: if the server is I/O- constrained anyway, virtualization won't help. Everything will just be even slower.
That depends. More expensive servers that would be suitable for virtualization host use also tend to have better I/O subsystems and faster disks. Relative to a 'cheap' system with much poorer base I/O bandwidth.
Am 29.06.2011 um 23:17 schrieb Lamar Owen:
On Wednesday, June 29, 2011 04:43:09 PM Rainer Duffner wrote:
Virtualization is an option, but the trouble is: if the server is I/ O- constrained anyway, virtualization won't help. Everything will just be even slower.
That depends. More expensive servers that would be suitable for virtualization host use also tend to have better I/O subsystems and faster disks. Relative to a 'cheap' system with much poorer base I/ O bandwidth.
The OP clearly stated that he's probably not running a datacenter full of DL580g7 servers...
On Wednesday, June 29, 2011 05:20:26 PM Rainer Duffner wrote:
Am 29.06.2011 um 23:17 schrieb Lamar Owen:
More expensive servers that would be suitable for virtualization host use also tend to have better I/O subsystems and faster disks. Relative to a 'cheap' system with much poorer base I/ O bandwidth.
The OP clearly stated that he's probably not running a datacenter full of DL580g7 servers...
Yeah, I saw that. I was just addressing the I/O slowdown thing, where if you double the money you might very well get more than double the performance, and get two VM's running faster than on the cheaper hardware. But it seems he's already doing some virt.
Just not enough detail to sort that out.
Although it would really be interesting to me to see scheduler settings that would indeed allow something of a 'privileged' ssh or an OOB console that would be responsive even under a punishing load with lots of swapping, which is what the OP originally asked about.
Although it would really be interesting to me to see scheduler settings that would indeed allow something of a 'privileged' ssh or an OOB console that would be responsive even under a punishing load with lots of swapping, which is what the OP originally asked about.
I'd be interested to hear thoughts on this. We have a small 1U test server with 2 entry-level SATA drives that was brought to its knees twice this week by an overzealous Java process. Load averages were up around 60+ and as a result, SSH access would timeout. I don't know if this behaviour is typical across operating systems, but it's frustrating to find yourself locked out a server just because a single process went to town on the i/o subsystem.
Cheers
Steve
Am 30.06.2011 08:36, schrieb Steve Barnes:
Although it would really be interesting to me to see scheduler settings that would indeed allow something of a 'privileged' ssh or an OOB console that would be responsive even under a punishing load with lots of swapping, which is what the OP originally asked about.
I'd be interested to hear thoughts on this. We have a small 1U test server with 2 entry-level SATA drives that was brought to its knees twice this week by an overzealous Java process. Load averages were up around 60+ and as a result, SSH access would timeout. I don't know if this behaviour is typical across operating systems, but it's frustrating to find yourself locked out a server just because a single process went to town on the i/o subsystem.
Cheers
Steve
CentOS 6 will support cgroups, by which you can control cpu, memory and I/O.
http://www.mjmwired.net/kernel/Documentation/cgroups.txt
http://www.mjmwired.net/kernel/Documentation/cgroups/blkio-controller.txt
Alexander
On Thu, Jun 30, 2011 at 4:38 AM, Alexander Dalloz ad+lists@uni-x.orgwrote:
Am 30.06.2011 08:36, schrieb Steve Barnes:
Although it would really be interesting to me to see scheduler settings
that would indeed allow something of a 'privileged' ssh or an OOB console that would be responsive even under a punishing load with lots of swapping, which is what the OP originally asked about.
I'd be interested to hear thoughts on this. We have a small 1U test
server with 2 entry-level SATA drives that was brought to its knees twice this week by an overzealous Java process. Load averages were up around 60+ and as a result, SSH access would timeout. I don't know if this behaviour is typical across operating systems, but it's frustrating to find yourself locked out a server just because a single process went to town on the i/o subsystem.
Cheers
Steve
CentOS 6 will support cgroups, by which you can control cpu, memory and I/O.
http://www.mjmwired.net/kernel/Documentation/cgroups.txt
http://www.mjmwired.net/kernel/Documentation/cgroups/blkio-controller.txt
Just tried the disktop.stp script on a Linux 2.6.38 and it looks nice. The possibilities! :)
http://sourceware.org/systemtap/examples/io/disktop.stp
Steve Barnes wrote:
I'd be interested to hear thoughts on this. We have a small 1U test server with 2 entry-level SATA drives that was brought to its knees twice this week by an overzealous Java process. Load averages were up around 60+ and as a result, SSH access would timeout. I don't know if this behaviour is typical across operating systems, but it's frustrating to find yourself locked out a server just because a single process went to town on the i/o subsystem.
that privileged SSH process should be spawned on RAM disk to avoid being caught up by I/O problems.
Another solution would be to have core root directory (no var/log's and similar) in RAM disk (either as cache or duplicated files) so important processes are independent from I/O problems.
Or maybe having that core root tree on separate HDD and separate HDD controller.
Ljubomir
2011/6/30 rainer@ultra-secure.de:
Steve Barnes wrote:
[...]
Or maybe having that core root tree on separate HDD and separate HDD controller.
Unfortunately, all this does not matter at all. The problem is: sshd is swapped out and the system needs to swap-out something else first, before it can take sshd back in.
How about buying more memory and faster harddisks?
-- Eero
Steve Barnes wrote:
[...]
Or maybe having that core root tree on separate HDD and separate HDD controller.
Unfortunately, all this does not matter at all. The problem is: sshd is swapped out and the system needs to swap-out
Hm, I thought the problem was I/O, not memory? If memory is not the problem then it has nothing to do with swapping (more correctly paging).
Simon
On 6/30/11, Simon Matter simon.matter@invoca.ch wrote:
Hm, I thought the problem was I/O, not memory? If memory is not the problem then it has nothing to do with swapping (more correctly paging).
After looking through the various replies here and rechecking whatever logs I managed to get, it might in a way be related to swapping, not on the host which I am trying to get into but the guest.
On 6/30/11 6:11 AM, Emmanuel Noobadmin wrote:
On 6/30/11, Simon Mattersimon.matter@invoca.ch wrote:
Hm, I thought the problem was I/O, not memory? If memory is not the problem then it has nothing to do with swapping (more correctly paging).
After looking through the various replies here and rechecking whatever logs I managed to get, it might in a way be related to swapping, not on the host which I am trying to get into but the guest.
Again, fixable by not sharing the disk the guest uses with the disk the host needs to load programs from... The disk head is always going to be in the wrong place.
But, odds are that the source of the problem is starting too many mail delivery programs, especially if they, or the user's local procmail, starts a spamassassin instance per message. Look at the mail logs for a problem time to see if you had a flurry of messages coming in. Sendmail/MimeDefang is fairly good at queuing the input and controlling the processes running at once but even with that you may have to throttle the concurrent sendmail processes.
On 6/30/11, Les Mikesell lesmikesell@gmail.com wrote:
Again, fixable by not sharing the disk the guest uses with the disk the host needs to load programs from... The disk head is always going to be in the wrong place.
Well, let's just say my original recommendation specifications for this particular set was a HP ML110G6 with 8GB, 2x 250GB for the host and 2x1.5TB storage drives, told them the extra memory and drives could be bought OTS so they don't have to pay HP prices for that.
What I end up working with is a no-brand desktop quad core with a pair of 500GB... so the chances of convincing them to fork out extra for hardware isn't good. Unfortunately I'm stuck with making things work because managing the server was part of the contract sold with the apps.
But, odds are that the source of the problem is starting too many mail delivery programs, especially if they, or the user's local procmail, starts a spamassassin instance per message. Look at the mail logs for a problem time to see if you had a flurry of messages coming in. Sendmail/MimeDefang is fairly good at queuing the input and controlling the processes running at once but even with that you may have to throttle the concurrent sendmail processes.
Does it make a difference if I'm running Exim instead of sendmail/MimeDefang? Right now it doesn't look like an mail run, more like a httpd run because it's starting to look like a large number of httpd threads was spawned just before that.
Unfortunately, I also discovered that logrotate was wrongly configured and I only have daily logs. Fixed that, hopefully, and shall see if I get something better to work on if it strikes again.
On 6/30/2011 12:39 PM, Emmanuel Noobadmin wrote:
But, odds are that the source of the problem is starting too many mail delivery programs, especially if they, or the user's local procmail, starts a spamassassin instance per message. Look at the mail logs for a problem time to see if you had a flurry of messages coming in. Sendmail/MimeDefang is fairly good at queuing the input and controlling the processes running at once but even with that you may have to throttle the concurrent sendmail processes.
Does it make a difference if I'm running Exim instead of sendmail/MimeDefang?
The principle is the same but the way to control it would be different. Spamassassin is a perl program that uses a lot of memory and takes a lot of resources to start up. If you run a lot of copies at once, expect the machine to crawl or die. MimeDefang, being mostly perl itself, runs spamassassin in its own process and has a way to control the number of instances - and does it in a way that doesn't tie a big perl process to every sendmail instance. Other systems might run the spamd background process and queue up the messages to scan. The worst case is something that starts a new process for every received message and keeps the big perl/spamassassin process running for the duration - you might also see this with spamassassin runs happening in each user's .procmailrc. One thing that might help is to make sure the spam/virus check operations happen in an order that starts with the least resource usage and the most likely checks to cause rejection so spamassassin might not have to run so much.
Right now it doesn't look like an mail run, more like a httpd run because it's starting to look like a large number of httpd threads was spawned just before that.
The same principle applies there, especially if you have big cgi programs or mod_perl, mod_python, mod_php (etc.) modules that use a lot of resources. You are probably running in pre-forking mode so those programs quickly stop sharing memory in the child processes (perl is particularly bad about this since variable reference counts are always being updated). Even if you handle normal load, you might have a problem when a search engine indexer walks your links and fires off more copies than usual. You can get an idea of how much of a problem you have here by looking at the RES size of the httpd processes in top. If they are big and fairly variable, you have some pages/modules/programs that consume a lot of memory. You can limit the number of concurrent processes, and in some cases it might help to reduce their life (MaxRequestsPerChild).
On 7/1/11, Les Mikesell lesmikesell@gmail.com wrote:
The principle is the same but the way to control it would be different. Spamassassin is a perl program that uses a lot of memory and takes a lot of resources to start up. If you run a lot of copies at once, expect the machine to crawl or die.
This I had experienced before, which is why the first thing I look at usually is the mail processes.
MimeDefang, being mostly perl itself, runs spamassassin in its own process and has a way to control the number of instances - and does it in a way that doesn't tie a big perl process to every sendmail instance. Other systems might run the spamd background process and queue up the messages to scan. The worst case is something that starts a new process for every received message and keeps the big perl/spamassassin process running for the duration - you might also see this with spamassassin runs happening in each user's .procmailrc. One thing that might help is to make sure the spam/virus check operations happen in an order that starts with the least resource usage and the most likely checks to cause rejection so spamassassin might not have to run so much.
I do have greylisting and stuff in to reject as much mail before spamd runs, so there's probably not much more I could do on that side without learning to program Exim conf.
The same principle applies there, especially if you have big cgi programs or mod_perl, mod_python, mod_php (etc.) modules that use a lot of resources. You are probably running in pre-forking mode so those programs quickly stop sharing memory in the child processes (perl is particularly bad about this since variable reference counts are always being updated). Even if you handle normal load, you might have a problem when a search engine indexer walks your links and fires off more copies than usual. You can get an idea of how much of a problem you have here by looking at the RES size of the httpd processes in top. If they are big and fairly variable, you have some pages/modules/programs that consume a lot of memory. You can limit the number of concurrent processes, and in some cases it might help to reduce their life (MaxRequestsPerChild).
I'll keep this in mind if the current fix doesn't hold up (no ballooning, higher starting memory for the VM) which it appears to so far.
Oh, one other thing... Do the web programs using mysql for anything? I've seen mysql do some really dumb things on a 3-table join, like make a temporary table containing all the join possibilities, sort it, then return the small number of rows you asked for with a LIMIT. Maybe it is better these days but that used to happen even when there were indexes on the fields involved and if any of the tables were big it would take a huge amount of disk activity.
Most of the apps run off mysql, the likely culprit could be the Wordpress corporate blog they have since that probably invites all kind of spambots and what not. Definitely not our customized apps since we basically have an audit trail of every single command issued to the system and so although I don't have the relevant httpd logs due to the logrotate error, I'm certain no cron jobs and nobody was accessing it at those times.
On 6/30/2011 12:39 PM, Emmanuel Noobadmin wrote:
Right now it doesn't look like an mail run, more like a httpd run because it's starting to look like a large number of httpd threads was spawned just before that.
Oh, one other thing... Do the web programs using mysql for anything? I've seen mysql do some really dumb things on a 3-table join, like make a temporary table containing all the join possibilities, sort it, then return the small number of rows you asked for with a LIMIT. Maybe it is better these days but that used to happen even when there were indexes on the fields involved and if any of the tables were big it would take a huge amount of disk activity.
At Fri, 1 Jul 2011 01:39:19 +0800 CentOS mailing list centos@centos.org wrote:
On 6/30/11, Les Mikesell lesmikesell@gmail.com wrote:
Again, fixable by not sharing the disk the guest uses with the disk the host needs to load programs from... The disk head is always going to be in the wrong place.
Well, let's just say my original recommendation specifications for this particular set was a HP ML110G6 with 8GB, 2x 250GB for the host and 2x1.5TB storage drives, told them the extra memory and drives could be bought OTS so they don't have to pay HP prices for that.
What I end up working with is a no-brand desktop quad core with a pair of 500GB... so the chances of convincing them to fork out extra for hardware isn't good. Unfortunately I'm stuck with making things work because managing the server was part of the contract sold with the apps.
But, odds are that the source of the problem is starting too many mail delivery programs, especially if they, or the user's local procmail, starts a spamassassin instance per message. Look at the mail logs for a problem time to see if you had a flurry of messages coming in. Sendmail/MimeDefang is fairly good at queuing the input and controlling the processes running at once but even with that you may have to throttle the concurrent sendmail processes.
Does it make a difference if I'm running Exim instead of sendmail/MimeDefang?
Probably not. I suspect that Exim also has a throttling parameter setting.
Right now it doesn't look like an mail run, more like a httpd run because it's starting to look like a large number of httpd threads was spawned just before that.
OK, there are probably settings for Apache to run fewer threads. Probably better have a "Server too busy" type of message than a wedged server. (And most likely the extra httpd threads will just be spambots of some sort anyway -- who cares if they get tossed...)
Unfortunately, I also discovered that logrotate was wrongly configured and I only have daily logs. Fixed that, hopefully, and shall see if I get something better to work on if it strikes again. _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On 6/30/2011 4:53 PM, Robert Heller wrote:
Right now it doesn't look like an mail run, more like a httpd run because it's starting to look like a large number of httpd threads was spawned just before that.
OK, there are probably settings for Apache to run fewer threads. Probably better have a "Server too busy" type of message than a wedged server. (And most likely the extra httpd threads will just be spambots of some sort anyway -- who cares if they get tossed...)
With the launch of Living Social, we have had a few clients use that service and you will suddenly have all Apache instances running and the server acting very laggy to all but unresponsive. I have cut back on the total number of Apache instances due to these 'non-attacks' which are much like a DoS attack. It seems the first day is horrid, the second not so bad and it wains down from there.
This really raises a new question of what to do the handle such broadcast ads? We run very conservative server loads, but...
I don't recommend running it all the time, only when you need to catch something, but server status can be your friend. You can run a refresh in your browser... leave it running in a tab set to refresh like once every minute or five. It will show the instances of Apache and the files being accessed. Much faster than digging through logs in a Virt server environment. This feature is built into Apache, but is not on by default. Look at your httpd.conf file.
On 6/30/11, rainer@ultra-secure.de rainer@ultra-secure.de wrote:
Unfortunately, all this does not matter at all. The problem is: sshd is swapped out and the system needs to swap-out something else first, before it can take sshd back in.
There appears to be some functions available to programs to lock their process pages in memory, mlock and mlockall. But I can't seem to find a command line equivalent that might be able to keep sshd locked into memory.
In any case, I've ionice and renice sshd and see if that would help.
On Thu, 30 Jun 2011, rainer@ultra-secure.de wrote:
Steve Barnes wrote:
[...]
Or maybe having that core root tree on separate HDD and separate HDD controller.
Unfortunately, all this does not matter at all. The problem is: sshd is swapped out and the system needs to swap-out something else first, before it can take sshd back in.
Reduce your chances of it being kicked out into swap as a result of i/o:
sysctl -q vm.swappiness=0
If that improves things, add an appropriate line into sysctl.conf.
jh
Although it would really be interesting to me to see scheduler settings that would indeed allow something of a 'privileged' ssh or an OOB console that would be responsive even under a punishing load with lots of swapping, which is what the OP originally asked about.
I should add, we have OOB management facilities available but even the console login was unresponsive. The one SSH login that was logged in at the time the trouble started wasn't capable of terminating any of the problematic processses or issuing a graceful reboot sequence. Pressing the power button and/or Ctrl+Alt+Delete would return "Shutdown: already in progress" (which I eventually gave up waiting for).
Cheers
Steve
On Wed, Jun 29, 2011 at 4:15 PM, Emmanuel Noobadmin centos.admin@gmail.com wrote:
On 6/30/11, Rainer Duffner rainer@ultra-secure.de wrote:
Yes, it's called "out of band management". Have dial-in access to IPMI/iLO interfaces or just an APC remote controlled power-switch to power-off the server.
I don't want to reboot the server everytime something like that happens. I'll expect pretty nasty problems will develop after a few dozen unclean shutdowns like that.
Would ILO work on a server that's unresponsive due to heavy load? The actual network access isn't a problem so dial up isn't necessary. The other problem is the server in question probably doesn't have ILO features on the mainboard.
Doing a hard power-off is extreme, but could be the last resort option.
ILO is just one product (by HP) that provides out-of-band management for servers. Dell has DRAC, and there are others. They allow you access to the server's console as if you are standing there, as well as other functions like power on/off, virtual CD drive, etc... These are usually built-in to the server so you can't really add-on later.
You can get similar functionality by using a remote IP-based KVM. They only provide the remote console, not power on/off or virtual CD. For a single server, a low cost option is the Lantronix Spider or Spider Duo. It provides a remote console for a single server for a few hundred $$$s.
An alternative that is usable for Linux servers is a remote serial console; it allows you to ssh into it and then connect to the serial port of the server. You will need to setup the bios, grub, and a serial getty to be able to login to a server this way. wti.com makes a good one that I currently use.
All of these solutions are "out of band" meaning they do not directly interface with the operating system, so if there's a problem with the server, they are not affected by it.
Your name suggests you are new to sysadmin. One of the lessons here is to always have at least 1 method of out of band management as part of the non-negotiable requirements for a server, especially a remote one.
-☙ Brian Mathis ❧-
--On Thursday, June 30, 2011 04:15:07 AM +0800 Emmanuel Noobadmin centos.admin@gmail.com wrote:
Would ILO work on a server that's unresponsive due to heavy load?
ILO or any other OOB solution gives you the functionality of sitting at the console. So if the problem is one that would cause the console to be unresponsive, you're still not going to be able to log in. OTOH, if console is responsive but network-based access such as ssh is not, then OOB may help.
As others have said, OOB is always a good idea. When the money managers question the cost, compare it to the cost of servers becoming unavailable, someone having to travel to the site, etc. For aftermarket stuff, in addition to what others have mentioned, search the list archives for threads mentioning 'ipeps'.
Getting back to determining the source of the problem:
I would suggest that you might want to look at running sar. It's something that will collect various system statistics constantly, so it's good for, when there's an event, going back after the fact and showing what lead up to it. It may not tell you the actual problem, but the statistics can help isolate the cause.
Be aware of Heisenberg, though. You suspect that your problem is I/O based. Sar is going to increase your I/O in order to log the stats to disk. If you have sar sampling too often, you're going to increase the amount your problem is happening (or make it happen faster). If you don't sample often enough, the lack of resolution of the data can hide what is actually going on with the system. (Grabbing a number out of the air, you might be able start sampling at once per minute.) If you can log sar stats to a different disk, it might help.
Also be aware that when things grind to a halt, you're probably not going to get stats. So what you see (after reboot) may just include the *start* of the event.
sar is part of the sysstat rpm.
Devin
On Wed, Jun 29, 2011 at 4:50 PM, Emmanuel Noobadmin centos.admin@gmail.comwrote:
I was having problems with the same server locking up to the point I can't even get in via SSH. I've already used HTB/TC to reserve bandwidth for my SSH port but the problem now isn't an attack on the bandwidth. So I'm trying to figure out if there's a way to ensure that SSH is given cpu and i/o priority.
However, so far reading seems to imply that it's probably not going to help if the issue is i/o related and/or it would require escalating SSH to such levels (above paging/filesystem processes) that makes it a really bad idea.
Since I'm not the only person who face problems trying to remotely access a locked up server, surely somebody must had come up with a solution that didn't involve somebody/something hitting the power button?
I would approach this issue from another perspective: who's locking up the server (as in eating all resources) and how to stop/constrain it. You can try to renice the sshd process and see what happens. I'm not entirely sure what 'locked up' means in this context.
On 6/30/11, Giovanni Tirloni gtirloni@sysdroid.com wrote:
I would approach this issue from another perspective: who's locking up the server (as in eating all resources) and how to stop/constrain it. You can try to renice the sshd process and see what happens. I'm not entirely sure what 'locked up' means in this context.
Server's unresponsive to the external world. It isn't dead, on two occasions, when it happened at times like Sunday and 1am in the night, I could afford to wait it out and see that it eventually does recover from whatever it was.
It's almost definitely related to disk i/o due to the VM guest fighting over the disks where their virtual disk-files are. However, the hard part is figuring out the exact factors, I know CPU isn't an issue having set up scripts to log top output when load goes above 5.
On Wed, Jun 29, 2011 at 5:57 PM, Emmanuel Noobadmin centos.admin@gmail.comwrote:
On 6/30/11, Giovanni Tirloni gtirloni@sysdroid.com wrote:
I would approach this issue from another perspective: who's locking up
the
server (as in eating all resources) and how to stop/constrain it. You can try to renice the sshd process and see what happens. I'm not entirely
sure
what 'locked up' means in this context.
Server's unresponsive to the external world. It isn't dead, on two occasions, when it happened at times like Sunday and 1am in the night, I could afford to wait it out and see that it eventually does recover from whatever it was.
It's almost definitely related to disk i/o due to the VM guest fighting over the disks where their virtual disk-files are. However, the hard part is figuring out the exact factors, I know CPU isn't an issue having set up scripts to log top output when load goes above 5.
Linux includes I/O in how it calculates the load average so you're not measuring CPU alone.
What does top show? Any error messages in /var/log during the time the server is unresponsive? Is network responsive? Latency normal too?
On 6/30/11, Giovanni Tirloni gtirloni@sysdroid.com wrote:
Linux includes I/O in how it calculates the load average so you're not measuring CPU alone.
On the host, it's expected, I've got two qemu-kvm process loading up 100% cpu. Within the guest VM, top looks like this, high load but low cpu %.
top - 10:21:40 up 1 day, 59 min, 0 users, load average: 16.72, 6.05, 2.29 Tasks: 176 total, 1 running, 175 sleeping, 0 stopped, 0 zombie Cpu(s): 3.3%us, 1.2%sy, 1.2%ni, 91.2%id, 2.7%wa, 0.1%hi, 0.2%si, 0.0%st Mem: 1017392k total, 970564k used, 46828k free, 1436k buffers Swap: 2040244k total, 200572k used, 1839672k free, 30344k cached
What does top show? Any error messages in /var/log during the time the server is unresponsive? Is network responsive? Latency normal too?
I think the network is responsive, pings work but nothing else does. No error messages in both host and guest. faillog, messages and dmesg give no clue. Which is why I figured I really need to be logged in, check and if necessary kill innocent processes one by one until I find the culprit when it's going crazy.
At Thu, 30 Jun 2011 05:31:05 +0800 CentOS mailing list centos@centos.org wrote:
On 6/30/11, Giovanni Tirloni gtirloni@sysdroid.com wrote:
Linux includes I/O in how it calculates the load average so you're not measuring CPU alone.
On the host, it's expected, I've got two qemu-kvm process loading up 100% cpu. Within the guest VM, top looks like this, high load but low cpu %.
top - 10:21:40 up 1 day, 59 min, 0 users, load average: 16.72, 6.05, 2.29 Tasks: 176 total, 1 running, 175 sleeping, 0 stopped, 0 zombie Cpu(s): 3.3%us, 1.2%sy, 1.2%ni, 91.2%id, 2.7%wa, 0.1%hi, 0.2%si, 0.0%st Mem: 1017392k total, 970564k used, 46828k free, 1436k buffers Swap: 2040244k total, 200572k used, 1839672k free, 30344k cached
What does top show? Any error messages in /var/log during the time the server is unresponsive? Is network responsive? Latency normal too?
I think the network is responsive, pings work but nothing else does. No error messages in both host and guest. faillog, messages and dmesg give no clue. Which is why I figured I really need to be logged in, check and if necessary kill innocent processes one by one until I find the culprit when it's going crazy.
This looks a lot like my server looked/looks when being hit on by the spambot(s). Lots of I/O, not much CPU. Moving message bytes around, etc. What does maillog look like? There won't be errors, but how much message traffic is there?
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Thu, Jun 30, 2011 at 03:50:30AM +0800, Emmanuel Noobadmin wrote:
I was having problems with the same server locking up to the point I can't even get in via SSH. I've already used HTB/TC to reserve bandwidth for my SSH port but the problem now isn't an attack on the bandwidth. So I'm trying to figure out if there's a way to ensure that SSH is given cpu and i/o priority.
As you've probably figured out, the short answer is no. There are sometimes workarounds, of course.
Since I'm not the only person who face problems trying to remotely access a locked up server, surely somebody must had come up with a solution that didn't involve somebody/something hitting the power button?
In addition to the suggestions already made, one possibility is to attach a serial console or IP KVM. Logging in may still be awful, but at least you won't have to go through sshd. I've been able to log in through a serial getty when sshd was not responding or taking too long (this works maybe 50-75% of the time; the rest of the time it's too late, and even getty is unresponsive). You have the added advantage of being able to log in directly as root if you have PermitRootLogin no in your sshd_config.
If your I/O problem is due to running out of memory and thrashing swap, you can try to be more aggressive with the OOM killer settings.
As someone else mentioned, it might help if you elaborated on "locked up". What are the common scenarios you see?
--keith
On Wed, 29 Jun 2011, Keith Keller wrote:
In addition to the suggestions already made, one possibility is to attach a serial console or IP KVM. Logging in may still be awful, but at least you won't have to go through sshd. I've been able to log in through a serial getty when sshd was not responding or taking too long (this works maybe 50-75% of the time; the rest of the time it's too late, and even getty is unresponsive). You have the added advantage of being able to log in directly as root if you have PermitRootLogin no in your sshd_config.
Even with OOB console access, there's still the problem of /bin/login timing out on highly loaded servers. The login.c source in the util-linux package hardwires the login timeout to 60 seconds. If your server can't process the login request in under a minute (not unusual if the load average is high and/or the machine is using swap), you can't login via *any* console.
So if killing the machine doesn't appeal to you, you still need OOB console access plus
* a patched version of /bin/login with a longer timeout, or * a process-watcher that aggressively kills known troublemakers, or * a remotely accessible console that never logs out.
I actually relied for a while on the last choice. I had a remotely accessible root shell that never logged out. When things got sluggish, I was able to /bin/kill to my heart's content. It wasn't a pretty solution, but it kept me running until I was able to solve the problem properly.
On 6/30/11, Paul Heinlein heinlein@madboa.com wrote:
I actually relied for a while on the last choice. I had a remotely accessible root shell that never logged out. When things got sluggish, I was able to /bin/kill to my heart's content. It wasn't a pretty solution, but it kept me running until I was able to solve the problem properly.
Would this work without the OOB hardware? E.g. if I leave a detached screen'd SSH session open from another server, then ionice + nice that shell on the problem server?
On Thu, 30 Jun 2011, Emmanuel Noobadmin wrote:
On 6/30/11, Paul Heinlein heinlein@madboa.com wrote:
I actually relied for a while on the last choice. I had a remotely accessible root shell that never logged out. When things got sluggish, I was able to /bin/kill to my heart's content. It wasn't a pretty solution, but it kept me running until I was able to solve the problem properly.
Would this work without the OOB hardware? E.g. if I leave a detached screen'd SSH session open from another server, then ionice + nice that shell on the problem server?
I wouldn't rely on that setup.
In my case, the problem server was connected to a Digi console server via its serial port. On another (non-problematic server), I opened a screen session and connected to the console on the problem server. I had to adjust some timeouts here and there to ensure an eternal console. :-)
On 06/29/11 14:50, Emmanuel Noobadmin wrote:
I was having problems with the same server locking up to the point I can't even get in via SSH.
investigate instead of band-aiding...
1) syslog to a remote host. remote syslogging rarely stops when the system is disk/iowait bound.
2) log diving. is there anything in the logs around the time of the incidents?
large emails(100MB+ body, not attachment) can freak out versions of spam assassin... server load would reach 300+ which timed out SSH connections. syslogs took time to wade through, but pinpointed the recurring issue.