I have a CentOS 6.5 x86_64 system that's been running problem-free for quite a while.
Recently, it's locked-up hard several times. It's a headless server, but I do have IP KVM. However, when it's locked up, all I can see are a few lines of kernel stack trace. No hints to the problem in the system logs. I even enabled remote logging of syslog, hoping to catch the errors that way. No luck.
I ran memtest86+ for about 36 errors, no problems.
I've tried to strip away just about all running services. It's just a home file server. I haven't had a crash in a while, but I also haven't had it running very long.
But even while it's up, I have severe input lag in the shell. I'll type a few characters, and two to 10 or so seconds pass before anything echoes to the screen.
I've checked top, practically zero CPU load.
It's not swapping - 16 GB of RAM, 0 swap used. Most memory heavy process is java (for CrashPlan backups).
iostat shows 0% disk utilization.
Anyone seen anything like this? Where else can I check to try to determine the source of this lag (which I suspect might be related to the recent crashes)?
Thanks, Matt
Is it under some type of ddos attack?
What's running on this machine? In front of it?
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Matt Garman Sent: Thursday, October 09, 2014 11:45 PM To: CentOS mailing list Subject: [CentOS] centos 6.5 input lag
I have a CentOS 6.5 x86_64 system that's been running problem-free for quite a while.
Recently, it's locked-up hard several times. It's a headless server, but I do have IP KVM. However, when it's locked up, all I can see are a few lines of kernel stack trace. No hints to the problem in the system logs. I even enabled remote logging of syslog, hoping to catch the errors that way. No luck.
I ran memtest86+ for about 36 errors, no problems.
I've tried to strip away just about all running services. It's just a home file server. I haven't had a crash in a while, but I also haven't had it running very long.
But even while it's up, I have severe input lag in the shell. I'll type a few characters, and two to 10 or so seconds pass before anything echoes to the screen.
I've checked top, practically zero CPU load.
It's not swapping - 16 GB of RAM, 0 swap used. Most memory heavy process is java (for CrashPlan backups).
iostat shows 0% disk utilization.
Anyone seen anything like this? Where else can I check to try to determine the source of this lag (which I suspect might be related to the recent crashes)?
Thanks, Matt _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Thu, Oct 9, 2014 at 11:20 PM, Joseph L. Brunner joe@affirmedsystems.com wrote:
Is it under some type of ddos attack?
What's running on this machine? In front of it?
A DDOS attack seems unlikely, though I suppose it's possible. Sitting between the lagging machine and the Internet is a pfSense box. All the other machines in the house have no issues, and they all route through the pfSense system.
Right now, the only stuff running on it:
- CrashPlan (java backup application) - Munin - Apache (only for Munin, no external access [i.e. no port forwarding from pfSense]) - mpd (music player daemon)
Thanks, Matt
If this is a server - is it possible your raid card battery died?
We have seen issuers where the BBWC fails and the box crawls
The only other thing on the hardware side that comes to mind is actual bad sectors if this is not a raided virtual drive.
From the OS side can you keep the box up long enough to do a yum update?
thanks
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Matt Garman Sent: Friday, October 10, 2014 7:48 AM To: CentOS mailing list Subject: Re: [CentOS] centos 6.5 input lag
On Thu, Oct 9, 2014 at 11:20 PM, Joseph L. Brunner joe@affirmedsystems.com wrote:
Is it under some type of ddos attack?
What's running on this machine? In front of it?
A DDOS attack seems unlikely, though I suppose it's possible. Sitting between the lagging machine and the Internet is a pfSense box. All the other machines in the house have no issues, and they all route through the pfSense system.
Right now, the only stuff running on it:
- CrashPlan (java backup application) - Munin - Apache (only for Munin, no external access [i.e. no port forwarding from pfSense]) - mpd (music player daemon)
Thanks, Matt _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Fri, Oct 10, 2014 at 4:11 PM, Joseph L. Brunner joe@affirmedsystems.com wrote:
If this is a server - is it possible your raid card battery died?
It is a server, but a home file server. The raid card has no battery backup, and in fact has been flashed to pure HBA mode. Actual RAID'ing is done at the software level.
The only other thing on the hardware side that comes to mind is actual bad sectors if this is not a raided virtual drive.
The system has eight total drives: two SSDs in raid-1 for the OS, five 3.5 spinning drives in RAID-6, and a single 3.5 drive normally used for mythtv recordings (though mythtv has been stopped for a long time now to try to debug the issue).
From the OS side can you keep the box up long enough to do a yum update?
Yes, I updated everything except packages beginning with "l" ("el" / lowercase 'L') due to that generating a number of conflicts that I haven't have time to resolve.
Update on this problem:
From another system, I initiated a constant ping on my laggy server.
I noticed that every 10--20 seconds, one or more ICMP packets would drop. These drops were consistent with the input lag I was experiencing.
I did a web search for "linux periodically hangs" and found this Serverfault post that had a lot in common with my symptoms:
http://serverfault.com/questions/371666/linux-bonded-interfaces-hanging-peri...
I in fact have bonded interfaces on the laggy server. When I checked the bonding config, I realized a while ago I had changed from balance-rr / mode 0, to 802.3ad / mode 4. (I did this because I kept getting "bond0: received packet with own address as source address" when using balance-rr with a bridge interface. The bridge interface was for using KVM.)
For now, I simply disabled one of the slave interfaces, and the lag / dropped ICMP packets problem has gone away.
Like the Serverfault poster, I have an HP TrueCurve 1800-24g switch. The switch is supposed to support 802.3ad link aggregation. It's not a managed switch, so I (perhaps incorrectly) assumed that 802.3ad would magically just work. Either there is more required to make it work, or it's implementation is broken. Curiously, however, running my bond0 in 802.3ad mode did work without any issue for over a month.
Anyway, hopefully this might help someone else struggling with a similar problem.
On Fri, Oct 10, 2014 at 4:17 PM, Matt Garman matthew.garman@gmail.com wrote:
On Fri, Oct 10, 2014 at 4:11 PM, Joseph L. Brunner joe@affirmedsystems.com wrote:
If this is a server - is it possible your raid card battery died?
It is a server, but a home file server. The raid card has no battery backup, and in fact has been flashed to pure HBA mode. Actual RAID'ing is done at the software level.
The only other thing on the hardware side that comes to mind is actual bad sectors if this is not a raided virtual drive.
The system has eight total drives: two SSDs in raid-1 for the OS, five 3.5 spinning drives in RAID-6, and a single 3.5 drive normally used for mythtv recordings (though mythtv has been stopped for a long time now to try to debug the issue).
From the OS side can you keep the box up long enough to do a yum update?
Yes, I updated everything except packages beginning with "l" ("el" / lowercase 'L') due to that generating a number of conflicts that I haven't have time to resolve.
I in fact have bonded interfaces on the laggy server. When I checked the bonding config, I realized a while ago I had changed from balance-rr / mode 0, to 802.3ad / mode 4. (I did this because I kept getting "bond0: received packet with own address as source address" when using balance-rr with a bridge interface. The bridge interface was for using KVM.)
See the comments about mode 0 in this thread, http://lists.centos.org/pipermail/centos-virt/2014-March/003720.html in particular http://lists.centos.org/pipermail/centos-virt/2014-March/003733.html
Like the Serverfault poster, I have an HP TrueCurve 1800-24g switch. The switch is supposed to support 802.3ad link aggregation. It's not a managed switch, so I (perhaps incorrectly) assumed that 802.3ad would magically just work. Either there is more required to make it work, or it's implementation is broken. Curiously, however, running my bond0 in 802.3ad mode did work without any issue for over a month.
I'm unfamiliar with these switches. The Cisco switches we use, all managed, require explicit configuration for LACP/802.3ad.