Hi,
I'm running a CentOS 4. server and I sometimes face a weird problem. It is a weird performance problem, and here is how I discovered it.
This server runs OpenVZ virtual machines, and one of them is an asterisk server for my personal use. The first symptom of the problem is that the voice quality became flaky. So I logged on the server to see what could be eating cpu cycles, when I ran top, it took almost one minute before top actually showed. Another hint is that when I run dstat (a monitoring utility that is a mix of iostat and vmstat and other stats), I often get a "missed xx ticks", where xx is a number.
Example (current) (sorry for the wrap):
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 3 2 93 2 0 0| 106k 273k| 0 0 | 0.2 0.4 |1039 389 3 6 91 0 0 0| 0 6416k| 276k 275k| 0 0 |2160 6822 missed 55 ticks 4 10 84 2 0 0|1200k 1992k| 82k 93k| 0 0 |1188 6275 missed 29 ticks 1 0 99 0 0 0| 0 1312k| 65k 66k| 0 0 |1050 1114 missed 38 ticks 2 1 96 0 0 0| 0 1168k| 57k 59k| 0 0 | 491 877 missed 13 ticks 3 1 94 1 0 0| 0 6016k| 181k 176k| 0 0 |2169 5996 missed 50 ticks 4 2 91 1 0 0| 28k 8744k| 216k 214k| 0 0 |2159 5438 missed 37 ticks 1 1 98 0 0 0| 0 2632k| 93k 91k| 0 0 | 983 1381 missed 34 ticks 1 1 98 1 0 0| 0 5624k| 113k 110k| 0 0 |1569 2643 missed 52 ticks 1 1 98 1 0 0| 0 2432k| 29k 28k| 0 0 | 679 647 missed 12 ticks 0 0 100 0 0 0| 0 0 | 60B 374B| 0 0 | 13 15 2 3 94 0 0 0| 0 1872k| 209k 210k| 0 0 |1375 3590 missed 30 ticks
The problem is currently occuring, but it doesn't seem to be affecting voice quality for now, so I have some time to try to find the cause. The only solution I've found up to now is to reboot... But hey, this isn't a Windows 98 machine :)!
I tried restarting the VZ system, which restarts all the VMs, but it didn't solve the problem. I can't tell if the problem occurs on a stock centos kernel, because the server is running production (but non-critical) virtual machines, so it is always running the openVZ kernel.
So here is what I've done for now:
- Top shows a load of about 0.4
- vmstat 1 10 shows this:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 592 191092 381720 537956 0 0 53 68 4 3 3 2 93 2 0 0 592 190720 381720 537956 0 0 0 0 32 60 1 1 98 0 0 0 592 191092 381720 537956 0 0 0 0 41 59 0 0 100 0 1 0 592 191092 381728 537948 0 0 0 2584 311 96 10 4 66 19 0 0 592 189968 381732 537944 0 0 0 2080 222 174 2 3 79 16 0 1 592 189968 381732 537944 0 0 0 3244 170 73 10 4 73 12 0 0 592 190216 381732 537944 0 0 0 136 76 113 1 2 93 4 0 0 592 189844 381732 537944 0 0 0 0 33 69 1 1 98 0 0 0 592 189844 381732 537944 0 0 0 0 24 32 0 0 100 0 0 0 592 190340 381732 537944 0 0 0 0 28 42 0 0 100 0
iostat -x 1 (excerpt)
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 171.00 0.00 124.00 0.00 2368.00 0.00 1184.00 19.10 0.14 1.13 0.02 0.20 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 171.00 0.00 124.00 0.00 2368.00 0.00 1184.00 19.10 0.17 1.35 0.02 0.30 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md1 0.00 0.00 0.00 294.00 0.00 2352.00 0.00 1176.00 8.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 294.00 0.00 2352.00 0.00 1176.00 8.00 0.30 1.01 0.02 0.50 dm-3 0.00 0.00 0.00 294.00 0.00 2352.00 0.00 1176.00 8.00 0.30 1.01 0.02 0.50 dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Any insight would be greatly appreciated. It is not critical, but I'd be glad to be able to finally pinpoint and solve the problem.
Hardware: HP Netserver, software raid, SCSI disks, 1.7 GB RAM.
I can provide more information if needed.
Thanks,
Ugo
Ugo Bellavance a écrit :
Hi,
I'm running a CentOS 4. server and I sometimes face a weird problem. It is a weird performance problem, and here is how I discovered it.
This server runs OpenVZ virtual machines, and one of them is an asterisk server for my personal use. The first symptom of the problem is that the voice quality became flaky. So I logged on the server to see what could be eating cpu cycles, when I ran top, it took almost one minute before top actually showed. Another hint is that when I run dstat (a monitoring utility that is a mix of iostat and vmstat and other stats), I often get a "missed xx ticks", where xx is a number.
Another hint is that pings are really slow. Even pinging localhost is very long. The first reply is fast, but the second takes ages to come.
It seems to be blocking here:
recvmsg(3, 0xbfbf84b0, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily unavailable) gettimeofday({1239887784, 389347}, NULL) = 0 poll(
The rest comes as soon as there is another response:
[{fd=3, events=POLLIN|POLLERR}], 1, 999) = 0 gettimeofday({1239887903, 119727}, NULL) = 0 gettimeofday({1239887903, 119791}, NULL) = 0 sendmsg(3, {msg_name(16)={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, msg_iov(1)=[{"\10\0\335\2018)\0\4\0370\347I\357\323\1\0\10\t\n\v\f\r\16\17\20\21\22\23\24\25\26\27"..., 64}], msg_controllen=0, msg_flags=0}, MSG_CONFIRM) = 64 recvmsg(3, {msg_name(16)={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, msg_iov(1)=[{"E\0\0T\26\264\0\0@\1e\363\177\0\0\1\177\0\0\1\0\0\345\2018)\0\4\0370\347I"..., 192}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */, ...}, msg_flags=0}, 0) = 84 write(1, "64 bytes from hn01.domain"..., 82) = 82 recvmsg(3, 0xbfbf84b0, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily unavailable) gettimeofday({1239887903, 120785}, NULL) = 0 poll(
Then it blocks again...
This confuses Nagios that is running in a VM on this server.
Can the 'gettimeofday' be the problem? 'date' runs w/o delay.
Thanks,
Ugo
On Thu, 16 Apr 2009, Ugo Bellavance wrote:
Ugo Bellavance a écrit :
I'm running a CentOS 4. server and I sometimes face a weird problem. It is a weird performance problem, and here is how I discovered it.
This server runs OpenVZ virtual machines, and one of them is an asterisk server for my personal use. The first symptom of the problem is that the voice quality became flaky. So I logged on the server to see what could be eating cpu cycles, when I ran top, it took almost one minute before top actually showed. Another hint is that when I run dstat (a monitoring utility that is a mix of iostat and vmstat and other stats), I often get a "missed xx ticks", where xx is a number.
Another hint is that pings are really slow. Even pinging localhost is very long. The first reply is fast, but the second takes ages to come.
I am glad that my dstat tool provided you the information about missing ticks, because all the other tools were basicly giving you wrong statistics.
Dstat's statistics are wrong too, but at least it is providing you with a hint you shouldn't trust the numbers. :)
How to fix this for OpenVZ I can't tell, but I am sure the OpenVZ forums have smart people with an insight.
If you find that information I am very interested to know what the solution is for OpenVZ. (Since you're running their kernel, I would have thought this would not have been an issue though)
Thanks for keeping us posted !
On Thu, 2009-04-16 at 09:12 -0400, Ugo Bellavance wrote:
Hi,
I'm running a CentOS 4. server and I sometimes face a weird problem. It is a weird performance problem, and here is how I discovered it.
This server runs OpenVZ virtual machines, and one of them is an asterisk server for my personal use. The first symptom of the problem is that the voice quality became flaky. So I logged on the server to see what could be eating cpu cycles, when I ran top, it took almost one minute before top actually showed. Another hint is that when I run dstat (a monitoring utility that is a mix of iostat and vmstat and other stats), I often get a "missed xx ticks", where xx is a number.
Example (current) (sorry for the wrap):
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 3 2 93 2 0 0| 106k 273k| 0 0 | 0.2 0.4 |1039 389 3 6 91 0 0 0| 0 6416k| 276k 275k| 0 0 |2160 6822 missed 55 ticks 4 10 84 2 0 0|1200k 1992k| 82k 93k| 0 0 |1188 6275 missed 29 ticks 1 0 99 0 0 0| 0 1312k| 65k 66k| 0 0 |1050 1114 missed 38 ticks 2 1 96 0 0 0| 0 1168k| 57k 59k| 0 0 | 491 877 missed 13 ticks 3 1 94 1 0 0| 0 6016k| 181k 176k| 0 0 |2169 5996 missed 50 ticks 4 2 91 1 0 0| 28k 8744k| 216k 214k| 0 0 |2159 5438 missed 37 ticks 1 1 98 0 0 0| 0 2632k| 93k 91k| 0 0 | 983 1381 missed 34 ticks 1 1 98 1 0 0| 0 5624k| 113k 110k| 0 0 |1569 2643 missed 52 ticks 1 1 98 1 0 0| 0 2432k| 29k 28k| 0 0 | 679 647 missed 12 ticks 0 0 100 0 0 0| 0 0 | 60B 374B| 0 0 | 13 15 2 3 94 0 0 0| 0 1872k| 209k 210k| 0 0 |1375 3590 missed 30 ticks
The problem is currently occuring, but it doesn't seem to be affecting voice quality for now, so I have some time to try to find the cause. The only solution I've found up to now is to reboot... But hey, this isn't a Windows 98 machine :)!
I tried restarting the VZ system, which restarts all the VMs, but it didn't solve the problem. I can't tell if the problem occurs on a stock centos kernel, because the server is running production (but non-critical) virtual machines, so it is always running the openVZ kernel.
So here is what I've done for now:
Top shows a load of about 0.4
vmstat 1 10 shows this:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 592 191092 381720 537956 0 0 53 68 4 3 3 2 93 2 0 0 592 190720 381720 537956 0 0 0 0 32 60 1 1 98 0 0 0 592 191092 381720 537956 0 0 0 0 41 59 0 0 100 0 1 0 592 191092 381728 537948 0 0 0 2584 311 96 10 4 66 19 0 0 592 189968 381732 537944 0 0 0 2080 222 174 2 3 79 16 0 1 592 189968 381732 537944 0 0 0 3244 170 73 10 4 73 12 0 0 592 190216 381732 537944 0 0 0 136 76 113 1 2 93 4 0 0 592 189844 381732 537944 0 0 0 0 33 69 1 1 98 0 0 0 592 189844 381732 537944 0 0 0 0 24 32 0 0 100 0 0 0 592 190340 381732 537944 0 0 0 0 28 42 0 0 100 0
iostat -x 1 (excerpt)
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 171.00 0.00 124.00 0.00 2368.00 0.00 1184.00 19.10 0.14 1.13 0.02 0.20 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 171.00 0.00 124.00 0.00 2368.00 0.00 1184.00 19.10 0.17 1.35 0.02 0.30 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md1 0.00 0.00 0.00 294.00 0.00 2352.00 0.00 1176.00 8.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 294.00 0.00 2352.00 0.00 1176.00 8.00 0.30 1.01 0.02 0.50 dm-3 0.00 0.00 0.00 294.00 0.00 2352.00 0.00 1176.00 8.00 0.30 1.01 0.02 0.50 dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Any insight would be greatly appreciated. It is not critical, but I'd be glad to be able to finally pinpoint and solve the problem.
Hardware: HP Netserver, software raid, SCSI disks, 1.7 GB RAM.
I can provide more information if needed.
Thanks,
Ugo
----- That's a known problem with the Kernel and VM Kernel. You need the fixed kernel.
On Sat, 2009-04-18 at 15:54 -0400, Ugo Bellavance wrote:
JohnS wrote:
That's a known problem with the Kernel and VM Kernel. You need the fixed kernel.
Do you have an URL of the bug or something?
I updated to the latest kernel.
I was running 2.6.18-92.1.18.el5.028stab060.2PAE.
Regards,
--- There are one or two people that have recompiled a kernel to be compatable with CentOS and it is on there site which they can join in and give you the site.
Please read: http://www.vmware.com/pdf/vmware_timekeeping.pdf
I don't reacall any one really saying if there was indeed a fix put into that PAE or any Specific Kernel for CentOS. Can the CentOS Kernel Builder Comment Please?
Also See: +1 http://wiki.centos.org/TipsAndTricks/VMWare_Server?highlight=(100hz)
JohnStanley
On Sun, Apr 19, 2009 at 8:34 AM, JohnS jses27@gmail.com wrote:
I don't reacall any one really saying if there was indeed a fix put into that PAE or any Specific Kernel for CentOS. Can the CentOS Kernel Builder Comment Please?
Also See: +1 http://wiki.centos.org/TipsAndTricks/VMWare_Server?highlight=(100hz)
JohnStanley
As noted at the top of that wiki page, the contents need to be updated. When I added that note, I intended to do it asap but have not had a chance to do so. However, the link referenced in there ( http://kb.vmware.com/kb/1006427 ) is the best source for timekeeping at this moment. In short, CentOS no longer offers 100Hz kernels because the divider=10 kernel option now works.
Akemi
On Sun, 2009-04-19 at 09:02 -0700, Akemi Yagi wrote:
On Sun, Apr 19, 2009 at 8:34 AM, JohnS jses27@gmail.com wrote:
I don't reacall any one really saying if there was indeed a fix put into that PAE or any Specific Kernel for CentOS. Can the CentOS Kernel Builder Comment Please?
Also See: +1 http://wiki.centos.org/TipsAndTricks/VMWare_Server?highlight=(100hz)
JohnStanley
As noted at the top of that wiki page, the contents need to be updated. When I added that note, I intended to do it asap but have not had a chance to do so. However, the link referenced in there ( http://kb.vmware.com/kb/1006427 ) is the best source for timekeeping at this moment. In short, CentOS no longer offers 100Hz kernels because the divider=10 kernel option now works.
Akemi
---- Thanks Akemi for the update on it. That should fix hin up,
JohnStanley
JohnS a écrit :
On Sun, 2009-04-19 at 09:02 -0700, Akemi Yagi wrote:
On Sun, Apr 19, 2009 at 8:34 AM, JohnS jses27@gmail.com wrote:
I don't reacall any one really saying if there was indeed a fix put into that PAE or any Specific Kernel for CentOS. Can the CentOS Kernel Builder Comment Please?
Also See: +1 http://wiki.centos.org/TipsAndTricks/VMWare_Server?highlight=(100hz)
JohnStanley
As noted at the top of that wiki page, the contents need to be updated. When I added that note, I intended to do it asap but have not had a chance to do so. However, the link referenced in there ( http://kb.vmware.com/kb/1006427 ) is the best source for timekeeping at this moment. In short, CentOS no longer offers 100Hz kernels because the divider=10 kernel option now works.
Akemi
Thanks Akemi for the update on it. That should fix hin up,
I'm not using VMWare, I'm using OpenVZ...
Ugo
On Mon, 2009-04-20 at 08:12 -0400, Ugo Bellavance wrote:
JohnS a écrit :
On Sun, 2009-04-19 at 09:02 -0700, Akemi Yagi wrote:
On Sun, Apr 19, 2009 at 8:34 AM, JohnS jses27@gmail.com wrote:
I don't reacall any one really saying if there was indeed a fix put into that PAE or any Specific Kernel for CentOS. Can the CentOS Kernel Builder Comment Please?
Also See: +1 http://wiki.centos.org/TipsAndTricks/VMWare_Server?highlight=(100hz)
JohnStanley
As noted at the top of that wiki page, the contents need to be updated. When I added that note, I intended to do it asap but have not had a chance to do so. However, the link referenced in there ( http://kb.vmware.com/kb/1006427 ) is the best source for timekeeping at this moment. In short, CentOS no longer offers 100Hz kernels because the divider=10 kernel option now works.
Akemi
Thanks Akemi for the update on it. That should fix hin up,
I'm not using VMWare, I'm using OpenVZ...
Ugo
----- Perhaphs persue your ? with OpenVZ. Try the kernel option?
Can you find the first time this problem occoured? How about trying older kernel versions?
You are either dealing with misbehaving hardware/driver or you need to tweak the settings on your clock source.
Believe it or not but good sources of info to fix this affect vmware also so read those docs.
On 4/20/09, Ugo Bellavance ugob@lubik.ca wrote:
JohnS a écrit :
On Sun, 2009-04-19 at 09:02 -0700, Akemi Yagi wrote:
On Sun, Apr 19, 2009 at 8:34 AM, JohnS jses27@gmail.com wrote:
I don't reacall any one really saying if there was indeed a fix put into that PAE or any Specific Kernel for CentOS. Can the CentOS Kernel Builder Comment Please?
Also See: +1 http://wiki.centos.org/TipsAndTricks/VMWare_Server?highlight=(100hz)
JohnStanley
As noted at the top of that wiki page, the contents need to be updated. When I added that note, I intended to do it asap but have not had a chance to do so. However, the link referenced in there ( http://kb.vmware.com/kb/1006427 ) is the best source for timekeeping at this moment. In short, CentOS no longer offers 100Hz kernels because the divider=10 kernel option now works.
Akemi
Thanks Akemi for the update on it. That should fix hin up,
I'm not using VMWare, I'm using OpenVZ...
Ugo
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos