I am working on high load daemon development, which listens on UDP and processes packets. Last few months I noticed some strange issue when it takes 500-700 ms to answer packet, while usually it takes 20 ms. I've run strace on all daemon processes and found this thing:
13:35:36.979887 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2309, ...}) = 0 13:35:36.979944 write(3, "[2014-04-15 13:35:36,979] WARNING WORKER 26 - [pkt#105132/AUTH] loadPresets - memory used: 0 kb\n", 99) = 99 13:35:37.599793 sendto(10, "Q\0\0\0\rSELECT 1\0", 14, MSG_NOSIGNAL, NULL, 0) = 14 13:35:37.599865 poll([{fd=10, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=10, revents=POLLIN}])
You can see that between write and sendto passed around 600 ms. At this time server was not overloaded (LA = 0.4, 16 cores). There were free memory. There were no load on disks. So I took straces of other daemon processes plus database processes. And then did:
grep '13:35:37.4' * grep '13:35:37.3' * grep '13:35:37.2' *
None of commans show any lines. So I guess the system was just doing nothing during this 600 ms.
Is there any way to diagnose this issue? What it might be?
I'm going to guess that your storage controller or drives was performing some internal operation or reset during that time, which blocked IO to the kernel and likely had disk-dependent tasks in iowait.
On Tue, Apr 15, 2014 at 5:25 PM, Andrii Zinchenko mail@zinok.org wrote:
I am working on high load daemon development, which listens on UDP and processes packets. Last few months I noticed some strange issue when it takes 500-700 ms to answer packet, while usually it takes 20 ms. I've run strace on all daemon processes and found this thing:
13:35:36.979887 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2309, ...}) = 0 13:35:36.979944 write(3, "[2014-04-15 13:35:36,979] WARNING WORKER 26 - [pkt#105132/AUTH] loadPresets - memory used: 0 kb\n", 99) = 99 13:35:37.599793 sendto(10, "Q\0\0\0\rSELECT 1\0", 14, MSG_NOSIGNAL, NULL, 0) = 14 13:35:37.599865 poll([{fd=10, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=10, revents=POLLIN}])
You can see that between write and sendto passed around 600 ms. At this time server was not overloaded (LA = 0.4, 16 cores). There were free memory. There were no load on disks. So I took straces of other daemon processes plus database processes. And then did:
grep '13:35:37.4' * grep '13:35:37.3' * grep '13:35:37.2' *
None of commans show any lines. So I guess the system was just doing nothing during this 600 ms.
Is there any way to diagnose this issue? What it might be?
-- Andrii Zinchenko mail@zinok.org _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos