[CentOS] Re: DRBD NFS load issues
Jed Reynolds
lists at bitratchet.com
Mon Jan 7 06:49:42 UTC 2008
Ugo Bellavance wrote:
> Jed Reynolds wrote:
>> My NFS setup is a heartbeat setup on two servers running Active/Passive
>> DRBD. The NFS servers themselves are 1x 2 core Opterons with 8G ram and
>> 5TB space with 16 drives and a 3ware controller. They're connected to a
>> HP procurve switch with bonded ethernet. The sync-rates between the two
>> DRBD nodes seem to safely reach 200Mbps or better. The processors on the
>> active NFS servers run with a load of 0.2, so it seems mighty healthy.
>> Until I do a serious backup.
>>
>> I have a few load balanced web nodes and two database nodes as NFS
>> clients. When I start backing up my database to a mounted NFS partition,
>> a plain rsync drives the NFS box through the roof and forces a failover.
>> I can do my backup using --bwlimit=1500, but then I'm not anywhere close
>> to a fast backup, just 1.5MBps. My backups are probably 40G. (The
>> database has fast disks and between database copies I see run at up to
>> 60MBps - close to 500Mbps). I obviously do not have a networking issue.
>>
>> The processor loads up like this:
>> bwlimit 1500 load 2.3
>> bwlimit 2500 load 3.5
>> bwlimit 4500 load 5.5+
>>
>> The DRBD secondary seems to run at about 1/2 the load of the primary.
>>
>> What I'm wondering is--why is this thing *so* load sensitive? Is it
>> DRBD? Is it NFS? I'm guessing that since I only have two cores in the
>> NFS boxes that a prolonged transfer makes NFS dominates 1 core and DRBD
>> dominate the next, and so I'm saturating my processor.
>
> Is your CPU usage 100% all the time?
>
Not 100% user or 100% system--not even close.
Wow. Looks like a lot of idle wait time to me, actually.
Looking at the stats below, I'd think that if there's so much idle time,
it's either disk or network latency. I wonder if packets going thru the
drbd device are ... wrong size? Drbd devices are waiting for a response
from seconday? Seems strange.
The only other thing running on that system is memcached, which uses 11%
cpu. About 200 connections open to memcached from other hosts. There
were 8 nfsd instances.
>
> Can you send us the output of vmstat -n 5 5
> when you're doing a backup?
>
This is with rsync at bwlimit=2500
top - 22:37:23 up 3 days, 10:07, 4 users, load average: 4.67, 2.37, 1.30
Tasks: 124 total, 1 running, 123 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.3% us, 1.3% sy, 0.0% ni, 9.3% id, 87.7% wa, 0.3% hi, 1.0% si
Cpu1 : 0.0% us, 3.3% sy, 0.0% ni, 8.0% id, 83.7% wa, 1.7% hi, 3.3% si
Mem: 8169712k total, 8148616k used, 21096k free, 296636k buffers
Swap: 4194296k total, 160k used, 4194136k free, 6295284k cached
$ vmstat -n 5 5
procs -----------memory---------- ---swap-- -----io---- --system--
----cpu----
r b swpd free buff cache si so bi bo in cs us sy
id wa
0 10 160 24136 304208 6277104 0 0 95 38 22 63 0
2 89 9
0 10 160 28224 304228 6277288 0 0 36 64 2015 707 0
3 0 97
0 0 160 28648 304316 6280328 0 0 629 28 3332 1781 0
4 65 31
0 8 160 26784 304384 6283388 0 0 629 106 4302 3085 1
5 70 25
0 0 160 21520 304412 6287304 0 0 763 104 3487 1944 0
4 78 18
$ vmstat -n 5 5
procs -----------memory---------- ---swap-- -----io---- --system--
----cpu----
r b swpd free buff cache si so bi bo in cs us sy
id wa
0 0 160 26528 301516 6287820 0 0 95 38 22 63 0
2 89 9
0 0 160 21288 301600 6292768 0 0 999 86 4856 3273 0
2 87 11
2 8 160 19408 298304 6283960 0 0 294 15293 33983 15309 0
22 53 25
0 10 160 28360 298176 6281232 0 0 34 266 2377 858 0
2 0 97
0 10 160 33680 298196 6281552 0 0 32 48 1937 564 0
1 4 96
More information about the CentOS
mailing list