I’ve posted this on the forums at https://www.centos.org/forums/viewtopic.php?f=47&t=57926&p=244614#p2... - posting to the list in the hopes of getting more eyeballs on it.
We have a cluster of 23 HP DL380p Gen8 hosts running Kafka. Basic specs:
2x E5-2650 128 GB RAM 12 x 4 TB 7200 RPM SATA drives connected to an HP H220 HBA Dual port 10 GB NIC
The drives are configured as one large RAID-10 volume with mdadm, filesystem is XFS. The OS is not installed on the drive - we PXE boot a CentOS image we've built with minimal packages installed, and do the OS configuration via puppet. Originally, the hosts were running CentOS 6.5, with Kafka 0.8.1, without issue. We recently upgraded to CentOS 7.2 and Kafka 0.9, and that's when the trouble started.
What we're seeing is that when the weekly raid-check script executes, performance nose dives, and I/O wait skyrockets. The raid check starts out fairly fast (20000K/sec - the limit that's been set), but then quickly drops down to about 4000K/Sec. dev.raid.speed sysctls are at the defaults:
dev.raid.speed_limit_max = 200000 dev.raid.speed_limit_min = 1000
Here's 10 seconds of iostat output, which illustrates the issue:
[root@r1k1log] # iostat 1 10 Linux 3.10.0-327.18.2.el7.x86_64 (r1k1) 05/24/16 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle 8.80 0.06 1.89 14.79 0.00 74.46
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 52.59 2033.16 10682.78 1210398902 6359779847 sdb 52.46 2031.25 10682.78 1209265338 6359779847 sdc 52.40 2033.21 10683.53 1210433924 6360229587 sdd 52.22 2031.16 10683.53 1209212513 6360229587 sdf 52.20 2031.17 10682.06 1209216701 6359354331 sdg 52.62 2033.22 10684.17 1210437080 6360606756 sdh 52.57 2031.21 10684.17 1209242746 6360606756 sde 51.67 2033.17 10682.06 1210408935 6359354331 sdj 51.90 2031.13 10684.48 1209191501 6360795559 sdi 52.47 2033.16 10684.48 1210399262 6360795559 sdk 52.09 2033.15 10684.36 1210396915 6360724971 sdl 51.95 2031.20 10684.36 1209235241 6360724971 md127 138.20 74.49 64101.35 44348810 38161468777
avg-cpu: %user %nice %system %iowait %steal %idle 8.57 0.09 1.33 26.19 0.00 63.81
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 28.00 512.00 8416.00 512 8416 sdb 28.00 512.00 8416.00 512 8416 sdc 25.00 448.00 8876.00 448 8876 sdd 24.00 448.00 8364.00 448 8364 sdf 23.00 448.00 8192.00 448 8192 sdg 24.00 512.00 7680.00 512 7680 sdh 24.00 512.00 7680.00 512 7680 sde 23.00 448.00 8192.00 448 8192 sdj 23.00 512.00 7680.00 512 7680 sdi 23.00 512.00 7680.00 512 7680 sdk 23.00 512.00 7680.00 512 7680 sdl 23.00 512.00 7680.00 512 7680 md127 101.00 0.00 48012.00 0 48012
avg-cpu: %user %nice %system %iowait %steal %idle 6.50 0.00 1.04 24.27 0.00 68.19
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 26.00 512.00 9216.00 512 9216 sdb 26.00 512.00 9216.00 512 9216 sdc 27.00 576.00 9204.00 576 9204 sdd 28.00 576.00 9716.00 576 9716 sdf 31.00 768.00 9728.00 768 9728 sdg 28.00 512.00 10240.00 512 10240 sdh 28.00 512.00 10240.00 512 10240 sde 31.00 768.00 9728.00 768 9728 sdj 28.00 512.00 9744.00 512 9744 sdi 28.00 512.00 9744.00 512 9744 sdk 27.00 512.00 9728.00 512 9728 sdl 27.00 512.00 9728.00 512 9728 md127 114.00 0.00 57860.00 0 57860
avg-cpu: %user %nice %system %iowait %steal %idle 9.24 0.00 1.32 20.02 0.00 69.42
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 50.00 512.00 20408.00 512 20408 sdb 50.00 512.00 20408.00 512 20408 sdc 48.00 512.00 19984.00 512 19984 sdd 48.00 512.00 19984.00 512 19984 sdf 50.00 704.00 19968.00 704 19968 sdg 47.00 512.00 19968.00 512 19968 sdh 47.00 512.00 19968.00 512 19968 sde 50.00 704.00 19968.00 704 19968 sdj 48.00 512.00 19972.00 512 19972 sdi 48.00 512.00 19972.00 512 19972 sdk 48.00 512.00 19980.00 512 19980 sdl 48.00 512.00 19980.00 512 19980 md127 241.00 0.00 120280.00 0 120280
avg-cpu: %user %nice %system %iowait %steal %idle 7.98 0.00 0.98 18.42 0.00 72.63
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 39.00 640.00 14076.00 640 14076 sdb 39.00 640.00 14076.00 640 14076 sdc 36.00 512.00 14324.00 512 14324 sdd 36.00 512.00 14324.00 512 14324 sdf 36.00 576.00 13824.00 576 13824 sdg 43.00 1024.00 13824.00 1024 13824 sdh 43.00 1024.00 13824.00 1024 13824 sde 36.00 576.00 13824.00 576 13824 sdj 44.00 1024.00 14104.00 1024 14104 sdi 44.00 1024.00 14104.00 1024 14104 sdk 45.00 1024.00 14336.00 1024 14336 sdl 45.00 1024.00 14336.00 1024 14336 md127 168.00 0.00 84488.00 0 84488
avg-cpu: %user %nice %system %iowait %steal %idle 7.39 0.00 1.01 19.48 0.00 72.13
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 22.00 896.00 4096.00 896 4096 sdb 22.00 896.00 4096.00 896 4096 sdc 19.00 640.00 4344.00 640 4344 sdd 19.00 640.00 4344.00 640 4344 sdf 18.00 512.00 5120.00 512 5120 sdg 18.00 512.00 5120.00 512 5120 sdh 18.00 512.00 5120.00 512 5120 sde 18.00 512.00 5120.00 512 5120 sdj 18.00 512.00 4624.00 512 4624 sdi 18.00 512.00 4624.00 512 4624 sdk 18.00 512.00 4608.00 512 4608 sdl 18.00 512.00 4608.00 512 4608 md127 57.00 0.00 27912.00 0 27912
avg-cpu: %user %nice %system %iowait %steal %idle 10.92 0.00 1.58 21.84 0.00 65.66
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 23.00 576.00 7168.00 576 7168 sdb 23.00 576.00 7168.00 576 7168 sdc 29.00 896.00 7680.00 896 7680 sdd 29.00 896.00 7680.00 896 7680 sdf 31.00 1024.00 7680.00 1024 7680 sdg 31.00 1024.00 7680.00 1024 7680 sdh 31.00 1024.00 7680.00 1024 7680 sde 31.00 1024.00 7680.00 1024 7680 sdj 30.00 1024.00 7168.00 1024 7168 sdi 31.00 1024.00 7680.00 1024 7680 sdk 32.00 1024.00 7424.00 1024 7424 sdl 32.00 1024.00 7424.00 1024 7424 md127 89.00 0.00 44800.00 0 44800
avg-cpu: %user %nice %system %iowait %steal %idle 13.89 0.03 2.63 21.54 0.00 61.91
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 30.00 960.00 7680.00 960 7680 sdb 30.00 960.00 7680.00 960 7680 sdc 32.00 1024.00 7684.00 1024 7684 sdd 32.00 1024.00 7684.00 1024 7684 sdf 31.00 1024.00 7680.00 1024 7680 sdg 31.00 1024.00 7680.00 1024 7680 sdh 31.00 1024.00 7680.00 1024 7680 sde 31.00 1024.00 7680.00 1024 7680 sdj 32.00 1024.00 8192.00 1024 8192 sdi 31.00 1024.00 7680.00 1024 7680 sdk 26.00 704.00 7680.00 704 7680 sdl 26.00 704.00 7680.00 704 7680 md127 92.00 0.00 46596.00 0 46596
avg-cpu: %user %nice %system %iowait %steal %idle 14.24 0.00 2.22 19.89 0.00 63.65
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 33.00 1024.00 7244.00 1024 7244 sdb 33.00 1024.00 7244.00 1024 7244 sdc 31.00 1024.00 7668.00 1024 7668 sdd 31.00 1024.00 7668.00 1024 7668 sdf 31.00 1024.00 7680.00 1024 7680 sdg 26.00 768.00 6672.00 768 6672 sdh 26.00 768.00 6672.00 768 6672 sde 31.00 1024.00 7680.00 1024 7680 sdj 21.00 512.00 6656.00 512 6656 sdi 21.00 512.00 6656.00 512 6656 sdk 27.00 832.00 7168.00 832 7168 sdl 27.00 832.00 7168.00 832 7168 md127 88.00 0.00 43088.00 0 43088
avg-cpu: %user %nice %system %iowait %steal %idle 8.02 0.13 1.42 23.90 0.00 66.53
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 30.00 1024.00 7168.00 1024 7168 sdb 30.00 1024.00 7168.00 1024 7168 sdc 29.00 960.00 7168.00 960 7168 sdd 29.00 960.00 7168.00 960 7168 sdf 23.00 512.00 7668.00 512 7668 sdg 28.00 768.00 7680.00 768 7680 sdh 28.00 768.00 7680.00 768 7680 sde 23.00 512.00 7668.00 512 7668 sdj 30.00 1024.00 6672.00 1024 6672 sdi 30.00 1024.00 6672.00 1024 6672 sdk 30.00 1024.00 7168.00 1024 7168 sdl 30.00 1024.00 7168.00 1024 7168 md127 87.00 0.00 43524.00 0 43524
Details of the array:
[root@r1k1] # cat /proc/mdstat Personalities : [raid10] md127 : active raid10 sdf[5] sdi[8] sdh[7] sdk[10] sdb[1] sdj[9] sdc[2] sdd[3] sdl[11] sde[13] sdg[12] sda[0] 23441323008 blocks super 1.2 512K chunks 2 near-copies [12/12] [UUUUUUUUUUUU] [======>..............] check = 30.8% (7237496960/23441323008) finish=62944.5min speed=4290K/sec unused devices: <none> [root@r1k1] # mdadm --detail /dev/md127 /dev/md127: Version : 1.2 Creation Time : Thu Sep 18 09:57:57 2014 Raid Level : raid10 Array Size : 23441323008 (22355.39 GiB 24003.91 GB) Used Dev Size : 3906887168 (3725.90 GiB 4000.65 GB) Raid Devices : 12 Total Devices : 12 Persistence : Superblock is persistent
Update Time : Tue May 24 15:32:56 2016 State : active, checking Active Devices : 12 Working Devices : 12 Failed Devices : 0 Spare Devices : 0
Layout : near=2 Chunk Size : 512K
Check Status : 30% complete
Name : localhost:kafka UUID : b6b98e3e:65ee06c3:3599d781:98908041 Events : 2459193
Number Major Minor RaidDevice State 0 8 0 0 active sync set-A /dev/sda 1 8 16 1 active sync set-B /dev/sdb 2 8 32 2 active sync set-A /dev/sdc 3 8 48 3 active sync set-B /dev/sdd 13 8 64 4 active sync set-A /dev/sde 5 8 80 5 active sync set-B /dev/sdf 12 8 96 6 active sync set-A /dev/sdg 7 8 112 7 active sync set-B /dev/sdh 8 8 128 8 active sync set-A /dev/sdi 9 8 144 9 active sync set-B /dev/sdj 10 8 160 10 active sync set-A /dev/sdk 11 8 176 11 active sync set-B /dev/sdl
We've tried changing the I/O scheduler, queue_depth, queue_type, read-ahead, etc, but nothing has helped. We've also upgraded all of the firmware, and installed HP's mpt2sas driver.
We have 4 other Kafka clusters, however they're HP DL180 G6 servers. We completed the same CentOS 6.5 -> 7.2/Kafka 0.8 -> 0.9 upgrade on those clusters, and there has been no impact to their performance.
We've been banging our heads against the wall for a few weeks now, really hoping someone from the community can point us in the right direction.
Thanks,
Kelly Lesperance
Kelly Lesperance wrote:
I’ve posted this on the forums at https://www.centos.org/forums/viewtopic.php?f=47&t=57926&p=244614#p2...
- posting to the list in the hopes of getting more eyeballs on it.
We have a cluster of 23 HP DL380p Gen8 hosts running Kafka. Basic specs:
2x E5-2650 128 GB RAM 12 x 4 TB 7200 RPM SATA drives connected to an HP H220 HBA Dual port 10 GB NIC
The drives are configured as one large RAID-10 volume with mdadm, filesystem is XFS. The OS is not installed on the drive - we PXE boot a CentOS image we've built with minimal packages installed, and do the OS configuration via puppet. Originally, the hosts were running CentOS 6.5, with Kafka 0.8.1, without issue. We recently upgraded to CentOS 7.2 and Kafka 0.9, and that's when the trouble started.
<SNIP> Really stupid question: are the drives in that the ones that came with the unit?
mark, who, a few years ago, found serious issues with green drives in a server....
They are:
[root@r1k1 ~] # hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media Model Number: MB4000GCWDC Serial Number: S1Z06RW9 Firmware Revision: HPGD Transport: Serial, SATA Rev 3.0
Thanks,
Kelly
On 2016-05-25, 1:21 PM, "centos-bounces@centos.org on behalf of m.roth@5-cent.us" <centos-bounces@centos.org on behalf of m.roth@5-cent.us> wrote:
Kelly Lesperance wrote:
I’ve posted this on the forums at https://www.centos.org/forums/viewtopic.php?f=47&t=57926&p=244614#p2...
- posting to the list in the hopes of getting more eyeballs on it.
We have a cluster of 23 HP DL380p Gen8 hosts running Kafka. Basic specs:
2x E5-2650 128 GB RAM 12 x 4 TB 7200 RPM SATA drives connected to an HP H220 HBA Dual port 10 GB NIC
The drives are configured as one large RAID-10 volume with mdadm, filesystem is XFS. The OS is not installed on the drive - we PXE boot a CentOS image we've built with minimal packages installed, and do the OS configuration via puppet. Originally, the hosts were running CentOS 6.5, with Kafka 0.8.1, without issue. We recently upgraded to CentOS 7.2 and Kafka 0.9, and that's when the trouble started.
<SNIP> Really stupid question: are the drives in that the ones that came with the unit?
mark, who, a few years ago, found serious issues with green drives in a server....
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
Kelly Lesperance wrote:
I’ve posted this on the forums at https://www.centos.org/forums/viewtopic.php?f=47&t=57926&p=244614#p2...
- posting to the list in the hopes of getting more eyeballs on it.
We have a cluster of 23 HP DL380p Gen8 hosts running Kafka. Basic specs:
2x E5-2650 128 GB RAM 12 x 4 TB 7200 RPM SATA drives connected to an HP H220 HBA Dual port 10 GB NIC
The drives are configured as one large RAID-10 volume with mdadm, filesystem is XFS. The OS is not installed on the drive - we PXE boot a CentOS image we've built with minimal packages installed, and do the OS configuration via puppet. Originally, the hosts were running CentOS 6.5, with Kafka 0.8.1, without issue. We recently upgraded to CentOS 7.2 and Kafka 0.9, and that's when the trouble started.
<SNIP> One more stupid question: could the configuration of the card for how the drives are accessed been accidentally changed?
mark
[merging]
The HBA the drives are attached to has no configuration that I’m aware of. We would have had to accidentally change 23 of them ☺
Thanks,
Kelly
On 2016-05-25, 1:25 PM, "Kelly Lesperance" klesperance@blackberry.com wrote:
They are:
[root@r1k1 ~] # hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media Model Number: MB4000GCWDC Serial Number: S1Z06RW9 Firmware Revision: HPGD Transport: Serial, SATA Rev 3.0
Thanks,
Kelly
On 2016-05-25, 1:23 PM, "centos-bounces@centos.org on behalf of m.roth@5-cent.us" <centos-bounces@centos.org on behalf of m.roth@5-cent.us> wrote:
Kelly Lesperance wrote:
I’ve posted this on the forums at https://www.centos.org/forums/viewtopic.php?f=47&t=57926&p=244614#p2...
- posting to the list in the hopes of getting more eyeballs on it.
We have a cluster of 23 HP DL380p Gen8 hosts running Kafka. Basic specs:
2x E5-2650 128 GB RAM 12 x 4 TB 7200 RPM SATA drives connected to an HP H220 HBA Dual port 10 GB NIC
The drives are configured as one large RAID-10 volume with mdadm, filesystem is XFS. The OS is not installed on the drive - we PXE boot a CentOS image we've built with minimal packages installed, and do the OS configuration via puppet. Originally, the hosts were running CentOS 6.5, with Kafka 0.8.1, without issue. We recently upgraded to CentOS 7.2 and Kafka 0.9, and that's when the trouble started.
<SNIP> One more stupid question: could the configuration of the card for how the drives are accessed been accidentally changed?
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
What is the HBA the drives are attached to? Have you done a quick benchmark on a single disk to check if this is a raid problem or further down the stack?
Regards, Dennis
On 25.05.2016 19:26, Kelly Lesperance wrote:
[merging]
The HBA the drives are attached to has no configuration that I’m aware of. We would have had to accidentally change 23 of them ☺
Thanks,
Kelly
On 2016-05-25, 1:25 PM, "Kelly Lesperance" klesperance@blackberry.com wrote:
They are:
[root@r1k1 ~] # hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media Model Number: MB4000GCWDC Serial Number: S1Z06RW9 Firmware Revision: HPGD Transport: Serial, SATA Rev 3.0
Thanks,
Kelly
On 2016-05-25, 1:23 PM, "centos-bounces@centos.org on behalf of m.roth@5-cent.us" <centos-bounces@centos.org on behalf of m.roth@5-cent.us> wrote:
Kelly Lesperance wrote:
I’ve posted this on the forums at https://www.centos.org/forums/viewtopic.php?f=47&t=57926&p=244614#p2...
- posting to the list in the hopes of getting more eyeballs on it.
We have a cluster of 23 HP DL380p Gen8 hosts running Kafka. Basic specs:
2x E5-2650 128 GB RAM 12 x 4 TB 7200 RPM SATA drives connected to an HP H220 HBA Dual port 10 GB NIC
The drives are configured as one large RAID-10 volume with mdadm, filesystem is XFS. The OS is not installed on the drive - we PXE boot a CentOS image we've built with minimal packages installed, and do the OS configuration via puppet. Originally, the hosts were running CentOS 6.5, with Kafka 0.8.1, without issue. We recently upgraded to CentOS 7.2 and Kafka 0.9, and that's when the trouble started.
<SNIP> One more stupid question: could the configuration of the card for how the drives are accessed been accidentally changed?
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
The HBA is an HP H220.
We haven’t really benchmarked individual drives – all 12 drives are utilized in one RAID-10 array, I’m unsure how we would test individual drives without breaking the array.
Trying ‘hdparm -tT /dev/sda’ now – it’s been running for 25 minutes so far…
Kelly
On 2016-05-25, 2:12 PM, "centos-bounces@centos.org on behalf of Dennis Jacobfeuerborn" <centos-bounces@centos.org on behalf of dennisml@conversis.de> wrote:
What is the HBA the drives are attached to? Have you done a quick benchmark on a single disk to check if this is a raid problem or further down the stack?
Regards, Dennis
On 25.05.2016 19:26, Kelly Lesperance wrote:
[merging]
The HBA the drives are attached to has no configuration that I’m aware of. We would have had to accidentally change 23 of them ☺
Thanks,
Kelly
On 2016-05-25, 1:25 PM, "Kelly Lesperance" klesperance@blackberry.com wrote:
They are:
[root@r1k1 ~] # hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media Model Number: MB4000GCWDC Serial Number: S1Z06RW9 Firmware Revision: HPGD Transport: Serial, SATA Rev 3.0
Thanks,
Kelly
On 2016-05-25, 1:23 PM, "centos-bounces@centos.org on behalf of m.roth@5-cent.us" <centos-bounces@centos.org on behalf of m.roth@5-cent.us> wrote:
Kelly Lesperance wrote:
I’ve posted this on the forums at https://www.centos.org/forums/viewtopic.php?f=47&t=57926&p=244614#p2...
- posting to the list in the hopes of getting more eyeballs on it.
We have a cluster of 23 HP DL380p Gen8 hosts running Kafka. Basic specs:
2x E5-2650 128 GB RAM 12 x 4 TB 7200 RPM SATA drives connected to an HP H220 HBA Dual port 10 GB NIC
The drives are configured as one large RAID-10 volume with mdadm, filesystem is XFS. The OS is not installed on the drive - we PXE boot a CentOS image we've built with minimal packages installed, and do the OS configuration via puppet. Originally, the hosts were running CentOS 6.5, with Kafka 0.8.1, without issue. We recently upgraded to CentOS 7.2 and Kafka 0.9, and that's when the trouble started.
<SNIP> One more stupid question: could the configuration of the card for how the drives are accessed been accidentally changed?
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
On 5/25/2016 11:44 AM, Kelly Lesperance wrote:
The HBA is an HP H220.
OH. its a very good idea to verify the driver is at the same revision level as the firmware. not 100% sure how you do this under CentOS, my H220 system is running FreeBSD, and is at revision P20, both firmware and driver. HP's firmware, at least what I could find, was a fairly old P14 or something, so I had to re-flash mine with 'generic' LSI firmware, this isn't exactly a recommended thing to do, but its sure working fine for me.
John R Pierce wrote:
On 5/25/2016 11:44 AM, Kelly Lesperance wrote:
The HBA is an HP H220.
OH. its a very good idea to verify the driver is at the same revision level as the firmware. not 100% sure how you do this under CentOS, my H220 system is running FreeBSD, and is at revision P20, both firmware and driver. HP's firmware, at least what I could find, was a fairly old P14 or something, so I had to re-flash mine with 'generic' LSI firmware, this isn't exactly a recommended thing to do, but its sure working fine for me.
Not sure if dmidecode will tell you, but you might see if you can run smartctl -i
Also, you could either, on boot, go into the card's firmware interface, and that'll tell you, somewhere, what the firmware version is. Not sure if MegaRAID will work with this card - if it does, you really want it..even though it has an actively user-hostile interface.
mark
LSI/Avago’s web pages don’t have any downloads for the SAS2308, so I think I’m out of luck wrt MegaRAID.
Bounced the node, confirmed MPT Firmware 15.10.09.00-IT. HP Driver is v 15.10.04.00.
Both are the latest from HP.
Unsure why, but the module itself reports version 20.100.00.00:
[root@r1k1 sys] # cat module/mpt2sas/version 20.100.00.00
On 2016-05-25, 3:20 PM, "centos-bounces@centos.org on behalf of m.roth@5-cent.us" <centos-bounces@centos.org on behalf of m.roth@5-cent.us> wrote:
John R Pierce wrote:
On 5/25/2016 11:44 AM, Kelly Lesperance wrote:
The HBA is an HP H220.
OH. its a very good idea to verify the driver is at the same revision level as the firmware. not 100% sure how you do this under CentOS, my H220 system is running FreeBSD, and is at revision P20, both firmware and driver. HP's firmware, at least what I could find, was a fairly old P14 or something, so I had to re-flash mine with 'generic' LSI firmware, this isn't exactly a recommended thing to do, but its sure working fine for me.
Not sure if dmidecode will tell you, but you might see if you can run smartctl -i
Also, you could either, on boot, go into the card's firmware interface, and that'll tell you, somewhere, what the firmware version is. Not sure if MegaRAID will work with this card - if it does, you really want it..even though it has an actively user-hostile interface.
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
Kelly Lesperance wrote:
LSI/Avago’s web pages don’t have any downloads for the SAS2308, so I think I’m out of luck wrt MegaRAID.
Bounced the node, confirmed MPT Firmware 15.10.09.00-IT. HP Driver is v 15.10.04.00.
Both are the latest from HP.
Unsure why, but the module itself reports version 20.100.00.00:
[root@r1k1 sys] # cat module/mpt2sas/version 20.100.00.00
Suggestion: if these are new, they're under warranty, and it's a hardware issue. Call HP tech support and open a ticket with them - they might have an answer.
mark
Already done – they’re not being very helpful, as we don’t have a support contract, just standard warranty.
On 2016-05-25, 4:27 PM, "centos-bounces@centos.org on behalf of m.roth@5-cent.us" <centos-bounces@centos.org on behalf of m.roth@5-cent.us> wrote:
Kelly Lesperance wrote:
LSI/Avago’s web pages don’t have any downloads for the SAS2308, so I think I’m out of luck wrt MegaRAID.
Bounced the node, confirmed MPT Firmware 15.10.09.00-IT. HP Driver is v 15.10.04.00.
Both are the latest from HP.
Unsure why, but the module itself reports version 20.100.00.00:
[root@r1k1 sys] # cat module/mpt2sas/version 20.100.00.00
Suggestion: if these are new, they're under warranty, and it's a hardware issue. Call HP tech support and open a ticket with them - they might have an answer.
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
I should rephrase that – some parts of HP are helping us, but the team I opened the case with isn’t being very helpful.
On 2016-05-25, 4:29 PM, "Kelly Lesperance" klesperance@blackberry.com wrote:
Already done – they’re not being very helpful, as we don’t have a support contract, just standard warranty.
On 2016-05-25, 4:27 PM, "centos-bounces@centos.org on behalf of m.roth@5-cent.us" <centos-bounces@centos.org on behalf of m.roth@5-cent.us> wrote:
Kelly Lesperance wrote:
LSI/Avago’s web pages don’t have any downloads for the SAS2308, so I think I’m out of luck wrt MegaRAID.
Bounced the node, confirmed MPT Firmware 15.10.09.00-IT. HP Driver is v 15.10.04.00.
Both are the latest from HP.
Unsure why, but the module itself reports version 20.100.00.00:
[root@r1k1 sys] # cat module/mpt2sas/version 20.100.00.00
Suggestion: if these are new, they're under warranty, and it's a hardware issue. Call HP tech support and open a ticket with them - they might have an answer.
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
Kelly Lesperance wrote:
Already done – they’re not being very helpful, as we don’t have a support contract, just standard warranty.
Right. We get support for five years, but then we keep things well past that, we don't get rid of them till they're dying. (Don't talk to me about "wasting tax dollars")
And I don't care for HP "support", they don't want to give you *anything* unless you're paying for support. The only ones worse are a) none of the above, and b) Sun/Oracle (I refer to dealing with their "tech support" as self-abuse)
mark
On 2016-05-25, 4:27 PM, "centos-bounces@centos.org on behalf of m.roth@5-cent.us" <centos-bounces@centos.org on behalf of m.roth@5-cent.us> wrote:
Kelly Lesperance wrote:
LSI/Avago’s web pages don’t have any downloads for the SAS2308, so I think I’m out of luck wrt MegaRAID.
Bounced the node, confirmed MPT Firmware 15.10.09.00-IT. HP Driver is v 15.10.04.00.
Both are the latest from HP.
Unsure why, but the module itself reports version 20.100.00.00:
[root@r1k1 sys] # cat module/mpt2sas/version 20.100.00.00
Suggestion: if these are new, they're under warranty, and it's a hardware issue. Call HP tech support and open a ticket with them - they might have an answer.
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
On 5/25/2016 12:20 PM, m.roth@5-cent.us wrote:
Also, you could either, on boot, go into the card's firmware interface, and that'll tell you, somewhere, what the firmware version is. Not sure if MegaRAID will work with this card - if it does, you really want it..even though it has an actively user-hostile interface.
in IT mode, the 2308 is a straight SAS host bus adapter, all drives are presented directly to the host OS as native SAS devices.
I installed the latest firmware and driver (mpt2sas) from HP on one system. The driver is v20, it appears the firmware may be 15, though:
[ 11.128979] mpt2sas version 20.100.00.00 loaded [ 11.513836] mpt2sas0: LSISAS2308: FWVersion(15.10.09.00), ChipRevision(0x05), BiosVersion(07.39.00.00)
On 2016-05-25, 3:01 PM, "centos-bounces@centos.org on behalf of John R Pierce" <centos-bounces@centos.org on behalf of pierce@hogranch.com> wrote:
On 5/25/2016 11:44 AM, Kelly Lesperance wrote:
The HBA is an HP H220.
OH. its a very good idea to verify the driver is at the same revision level as the firmware. not 100% sure how you do this under CentOS, my H220 system is running FreeBSD, and is at revision P20, both firmware and driver. HP's firmware, at least what I could find, was a fairly old P14 or something, so I had to re-flash mine with 'generic' LSI firmware, this isn't exactly a recommended thing to do, but its sure working fine for me.
-- john r pierce, recycling bits in santa cruz
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
Hdparm didn’t get far:
[root@r1k1 ~] # hdparm -tT /dev/sda
/dev/sda: Timing cached reads: Alarm clock [root@r1k1 ~] #
On 2016-05-25, 2:44 PM, "Kelly Lesperance" klesperance@blackberry.com wrote:
The HBA is an HP H220.
We haven’t really benchmarked individual drives – all 12 drives are utilized in one RAID-10 array, I’m unsure how we would test individual drives without breaking the array.
Trying ‘hdparm -tT /dev/sda’ now – it’s been running for 25 minutes so far…
Kelly
On 2016-05-25, 2:12 PM, "centos-bounces@centos.org on behalf of Dennis Jacobfeuerborn" <centos-bounces@centos.org on behalf of dennisml@conversis.de> wrote:
What is the HBA the drives are attached to? Have you done a quick benchmark on a single disk to check if this is a raid problem or further down the stack?
Regards, Dennis
On 25.05.2016 19:26, Kelly Lesperance wrote:
[merging]
The HBA the drives are attached to has no configuration that I’m aware of. We would have had to accidentally change 23 of them ☺
Thanks,
Kelly
On 2016-05-25, 1:25 PM, "Kelly Lesperance" klesperance@blackberry.com wrote:
They are:
[root@r1k1 ~] # hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media Model Number: MB4000GCWDC Serial Number: S1Z06RW9 Firmware Revision: HPGD Transport: Serial, SATA Rev 3.0
Thanks,
Kelly
On 2016-05-25, 1:23 PM, "centos-bounces@centos.org on behalf of m.roth@5-cent.us" <centos-bounces@centos.org on behalf of m.roth@5-cent.us> wrote:
Kelly Lesperance wrote:
I’ve posted this on the forums at https://www.centos.org/forums/viewtopic.php?f=47&t=57926&p=244614#p2...
- posting to the list in the hopes of getting more eyeballs on it.
We have a cluster of 23 HP DL380p Gen8 hosts running Kafka. Basic specs:
2x E5-2650 128 GB RAM 12 x 4 TB 7200 RPM SATA drives connected to an HP H220 HBA Dual port 10 GB NIC
The drives are configured as one large RAID-10 volume with mdadm, filesystem is XFS. The OS is not installed on the drive - we PXE boot a CentOS image we've built with minimal packages installed, and do the OS configuration via puppet. Originally, the hosts were running CentOS 6.5, with Kafka 0.8.1, without issue. We recently upgraded to CentOS 7.2 and Kafka 0.9, and that's when the trouble started.
<SNIP> One more stupid question: could the configuration of the card for how the drives are accessed been accidentally changed?
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
On 2016-05-25 19:13, Kelly Lesperance wrote:
Hdparm didn’t get far:
[root@r1k1 ~] # hdparm -tT /dev/sda
/dev/sda: Timing cached reads: Alarm clock [root@r1k1 ~] #
Hi Kelly,
Try running 'iostat -xdmc 1'. Look for a single drive that has substantially greater await than ~10msec. If all the drives except one are taking 6-8msec, but one is very much more, you've got a drive that drags down the whole array's performance.
Ignore the very first output from the command - it's an average of the disk subsystem since boot.
Post a representative output along with the contents /proc/mdstat.
Good luck,
Hi Charles,
Looks to me like all of the drives are performing roughly the same – there’s certainly not 1 that sticks out (also note this is happening on all 23 nodes in the cluster).
Thanks!
Kelly
[root@r1k1.kafka.log10.blackberry sys] # cat /proc/mdstat Personalities : [raid10] md127 : active raid10 sdc[2] sdh[7] sdb[1] sdf[5] sde[13] sdg[12] sdj[9] sdk[10] sda[0] sdl[11] sdd[3] sdi[8] 23441323008 blocks super 1.2 512K chunks 2 near-copies [12/12] [UUUUUUUUUUUU] [>....................] check = 0.0% (618944/23441323008) finish=108288.4min speed=3607K/sec
unused devices: <none> [root@r1k1.kafka.log10.blackberry sys] # iostat -xdmc 1 10 Linux 3.10.0-327.18.2.el7.x86_64 (r1k1.kafka.log10.blackberry) 05/26/16 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle 12.76 0.07 2.48 0.16 0.00 84.53
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdi 0.01 0.56 0.26 26.44 0.06 11.30 871.39 9.67 362.22 5.14 365.71 6.68 17.83 sdk 0.01 0.56 0.26 26.56 0.06 11.30 867.70 9.53 355.45 5.05 358.84 6.58 17.65 sdc 0.01 0.46 0.26 26.34 0.06 11.29 874.67 9.73 365.89 4.86 369.38 6.81 18.11 sdd 0.01 0.46 0.20 26.34 0.07 11.29 876.98 10.40 391.99 5.33 394.93 7.17 19.02 sda 0.01 0.49 0.26 26.53 0.06 11.29 868.24 9.48 353.91 4.96 357.36 6.57 17.61 sdj 0.01 0.56 0.20 26.44 0.07 11.30 873.73 10.04 376.87 5.48 379.68 6.91 18.40 sdl 0.01 0.56 0.20 26.56 0.07 11.30 869.99 9.77 365.16 5.92 367.92 6.72 17.99 sdh 0.01 0.57 0.21 26.79 0.07 11.30 862.30 9.65 357.60 5.27 360.31 6.63 17.90 sde 0.01 0.47 0.26 26.13 0.06 11.29 881.38 10.60 401.47 6.62 405.41 7.35 19.41 sdf 0.01 0.47 0.20 26.13 0.07 11.29 883.71 9.53 361.85 5.24 364.64 6.73 17.73 sdg 0.01 0.57 0.26 26.79 0.06 11.30 859.99 10.15 375.20 5.26 378.82 6.86 18.57 sdb 0.01 0.49 0.20 26.53 0.07 11.29 870.69 9.85 368.48 5.35 371.23 6.79 18.15 md127 0.00 0.00 2.51 156.82 0.77 67.77 881.06 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 25.51 0.03 4.37 1.05 0.00 69.04
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdi 0.00 1.00 8.00 30.00 0.50 14.18 791.16 1.06 28.03 0.50 35.37 6.97 26.50 sdk 0.00 0.00 8.00 30.00 0.50 14.52 809.47 0.93 24.32 0.00 30.80 7.87 29.90 sdc 0.00 1.00 9.00 32.00 0.56 15.21 787.90 1.13 27.54 0.67 35.09 6.90 28.30 sdd 0.00 1.00 10.00 32.00 0.62 15.21 772.19 1.29 30.69 0.70 40.06 6.76 28.40 sda 0.00 0.00 8.00 38.00 0.50 15.54 714.09 1.40 30.35 0.38 36.66 7.91 36.40 sdj 0.00 1.00 8.00 30.00 0.50 14.18 791.16 1.05 27.68 0.50 34.93 7.00 26.60 sdl 0.00 0.00 8.00 30.00 0.50 14.52 809.47 0.90 23.61 0.25 29.83 7.66 29.10 sdh 0.00 1.00 13.00 34.00 0.81 14.11 650.04 1.17 24.98 0.31 34.41 6.60 31.00 sde 0.00 0.00 16.00 31.00 1.00 14.54 676.94 1.20 25.45 0.31 38.42 7.13 33.50 sdf 0.00 0.00 16.00 31.00 1.00 14.54 676.94 1.19 25.38 0.31 38.32 5.57 26.20 sdg 0.00 1.00 13.00 34.00 0.81 14.11 650.04 1.22 25.98 0.31 35.79 6.70 31.50 sdb 0.00 0.00 8.00 38.00 0.50 15.54 714.09 1.31 28.41 0.25 34.34 8.02 36.90 md127 0.00 0.00 0.00 198.00 0.00 86.59 895.60 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 21.31 0.00 2.99 0.00 0.00 75.69
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdi 0.00 0.00 8.00 7.00 0.50 3.50 546.13 0.13 8.47 6.25 11.00 8.47 12.70 sdk 0.00 0.00 8.00 8.00 0.50 3.98 574.00 0.16 9.88 1.00 18.75 10.06 16.10 sdc 0.00 0.00 8.00 8.00 0.50 4.00 576.00 0.12 7.25 0.62 13.88 7.25 11.60 sdd 0.00 0.00 8.00 8.00 0.50 4.00 576.00 0.12 7.44 0.50 14.38 7.44 11.90 sda 0.00 0.00 8.00 8.00 0.50 4.00 576.00 0.13 8.00 0.50 15.50 8.00 12.80 sdj 0.00 0.00 8.00 7.00 0.50 3.50 546.13 0.18 12.20 9.25 15.57 12.20 18.30 sdl 0.00 0.00 8.00 9.00 0.50 4.48 600.47 0.11 6.94 1.00 12.22 6.59 11.20 sdh 0.00 0.00 8.00 9.00 0.50 3.51 482.82 0.10 6.12 0.50 11.11 6.12 10.40 sde 0.00 0.00 8.00 9.00 0.50 4.00 542.59 0.16 9.65 0.25 18.00 9.65 16.40 sdf 0.00 0.00 8.00 9.00 0.50 4.00 542.59 0.13 7.65 0.25 14.22 7.65 13.00 sdg 0.00 0.00 8.00 9.00 0.50 3.51 482.82 0.13 7.59 0.50 13.89 7.59 12.90 sdb 0.00 0.00 8.00 8.00 0.50 4.00 576.00 0.11 6.62 2.12 11.12 6.62 10.60 md127 0.00 0.00 0.00 49.00 0.00 23.00 961.14 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 18.70 4.21 4.24 0.00 0.00 72.85
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdi 0.00 0.00 8.00 15.00 0.50 6.09 586.78 0.22 9.17 0.25 13.93 6.26 14.40 sdk 0.00 0.00 8.00 14.00 0.50 5.55 563.27 0.25 11.68 2.38 17.00 7.59 16.70 sdc 0.00 0.00 8.00 13.00 0.50 6.50 682.67 0.15 7.00 0.25 11.15 6.00 12.60 sdd 0.00 0.00 8.00 13.00 0.50 6.50 682.67 0.17 7.95 0.25 12.69 6.86 14.40 sda 0.00 0.00 8.00 14.00 0.50 6.50 652.00 0.26 11.77 0.62 18.14 7.86 17.30 sdj 0.00 0.00 8.00 15.00 0.50 6.09 586.78 0.34 14.35 2.00 20.93 9.87 22.70 sdl 0.00 0.00 8.00 13.00 0.50 5.05 541.33 0.25 11.86 0.50 18.85 7.57 15.90 sdh 0.00 0.00 10.00 17.00 0.62 7.14 589.04 0.33 12.19 0.60 19.00 7.41 20.00 sde 0.00 0.00 8.00 18.00 0.50 6.68 565.85 0.31 11.77 0.25 16.89 7.00 18.20 sdf 0.00 0.00 8.00 18.00 0.50 6.68 565.85 0.42 16.12 2.25 22.28 9.96 25.90 sdg 0.00 0.00 10.00 17.00 0.62 7.14 589.04 0.33 12.30 0.60 19.18 6.59 17.80 sdb 0.00 0.00 8.00 14.00 0.50 6.50 652.00 0.27 12.14 2.25 17.79 8.00 17.60 md127 0.00 0.00 0.00 91.00 0.00 38.47 865.76 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 16.69 0.03 3.08 0.03 0.00 80.16
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdi 0.00 0.00 18.00 14.00 1.12 7.00 520.00 0.15 4.84 0.50 10.43 4.62 14.80 sdk 0.00 0.00 16.00 14.00 1.00 7.00 546.13 0.14 4.77 0.38 9.79 4.77 14.30 sdc 0.00 0.00 16.00 13.00 1.00 6.50 529.66 0.14 5.00 0.38 10.69 5.00 14.50 sdd 0.00 0.00 16.00 13.00 1.00 6.50 529.66 0.15 5.10 0.38 10.92 5.10 14.80 sda 0.00 0.00 16.00 18.00 1.00 7.54 514.59 0.21 6.12 1.31 10.39 6.26 21.30 sdj 0.00 0.00 18.00 14.00 1.12 7.00 520.00 0.13 4.25 0.50 9.07 4.03 12.90 sdl 0.00 0.00 16.00 14.00 1.00 7.00 546.13 0.13 4.47 0.31 9.21 4.47 13.40 sdh 7.00 0.00 10.00 13.00 1.06 6.50 673.39 0.10 4.57 0.50 7.69 4.57 10.50 sde 6.00 0.00 10.00 13.00 1.00 6.50 667.83 0.15 6.35 0.60 10.77 6.35 14.60 sdf 6.00 0.00 10.00 13.00 1.00 6.50 667.83 0.14 6.22 0.60 10.54 6.22 14.30 sdg 7.00 0.00 10.00 13.00 1.06 6.50 673.39 0.10 4.39 0.50 7.38 4.39 10.10 sdb 0.00 0.00 16.00 19.00 1.00 7.57 501.71 0.13 3.77 0.31 6.68 3.77 13.20 md127 0.00 0.00 0.00 85.00 0.00 40.57 977.60 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 22.73 0.00 5.91 0.06 0.00 71.30
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdi 242.00 0.00 1048.00 1.00 80.62 0.01 157.43 4.38 4.18 4.17 8.00 0.37 38.50 sdk 334.00 0.00 954.00 0.00 80.50 0.00 172.81 7.27 7.62 7.62 0.00 0.45 43.20 sdc 294.00 0.00 994.00 0.00 80.50 0.00 165.86 5.56 5.59 5.59 0.00 0.40 39.80 sdd 249.00 0.00 1039.00 0.00 80.50 0.00 158.68 4.11 3.95 3.95 0.00 0.37 38.00 sda 268.00 0.00 1020.00 11.00 80.50 0.18 160.26 5.47 5.31 5.14 21.36 0.58 60.20 sdj 253.00 0.00 1037.00 1.00 80.62 0.01 159.10 4.42 4.26 4.26 3.00 0.37 38.80 sdl 257.00 0.00 1031.00 0.00 80.50 0.00 159.91 5.13 4.98 4.98 0.00 0.37 38.30 sdh 224.00 0.00 1064.00 1.00 80.50 0.00 154.81 3.80 3.57 3.57 10.00 0.36 38.30 sde 247.00 0.00 1041.00 0.00 80.50 0.00 158.37 4.96 4.77 4.77 0.00 0.37 38.60 sdf 220.00 0.00 1068.00 0.00 80.50 0.00 154.37 3.70 3.47 3.47 0.00 0.33 35.40 sdg 242.00 0.00 1046.00 1.00 80.50 0.00 157.47 5.05 4.82 4.81 13.00 0.39 40.80 sdb 239.00 0.00 1049.00 10.00 80.50 0.15 155.97 4.77 4.51 4.43 12.10 0.45 47.90 md127 0.00 0.00 0.00 13.00 0.00 0.17 26.46 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 28.14 0.03 6.00 0.00 0.00 65.83
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdi 0.00 5.00 12.00 37.00 0.75 0.29 43.27 1.13 23.04 0.67 30.30 4.98 24.40 sdk 0.00 17.00 16.00 34.00 1.00 0.34 54.88 2.95 59.02 0.56 86.53 3.98 19.90 sdc 0.00 4.00 8.00 33.00 0.50 0.25 37.66 1.14 27.88 0.25 34.58 5.66 23.20 sdd 0.00 4.00 8.00 33.00 0.50 0.25 37.66 0.52 12.83 0.25 15.88 4.02 16.50 sda 0.00 3.00 16.00 21.00 1.00 0.17 64.86 0.26 7.14 1.06 11.76 3.92 14.50 sdj 0.00 5.00 12.00 37.00 0.75 0.29 43.27 0.84 17.24 0.42 22.70 4.47 21.90 sdl 0.00 17.00 16.00 34.00 1.00 0.34 54.88 2.98 59.56 0.56 87.32 3.92 19.60 sdh 0.00 4.00 8.00 26.00 0.50 0.20 41.88 0.67 19.71 1.75 25.23 4.50 15.30 sde 0.00 4.00 8.00 22.00 0.50 0.19 47.20 0.39 12.83 2.38 16.64 3.93 11.80 sdf 0.00 4.00 8.00 22.00 0.50 0.19 47.20 0.35 11.60 2.25 15.00 3.67 11.00 sdg 0.00 4.00 8.00 26.00 0.50 0.20 41.88 0.67 19.62 1.50 25.19 4.12 14.00 sdb 0.00 3.00 16.00 21.00 1.00 0.17 64.86 0.42 11.27 1.00 19.10 6.73 24.90 md127 0.00 0.00 0.00 210.00 0.00 1.44 14.06 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 35.57 0.00 10.34 0.00 0.00 54.08
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdi 0.00 0.00 9.00 11.00 0.56 0.04 62.00 0.14 7.00 1.78 11.27 6.50 13.00 sdk 0.00 0.00 8.00 11.00 0.50 0.05 58.95 0.10 5.26 0.88 8.45 5.26 10.00 sdc 0.00 0.00 16.00 16.00 1.00 0.07 68.75 0.17 5.44 0.44 10.44 5.44 17.40 sdd 0.00 0.00 16.00 16.00 1.00 0.07 68.75 0.17 5.38 1.00 9.75 5.38 17.20 sda 0.00 0.00 8.00 16.00 0.50 0.06 48.08 0.20 8.54 0.75 12.44 5.42 13.00 sdj 0.00 0.00 9.00 11.00 0.56 0.04 62.00 0.13 6.65 0.44 11.73 5.90 11.80 sdl 0.00 0.00 8.00 11.00 0.50 0.05 58.95 0.12 6.16 1.62 9.45 6.16 11.70 sdh 0.00 0.00 16.00 30.00 1.00 0.11 49.63 0.32 6.85 0.44 10.27 4.39 20.20 sde 0.00 0.00 16.00 6.00 1.00 0.02 95.27 0.10 4.41 0.44 15.00 4.41 9.70 sdf 0.00 0.00 16.00 6.00 1.00 0.02 95.27 0.14 6.59 4.06 13.33 6.55 14.40 sdg 0.00 0.00 16.00 29.00 1.00 0.11 50.56 0.31 6.80 0.44 10.31 4.82 21.70 sdb 0.00 0.00 8.00 16.00 0.50 0.06 48.08 0.24 10.17 0.62 14.94 6.75 16.20 md127 0.00 0.00 0.00 89.00 0.00 0.36 8.24 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 35.50 0.03 7.43 0.00 0.00 57.04
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdi 74.00 0.00 21.00 7.00 5.94 1.53 546.00 0.47 16.71 17.57 14.14 4.89 13.70 sdk 70.00 0.00 26.00 10.00 6.00 1.41 421.56 0.41 10.94 11.73 8.90 4.33 15.60 sdc 77.00 0.00 11.00 9.00 5.50 1.57 723.60 0.64 32.00 42.64 19.00 10.65 21.30 sdd 77.00 0.00 11.00 9.00 5.50 1.57 723.60 1.19 59.60 96.36 14.67 12.10 24.20 sda 71.00 1.00 24.00 11.00 5.94 1.53 437.09 0.51 14.46 14.38 14.64 5.09 17.80 sdj 74.00 0.00 21.00 7.00 5.94 1.53 546.00 0.58 20.79 20.57 21.43 7.04 19.70 sdl 70.00 0.00 26.00 11.00 6.00 1.91 437.84 0.39 10.54 11.04 9.36 4.32 16.00 sdh 77.00 0.00 11.00 7.00 5.50 1.52 798.67 0.43 24.17 33.82 9.00 6.61 11.90 sde 77.00 0.00 11.00 6.00 5.50 1.52 845.18 0.58 34.24 36.91 29.33 13.71 23.30 sdf 77.00 0.00 11.00 6.00 5.50 1.52 845.18 0.60 35.35 45.36 17.00 10.06 17.10 sdg 77.00 0.00 11.00 8.00 5.50 1.52 757.05 0.43 22.95 32.00 10.50 6.89 13.10 sdb 71.00 1.00 24.00 11.00 5.94 1.53 437.09 0.60 17.14 13.67 24.73 9.03 31.60 md127 0.00 0.00 0.00 52.00 0.00 9.57 376.96 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 27.06 0.03 6.00 0.00 0.00 66.91
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdi 14.00 0.00 10.00 9.00 1.50 4.06 599.58 0.13 6.84 2.60 11.56 6.63 12.60 sdk 14.00 0.00 10.00 10.00 1.50 5.00 665.60 0.13 7.05 2.50 11.60 6.35 12.70 sdc 14.00 0.00 11.00 10.00 1.56 4.01 543.24 0.15 7.33 1.00 14.30 7.24 15.20 sdd 14.00 0.00 11.00 10.00 1.56 4.01 543.24 0.15 7.14 1.00 13.90 7.05 14.80 sda 14.00 0.00 11.00 10.00 1.56 4.20 561.52 0.12 5.38 0.91 10.30 5.43 11.40 sdj 14.00 0.00 10.00 9.00 1.50 4.06 599.58 0.26 13.68 3.60 24.89 13.47 25.60 sdl 14.00 0.00 10.00 9.00 1.50 4.50 646.74 0.13 6.63 1.30 12.56 6.47 12.30 sdh 13.00 0.00 11.00 9.00 1.50 4.00 563.60 0.11 5.70 1.18 11.22 5.55 11.10 sde 14.00 0.00 10.00 8.00 1.50 4.00 625.78 0.09 4.78 1.10 9.38 4.67 8.40 sdf 14.00 0.00 10.00 8.00 1.50 4.00 625.78 0.14 8.06 4.00 13.12 7.17 12.90 sdg 13.00 0.00 11.00 9.00 1.50 4.00 563.60 0.14 7.00 1.91 13.22 6.80 13.60 sdb 14.00 0.00 11.00 10.00 1.56 4.20 561.52 0.17 7.67 1.73 14.20 7.71 16.20 md127 0.00 0.00 0.00 56.00 0.00 25.27 924.14 0.00 0.00 0.00 0.00 0.00 0.00
On 2016-05-25, 5:43 PM, "centos-bounces@centos.org on behalf of cpolish@surewest.net" <centos-bounces@centos.org on behalf of cpolish@surewest.net> wrote:
On 2016-05-25 19:13, Kelly Lesperance wrote:
Hdparm didn’t get far:
[root@r1k1 ~] # hdparm -tT /dev/sda
/dev/sda: Timing cached reads: Alarm clock [root@r1k1 ~] #
Hi Kelly,
Try running 'iostat -xdmc 1'. Look for a single drive that has substantially greater await than ~10msec. If all the drives except one are taking 6-8msec, but one is very much more, you've got a drive that drags down the whole array's performance.
Ignore the very first output from the command - it's an average of the disk subsystem since boot.
Post a representative output along with the contents /proc/mdstat.
Good luck,
Charles Polisher
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
On 05/25/2016 09:54 AM, Kelly Lesperance wrote:
What we're seeing is that when the weekly raid-check script executes, performance nose dives, and I/O wait skyrockets. The raid check starts out fairly fast (20000K/sec - the limit that's been set), but then quickly drops down to about 4000K/Sec. dev.raid.speed sysctls are at the defaults:
It looks like some pretty heavy writes are going on at the time. I'm not sure what you mean by "nose dives", but I'd expect *some* performance impact of running a read-intensive process like a RAID check at the same time you're running a write-intensive process.
Do the same write-heavy processes run on the other clusters, where you aren't seeing performance issues?
avg-cpu: %user %nice %system %iowait %steal %idle 9.24 0.00 1.32 20.02 0.00 69.42
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 50.00 512.00 20408.00 512 20408 sdb 50.00 512.00 20408.00 512 20408 sdc 48.00 512.00 19984.00 512 19984 sdd 48.00 512.00 19984.00 512 19984 sdf 50.00 704.00 19968.00 704 19968 sdg 47.00 512.00 19968.00 512 19968 sdh 47.00 512.00 19968.00 512 19968 sde 50.00 704.00 19968.00 704 19968 sdj 48.00 512.00 19972.00 512 19972 sdi 48.00 512.00 19972.00 512 19972 sdk 48.00 512.00 19980.00 512 19980 sdl 48.00 512.00 19980.00 512 19980 md127 241.00 0.00 120280.00 0 120280
All of our Kafka clusters are fairly write-heavy. The cluster in question is our second-heaviest – we haven’t yet upgraded the heaviest, due to the issues we’ve been experiencing in this one.
Here is an iostat example from a host within the same cluster, but without the RAID check running:
[root@r2k1 ~] # iostat -xdmc 1 10 Linux 3.10.0-327.13.1.el7.x86_64 (r2k1) 05/27/16 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle 8.87 0.02 1.28 0.21 0.00 89.62
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdd 0.02 0.55 0.15 27.06 0.03 11.40 859.89 1.02 37.40 36.13 37.41 6.86 18.65 sdf 0.02 0.48 0.15 26.99 0.03 11.40 862.17 0.15 5.56 40.94 5.37 7.27 19.73 sdk 0.03 0.58 0.22 27.10 0.03 11.40 857.01 1.60 58.49 36.20 58.67 7.17 19.58 sdb 0.02 0.52 0.15 27.43 0.03 11.40 848.37 0.02 0.78 42.84 0.55 7.07 19.50 sdj 0.02 0.55 0.15 27.11 0.03 11.40 858.28 0.62 22.70 41.97 22.59 7.43 20.27 sdg 0.03 0.68 0.22 27.76 0.03 11.40 836.98 0.76 27.10 34.36 27.04 7.33 20.51 sde 0.03 0.48 0.22 26.99 0.03 11.40 860.43 0.33 12.07 33.16 11.90 7.34 19.98 sda 0.03 0.52 0.22 27.43 0.03 11.40 846.65 0.57 20.48 36.42 20.35 7.34 20.31 sdh 0.02 0.68 0.15 27.76 0.03 11.40 838.63 0.47 16.66 40.96 16.53 7.20 20.09 sdc 0.03 0.55 0.22 27.06 0.03 11.40 858.19 0.74 27.30 36.96 27.22 7.55 20.58 sdi 0.03 0.53 0.22 27.13 0.03 11.40 856.04 1.60 58.50 27.43 58.75 5.21 14.24 sdl 0.02 0.56 0.15 27.11 0.03 11.40 858.27 1.12 41.09 27.89 41.16 5.00 13.63 md127 0.00 0.00 2.53 161.84 0.36 68.39 856.56 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 13.11 0.00 1.82 1.07 0.00 84.01
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdd 0.00 0.00 0.00 81.00 0.00 38.48 972.95 51.00 219.06 0.00 219.06 6.37 51.60 sdf 0.00 1.00 0.00 73.00 0.00 33.70 945.33 55.02 235.86 0.00 235.86 7.12 52.00 sdk 0.00 1.00 0.00 56.00 0.00 25.70 939.73 60.45 223.79 0.00 223.79 9.29 52.00 sdb 0.00 2.00 0.00 70.00 0.00 34.48 1008.70 58.88 292.81 0.00 292.81 7.37 51.60 sdj 0.00 3.00 0.00 62.00 0.00 29.87 986.60 59.32 243.48 0.00 243.48 8.26 51.20 sdg 0.00 1.00 0.00 49.00 0.00 23.43 979.45 60.37 234.98 0.00 234.98 10.53 51.60 sde 0.00 1.00 0.00 61.00 0.00 27.95 938.38 58.17 239.57 0.00 239.57 8.52 52.00 sda 0.00 2.00 0.00 56.00 0.00 27.48 1004.88 56.27 202.88 0.00 202.88 9.27 51.90 sdh 0.00 1.00 0.00 70.00 0.00 33.57 982.19 59.00 277.84 0.00 277.84 7.43 52.00 sdc 0.00 0.00 0.00 64.00 0.00 30.06 961.89 58.20 268.30 0.00 268.30 8.08 51.70 sdi 0.00 3.00 0.00 116.00 0.00 55.62 981.94 44.54 199.72 0.00 199.72 4.56 52.90 sdl 0.00 1.00 0.00 128.00 0.00 60.31 964.88 43.91 215.94 0.00 215.94 4.11 52.60 md127 0.00 0.00 0.00 1143.00 0.00 538.90 965.59 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 15.70 0.00 1.97 0.44 0.00 81.89
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdd 0.00 0.00 0.00 119.00 0.00 56.39 970.42 42.84 639.45 0.00 639.45 6.66 79.20 sdf 0.00 1.00 0.00 129.00 0.00 61.21 971.84 48.89 672.04 0.00 672.04 6.34 81.80 sdk 0.00 0.00 0.00 152.00 0.00 72.62 978.53 61.02 716.76 0.00 716.76 5.74 87.20 sdb 0.00 1.00 0.00 133.00 0.00 62.86 967.88 54.10 695.35 0.00 695.35 6.45 85.80 sdj 0.00 0.00 0.00 146.00 0.00 68.36 958.85 69.22 767.12 0.00 767.12 6.85 100.00 sdg 0.00 0.00 0.00 146.00 0.00 69.87 980.11 77.99 789.53 0.00 789.53 6.85 100.00 sde 0.00 1.00 0.00 141.00 0.00 66.96 972.60 56.21 707.61 0.00 707.61 6.21 87.60 sda 0.00 1.00 0.00 147.00 0.00 69.86 973.22 62.21 728.76 0.00 728.76 6.32 92.90 sdh 0.00 0.00 0.00 134.00 0.00 62.61 956.90 55.79 711.49 0.00 711.49 6.63 88.90 sdc 0.00 0.00 0.00 136.00 0.00 64.81 975.94 61.46 753.57 0.00 753.57 6.93 94.20 sdi 0.00 0.00 0.00 93.00 0.00 42.67 939.61 17.60 419.10 0.00 419.10 4.63 43.10 sdl 0.00 0.00 0.00 80.00 0.00 38.02 973.20 11.00 340.79 0.00 340.79 4.25 34.00 md127 0.00 0.00 0.00 87.00 0.00 40.99 964.97 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 12.11 0.00 1.35 0.00 0.00 86.54
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdd 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.01 15.00 0.00 15.00 15.00 1.50 sdf 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.01 11.00 0.00 11.00 11.00 1.10 sdk 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.01 11.00 0.00 11.00 11.00 1.10 sdb 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.01 7.00 0.00 7.00 7.00 0.70 sdj 0.00 0.00 0.00 2.00 0.00 0.06 64.50 0.01 733.50 0.00 733.50 7.50 1.50 sdg 0.00 0.00 0.00 10.00 0.00 2.88 588.90 0.55 1212.80 0.00 1212.80 15.50 15.50 sde 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.01 12.00 0.00 12.00 12.00 1.20 sda 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.01 11.00 0.00 11.00 11.00 1.10 sdh 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.02 20.00 0.00 20.00 20.00 2.00 sdc 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.02 17.00 0.00 17.00 17.00 1.70 sdi 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.01 12.00 0.00 12.00 12.00 1.20 sdl 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.02 17.00 0.00 17.00 17.00 1.70 md127 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 15.22 0.00 1.50 0.00 0.00 83.28
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdk 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdj 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdh 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdl 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md127 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 16.96 0.09 1.63 0.16 0.00 81.16
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdd 0.00 0.00 0.00 8.00 0.00 0.66 168.25 0.09 11.50 0.00 11.50 8.75 7.00 sdf 0.00 0.00 0.00 5.00 0.00 0.52 213.20 0.08 16.20 0.00 16.20 16.20 8.10 sdk 0.00 0.00 0.00 3.00 0.00 0.50 342.00 0.06 20.33 0.00 20.33 20.33 6.10 sdb 0.00 0.00 0.00 3.00 0.00 0.50 342.00 0.05 16.67 0.00 16.67 16.67 5.00 sdj 0.00 0.00 0.00 4.00 0.00 0.98 500.50 0.06 14.50 0.00 14.50 11.00 4.40 sdg 0.00 1.00 0.00 4.00 0.00 0.63 322.50 0.14 36.00 0.00 36.00 32.75 13.10 sde 0.00 0.00 0.00 5.00 0.00 0.52 213.20 0.07 13.60 0.00 13.60 13.60 6.80 sda 0.00 0.00 0.00 3.00 0.00 0.50 342.00 0.05 15.67 0.00 15.67 15.67 4.70 sdh 0.00 1.00 0.00 4.00 0.00 0.63 322.50 0.06 14.50 0.00 14.50 11.50 4.60 sdc 0.00 0.00 0.00 8.00 0.00 0.66 168.25 0.11 13.25 0.00 13.25 10.62 8.50 sdi 0.00 0.00 0.00 4.00 0.00 0.98 500.50 0.06 15.50 0.00 15.50 12.00 4.80 sdl 0.00 0.00 0.00 3.00 0.00 0.50 342.00 0.04 13.67 0.00 13.67 13.67 4.10 md127 0.00 0.00 0.00 17.00 0.00 3.78 455.53 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 14.08 0.00 1.50 0.00 0.00 84.42
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdk 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdj 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdh 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdl 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md127 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 14.89 0.00 1.98 0.00 0.00 83.13
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdd 0.00 0.00 0.00 90.00 0.00 41.31 940.01 27.25 302.80 0.00 302.80 7.07 63.60 sdf 0.00 0.00 0.00 87.00 0.00 41.35 973.44 22.73 261.30 0.00 261.30 6.92 60.20 sdk 0.00 2.00 0.00 97.00 0.00 42.08 888.42 39.86 410.94 0.00 410.94 8.10 78.60 sdb 0.00 0.00 0.00 87.00 0.00 41.07 966.82 24.39 280.30 0.00 280.30 7.14 62.10 sdj 0.00 1.00 0.00 91.00 0.00 41.94 943.92 36.37 399.62 0.00 399.62 8.44 76.80 sdg 0.00 0.00 0.00 86.00 0.00 40.67 968.48 31.76 369.33 0.00 369.33 8.81 75.80 sde 0.00 0.00 0.00 87.00 0.00 41.35 973.44 30.80 354.05 0.00 354.05 9.01 78.40 sda 0.00 0.00 0.00 87.00 0.00 41.07 966.82 32.61 374.80 0.00 374.80 8.57 74.60 sdh 0.00 0.00 0.00 86.00 0.00 40.67 968.48 29.52 343.23 0.00 343.23 8.56 73.60 sdc 0.00 0.00 0.00 89.00 0.00 40.81 939.07 32.80 360.15 0.00 360.15 8.91 79.30 sdi 0.00 1.00 0.00 91.00 0.00 41.94 943.92 19.60 215.34 0.00 215.34 5.62 51.10 sdl 0.00 2.00 0.00 97.00 0.00 42.08 888.42 19.59 201.93 0.00 201.93 4.69 45.50 md127 0.00 0.00 0.00 535.00 0.00 248.42 950.95 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 11.08 0.00 1.41 0.00 0.00 87.51
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdd 0.00 5.00 0.00 42.00 0.00 0.38 18.55 2.25 53.52 0.00 53.52 4.93 20.70 sdf 0.00 0.00 0.00 35.00 0.00 0.21 12.43 1.62 46.17 0.00 46.17 5.29 18.50 sdk 0.00 23.00 0.00 42.00 0.00 0.44 21.40 1.99 47.29 0.00 47.29 4.64 19.50 sdb 0.00 9.00 0.00 58.00 0.00 0.34 12.02 2.77 47.78 0.00 47.78 4.12 23.90 sdj 0.00 1.00 0.00 39.00 0.00 0.24 12.79 1.79 45.97 0.00 45.97 5.21 20.30 sdg 0.00 11.00 0.00 66.00 0.00 0.40 12.45 3.60 54.47 0.00 54.47 3.42 22.60 sde 0.00 0.00 0.00 35.00 0.00 0.21 12.43 2.13 61.00 0.00 61.00 8.89 31.10 sda 0.00 9.00 0.00 58.00 0.00 0.34 12.02 2.48 42.81 0.00 42.81 3.71 21.50 sdh 0.00 11.00 0.00 66.00 0.00 0.40 12.45 4.81 72.83 0.00 72.83 3.80 25.10 sdc 0.00 5.00 0.00 43.00 0.00 0.88 41.93 1.99 63.81 0.00 63.81 5.00 21.50 sdi 0.00 1.00 0.00 39.00 0.00 0.24 12.79 1.31 33.69 0.00 33.69 4.03 15.70 sdl 0.00 23.00 0.00 42.00 0.00 0.44 21.40 1.23 29.33 0.00 29.33 3.71 15.60 md127 0.00 0.00 0.00 313.00 0.00 2.01 13.14 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 16.16 0.03 1.66 0.00 0.00 82.15
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdk 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdj 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdh 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdl 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md127 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
On 2016-05-26, 11:50 PM, "centos-bounces@centos.org on behalf of Gordon Messmer" <centos-bounces@centos.org on behalf of gordon.messmer@gmail.com> wrote:
On 05/25/2016 09:54 AM, Kelly Lesperance wrote:
What we're seeing is that when the weekly raid-check script executes, performance nose dives, and I/O wait skyrockets. The raid check starts out fairly fast (20000K/sec - the limit that's been set), but then quickly drops down to about 4000K/Sec. dev.raid.speed sysctls are at the defaults:
It looks like some pretty heavy writes are going on at the time. I'm not sure what you mean by "nose dives", but I'd expect *some* performance impact of running a read-intensive process like a RAID check at the same time you're running a write-intensive process.
Do the same write-heavy processes run on the other clusters, where you aren't seeing performance issues?
avg-cpu: %user %nice %system %iowait %steal %idle 9.24 0.00 1.32 20.02 0.00 69.42
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 50.00 512.00 20408.00 512 20408 sdb 50.00 512.00 20408.00 512 20408 sdc 48.00 512.00 19984.00 512 19984 sdd 48.00 512.00 19984.00 512 19984 sdf 50.00 704.00 19968.00 704 19968 sdg 47.00 512.00 19968.00 512 19968 sdh 47.00 512.00 19968.00 512 19968 sde 50.00 704.00 19968.00 704 19968 sdj 48.00 512.00 19972.00 512 19972 sdi 48.00 512.00 19972.00 512 19972 sdk 48.00 512.00 19980.00 512 19980 sdl 48.00 512.00 19980.00 512 19980 md127 241.00 0.00 120280.00 0 120280
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
I did some additional testing - I stopped Kafka on the host, and kicked off a disk check, and it ran at the expected speed overnight. I started kafka this morning, and the raid check's speed immediately dropped down to ~2000K/Sec.
I then enabled the write-back cache on the drives (hdparm -W1 /dev/sd*). The raid check is now running between 100000K/Sec and 200000K/Sec, and has been for several hours (it fluctuates, but seems to stay within that range). Write-back cache is NOT enabled for the drives on the hosts we haven't upgraded yet, but the speeds are similar (I kicked off a raid check on one of our CentOS 6 hosts as well, the window seems to be 150000 - 200000K/Sec on that host).
Kelly
On 2016-05-27, 9:21 AM, "Kelly Lesperance" klesperance@blackberry.com wrote:
All of our Kafka clusters are fairly write-heavy. The cluster in question is our second-heaviest – we haven’t yet upgraded the heaviest, due to the issues we’ve been experiencing in this one.
Here is an iostat example from a host within the same cluster, but without the RAID check running:
[root@r2k1 ~] # iostat -xdmc 1 10 Linux 3.10.0-327.13.1.el7.x86_64 (r2k1) 05/27/16 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle 8.87 0.02 1.28 0.21 0.00 89.62
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdd 0.02 0.55 0.15 27.06 0.03 11.40 859.89 1.02 37.40 36.13 37.41 6.86 18.65 sdf 0.02 0.48 0.15 26.99 0.03 11.40 862.17 0.15 5.56 40.94 5.37 7.27 19.73 sdk 0.03 0.58 0.22 27.10 0.03 11.40 857.01 1.60 58.49 36.20 58.67 7.17 19.58 sdb 0.02 0.52 0.15 27.43 0.03 11.40 848.37 0.02 0.78 42.84 0.55 7.07 19.50 sdj 0.02 0.55 0.15 27.11 0.03 11.40 858.28 0.62 22.70 41.97 22.59 7.43 20.27 sdg 0.03 0.68 0.22 27.76 0.03 11.40 836.98 0.76 27.10 34.36 27.04 7.33 20.51 sde 0.03 0.48 0.22 26.99 0.03 11.40 860.43 0.33 12.07 33.16 11.90 7.34 19.98 sda 0.03 0.52 0.22 27.43 0.03 11.40 846.65 0.57 20.48 36.42 20.35 7.34 20.31 sdh 0.02 0.68 0.15 27.76 0.03 11.40 838.63 0.47 16.66 40.96 16.53 7.20 20.09 sdc 0.03 0.55 0.22 27.06 0.03 11.40 858.19 0.74 27.30 36.96 27.22 7.55 20.58 sdi 0.03 0.53 0.22 27.13 0.03 11.40 856.04 1.60 58.50 27.43 58.75 5.21 14.24 sdl 0.02 0.56 0.15 27.11 0.03 11.40 858.27 1.12 41.09 27.89 41.16 5.00 13.63 md127 0.00 0.00 2.53 161.84 0.36 68.39 856.56 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 13.11 0.00 1.82 1.07 0.00 84.01
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdd 0.00 0.00 0.00 81.00 0.00 38.48 972.95 51.00 219.06 0.00 219.06 6.37 51.60 sdf 0.00 1.00 0.00 73.00 0.00 33.70 945.33 55.02 235.86 0.00 235.86 7.12 52.00 sdk 0.00 1.00 0.00 56.00 0.00 25.70 939.73 60.45 223.79 0.00 223.79 9.29 52.00 sdb 0.00 2.00 0.00 70.00 0.00 34.48 1008.70 58.88 292.81 0.00 292.81 7.37 51.60 sdj 0.00 3.00 0.00 62.00 0.00 29.87 986.60 59.32 243.48 0.00 243.48 8.26 51.20 sdg 0.00 1.00 0.00 49.00 0.00 23.43 979.45 60.37 234.98 0.00 234.98 10.53 51.60 sde 0.00 1.00 0.00 61.00 0.00 27.95 938.38 58.17 239.57 0.00 239.57 8.52 52.00 sda 0.00 2.00 0.00 56.00 0.00 27.48 1004.88 56.27 202.88 0.00 202.88 9.27 51.90 sdh 0.00 1.00 0.00 70.00 0.00 33.57 982.19 59.00 277.84 0.00 277.84 7.43 52.00 sdc 0.00 0.00 0.00 64.00 0.00 30.06 961.89 58.20 268.30 0.00 268.30 8.08 51.70 sdi 0.00 3.00 0.00 116.00 0.00 55.62 981.94 44.54 199.72 0.00 199.72 4.56 52.90 sdl 0.00 1.00 0.00 128.00 0.00 60.31 964.88 43.91 215.94 0.00 215.94 4.11 52.60 md127 0.00 0.00 0.00 1143.00 0.00 538.90 965.59 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 15.70 0.00 1.97 0.44 0.00 81.89
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdd 0.00 0.00 0.00 119.00 0.00 56.39 970.42 42.84 639.45 0.00 639.45 6.66 79.20 sdf 0.00 1.00 0.00 129.00 0.00 61.21 971.84 48.89 672.04 0.00 672.04 6.34 81.80 sdk 0.00 0.00 0.00 152.00 0.00 72.62 978.53 61.02 716.76 0.00 716.76 5.74 87.20 sdb 0.00 1.00 0.00 133.00 0.00 62.86 967.88 54.10 695.35 0.00 695.35 6.45 85.80 sdj 0.00 0.00 0.00 146.00 0.00 68.36 958.85 69.22 767.12 0.00 767.12 6.85 100.00 sdg 0.00 0.00 0.00 146.00 0.00 69.87 980.11 77.99 789.53 0.00 789.53 6.85 100.00 sde 0.00 1.00 0.00 141.00 0.00 66.96 972.60 56.21 707.61 0.00 707.61 6.21 87.60 sda 0.00 1.00 0.00 147.00 0.00 69.86 973.22 62.21 728.76 0.00 728.76 6.32 92.90 sdh 0.00 0.00 0.00 134.00 0.00 62.61 956.90 55.79 711.49 0.00 711.49 6.63 88.90 sdc 0.00 0.00 0.00 136.00 0.00 64.81 975.94 61.46 753.57 0.00 753.57 6.93 94.20 sdi 0.00 0.00 0.00 93.00 0.00 42.67 939.61 17.60 419.10 0.00 419.10 4.63 43.10 sdl 0.00 0.00 0.00 80.00 0.00 38.02 973.20 11.00 340.79 0.00 340.79 4.25 34.00 md127 0.00 0.00 0.00 87.00 0.00 40.99 964.97 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 12.11 0.00 1.35 0.00 0.00 86.54
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdd 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.01 15.00 0.00 15.00 15.00 1.50 sdf 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.01 11.00 0.00 11.00 11.00 1.10 sdk 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.01 11.00 0.00 11.00 11.00 1.10 sdb 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.01 7.00 0.00 7.00 7.00 0.70 sdj 0.00 0.00 0.00 2.00 0.00 0.06 64.50 0.01 733.50 0.00 733.50 7.50 1.50 sdg 0.00 0.00 0.00 10.00 0.00 2.88 588.90 0.55 1212.80 0.00 1212.80 15.50 15.50 sde 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.01 12.00 0.00 12.00 12.00 1.20 sda 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.01 11.00 0.00 11.00 11.00 1.10 sdh 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.02 20.00 0.00 20.00 20.00 2.00 sdc 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.02 17.00 0.00 17.00 17.00 1.70 sdi 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.01 12.00 0.00 12.00 12.00 1.20 sdl 0.00 0.00 0.00 1.00 0.00 0.00 1.00 0.02 17.00 0.00 17.00 17.00 1.70 md127 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 15.22 0.00 1.50 0.00 0.00 83.28
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdk 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdj 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdh 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdl 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md127 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 16.96 0.09 1.63 0.16 0.00 81.16
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdd 0.00 0.00 0.00 8.00 0.00 0.66 168.25 0.09 11.50 0.00 11.50 8.75 7.00 sdf 0.00 0.00 0.00 5.00 0.00 0.52 213.20 0.08 16.20 0.00 16.20 16.20 8.10 sdk 0.00 0.00 0.00 3.00 0.00 0.50 342.00 0.06 20.33 0.00 20.33 20.33 6.10 sdb 0.00 0.00 0.00 3.00 0.00 0.50 342.00 0.05 16.67 0.00 16.67 16.67 5.00 sdj 0.00 0.00 0.00 4.00 0.00 0.98 500.50 0.06 14.50 0.00 14.50 11.00 4.40 sdg 0.00 1.00 0.00 4.00 0.00 0.63 322.50 0.14 36.00 0.00 36.00 32.75 13.10 sde 0.00 0.00 0.00 5.00 0.00 0.52 213.20 0.07 13.60 0.00 13.60 13.60 6.80 sda 0.00 0.00 0.00 3.00 0.00 0.50 342.00 0.05 15.67 0.00 15.67 15.67 4.70 sdh 0.00 1.00 0.00 4.00 0.00 0.63 322.50 0.06 14.50 0.00 14.50 11.50 4.60 sdc 0.00 0.00 0.00 8.00 0.00 0.66 168.25 0.11 13.25 0.00 13.25 10.62 8.50 sdi 0.00 0.00 0.00 4.00 0.00 0.98 500.50 0.06 15.50 0.00 15.50 12.00 4.80 sdl 0.00 0.00 0.00 3.00 0.00 0.50 342.00 0.04 13.67 0.00 13.67 13.67 4.10 md127 0.00 0.00 0.00 17.00 0.00 3.78 455.53 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 14.08 0.00 1.50 0.00 0.00 84.42
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdk 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdj 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdh 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdl 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md127 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 14.89 0.00 1.98 0.00 0.00 83.13
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdd 0.00 0.00 0.00 90.00 0.00 41.31 940.01 27.25 302.80 0.00 302.80 7.07 63.60 sdf 0.00 0.00 0.00 87.00 0.00 41.35 973.44 22.73 261.30 0.00 261.30 6.92 60.20 sdk 0.00 2.00 0.00 97.00 0.00 42.08 888.42 39.86 410.94 0.00 410.94 8.10 78.60 sdb 0.00 0.00 0.00 87.00 0.00 41.07 966.82 24.39 280.30 0.00 280.30 7.14 62.10 sdj 0.00 1.00 0.00 91.00 0.00 41.94 943.92 36.37 399.62 0.00 399.62 8.44 76.80 sdg 0.00 0.00 0.00 86.00 0.00 40.67 968.48 31.76 369.33 0.00 369.33 8.81 75.80 sde 0.00 0.00 0.00 87.00 0.00 41.35 973.44 30.80 354.05 0.00 354.05 9.01 78.40 sda 0.00 0.00 0.00 87.00 0.00 41.07 966.82 32.61 374.80 0.00 374.80 8.57 74.60 sdh 0.00 0.00 0.00 86.00 0.00 40.67 968.48 29.52 343.23 0.00 343.23 8.56 73.60 sdc 0.00 0.00 0.00 89.00 0.00 40.81 939.07 32.80 360.15 0.00 360.15 8.91 79.30 sdi 0.00 1.00 0.00 91.00 0.00 41.94 943.92 19.60 215.34 0.00 215.34 5.62 51.10 sdl 0.00 2.00 0.00 97.00 0.00 42.08 888.42 19.59 201.93 0.00 201.93 4.69 45.50 md127 0.00 0.00 0.00 535.00 0.00 248.42 950.95 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 11.08 0.00 1.41 0.00 0.00 87.51
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdd 0.00 5.00 0.00 42.00 0.00 0.38 18.55 2.25 53.52 0.00 53.52 4.93 20.70 sdf 0.00 0.00 0.00 35.00 0.00 0.21 12.43 1.62 46.17 0.00 46.17 5.29 18.50 sdk 0.00 23.00 0.00 42.00 0.00 0.44 21.40 1.99 47.29 0.00 47.29 4.64 19.50 sdb 0.00 9.00 0.00 58.00 0.00 0.34 12.02 2.77 47.78 0.00 47.78 4.12 23.90 sdj 0.00 1.00 0.00 39.00 0.00 0.24 12.79 1.79 45.97 0.00 45.97 5.21 20.30 sdg 0.00 11.00 0.00 66.00 0.00 0.40 12.45 3.60 54.47 0.00 54.47 3.42 22.60 sde 0.00 0.00 0.00 35.00 0.00 0.21 12.43 2.13 61.00 0.00 61.00 8.89 31.10 sda 0.00 9.00 0.00 58.00 0.00 0.34 12.02 2.48 42.81 0.00 42.81 3.71 21.50 sdh 0.00 11.00 0.00 66.00 0.00 0.40 12.45 4.81 72.83 0.00 72.83 3.80 25.10 sdc 0.00 5.00 0.00 43.00 0.00 0.88 41.93 1.99 63.81 0.00 63.81 5.00 21.50 sdi 0.00 1.00 0.00 39.00 0.00 0.24 12.79 1.31 33.69 0.00 33.69 4.03 15.70 sdl 0.00 23.00 0.00 42.00 0.00 0.44 21.40 1.23 29.33 0.00 29.33 3.71 15.60 md127 0.00 0.00 0.00 313.00 0.00 2.01 13.14 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle 16.16 0.03 1.66 0.00 0.00 82.15
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdk 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdj 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdh 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdl 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md127 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
On 2016-05-26, 11:50 PM, "centos-bounces@centos.org on behalf of Gordon Messmer" <centos-bounces@centos.org on behalf of gordon.messmer@gmail.com> wrote:
On 05/25/2016 09:54 AM, Kelly Lesperance wrote:
What we're seeing is that when the weekly raid-check script executes, performance nose dives, and I/O wait skyrockets. The raid check starts out fairly fast (20000K/sec - the limit that's been set), but then quickly drops down to about 4000K/Sec. dev.raid.speed sysctls are at the defaults:
It looks like some pretty heavy writes are going on at the time. I'm not sure what you mean by "nose dives", but I'd expect *some* performance impact of running a read-intensive process like a RAID check at the same time you're running a write-intensive process.
Do the same write-heavy processes run on the other clusters, where you aren't seeing performance issues?
avg-cpu: %user %nice %system %iowait %steal %idle 9.24 0.00 1.32 20.02 0.00 69.42
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 50.00 512.00 20408.00 512 20408 sdb 50.00 512.00 20408.00 512 20408 sdc 48.00 512.00 19984.00 512 19984 sdd 48.00 512.00 19984.00 512 19984 sdf 50.00 704.00 19968.00 704 19968 sdg 47.00 512.00 19968.00 512 19968 sdh 47.00 512.00 19968.00 512 19968 sde 50.00 704.00 19968.00 704 19968 sdj 48.00 512.00 19972.00 512 19972 sdi 48.00 512.00 19972.00 512 19972 sdk 48.00 512.00 19980.00 512 19980 sdl 48.00 512.00 19980.00 512 19980 md127 241.00 0.00 120280.00 0 120280
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
Kelly Lesperance wrote:
I did some additional testing - I stopped Kafka on the host, and kicked off a disk check, and it ran at the expected speed overnight. I started kafka this morning, and the raid check's speed immediately dropped down to ~2000K/Sec.
I then enabled the write-back cache on the drives (hdparm -W1 /dev/sd*). The raid check is now running between 100000K/Sec and 200000K/Sec, and has been for several hours (it fluctuates, but seems to stay within that range). Write-back cache is NOT enabled for the drives on the hosts we haven't upgraded yet, but the speeds are similar (I kicked off a raid check on one of our CentOS 6 hosts as well, the window seems to be 150000
- 200000K/Sec on that host).
<snip> Perhaps I missed where you answered this: is this software RAID, or hardware? And I think you said you're upgrading existing boxes?
mark
Software RAID 10. Servers are HP DL380 Gen 8s, with 12x4 TB 7200 RPM drives.
On 2016-06-01, 3:52 PM, "centos-bounces@centos.org on behalf of m.roth@5-cent.us" <centos-bounces@centos.org on behalf of m.roth@5-cent.us> wrote:
Kelly Lesperance wrote:
I did some additional testing - I stopped Kafka on the host, and kicked off a disk check, and it ran at the expected speed overnight. I started kafka this morning, and the raid check's speed immediately dropped down to ~2000K/Sec.
I then enabled the write-back cache on the drives (hdparm -W1 /dev/sd*). The raid check is now running between 100000K/Sec and 200000K/Sec, and has been for several hours (it fluctuates, but seems to stay within that range). Write-back cache is NOT enabled for the drives on the hosts we haven't upgraded yet, but the speeds are similar (I kicked off a raid check on one of our CentOS 6 hosts as well, the window seems to be 150000
- 200000K/Sec on that host).
<snip> Perhaps I missed where you answered this: is this software RAID, or hardware? And I think you said you're upgrading existing boxes?
mark
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
On 2016-06-01 20:07, Kelly Lesperance wrote:
Software RAID 10. Servers are HP DL380 Gen 8s, with 12x4 TB 7200 RPM drives.
On 2016-06-01, 3:52 PM, "centos-bounces@centos.org on behalf of m.roth@5-cent.us" <centos-bounces@centos.org on behalf of m.roth@5-cent.us> wrote:
Kelly Lesperance wrote:
I did some additional testing - I stopped Kafka on the host, and kicked off a disk check, and it ran at the expected speed overnight. I started kafka this morning, and the raid check's speed immediately dropped down to ~2000K/Sec.
I then enabled the write-back cache on the drives (hdparm -W1 /dev/sd*). The raid check is now running between 100000K/Sec and 200000K/Sec, and has been for several hours (it fluctuates, but seems to stay within that range). Write-back cache is NOT enabled for the drives on the hosts we haven't upgraded yet, but the speeds are similar (I kicked off a raid check on one of our CentOS 6 hosts as well, the window seems to be 150000
- 200000K/Sec on that host).
Hi Kelly,
I hope this is relevant -- you might want to try the very most recent kernel in git to see if your problem is fixed.
Best regards,