Hello. We've started a virtualisation project and got stuck in one moment. Currently we are using the following: Intel 2312WPQJR as a node Intel R2312GL4GS as a storage with Intel Infiniband 2 ports controller Infiniband Mellanox SwitchX IS5023 for commutation.
The nodes run CentOS 6.5 with built-in Infiniband package (Linux v0002 2.6.32-431.el6.x86_64), the storage - CentOS 6.4 also built-in drivers (Linux stor1.colocat.ru 2.6.32-279.el6.x86_64).
On the storage is made an array, that is shown in system as /storage/s01. Then it is exported via NFS. The nodes connect to NFS by: /bin/mount -t nfs -o rdma,port=20049,rw,hard,timeo=600,retrans=5,async,nfsvers=3,intr 192.168.1.1:/storage/s01 /home/storage/sata/01 mount shows: 192.168.1.1:/storage/s01 on /home/storage/sata/01 type nfs (rw,rdma,port=20049,hard,timeo=600,retrans=5,nfsvers=3,intr,addr=192.168.1.1)
Then we create a virtual machine with virsh with a disk bus virtio. All is OK, until we don't start Windows on KVM. It may work for 2 hours or 2 days, but under heavy load it hangs the mount (i.e. /sata/02 and 03 are accessible, but requesting 01 will result in a total hang of console). This can be beaten only by hardware reset of the node. If we mount without rdma - all is fine.
What can we do in such case? If any debugs or info needed - please ask.
Best regards, Nikolay.
I'm not an expert on this, but here is what I would try:
1. Are you using the latest virtio drivers on your Windows guest(s)? http://alt.fedoraproject.org/pub/alt/virtio-win/latest/images/
2. Is there any particular reason you use CentOS 6.4 on your storage server? I would update it to CentOS 6.5 so you would have matching Linux kernels on both of your systems.
Zoltan
On 2/19/2014 11:21 AM, engineer@colocat.ru wrote:
Hello. We've started a virtualisation project and got stuck in one moment. Currently we are using the following: Intel 2312WPQJR as a node Intel R2312GL4GS as a storage with Intel Infiniband 2 ports controller Infiniband Mellanox SwitchX IS5023 for commutation.
The nodes run CentOS 6.5 with built-in Infiniband package (Linux v0002 2.6.32-431.el6.x86_64), the storage - CentOS 6.4 also built-in drivers (Linux stor1.colocat.ru 2.6.32-279.el6.x86_64).
On the storage is made an array, that is shown in system as /storage/s01. Then it is exported via NFS. The nodes connect to NFS by: /bin/mount -t nfs -o rdma,port=20049,rw,hard,timeo=600,retrans=5,async,nfsvers=3,intr 192.168.1.1:/storage/s01 /home/storage/sata/01 mount shows: 192.168.1.1:/storage/s01 on /home/storage/sata/01 type nfs (rw,rdma,port=20049,hard,timeo=600,retrans=5,nfsvers=3,intr,addr=192.168.1.1)
Then we create a virtual machine with virsh with a disk bus virtio. All is OK, until we don't start Windows on KVM. It may work for 2 hours or 2 days, but under heavy load it hangs the mount (i.e. /sata/02 and 03 are accessible, but requesting 01 will result in a total hang of console). This can be beaten only by hardware reset of the node. If we mount without rdma - all is fine.
What can we do in such case? If any debugs or info needed - please ask.
Best regards, Nikolay.
CentOS-virt mailing list CentOS-virt@centos.org http://lists.centos.org/mailman/listinfo/centos-virt
Thanks for the answer.
I'm not an expert on this, but here is what I would try:
- Are you using the latest virtio drivers on your Windows guest(s)? http://alt.fedoraproject.org/pub/alt/virtio-win/latest/images/
Yes, this version is used
- Is there any particular reason you use CentOS 6.4 on your storage
server? I would update it to CentOS 6.5 so you would have matching Linux kernels on both of your systems.
OK, will try and schedule the maintenance, but when it was 6.4 everywhere - all was the same.
Zoltan
1. Try to see the logs on Storage server for more information. What kind of errors you are getting?
On Wed, Feb 19, 2014 at 4:05 PM, engineer@colocat.ru wrote:
Thanks for the answer.
I'm not an expert on this, but here is what I would try:
- Are you using the latest virtio drivers on your Windows guest(s)? http://alt.fedoraproject.org/pub/alt/virtio-win/latest/images/
Yes, this version is used
- Is there any particular reason you use CentOS 6.4 on your storage
server? I would update it to CentOS 6.5 so you would have matching Linux kernels on both of your systems.
OK, will try and schedule the maintenance, but when it was 6.4 everywhere
- all was the same.
Zoltan
CentOS-virt mailing list CentOS-virt@centos.org http://lists.centos.org/mailman/listinfo/centos-virt
Sometimes there are messages like Feb 17 04:11:28 stor1 rpc.idmapd[3116]: nss_getpwnam: name '0' does not map into domain 'localdomain' And nothing more. We've done tailing of logs both storage and node - nothing. In debug we've got aroung 10Gb of messages but there's noone to catch the problem :(
- Try to see the logs on Storage server for more information. What kind
of errors you are getting?
On Wed, Feb 19, 2014 at 4:05 PM, engineer@colocat.ru wrote:
Thanks for the answer.
I'm not an expert on this, but here is what I would try:
- Are you using the latest virtio drivers on your Windows guest(s)? http://alt.fedoraproject.org/pub/alt/virtio-win/latest/images/
Yes, this version is used
- Is there any particular reason you use CentOS 6.4 on your storage
server? I would update it to CentOS 6.5 so you would have matching Linux kernels on both of your systems.
OK, will try and schedule the maintenance, but when it was 6.4 everywhere
- all was the same.
Zoltan
CentOS-virt mailing list CentOS-virt@centos.org http://lists.centos.org/mailman/listinfo/centos-virt
-- Regards Ashishkumar S. Yadav _______________________________________________ CentOS-virt mailing list CentOS-virt@centos.org http://lists.centos.org/mailman/listinfo/centos-virt
On Thu, Feb 20, 2014 at 1:53 PM, engineer@colocat.ru wrote:
Sometimes there are messages like Feb 17 04:11:28 stor1 rpc.idmapd[3116]: nss_getpwnam: name '0' does not map into domain 'localdomain' And nothing more. We've done tailing of logs both storage and node - nothing. In debug we've got aroung 10Gb of messages but there's noone to catch the problem :(
The problem with NFS server,
Try below link to tune NFS server for better performance.
1. http://www.tldp.org/HOWTO/NFS-HOWTO/performance.html
2. http://www.techrepublic.com/blog/linux-and-open-source/tuning-nfs-for-better...
- Try to see the logs on Storage server for more information. What kind
of errors you are getting?
On Wed, Feb 19, 2014 at 4:05 PM, engineer@colocat.ru wrote:
Thanks for the answer.
I'm not an expert on this, but here is what I would try:
- Are you using the latest virtio drivers on your Windows guest(s)? http://alt.fedoraproject.org/pub/alt/virtio-win/latest/images/
Yes, this version is used
- Is there any particular reason you use CentOS 6.4 on your storage
server? I would update it to CentOS 6.5 so you would have matching Linux kernels on both of your systems.
OK, will try and schedule the maintenance, but when it was 6.4 everywhere
- all was the same.
Zoltan
CentOS-virt mailing list CentOS-virt@centos.org http://lists.centos.org/mailman/listinfo/centos-virt
-- Regards Ashishkumar S. Yadav _______________________________________________ CentOS-virt mailing list CentOS-virt@centos.org http://lists.centos.org/mailman/listinfo/centos-virt
CentOS-virt mailing list CentOS-virt@centos.org http://lists.centos.org/mailman/listinfo/centos-virt
Thanks, that was done, but will check again.
On Thu, Feb 20, 2014 at 1:53 PM, engineer@colocat.ru wrote:
Sometimes there are messages like Feb 17 04:11:28 stor1 rpc.idmapd[3116]: nss_getpwnam: name '0' does not map into domain 'localdomain' And nothing more. We've done tailing of logs both storage and node - nothing. In debug we've got aroung 10Gb of messages but there's noone to catch the problem :(
The problem with NFS server,
Try below link to tune NFS server for better performance.
http://www.techrepublic.com/blog/linux-and-open-source/tuning-nfs-for-better...
Done so. The problem still exists but only with win. Much heavier load on Linux/FreeBSD VMs doesn't cause anything.
On Thu, Feb 20, 2014 at 1:53 PM, engineer@colocat.ru wrote:
Sometimes there are messages like Feb 17 04:11:28 stor1 rpc.idmapd[3116]: nss_getpwnam: name '0' does not map into domain 'localdomain' And nothing more. We've done tailing of logs both storage and node - nothing. In debug we've got aroung 10Gb of messages but there's noone to catch the problem :(
The problem with NFS server,
Try below link to tune NFS server for better performance.
http://www.techrepublic.com/blog/linux-and-open-source/tuning-nfs-for-better...
- Try to see the logs on Storage server for more information. What
kind
of errors you are getting?
On Wed, Feb 19, 2014 at 4:05 PM, engineer@colocat.ru wrote:
Thanks for the answer.
I'm not an expert on this, but here is what I would try:
- Are you using the latest virtio drivers on your Windows
guest(s)?
http://alt.fedoraproject.org/pub/alt/virtio-win/latest/images/
Yes, this version is used
- Is there any particular reason you use CentOS 6.4 on your
storage
server? I would update it to CentOS 6.5 so you would have matching
Linux
kernels on both of your systems.
OK, will try and schedule the maintenance, but when it was 6.4 everywhere
- all was the same.
Zoltan
CentOS-virt mailing list CentOS-virt@centos.org http://lists.centos.org/mailman/listinfo/centos-virt
-- Regards Ashishkumar S. Yadav _______________________________________________ CentOS-virt mailing list CentOS-virt@centos.org http://lists.centos.org/mailman/listinfo/centos-virt
CentOS-virt mailing list CentOS-virt@centos.org http://lists.centos.org/mailman/listinfo/centos-virt
-- Regards Ashishkumar S. Yadav _______________________________________________ CentOS-virt mailing list CentOS-virt@centos.org http://lists.centos.org/mailman/listinfo/centos-virt
OK, new info here. Tuning is done, got this on node with Windows 2008 Server. Others with *nix are working good. On the storage nothing in logs, on the node:
195 Mar 20 09:42:22 v0004 kernel: rpcrdma: connection to 192.168.1.1:20049 closed (-103) 196 Mar 20 09:42:42 v0004 kernel: rpcrdma: connection to 192.168.1.1:20049 on mlx4_0, memreg 5 slots 32 ird 16 197 Mar 20 09:42:49 v0004 kernel: ------------[ cut here ]------------ 198 Mar 20 09:42:49 v0004 kernel: WARNING: at kernel/softirq.c:159 local_bh_enable_ip+0x7d/0xb0() (Not tainted) 199 Mar 20 09:42:49 v0004 kernel: Hardware name: S2600WP 200 Mar 20 09:42:49 v0004 kernel: Modules linked in: act_police cls_u32 sch_ingress cls_fw sch_sfq sch_htb ebt_arp ebt_ip ebtable_nat ebtables xprtrdma nfs lockd fscache auth_rpcgss nfs_acl sunrpc bridge stp llc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 openvswitch(U) vhost_net macvtap macvlan tun kvm_intel kvm iTCO_wdt iTCO_vendor_support sr_mod cdrom sb_edac edac_core lpc_ich mfd_core igb i2c_algo_bit ptp pps_core sg i2c_i801 i2c_core ioatdma dca mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core ext4 jbd2 mbcache usb_storage sd_mod crc_t10dif ahci isci libsas scsi_transport_sas wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] 201 Mar 20 09:42:49 v0004 kernel: Pid: 0, comm: swapper Not tainted 2.6.32-431.5.1.el6.x86_64 #1 202 Mar 20 09:42:49 v0004 kernel: Call Trace: 203 Mar 20 09:42:49 v0004 kernel: <IRQ> [<ffffffff81071e27>] ? warn_slowpath_common+0x87/0xc0 204 Mar 20 09:42:49 v0004 kernel: [<ffffffff81071e7a>] ? warn_slowpath_null+0x1a/0x20 205 Mar 20 09:42:49 v0004 kernel: [<ffffffff8107a3ed>] ? local_bh_enable_ip+0x7d/0xb0 206 Mar 20 09:42:49 v0004 kernel: [<ffffffff8152a7fb>] ? _spin_unlock_bh+0x1b/0x20 207 Mar 20 09:42:49 v0004 kernel: [<ffffffffa04554f0>] ? rpc_wake_up_status+0x70/0x80 [sunrpc] 208 Mar 20 09:42:49 v0004 kernel: [<ffffffffa044e79c>] ? xprt_wake_pending_tasks+0x2c/0x30 [sunrpc] 209 Mar 20 09:42:49 v0004 kernel: [<ffffffffa05322fc>] ? rpcrdma_conn_func+0x9c/0xb0 [xprtrdma] 210 Mar 20 09:42:49 v0004 kernel: [<ffffffffa0535450>] ? rpcrdma_qp_async_error_upcall+0x40/0x80 [xprtrdma] 211 Mar 20 09:42:49 v0004 kernel: [<ffffffffa01c11cb>] ? mlx4_ib_qp_event+0x8b/0x100 [mlx4_ib] 212 Mar 20 09:42:49 v0004 kernel: [<ffffffffa0166c54>] ? mlx4_qp_event+0x74/0xf0 [mlx4_core] 213 Mar 20 09:42:49 v0004 kernel: [<ffffffffa0154057>] ? mlx4_eq_int+0x557/0xcb0 [mlx4_core] 214 Mar 20 09:42:49 v0004 kernel: [<ffffffffa0455396>] ? rpc_wake_up_task_queue_locked+0x186/0x270 [sunrpc] 215 Mar 20 09:42:49 v0004 kernel: [<ffffffffa01547c4>] ? mlx4_msi_x_interrupt+0x14/0x20 [mlx4_core] 216 Mar 20 09:42:49 v0004 kernel: [<ffffffff810e6eb0>] ? handle_IRQ_event+0x60/0x170 217 Mar 20 09:42:49 v0004 kernel: [<ffffffff810e980e>] ? handle_edge_irq+0xde/0x180 218 Mar 20 09:42:49 v0004 kernel: [<ffffffffa0153362>] ? mlx4_cq_completion+0x42/0x90 [mlx4_core] 219 Mar 20 09:42:49 v0004 kernel: [<ffffffff8100faf9>] ? handle_irq+0x49/0xa0 220 Mar 20 09:42:49 v0004 kernel: [<ffffffff815312ec>] ? do_IRQ+0x6c/0xf0 221 Mar 20 09:42:49 v0004 kernel: [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11 222 Mar 20 09:42:49 v0004 kernel: [<ffffffff8107a893>] ? __do_softirq+0x73/0x1e0 223 Mar 20 09:42:49 v0004 kernel: [<ffffffff810e6eb0>] ? handle_IRQ_event+0x60/0x170 224 Mar 20 09:42:49 v0004 kernel: [<ffffffff8100c30c>] ? call_softirq+0x1c/0x30 225 Mar 20 09:42:49 v0004 kernel: [<ffffffff8100fa75>] ? do_softirq+0x65/0xa0 226 Mar 20 09:42:49 v0004 kernel: [<ffffffff8107a795>] ? irq_exit+0x85/0x90 227 Mar 20 09:42:49 v0004 kernel: [<ffffffff815312f5>] ? do_IRQ+0x75/0xf0 228 Mar 20 09:42:49 v0004 kernel: [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11 229 Mar 20 09:42:49 v0004 kernel: <EOI> [<ffffffff812e09ae>] ? intel_idle+0xde/0x170 230 Mar 20 09:42:49 v0004 kernel: [<ffffffff812e0991>] ? intel_idle+0xc1/0x170 231 Mar 20 09:42:49 v0004 kernel: [<ffffffff814268f7>] ? cpuidle_idle_call+0xa7/0x140 232 Mar 20 09:42:49 v0004 kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110 233 Mar 20 09:42:49 v0004 kernel: [<ffffffff8150cf1a>] ? rest_init+0x7a/0x80 234 Mar 20 09:42:49 v0004 kernel: [<ffffffff81c26f8f>] ? start_kernel+0x424/0x430 235 Mar 20 09:42:49 v0004 kernel: [<ffffffff81c2633a>] ? x86_64_start_reservations+0x125/0x129 236 Mar 20 09:42:49 v0004 kernel: [<ffffffff81c26453>] ? x86_64_start_kernel+0x115/0x124 237 Mar 20 09:42:49 v0004 kernel: ---[ end trace ddc1b92aa1d57ab7 ]--- 238 Mar 20 09:42:49 v0004 kernel: rpcrdma: connection to 192.168.1.1:20049 closed (-103) 239 Mar 20 09:43:19 v0004 kernel: rpcrdma: connection to 192.168.1.1:20049 on mlx4_0, memreg 5 slots 32 ird 16
and so on.
Done so. The problem still exists but only with win. Much heavier load on Linux/FreeBSD VMs doesn't cause anything.
On Thu, Feb 20, 2014 at 1:53 PM, engineer@colocat.ru wrote:
Sometimes there are messages like Feb 17 04:11:28 stor1 rpc.idmapd[3116]: nss_getpwnam: name '0' does not map into domain 'localdomain' And nothing more. We've done tailing of logs both storage and node - nothing. In debug we've got aroung 10Gb of messages but there's noone to catch the problem :(
The problem with NFS server,
Try below link to tune NFS server for better performance.
http://www.techrepublic.com/blog/linux-and-open-source/tuning-nfs-for-better...
- Try to see the logs on Storage server for more information. What
kind
of errors you are getting?