I am trying to copy(~7TB of data using rsync) between two server in same data center in the backend its using EMC VMAX3
After copying ~30-40GB of data multipath start failing
Dec 15 01:57:53 test.example.com multipathd: 360000970000196801239533037303434: Recovered to normal mode Dec 15 01:57:53 test.example.com multipathd: 360000970000196801239533037303434: remaining active paths: 1 Dec 15 01:57:53 test.example.com kernel: sd 1:0:2:20: [sdeu] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[root@test log]# multipath -ll |grep -i fail |- 1:0:0:15 sdq 65:0 failed ready running - 3:0:0:15 sdai 66:32 failed ready running
We are using default multipath.conf
HBA driver version 8.07.00.26.06.8-k
HBA model QLogic Corp. ISP8324-based 16Gb Fibre Channel to PCI Express Adapter
OS: CentOS 64-bit/2.6.32-642.6.2.el6.x86_64 Hardware:Intel/HP ProLiant DL380 Gen9
Already verified this solution and checked with EMC everything looks good https://access.redhat.com/solutions/438403
Some more info
- There is no drop/error packet on the network side.
Filesystem is mounted with noatime,nodiratime Filesystem ext4(Already tried xfs but same error) LVM is in striped mode(Started with linear option and then converted to striped)
Already disabled THP
echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled Whenever multipath start failing process goes to D state System firmware upgraded Tried with latest version of qlogic driver Tried with different scheduler(noop,deadline,cfq) Tried with different tuned profile(enterprise-storage)
Vmcore collected during the time of issue
I am able to collect vmcore during the time of issue
KERNEL: /usr/lib/debug/lib/modules/2.6.32-642.6.2.el6.x86_64/vmlinux DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 36 DATE: Fri Dec 16 00:11:26 2016 UPTIME: 01:48:57 LOAD AVERAGE: 0.41, 0.49, 0.60 TASKS: 1238 NODENAME: test.example.com RELEASE: 2.6.32-642.6.2.el6.x86_64 VERSION: #1 SMP Wed Oct 26 06:52:09 UTC 2016 MACHINE: x86_64 (2297 Mhz) MEMORY: 511.9 GB PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000018" PID: 15840 COMMAND: "kjournald" TASK: ffff884023446ab0 [THREAD_INFO: ffff88103def4000] CPU: 2 STATE: TASK_RUNNING (PANIC)
-- View this message in context: http://centos.1050465.n5.nabble.com/Result-hostbyte-DID-ERROR-driverbyte-DRI... Sent from the CentOS mailing list archive at Nabble.com.
On Jan 3, 2017, at 2:59 PM, lakhera2017 plakhera@salesforce.com wrote:
|- 1:0:0:15 sdq 65:0 failed ready running
- 3:0:0:15 sdai 66:32 failed ready running
Does the same SAN target fail each time? What brand/model/firmware SAN switch is between initiator and target? Does the HBA show any SCSI aborts?
Hi Steven
Please find my answer inline
On Wed, Jan 4, 2017 at 5:48 PM, Steven Tardy-2 [via CentOS] < ml-node+s1050465n5746476h5@n5.nabble.com> wrote:
On Jan 3, 2017, at 2:59 PM, lakhera2017 <[hidden email]
http:///user/SendEmail.jtp?type=node&node=5746476&i=0> wrote:
|- 1:0:0:15 sdq 65:0 failed ready running
- 3:0:0:15 sdai 66:32 failed ready running
Does the same SAN target fail each time?
Nope ever time its different target
What brand/model/firmware SAN switch is between initiator and target?
Cisco MDS 9710
NX-OS Version 6.2.15 8 Gb SFP end to end connectivity
VMAX3 Enginuity Build Version : 5977.813.785
Does the HBA show any SCSI aborts?
Reply from EMC
*ENG can see the ab3e/cc3e error logs on a write of 0x180 blocks that spans tracks from head B to head C.*
*First 0x100 blocks transferred okay.* *But when we send receiver ready for remaining 80 blocks the hosts sends an abts so we need to find out why the host is aborting the write.*
CentOS mailing list [hidden email] http:///user/SendEmail.jtp?type=node&node=5746476&i=1 https://lists.centos.org/mailman/listinfo/centos
If you reply to this email, your message will be added to the discussion below: http://centos.1050465.n5.nabble.com/Result-hostbyte- DID-ERROR-driverbyte-DRIVER-OK-tp5746449p5746476.html To unsubscribe from Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK, click here http://centos.1050465.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=5746449&code=cGxha2hlcmFAc2FsZXNmb3JjZS5jb218NTc0NjQ0OXwxMjE5NjMzMTE2 . NAML http://centos.1050465.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
-- View this message in context: http://centos.1050465.n5.nabble.com/Result-hostbyte-DID-ERROR-driverbyte-DRI... Sent from the CentOS mailing list archive at Nabble.com.