Hello all,
i have already had a discussion on the software raid mailinglist and i want to switch to this one :)
I am having a really strange problem with my md0 device running centos7. after a new start of my server the md0 was gone. now after trying to find the problem i detected the following:
Booting any installed kernel gives me NO md0 device. (ls /dev/md* doesnt give anything). a 'cat /proc/partitions show me now /dev/sd[a-d]1 partition. partprobe and a mdadm assemble gives me "disk busy"
[root@quad live]# cat mdstat Personalities : [raid6] [raid5] [raid4] [raid10] unused devices: <none>
[root@quad ~]# partprobe device-mapper: remove ioctl on WDC_WD20EFRX-68AX9N0_WD-WMC301255087p1 failed: Device or resource busy Warning: parted was unable to re-read the partition table on /dev/mapper/WDC_WD20EFRX-68AX9N0_WD-WMC301255087 (Device or resource busy). This means Linux won't know anything about the modifications you made. .... ....
[root@quad ~]# mdadm --assemble --force /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 mdadm: /dev/sda1 is busy - skipping mdadm: /dev/sdb1 is busy - skipping mdadm: /dev/sdc1 is busy - skipping mdadm: /dev/sdd1 is busy - skipping
booting from a usb stick for rescue my centos everything works. the md0 device exists and is mounted. (rw).
[root@quad usb-rescue]# cat mount | grep '/data' /dev/mapper/data-store on /mnt/sysimage/store type xfs (rw,noatime,seclabel,attr2,largeio,nobarrier,inode64,logbufs=8,logbsize=256k,sunit=256,swidth=768,noquota) /dev/mapper/data-tm on /mnt/sysimage/var/lib/vdr/video type xfs (rw,noatime,seclabel,attr2,largeio,nobarrier,inode64,logbufs=8,logbsize=256k,sunit=256,swidth=768,noquota)
3rd option: i am booting the installed rescue kernel from disk: i am getting a md0 device, but its not started. when i stop the md0 i cant assemble it anymore (disk busy)
/dev/md0: Version : 1.2 Creation Time : Wed Aug 20 19:28:52 2014 Raid Level : raid5 Used Dev Size : 1953382272 (1862.89 GiB 2000.26 GB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent
Update Time : Thu Aug 17 22:38:14 2017 State : active, Not Started Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0
Layout : left-symmetric Chunk Size : 128K
Name : quad.core.sartori.at:0 (local to host quad.core.sartori.at) UUID : 9d020f27:c0542472:b95a18d2:5741114d Events : 25458
Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 1 8 17 1 active sync /dev/sdb1 2 8 33 2 active sync /dev/sdc1 4 8 49 3 active sync /dev/sdd1
anyone got an idea, in which direction the problem could be? more logs needed? please help, i have no ideas anymore.
regards Andy
On 08/18/2017 12:35 PM, Mr Typo wrote:
mdadm: /dev/sda1 is busy - skipping mdadm: /dev/sdb1 is busy - skipping mdadm: /dev/sdc1 is busy - skipping mdadm: /dev/sdd1 is busy - skipping
That's plenty strange. The output of "lsblk" might tell you why those devices are busy.
Hello Gordon,
yeah. it is really strange. from one boot to the next, everyhing is f** up.(2 months between).
any idea?
[root@quad live]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 1.8T 0 disk ├─sda1 8:1 0 1.8T 0 part └─WDC_WD20EFRX-68AX9N0_WD-WMC1T2547260 253:3 0 1.8T 0 mpath └─WDC_WD20EFRX-68AX9N0_WD-WMC1T2547260p1 253:8 0 1.8T 0 part sdb 8:16 0 1.8T 0 disk ├─sdb1 8:17 0 1.8T 0 part └─WDC_WD20EFRX-68AX9N0_WD-WMC301255087 253:4 0 1.8T 0 mpath └─WDC_WD20EFRX-68AX9N0_WD-WMC301255087p1 253:9 0 1.8T 0 part sdc 8:32 0 1.8T 0 disk ├─sdc1 8:33 0 1.8T 0 part └─WDC_WD20EFRX-68EUZN0_WD-WCC4M2668622 253:5 0 1.8T 0 mpath └─WDC_WD20EFRX-68EUZN0_WD-WCC4M2668622p1 253:7 0 1.8T 0 part sdd 8:48 0 1.8T 0 disk ├─sdd1 8:49 0 1.8T 0 part └─WDC_WD20EFRX-68EUZN0_WD-WMC4M2878723 253:2 0 1.8T 0 mpath └─WDC_WD20EFRX-68EUZN0_WD-WMC4M2878723p1 253:6 0 1.8T 0 part sde 8:64 0 119.2G 0 disk ├─sde1 8:65 0 500M 0 part /boot └─sde2 8:66 0 118.8G 0 part ├─centos-swap 253:0 0 2G 0 lvm [SWAP] ├─centos-root 253:1 0 50G 0 lvm / └─centos-home 253:10 0 66.8G 0 lvm /home
On Fri, Aug 18, 2017 at 11:56 PM, Gordon Messmer gordon.messmer@gmail.com wrote:
On 08/18/2017 12:35 PM, Mr Typo wrote:
mdadm: /dev/sda1 is busy - skipping mdadm: /dev/sdb1 is busy - skipping mdadm: /dev/sdc1 is busy - skipping mdadm: /dev/sdd1 is busy - skipping
That's plenty strange. The output of "lsblk" might tell you why those devices are busy.
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
On 08/19/2017 12:06 PM, Mr Typo wrote:
sda 8:0 0 1.8T 0 disk ├─sda1 8:1 0 1.8T 0 part └─WDC_WD20EFRX-68AX9N0_WD-WMC1T2547260 253:3 0 1.8T 0 mpath └─WDC_WD20EFRX-68AX9N0_WD-WMC1T2547260p1 253:8 0 1.8T 0 part
You haven't said anything about multipath hardware yet, and you've been referring to "sda1", etc, which makes me think that you probably don't have multipath hardware.
If that's true, then the problem is probably that someone installed the multipath software on this system after the last time it booted successfully. One fix could be to boot from install media and use rescue mode to get a shell. Inside the rescue environment, remove the multipath software.
If you did have multipath hardware, you'd be assembling the multipath targets, like WDC_WD20EFRX-68AX9N0_WD-WMC1T2547260p1, rather than sda1.
hello Gardon,
thank you for the tip. I had an eye on multipathd during my debugging, but i ignored it, because i had installed it for years now. (and a stop of the service still gave me the device busy stuff). i assume that another rpm has enabled the multipath service and this was fiddling around.
disabling the multipath stuff helped.
thank you for your hint again!
got my kudos .)
regards Andreas
On Sun, Aug 20, 2017 at 8:08 PM, Gordon Messmer gordon.messmer@gmail.com wrote:
On 08/19/2017 12:06 PM, Mr Typo wrote:
sda 8:0 0 1.8T 0 disk ├─sda1 8:1 0 1.8T 0 part └─WDC_WD20EFRX-68AX9N0_WD-WMC1T2547260 253:3 0 1.8T 0 mpath └─WDC_WD20EFRX-68AX9N0_WD-WMC1T2547260p1 253:8 0 1.8T 0 part
You haven't said anything about multipath hardware yet, and you've been referring to "sda1", etc, which makes me think that you probably don't have multipath hardware.
If that's true, then the problem is probably that someone installed the multipath software on this system after the last time it booted successfully. One fix could be to boot from install media and use rescue mode to get a shell. Inside the rescue environment, remove the multipath software.
If you did have multipath hardware, you'd be assembling the multipath targets, like WDC_WD20EFRX-68AX9N0_WD-WMC1T2547260p1, rather than sda1.
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
18. Aug 2017 13:35 by euroregistrar@gmail.com:
Hello all,
i have already had a discussion on the software raid mailinglist and i want to switch to this one :)
I am having a really strange problem with my md0 device running centos7. after a new start of my server the md0 was gone. now after trying to find the problem i detected the following:
Booting any installed kernel gives me NO md0 device. (ls /dev/md* doesnt give anything). a 'cat /proc/partitions show me now /dev/sd[a-d]1 partition. partprobe and a mdadm assemble gives me "disk busy"
[root@quad live]# cat mdstat Personalities : [raid6] [raid5] [raid4] [raid10] unused devices: <none>
[root@quad ~]# partprobe device-mapper: remove ioctl on WDC_WD20EFRX-68AX9N0_WD-WMC301255087p1 failed: Device or resource busy
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>snip
Are you definately using cables rated for sata III? Have you checked the power connections? Have you checked the power supply voltages durning spin up/later?
Is there tension or major twisting forces on the sata cables? I've seen this cause intermittent problems and was solved by using a longer cable that reduced the stress at the connector.
Are the drives getting hot (your' model shouldn't have a heat issue under normal conditions). Are the drives bolted into a system? Drives can be sensitive to vibration and identical, unmounted drives will tend to shake each other and can produce rotational torque as well (especially when the same model as they'll all have the same resonances in that case). Either can cause problems with keeping the heads over the track reliably.
I'd definately run all the smart test. start with the conveyance test and then the short self test, and possibly the long test. do check the drive temperatures immediately after each test to make sure they aren't getting too hot.
I assume you've done an fsck on the file systems? If not it might be good to check.
Are you using the mother boards sata interfaces or an add-on card? If using a card i'd check the firmware version on the card and what the manufacturer is offering for updates.
Are the drives still under warranty? If so try WD tech support. Also check that all the Raid tools are properly installed with their' dependencies met. could be other hardware/drivers interfering. might reset the bios to "optimized settings". Which software raid package are you using?
Other than that I'd possibly suspect a software problem, not familiar with software raids myself (haven't used on, know what they are). Or possibly a problem with the drive that is intermitant or complex in how it fails.
Are you definately using cables rated for sata III? Have you checked the power connections? Have you checked the power supply voltages durning spin up/later?
yeah. the setup is running for years now. as i said: booting from usb stick -> everything works
Is there tension or major twisting forces on the sata cables? I've seen this cause intermittent problems and was solved by using a longer cable that reduced the stress at the connector.
nope. check and works.
Are the drives getting hot (your' model shouldn't have a heat issue under normal conditions). Are the drives bolted into a system? Drives can be sensitive to vibration and identical, unmounted drives will tend to shake each other and can produce rotational torque as well (especially when the same model as they'll all have the same resonances in that case). Either can cause problems with keeping the heads over the track reliably.
nope. the issue arised first time after the box was down for several hours. the box is in my cellar so in a good environment.
I'd definately run all the smart test. start with the conveyance test and then the short self test, and possibly the long test. do check the drive temperatures immediately after each test to make sure they aren't getting too hot.
output of the test after my reply.
I assume you've done an fsck on the file systems? If not it might be good to check.
no i did not. i am running xfs. and the filesystem ist not corrupt. so no repair needed. i can access the data when booting from usb.
Are you using the mother boards sata interfaces or an add-on card? If using a card i'd check the firmware version on the card and what the manufacturer is offering for updates.
motherboard sata. hp microserver gen8
Are the drives still under warranty? If so try WD tech support. Also check that all the Raid tools are properly installed with their' dependencies met. could be other hardware/drivers interfering. might reset the bios to "optimized settings". Which software raid package are you using?
mdadm has nom dependencies. but i reinstalled the package. version 3.4-14.el7_3.1
Other than that I'd possibly suspect a software problem, not familiar with software raids myself (haven't used on, know what they are). Or possibly a problem with the drive that is intermitant or complex in how it fails.
software problem sounds great. i would like to find out, why its not working. i could reinstalled the complete box, but that is not my intension. takes lots of time and i am not learning something new :)
regards Andy
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos