Hey,
anyone using an LSI MegaRAID experienced "disappearing drives"...? We installed 6 new C6 servers, each with a Supermicro SMC2108 (LSI MegaRAID) controllers and 3 PX-128M5Pro SSDs ( RAID1 + hostswap). 2 weeks (and almost no activity on it, since not in production, apart from installation) later, megacli sees (based on the slot numbers): - on one server: only the 2nd disk of the RAID + the ex-hotswap brought online. NO sign of the first disk, no error message, it just "disapeared"... - on another server: apparently lost 1 of the 2 RAID1 disks and then the second disk of the RAID1... Now I can only see the spare brought online, so lonely... Is it "normal" for megacli to "hide" failed disks? Controller's firmware is the latest. SSDs's firmware are not... will try to flash them. BTW, any tips to flash them through the RAID controller instead of having to remove them all and connect them to a SATA controller...? This is a bit scary...
Thx, JD
I have about 20 servers running CentOS 6 with LSI RAID controllers, all using MegaCLI64 and I have not have problems. I did have a problem with a white-box using a SM chassis and I found on occasion that one node would periodically fail to see a couple drives on boot though. In this case, a cold power off fixed the problem. I only have two SM chassis though, so my sample-set is low.
Given this though, I'd look at the SM back-plane before the LSI controller itself.
On 18/07/13 11:26, John Doe wrote:
Hey,
anyone using an LSI MegaRAID experienced "disappearing drives"...? We installed 6 new C6 servers, each with a Supermicro SMC2108 (LSI MegaRAID) controllers and 3 PX-128M5Pro SSDs ( RAID1 + hostswap). 2 weeks (and almost no activity on it, since not in production, apart from installation) later, megacli sees (based on the slot numbers):
- on one server: only the 2nd disk of the RAID + the ex-hotswap brought online. NO sign of the first disk, no error message, it just "disapeared"...
- on another server: apparently lost 1 of the 2 RAID1 disks and then the second disk of the RAID1... Now I can only see the spare brought online, so lonely...
Is it "normal" for megacli to "hide" failed disks? Controller's firmware is the latest. SSDs's firmware are not... will try to flash them. BTW, any tips to flash them through the RAID controller instead of having to remove them all and connect them to a SATA controller...? This is a bit scary...
Thx, JD _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
John Doe wrote:
anyone using an LSI MegaRAID experienced "disappearing drives"...? We installed 6 new C6 servers, each with a Supermicro SMC2108 (LSI MegaRAID) controllers and 3 PX-128M5Pro SSDs ( RAID1 + hostswap). 2 weeks (and almost no activity on it, since not in production, apart from installation) later, megacli sees (based on the slot numbers):
- on one server: only the 2nd disk of the RAID + the ex-hotswap brought
online. NO sign of the first disk, no error message, it just "disapeared"...
- on another server: apparently lost 1 of the 2 RAID1 disks and then the
second disk of the RAID1... Now I can only see the spare brought online, so lonely... Is it "normal" for megacli to "hide" failed disks? Controller's firmware is the latest. SSDs's firmware are not... will try to flash them. BTW, any tips to flash them through the RAID controller instead of having to remove them all and connect them to a SATA controller...? This is a bit scary...
You're saying that if you use megacli, it doesn't show the physical drive?
As someone else said, I'd look at the SM box: I have *nothing* but very bad experience with SM's quality control (try sending 4? 5? 6? boxes out of 20 back for repair from Penguin, a vendor that's all SM).
mark "that doesn't count the couple or so sent back *twice*"
On 2013/07/18 05:57 AM, m.roth@5-cent.us wrote:
John Doe wrote:
anyone using an LSI MegaRAID experienced "disappearing drives"...? We installed 6 new C6 servers, each with a Supermicro SMC2108 (LSI MegaRAID) controllers and 3 PX-128M5Pro SSDs ( RAID1 + hostswap). 2 weeks (and almost no activity on it, since not in production, apart from installation) later, megacli sees (based on the slot numbers):
- on one server: only the 2nd disk of the RAID + the ex-hotswap brought
online. NO sign of the first disk, no error message, it just "disapeared"...
- on another server: apparently lost 1 of the 2 RAID1 disks and then the
second disk of the RAID1... Now I can only see the spare brought online, so lonely... Is it "normal" for megacli to "hide" failed disks? Controller's firmware is the latest. SSDs's firmware are not... will try to flash them. BTW, any tips to flash them through the RAID controller instead of having to remove them all and connect them to a SATA controller...? This is a bit scary...
You're saying that if you use megacli, it doesn't show the physical drive?
As someone else said, I'd look at the SM box: I have *nothing* but very bad experience with SM's quality control (try sending 4? 5? 6? boxes out of 20 back for repair from Penguin, a vendor that's all SM).
mark "that doesn't count the couple or so sent back *twice*"
We run 3 centos6+MR9286-8e systems and we've always been able to see the drives via cli. The only time a system didn't report drives was one of the cards had the metal bracket be a little bit not straight such that if you screwed down the bracket, the back of the card lifted a bit. When moving the box, it jolted that card enough that upon boot the motherboard no longer saw the card at all and of course not the drives hanging off it either... straightening the bracket fixed that problem. We've seen this problem two or three times out of the 8 or so cards we've had (the original configuration used more cards than the current one, four cards are so much cheaper than 8! =P)
Now, we've tested the systems heavily, "killing" drives, changing things around, etc, and never had a drive vanish unless it was pulled out of the connector, so, no, megacli does not hide dead drives.
Thanks! Miranda
On 7/18/2013 8:26 AM, John Doe wrote:
We installed 6 new C6 servers, each with a Supermicro SMC2108 (LSI MegaRAID) controllers and 3 PX-128M5Pro SSDs ( RAID1 + hostswap).
I helped deploy a couple petabytes of storage behind LSI MegaRAID SAS 9260-8i's which is the same card. never had any storage disappear.
The thing that bothers me is that the ctrl sees all the drives at first, later
does not see some anymore, and he just "forgets" about them like they never existed. I would have expected to still see them but in a failed state... Here, megacli just lists info for the remaining drive(s). So I miss all the "post mortem" info like the SMART status or the error counts if they had any... Am I missing an option to add to megacli to show the failed ones too maybe? Having used HP raid ctrls, I am used to see all drives, even failed ones.
Anyway, I"ll have to check the drives, backplane and cabling...
Thx for all the answers, JD
If these drives do not have TLER do not use them with LSI controllers.
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of John Doe Sent: Friday, July 19, 2013 5:13 AM To: CentOS mailing list Subject: Re: [CentOS] LSI MegaRAID experience...
The thing that bothers me is that the ctrl sees all the drives at first, later
does not see some anymore, and he just "forgets" about them like they never existed. I would have expected to still see them but in a failed state... Here, megacli just lists info for the remaining drive(s). So I miss all the "post mortem" info like the SMART status or the error counts if they had any... Am I missing an option to add to megacli to show the failed ones too maybe? Having used HP raid ctrls, I am used to see all drives, even failed ones.
Anyway, I"ll have to check the drives, backplane and cabling...
Thx for all the answers, JD _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
From: Drew Weaver drew.weaver@thenap.com
If these drives do not have TLER do not use them with LSI controllers.
Not sure about TLER on those Plextors... This is what megacli says: ---------------------------------------- Enclosure Device ID: 252 Slot Number: 0 Drive's position: DiskGroup: 0, Span: 0, Arm: 0 Enclosure position: N/A Device Id: 0 WWN: 4154412020202020 Sequence Number: 2 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA
Raw Size: 119.242 GB [0xee7c2b0 Sectors] Non Coerced Size: 118.742 GB [0xed7c2b0 Sectors] Coerced Size: 118.277 GB [0xec8e000 Sectors] Sector Size: 0 Logical Sector Size: 0 Physical Sector Size: 0 Firmware state: Online, Spun Up Commissioned Spare : No Emergency Spare : No Device Firmware Level: 1.02 Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x4433221100000000 Connected Port Number: 0(path0) Inquiry Data: P02302103634 PLEXTOR PX-128M5Pro 1.02 FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Solid State Device Drive: Not Certified Drive Temperature : N/A PI Eligibility: No Drive is formatted for PI information: No PI: No PI Port-0 : Port status: Active Port's Linkspeed: 6.0Gb/s Drive has flagged a S.M.A.R.T alert : No ----------------------------------------
Apart from that, I found the lsi events logs... Command timeout on PD 00(e0xfc/s0) . . . PD 00(e0xfc/s0) Path ... reset Error on PD 00(e0xfc/s0) State change on PD 00(e0xfc/s0) from ONLINE(18) to FAILED State change on VD 00/0 from OPTIMAL(3) to DEGRADED(2) Command timeout on PD 00(e0xfc/s0) PD 00(e0xfc/s0) Path ... reset State change on PD 00(e0xfc/s0) from FAILED(11) to UNCONFIGURED_BAD(1) . . .
Exact same behavior for the 2 servers and 3 SSDs... So it seems the ctrl changes them first to failed and then to unconfigured...
Thx, JD
Not sure about TLER on those Plextors... This is what megacli says: ---------------------------------------- Enclosure Device ID: 252 Slot Number: 0 Drive's position: DiskGroup: 0, Span: 0, Arm: 0 Enclosure position: N/A Device Id: 0 WWN: 4154412020202020 Sequence Number: 2 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA
Raw Size: 119.242 GB [0xee7c2b0 Sectors] Non Coerced Size: 118.742 GB [0xed7c2b0 Sectors] Coerced Size: 118.277 GB [0xec8e000 Sectors] Sector Size: 0 Logical Sector Size: 0 Physical Sector Size: 0 Firmware state: Online, Spun Up Commissioned Spare : No Emergency Spare : No Device Firmware Level: 1.02 Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x4433221100000000 Connected Port Number: 0(path0) Inquiry Data: P02302103634 PLEXTOR PX-128M5Pro 1.02 FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Solid State Device Drive: Not Certified Drive Temperature : N/A PI Eligibility: No Drive is formatted for PI information: No PI: No PI Port-0 : Port status: Active Port's Linkspeed: 6.0Gb/s Drive has flagged a S.M.A.R.T alert : No ----------------------------------------
Apart from that, I found the lsi events logs... Command timeout on PD 00(e0xfc/s0) . . . PD 00(e0xfc/s0) Path ... reset Error on PD 00(e0xfc/s0) State change on PD 00(e0xfc/s0) from ONLINE(18) to FAILED State change on VD 00/0 from OPTIMAL(3) to DEGRADED(2) Command timeout on PD 00(e0xfc/s0) PD 00(e0xfc/s0) Path ... reset State change on PD 00(e0xfc/s0) from FAILED(11) to UNCONFIGURED_BAD(1) . . .
Exact same behavior for the 2 servers and 3 SSDs... So it seems the ctrl changes them first to failed and then to unconfigured... --------------------------- We have experienced similar behavior with (to be blunt, non Intel) SSDs and with spinning rust (without TLER) on Dell PERC controllers (which are the same as LSI controllers) the drives simply "fall out" of the raid arrays they are in after a random period of time.
This seems to "just happen" with certain SSDs, in the beginning we pushed very hard to try and understand why; now we just use different SSDs.
The ones we've had problems with are: OCZ Vertex, Samsung 840/840 pro, etc Ones we've never had issues with are: Intel 520, Intel S3700
I know this doesn't really help you, but you could see if using a different SSD makes the problem go away.
John Doe wrote:
From: Drew Weaver drew.weaver@thenap.com
If these drives do not have TLER do not use them with LSI controllers.
Not sure about TLER on those Plextors...
<snip> TLER would only show up on something that looks at a *very* low level on the physical drive. What I know is that you can see it with smartctl - from the man page: scterc[,READTIME,WRITETIME] - [ATA only] prints values and descriptions of the SCT Error Recovery Control settings. These are equivalent to TLER (as used by Western Digital), CCTL (as used by Samsung and Hitachi) and ERC (as used by Seagate). READ- TIME and WRITETIME arguments (deciseconds) set the specified values. Values of 0 disable the feature, other values less than 65 are probably not supported. For RAID configurations, this is typically set to 70,70 deciseconds.
Note that knowing this was the result of a *lot* of research a couple-or so years ago. One *good* thing *seems* to be WD's new Red line, which is targeted toward NAS, they say... because they've put TLER back to something appropriate, like 7 sec or so, where it was 2 *minutes* for their "desktop" drives, and they disallowed changing it in firmware around '09, and the other OEMs followed suit. What makes Red good, if they work, is that they're only about one-third more than the low-cost drives, where the "server-grade" drives are 2-3 *times* the cost (look at the price of Seagate Constellations, for example).
mark
John Doe wrote:
From: Drew Weaver drew.weaver@thenap.com
If these drives do not have TLER do not use them with LSI controllers.
Not sure about TLER on those Plextors...
<snip> TLER would only show up on something that looks at a *very* low level on the physical drive. What I know is that you can see it with smartctl - from the man page: scterc[,READTIME,WRITETIME] - [ATA only] prints values and descriptions of the SCT Error Recovery Control settings. These are equivalent to TLER (as used by Western Digital), CCTL (as used by Samsung and Hitachi) and ERC (as used by Seagate). READ- TIME and WRITETIME arguments (deciseconds) set the specified values. Values of 0 disable the feature, other values less than 65 are probably not supported. For RAID configurations, this is typically set to 70,70 deciseconds.
Note that knowing this was the result of a *lot* of research a couple-or so years ago. One *good* thing *seems* to be WD's new Red line, which is targeted toward NAS, they say... because they've put TLER back to something appropriate, like 7 sec or so, where it was 2 *minutes* for their "desktop" drives, and they disallowed changing it in firmware around '09, and the other OEMs followed suit. What makes Red good, if they work, is that they're only about one-third more than the low-cost drives, where the "server-grade" drives are 2-3 *times* the cost (look at the price of Seagate Constellations, for example).
----
I would also like to note that up until Red were released to had to use RE to get TLER, and now apparently RE, SE, and RED (cost in that order) all support TLER.
The thing that worries me about RED is that they're listed as only supporting up to 5 drives in an array, -- how are they limiting that?
I think they probably could've just merged RED and SE into one line of drives but I guess they limited RED to 3TB so if you want a 4TB part you have to get the SE.
Something in the back of my mind tells me that RE, SE, and Red are the exact same hardware with different FW.