[CentOS] LSI MegaRAID experience...

Fri Jul 19 13:41:05 UTC 2013

Not sure about TLER on those Plextors...
This is what megacli says:
----------------------------------------
Enclosure Device ID: 252
Slot Number: 0
Drive's position: DiskGroup: 0, Span: 0, Arm: 0 Enclosure position: N/A Device Id: 0
WWN: 4154412020202020
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0 PD Type: SATA

Raw Size: 119.242 GB [0xee7c2b0 Sectors] Non Coerced Size: 118.742 GB [0xed7c2b0 Sectors] Coerced Size: 118.277 GB [0xec8e000 Sectors] Sector Size:  0 Logical Sector Size:  0 Physical Sector Size:  0 Firmware state: Online, Spun Up Commissioned Spare : No Emergency Spare : No Device Firmware Level: 1.02 Shield Counter: 0 Successful diagnostics completion on :  N/A SAS Address(0): 0x4433221100000000 Connected Port Number: 0(path0) Inquiry Data: P02302103634        PLEXTOR PX-128M5Pro                     1.02 FDE Capable: Not Capable FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Solid State Device
Drive:  Not Certified
Drive Temperature : N/A
PI Eligibility:  No
Drive is formatted for PI information:  No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Drive has flagged a S.M.A.R.T alert : No
----------------------------------------

Apart from that, I found the lsi events logs...
  Command timeout on PD 00(e0xfc/s0)
  . . .
  PD 00(e0xfc/s0) Path ... reset
  Error on PD 00(e0xfc/s0)
  State change on PD 00(e0xfc/s0) from ONLINE(18) to FAILED
  State change on VD 00/0 from OPTIMAL(3) to DEGRADED(2)
  Command timeout on PD 00(e0xfc/s0)
  PD 00(e0xfc/s0) Path ... reset
  State change on PD 00(e0xfc/s0) from FAILED(11) to UNCONFIGURED_BAD(1)
  . . .

Exact same behavior for the 2 servers and 3 SSDs...
So it seems the ctrl changes them first to failed and then to unconfigured...
---------------------------
We have experienced similar behavior with (to be blunt, non Intel) SSDs and with spinning rust (without TLER) on Dell PERC controllers (which are the same as LSI controllers) the drives simply "fall out" of the raid arrays they are in after a random period of time. 

This seems to "just happen" with certain SSDs, in the beginning we pushed very hard to try and understand why; now we just use different SSDs.

The ones we've had problems with are: OCZ Vertex, Samsung 840/840 pro, etc
Ones we've never had issues with are: Intel 520, Intel S3700

I know this doesn't really help you, but you could see if using a different SSD makes the problem go away.