I was just trying to be helpful.
*backs away slowly*
Cameron
On Fri, Jan 20, 2017 at 5:16 PM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
On Fri, January 20, 2017 7:00 pm, Cameron Smith wrote:
Hi Valeri,
Before you pull a drive you should check to make sure that doing so won't kill the whole array.
Wow! What did I say to make you treat me as an ultimate idiot!? ;-) All my comments, at least in my own reading, we about things you need to do to make sure when you hot unplug bad drive it is indeed failed drive you have to replace.
Valeri
MegaCli can help you prevent a storage disaster and can let you have more insight into your RAID and the status of the virtual disks and the disks than make up each array.
MegaCli will let you see the health and status of each drive. Does it
have
media errors, is it in predictive failure mode, what firmware version
does
it have etc. MegaCli will also let you see the status of the enclosure, the adapter and the virtual disks (logical disks).
Before you pull a drive it's a good idea to properly prepare it for removal after confirming that it's OK to remove it.
Here are a few commands:
OFFLINE A DISK MegaCli -PDOffline -PhysDrv[32:0] -a0
MARK A DISK AS MISSING MegaCli -pdmarkmissing -physdrv[32:0] -a0
MARK A DISK AS PREPARED FOR REMOVAL MegaCli -pdprprmv -physdrv[32:0] -a0
Here are some easy overview commands that I run when first looking at the storage on a system: MegaCli -AdpAllInfo -aAll |grep -A 8 "Device Present"; MegaCli -PDList -aALL |grep "Firmware state"; MegaCli -PDList -aALL |grep "Media Error Count"; MegaCli -PDList -aALL |grep "Predictive Failure Count"; MegaCli -PDList -aALL |grep "Inquiry Data"; MegaCli -PDList -aALL |grep "Device Firmware Level"; MegaCli -PDList -aALL |grep "Drive has flagged"; MegaCli -PDList -aALL |grep Temperature;
I also leverage MegaCli from bash scripts on my older Dell 11Gen that I run in cron.hourly that check the health status of my arrays and email me if there is an issue.
Cameron Smith Technical Operations Manager Network Redux, LLC Cell: 503-926-4928
On Fri, Jan 20, 2017 at 3:38 PM, Valeri Galtsev galtsev@kicp.uchicago.edu wrote:
On Fri, January 20, 2017 5:16 pm, Joseph L. Casale wrote:
This is why before configuring and installing everything you may want
to
attach drives one at a time, and upon boot take a note which physical drive number the controller has for that drive, and definitely label
it
so y9ou will know which drive to pull when drive failure is reported.
Sorry Valeri, that only works if you're the only guy in the org.
Well, this is true, I'm only one sysadmin working for two departments here...
In reality, you cannot and should not rely on this given how easily it
can
change and more than likely someone won't update it.
Would you walk up to a production unit in a degraded state and simply
pull
out a drive and risk a production issue? I wouldn't...
I routinely do: I just hot remove failed drive from running production systems, and replace with good drive (take a note what I said about my job above though). No one of our users ever notices. When I do it I usually am only taking chance of making degraded RAID6 (with one drive failed) degraded yet even more and become not fault tolerant, though still on line with all data on it. But even that chance is slim given I take all precautions when I am initially setting up the box.
You need to assert the position of the drive and prepare it in the
array
controller for removal, then swap, scan, add to virtual disk then
initiate
rebuild.
Hm, not certain what process you describe. Most of my controllers are 3ware and LSI, I just pull failed drive (and I know phailed physical drive number), put good in its place and rebuild stars right away. I have a couple of Areca ones (I love them too!), I don't remember if I have to manually initialize rebuild. (I'm lucky in using good drives - very careful in choosing good ones ;-).
Not to mention if it's a busy system, confirm that the IO load from
the
rebuild is not having an impact on the application. You may need to
lower
the rate.
Indeed, in 3ware configuration there is a choice of several grades of rebuild vs IO, I usually choose slower rebuild - faster IO. If I have only one drive failing on me during a year in a given rack, there is almost zero chance of second drive failing during quite some time (we had heated discussion about it once and I still stand by my opinion that drive failures are independent events). So, my degraded RAID-6 can keep running and even still stay redundant ("single redundant" akin RAID-5) for the period of rebuild, even if that takes quite long.
Valeri
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++ _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos
++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++ _______________________________________________ CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos