stale dm-multipath mappings

List overview All Threads
Download

newer

older

Local yum mirror and repomd.xml

timezone "Europe/London" ntpdate

Eugene Vilensky

7 May 2009 7 May '09

4:54 p.m.

Greetings,

I've hit this exact 'bug':

https://bugzilla.redhat.com/show_bug.cgi?id=491311

I need to remove the mappings manually. I assume this is done via 'multipath -F' followed by a 'multipath -v2' ? Has anyone experienced doing this on a production system? We can do it during hours of low activity, but we would prefer to keep the databases on this host online at the time. The LUNs themselves are completely removed from the host and are not visible on the HBA bus.

Regards, Eugene Vilensky evilensky@gmail.com

Attachments:

attachment.html (text/html — 693 bytes)

Show replies by date

nate

7 May 7 May

5:21 p.m.

Eugene Vilensky wrote:

...

Greetings,

I've hit this exact 'bug':

https://bugzilla.redhat.com/show_bug.cgi?id=491311

I need to remove the mappings manually. I assume this is done via 'multipath -F' followed by a 'multipath -v2' ? Has anyone experienced doing this on a production system? We can do it during hours of low activity, but we would prefer to keep the databases on this host online at the time. The LUNs themselves are completely removed from the host and are not visible on the HBA bus.

Just wondering what sort of impact this has to your system? If the paths are gone they won't be used, so what does it matter?

I have a couple batch processes that run nightly:

- One of them takes a consistent snapshot of a standby mysql database and exports the read-write snapshot to a QA host nightly

- The other takes another consistent snapshot of a standby mysql database and exports the read-write snapshot to a backup server

In both cases the process involves removing the previous snapshots from the QA and backup servers respectively, before re-creating new snapshots and presenting them back to the original servers on the same LUN IDs. Part of the process I do delete *all* device mappings for the snapshotted luns on the destination servers with these commands:

for i in `/sbin/dmsetup ls | grep p1 | awk '{print $1}'`; do dmsetup remove $i; done

for i in `/sbin/dmsetup ls | awk '{print $1}'`; do dmsetup remove $i; done

I just restart the multipathing service after I present new LUNs to the system. Both systems do this daily and have been for about two months now and it works great.

In these two cases currently I am using VMWare raw device mapping on the remote systems so while I'm using multipath, there is only 1 path(visible to the VM, the MPIO is handled by the host). Prior to that I used software iSCSI on CentOS 5.2 (no 5.3 yet), and I did the same thing, I did the same thing because I found restarting software iSCSI on CentOS 5.2 to be unreliable(more than one kernel panic during testing). The reason I use MPIO with only 1 path is so that I can maintain a consistent configuration across systems, don't need to worry about who has one path or who has 2 or who has 4, treat them all the same, since multipathing is automatic.

On CentOS 4.x with software iSCSI I didn't remove the paths I just let them go stale. I restarted software iSCSI and multipath as part of the snapshot process(software iSCSI is more solid as far as restarting goes under 4.x, had two panics in 6 months with multiple systems restarting every day). Thankfully I use LVM because the device names changed all the time, at some point I was up to like /dev/sddg.

But if your removing dead paths, or even restarting multipath on a system to detect new ones I have not had this have any noticeable impact to the system production or not.

I think device mapper will even prevent you from removing a device that is still in use.

[root@dc1-backup01:/var/log-ng]# dmsetup ls 350002ac005ce0714 (253, 0) 350002ac005d00714 (253, 2) 350002ac005d00714p1 (253, 10) 350002ac005cf0714 (253, 1) 350002ac005d10714 (253, 4) 350002ac005ce0714p1 (253, 7) san--p--mysql002b--db-san--p--mysql002b--db (253, 17) 350002ac005d10714p1 (253, 9) 350002ac005d20714 (253, 3) san--p--mysql002b--log-san--p--mysql002b--log (253, 13) san--p--pd1mysql001b--log-san--p--pd1mysql001b--log (253, 14) san--p--mysql001b--log-san--p--mysql001b--log (253, 16) 350002ac005d30714 (253, 5) 350002ac005cf0714p1 (253, 8) 350002ac005d20714p1 (253, 6) san--p--pd1mysql001b--db-san--p--pd1mysql001b--db (253, 12) san--p--mysql001b--db-san--p--mysql001b--db (253, 15) 350002ac005d30714p1 (253, 11)

[root@dc1-backup01:/var/log-ng]# dmsetup remove 350002ac005d20714p1 device-mapper: remove ioctl failed: Device or resource busy Command failed

nate

Eugene Vilensky

9:38 p.m.

...

Just wondering what sort of impact this has to your system? If the paths are gone they won't be used, so what does it matter?

Right now I have backgrounded a 'vgscan -v' operation that froze, which has never happened before. I assume it is trying to scan the /dev/mpath23 device that is supported by these four downed paths, and I am worried what would happen if I removed the maps manually while in this state.

I am surprised there is not an error-return of some kind between vgscan and dm-multipath if all paths for a particular mpath device are down...

nate

9:54 p.m.

Eugene Vilensky wrote:

...

...
Just wondering what sort of impact this has to your system? If the paths are gone they won't be used, so what does it matter?

Right now I have backgrounded a 'vgscan -v' operation that froze, which has never happened before. I assume it is trying to scan the /dev/mpath23 device that is supported by these four downed paths, and I am worried what would happen if I removed the maps manually while in this state.

I am surprised there is not an error-return of some kind between vgscan and dm-multipath if all paths for a particular mpath device are down...

Check lsof to see what it's hung on..it's been a while since I've run into that sort of issue..

When I mount volumes I have a special init script that handles it, all of my SAN volumes are LVM, and they have a particular naming scheme so the script can detect them easily, the script is here, perhaps the ideas behind it could be useful for you:

http://portal.aphroland.org/~aphro/mount_san.init

Sample runs:

start- [root@dc1-mysql002b:~]# /etc/init.d/mount_san start Scanning and activating SAN-based volume groups PV /dev/sdc1 is in exported VG san-p-mysql002b-log [1023.99 GB / 983.99 GB free] PV /dev/sdb1 is in exported VG san-p-mysql002b-db [2.00 TB / 1.90 TB free] Total: 2 [3.00 TB] / in use: 2 [3.00 TB] / in no VG: 0 [0 ] Volume group "san-p-mysql002b-log" successfully imported Volume group "san-p-mysql002b-db" successfully imported Checking LVM SAN filesystems.. e2fsck 1.39 (29-May-2006) /dev/san-p-mysql002b-log/san-p-mysql002b-log: clean, 84/10240 files, 391173/10485760 blocks e2fsck 1.39 (29-May-2006) /dev/san-p-mysql002b-db/san-p-mysql002b-db: clean, 353/25600 files, 4633225/26214400 blocks Finished checking LVM SAN filesystems.. Scanning and mounting multipathed filesystems.....[/san/MrT/mysql/db]....[/san/MrT/mysql/log]..done!

stop: [root@dc1-mysql002b:~]# /etc/init.d/mount_san stop Scanning and un-mounting multipathed filesystems.....[/san/MrT/mysql/db]....[/san/MrT/mysql/log]..done! Scanning and exporting SAN based volume groups.. Volume group "san-p-mysql002b-log" successfully exported Volume group "san-p-mysql002b-db" successfully exported

With CentOS 5.x I had to fix the rc.sysinit to skip the 'noauto' file systems, otherwise the system pukes on boot(wasn't a problem in CentOS 4.x). I originally came up with the script because I needed a way to mount software iSCSI file systems after the network was up and unmount them cleanly before the network went down, RHEL/CentOS 5 introduced better support for this(haven't tried it my system works..), but in 4.x it was an issue, caused I/O hangs on shutdown and prevented file systems from being mounted automatically on boot.

Been using this script on many systems for about a year and a half now.

I don't recall ever using/needing vgscan myself. The man page for vgimport mentions using it for previously recognized volume groups, but I've used it even for first time volume recognition and it's always worked.

nate

6005

Age (days ago)

6005

Last active (days ago)

discuss@lists.centos.org

3 comments

2 participants

tags (0)

participants (2)

Eugene Vilensky
nate