Are there any ways to improve/manage the speed of pvmove? Man doesn't show any documented switches for priority scheduling. Iostat shows the system way underutilized even though the lv whose pe's are being migrated is continuously being written (slowly) to.
Thanks! jlc
On Tue, 2008-02-12 at 19:57 -0700, Joseph L. Casale wrote:
Are there any ways to improve/manage the speed of pvmove? Man doesn't show any documented switches for priority scheduling. Iostat shows the system way underutilized even though the lv whose pe's are being migrated is continuously being written (slowly) to.
Thanks! jlc _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Sorry 'bout that previous one. Wrong key combo hit!
On Tue, 2008-02-12 at 19:57 -0700, Joseph L. Casale wrote:
Are there any ways to improve/manage the speed of pvmove?
Not that I am aware of. Keep in mind that a *lot* of work is being done.
You could "nice" it. "man nice". Since there is likely to be a lot of I/O happening, it may not help much.
Man doesn't show any documented switches for priority scheduling. Iostat shows the system way underutilized even though the lv whose pe's are being migrated is continuously being written (slowly) to.
If the drives are on the same channel, or other devices on the channel are also flooding the channel, that would be expected. Does "swapon -s" show a lot of swap being used? Does top give a clue? I suspect a lot of CPU may also be involved.
Thanks! jlc
<snip sig stuff>
You could "nice" it. "man nice". Since there is likely to be a lot of I/O happening, it may not help much.
Ok, here's a noob question :) - What process would I nice?
If the drives are on the same channel, or other devices on the channel are also flooding the channel, that would be expected. Does "swapon -s" show a lot of swap being used? Does top give a clue? I suspect a lot of CPU may also be involved.
Swapon -s shows 0 being used, top shows cpu's next under 1%.
Thanks! jlc
On Tue, 2008-02-12 at 20:41 -0700, Joseph L. Casale wrote:
You could "nice" it. "man nice". Since there is likely to be a lot of I/O happening, it may not help much.
Ok, here's a noob question :) - What process would I nice?
If you run pvmove from the command line, "nice -20 pvmove" for example. If you start lvm and run pvmove inside that, then "nice -20 lvm" e.g.
But based on the 1% CPU usage, probably won't help much.
If the drives are on the same channel, or other devices on the channel are also flooding the channel, that would be expected. Does "swapon -s" show a lot of swap being used? Does top give a clue? I suspect a lot of CPU may also be involved.
Swapon -s shows 0 being used, top shows cpu's next under 1%.
My guess then is that the writes to the HD are just large and slow. If the two HDs on the same channel that would make it even slower. If the drives are older/slower models, ditto. If they have small on-board cache, same thing.
But I really have a hunch that it is just a lot of I/O wait time due to either metadata maintenance and checkpointing and/or I/O failures, which have very long timeouts before failure is recognized and *then* alternate block assignment and mapping is done.
Thanks! jlc
<snip sig stuff>
But I really have a hunch that it is just a lot of I/O wait time due to either metadata maintenance and checkpointing and/or I/O failures, which have very long timeouts before failure is recognized and *then* alternate block assignment and mapping is done.
One of the original arrays just needs to be rebuilt with more members, there are no errors but I believe you are right about simple I/O wait time.
Going from sdd to sde:
# iostat -d -m -x Linux 2.6.18-53.1.6.el5 (host) 02/12/2008
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.74 0.00 1.52 42.72 0.11 1.75 86.41 0.50 11.40 5.75 25.43 sde 0.00 0.82 0.28 1.04 0.00 0.11 177.52 0.13 98.71 53.55 7.09
Not very impressive :) Two different SATA II based arrays on an LSI controller, 5% complete in ~7 hours == a week to complete! I ran this command from an ssh session from my workstation (That was clearly a dumb move). Given the robustness of the pvmove command I have gleaned from reading, if the session bales how much time am I likely to lose by restarting? Are the checkpoints frequent?
Thanks! jlc
On Tue, 2008-02-12 at 22:24 -0700, Joseph L. Casale wrote:
But I really have a hunch that it is just a lot of I/O wait time due to either metadata maintenance and checkpointing and/or I/O failures, which have very long timeouts before failure is recognized and *then* alternate block assignment and mapping is done.
One of the original arrays just needs to be rebuilt with more members, there are no errors but I believe you are right about simple I/O wait time.
Going from sdd to sde:
# iostat -d -m -x Linux 2.6.18-53.1.6.el5 (host) 02/12/2008
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.74 0.00 1.52 42.72 0.11 1.75 86.41 0.50 11.40 5.75 25.43 sde 0.00 0.82 0.28 1.04 0.00 0.11 177.52 0.13 98.71 53.55 7.09
Not very impressive :) Two different SATA II based arrays on an LSI controller, 5% complete in ~7 hours == a week to complete! I ran this command from an ssh session from my workstation (That was clearly a dumb move). Given the robustness of the pvmove command I have gleaned from reading, if the session bales how much time am I likely to lose by restarting? Are the checkpoints frequent?
Beyond my ken on the checkpoint frequency. Never had to use them. I'm in a situation where I can start 'em up and walk away. My best thought is to read the description of it in the man page and make a best-guess about letting it run or not.
Sorry I can't offer more, but I'd being spewing FUD if I tried!
I suggest that with an estimated 1 week completion, you can't lose much by killing it and restarting. Other checkpoints I've used in the past have *very* low overhead and easily justify their use.
I would anticipate this to be the same. IIRC from the man page description, it is essentially just marking completed portions and updating metadata to reflect the new status. With such a straightforward process, restart should be almost instantaneous with very low loss of time.
Again, this is all supposition as I don't know the code.
Thanks! jlc
<snip sig stuff>
On 13/02/2008 05:24, Joseph L. Casale wrote:
But I really have a hunch that it is just a lot of I/O wait time due to either metadata maintenance and checkpointing and/or I/O failures, which have very long timeouts before failure is recognized and *then* alternate block assignment and mapping is done.
One of the original arrays just needs to be rebuilt with more members, there are no errors but I believe you are right about simple I/O wait time.
Going from sdd to sde:
# iostat -d -m -x Linux 2.6.18-53.1.6.el5 (host) 02/12/2008
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.74 0.00 1.52 42.72 0.11 1.75 86.41 0.50 11.40 5.75 25.43 sde 0.00 0.82 0.28 1.04 0.00 0.11 177.52 0.13 98.71 53.55 7.09
Not very impressive :) Two different SATA II based arrays on an LSI controller, 5% complete in ~7 hours == a week to complete! I ran this command from an ssh session from my workstation (That was clearly a dumb move). Given the robustness of the pvmove command I have gleaned from reading, if the session bales how much time am I likely to lose by restarting? Are the checkpoints frequent?
Thanks! jlc
Running iostat like this will give you utilisation statistics since boot, which will not be inidicative of what's happening now. If you give it a reporting interval, say 10 seconds (iostat -m -x 10), I am guessing you will see very different data (likely high r/s, w/s, await, and derived values).
Running iostat like this will give you utilisation statistics since boot, which will not be inidicative of what's happening now. If you give it a reporting interval, say 10 seconds (iostat -m -x >10), I am guessing you will see very different data (likely high r/s, w/s, await, and derived values).
Thanks for all the pointers guys! jlc
on 2/12/2008 9:24 PM Joseph L. Casale spake the following:
But I really have a hunch that it is just a lot of I/O wait time due to either metadata maintenance and checkpointing and/or I/O failures, which have very long timeouts before failure is recognized and *then* alternate block assignment and mapping is done.
One of the original arrays just needs to be rebuilt with more members, there are no errors but I believe you are right about simple I/O wait time.
Going from sdd to sde:
# iostat -d -m -x Linux 2.6.18-53.1.6.el5 (host) 02/12/2008
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.74 0.00 1.52 42.72 0.11 1.75 86.41 0.50 11.40 5.75 25.43 sde 0.00 0.82 0.28 1.04 0.00 0.11 177.52 0.13 98.71 53.55 7.09
Not very impressive :) Two different SATA II based arrays on an LSI controller, 5% complete in ~7 hours == a week to complete! I ran this command from an ssh session from my workstation (That was clearly a dumb move). Given the robustness of the pvmove command I have gleaned from reading, if the session bales how much time am I likely to lose by restarting? Are the checkpoints frequent?
Thanks! jlc
I know it is too late for this one, but I usually run long running remote commands in a screen session just in case I lose the session.
On Wed, Feb 13, 2008 at 11:18:10AM -0700, Joseph L. Casale enlightened us:
I know it is too late for this one, but I usually run long running remote commands in a screen session just in case I lose the session.
What provides 'screen' in CentOS? Also, is there a resource for finding out what yum packages provide when searching for a util?
Funny, your choice of language.
[hyclak@euclid ~]$ yum provides screen Loading "priorities" plugin Searching Packages: Setting up repositories Reading repository metadata in from local files
screen.i386 4.0.2-5 base Matched from: screen
I believe that answers both of your questions.
Matt
Funny, your choice of language.
/me wiping frantic look off face
Hilarious... But you had me going for a moment, I thought I slipped and spoke like I would if asking a buddy for a moment. I can't tell you how many times I needed that, I always searched the net until I came up with someone else's post that included the info...
Thanks! jlc
On Tue, 2008-02-12 at 19:57 -0700, Joseph L. Casale wrote:
<snip>
Iostat shows the system way underutilized even though the lv whose pe's are being migrated is continuously being written (slowly) to.
I finally thought about that last line. Makes since because meta-data tracking must be done as various pieces are moved and a checkpoint is written (note in the man page about being able to restart without providing any parameters). And that is the drive that is failing too! May be a lot of write failures followed by alternate block assignments going on at the hardware level. Just a SWAG (Scientific Wild-Assed Guess).
Thanks! jlc
<snip sig stuff>
Joseph L. Casale wrote:
Are there any ways to improve/manage the speed of pvmove? Man doesn't show any documented switches for priority scheduling. Iostat shows the system way underutilized even though the lv whose pe's are being migrated is continuously being written (slowly) to.
I don't believe pvmove actually does any of the lifting. Pvmove merely creates a mirrored pv area in dev-mapper and then hangs around monitoring it's progress until the mirror is sync'd up then it throws a couple of barriers and removes the original pv from the mirror leaving the new pv as the new location for the data.
That is how the move continues through reboots. All lifting is actually done in dev-mapper and it's state is preserved there. On restart LVM will read it's meta-data to determine if there is a pvmove in progress and then spawn a pvmove to wait for it to complete so it can remove the mirror.
Any slowness is due to disk io errors and retries being thrown around.
You should really run LVM on top of a RAID1, software or hardware makes no difference, but LVM is more to storage management then fault tolerance and redundancy.
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
I don't believe pvmove actually does any of the lifting. Pvmove merely creates a mirrored pv area in dev-mapper and then hangs around monitoring it's progress until the mirror is sync'd up then it throws a couple of barriers and removes the original pv from the mirror leaving the new pv as the new location for the data.
That is how the move continues through reboots. All lifting is actually done in dev-mapper and it's state is preserved there. On restart LVM will read it's meta-data to determine if there is a pvmove in progress and then spawn a pvmove to wait for it to complete so it can remove the mirror.
Any slowness is due to disk io errors and retries being thrown around.
You should really run LVM on top of a RAID1, software or hardware makes no difference, but LVM is more to storage management then fault tolerance and redundancy.
-Ross
The LD's provided to LVM through the RAID controller are all fault tolerant...
Good info, Thanks! jlc
Joseph L. Casale wrote:
I don't believe pvmove actually does any of the lifting. Pvmove merely creates a mirrored pv area in dev-mapper and then hangs around monitoring it's progress until the mirror is sync'd up then it throws a couple of barriers and removes the original pv from the mirror leaving the new pv as the new location for the data.
That is how the move continues through reboots. All lifting is actually done in dev-mapper and it's state is preserved there. On restart LVM will read it's meta-data to determine if there is a pvmove in progress and then spawn a pvmove to wait for it to complete so it can remove the mirror.
Any slowness is due to disk io errors and retries being thrown around.
You should really run LVM on top of a RAID1, software or hardware makes no difference, but LVM is more to storage management then fault tolerance and redundancy.
-Ross
The LD's provided to LVM through the RAID controller are all fault tolerant...
If the PVs are fault tolerant then I don't know why pvmove would be running so slow, there should be no io errors being thrown as the bad drive would be marked as faulty and taken offline.
What are you pvmoving again?
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
What are you pvmoving again?
-Ross
Ok, here is what happened: I have a box running iet exporting an LV that started out as two 750 gig HD's mirrored off an 8 channel LSI SAS controller. I needed more space, and added 3 400 gig HD's in a r5 vd to this VG. Yes, I now need even more space, but I only have 8 channels, so... Moving it all over to 7 750's in an r5 either with a hotspare or maybe 8 750's in a r6, don't know yet
All vd's on the controller are optimal, nothing is degraded but I need to move all this data off the darn thing to free up the original ld so I can break and recreate it.
jlc
Joseph L. Casale wrote:
What are you pvmoving again?
-Ross
Ok, here is what happened: I have a box running iet exporting an LV that started out as two 750 gig HD's mirrored off an 8 channel LSI SAS controller. I needed more space, and added 3 400 gig HD's in a r5 vd to this VG. Yes, I now need even more space, but I only have 8 channels, so... Moving it all over to 7 750's in an r5 either with a hotspare or maybe 8 750's in a r6, don't know yet
Don't know? Where are you pvmoving everything now?
It would be a whole lot easier to get the new array fully setup, initialized and tested, then add it as a new PV to the existing VG, then do the pvmove then to pvmove it twice.
If you put the new array on a newer higher end controller and leave the existing setup as it is and pvmove between them things would move a lot faster.
All vd's on the controller are optimal, nothing is degraded but I need to move all this data off the darn thing to free up the original ld so I can break and recreate it.
Is that array on a different controller?
Is that array fully initialized?
Does the controller have a BBU write-back cache?
Maybe I am missing some important parts of the picture here?
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
Don't know? Where are you pvmoving everything now?
Where do I begin... Scenario is "No cash to do it right" so the interim step involves migration to a non fault tolerant setup temporarily. Server is a 1u HP and I don't have another controller that matches the remaining interface in that small server.
If I continue to explain all that I have to do, you'll likely not be impressed. Sigh, I can only do what I can!
Regardless, your help has been valuable! jlc
Joseph L. Casale
Don't know? Where are you pvmoving everything now?
Where do I begin... Scenario is "No cash to do it right" so the interim step involves migration to a non fault tolerant setup temporarily. Server is a 1u HP and I don't have another controller that matches the remaining interface in that small server.
Ah, well you are using SAS drives, so there is some cash there...
Need to learn how to shake the money maker, it's the only way we can get our jobs done these days. Tell management that there is no more room to get projects X or Y done because they need to invest in upgrading storage, or if it's for fault tolerance tell them what the worse case scenario will be. That usually gets them to find that extra $$ to make it happen.
What industry do you work in?
If I continue to explain all that I have to do, you'll likely not be impressed. Sigh, I can only do what I can!
That's not true! I'm unimpressed now ;-)
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
Ah, well you are using SAS drives, so there is some cash there...
My bad, SAS controller with SATA II drives :(
What industry do you work in?
All sorts, odd company: We do everything from automotive accessories to home building!
That's not true! I'm unimpressed now ;-)
-Ross
Love your honesty! <vbg> jlc
Joseph L. Casale wrote:
Ah, well you are using SAS drives, so there is some cash there...
My bad, SAS controller with SATA II drives :(
What industry do you work in?
All sorts, odd company: We do everything from automotive accessories to home building!
That's not true! I'm unimpressed now ;-)
-Ross
Love your honesty!
Since your moving the data over to a new server/array combo have you thought about using LTO tapes to back it up and restore it on the new server?
I know it isn't as sexy as LVM pv duplication and such, but it works...
If the LTO drives are too expensive why not just rent them for this activity? You need to buy the tapes, but that's not too much expense.
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
Since your moving the data over to a new server/array combo have you thought about using LTO tapes to back it up and restore it on the new server?
I know it isn't as sexy as LVM pv duplication and such, but it works...
We have an HP Autoloader, I thought of doing that actually, and I think I might :) I'll let it run through the weekend and make a decision on Monday. The autoloader is hooked up to a windows box running the scourge of my life (Backup exec 9 for windows) and I didn't know how to interface it easily to the data without installing an agent on the client running the ini which I thought would be just as painfully slow! The LV is exported through iet and is formatted NTFS.
Suggestions welcome :)
Jlc
Ps. My solution aint so sexy, it involves a non fault tolerant interim period so I am not pleased to say the least!
Joseph L. Casale wrote:
Since your moving the data over to a new server/array combo have you thought about using LTO tapes to back it up and restore it on the new server?
I know it isn't as sexy as LVM pv duplication and such, but it works...
We have an HP Autoloader, I thought of doing that actually, and I think I might :) I'll let it run through the weekend and make a decision on Monday. The autoloader is hooked up to a windows box running the scourge of my life (Backup exec 9 for windows) and I didn't know how to interface it easily to the data without installing an agent on the client running the ini which I thought would be just as painfully slow! The LV is exported through iet and is formatted NTFS.
Suggestions welcome :)
Well I suppose you have nightly backups of the data set already?
Maybe just abort the pvmove, let the Friday full backup run, then on Saturday do a full restore on the new server over iSCSI and bring it online that way.
I am facing the same issue with a migration of our VM machines to a new iSCSI setup this year, around 1TB of VMs need to be fork lifted over and I thought about exotic ways to move it over, but I think in the end it will be by good ole backup exec and tape.
Hey! Or maybe just use robocopy from one iSCSI volume to the other on the Windows side!
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
I am facing the same issue with a migration of our VM machines to a new iSCSI setup this year, around 1TB of VMs need to be fork lifted over and I thought about exotic ways to move it over, but I think in the end it will be by good ole backup exec and tape.
You're not running esx are you? Heh, I just did the same thing on a much smaller scale. Couldn't afford the long downtime while a copy took place so I shut the vm's off, snapped it and restarted it. I then scripted all files "without" 00000 in the name to rsync over (ssssslowly). I then only had to shut the vm off and sync the small snap's and restart the vm's on other storage. Only took a few minutes.
jlc