pvmove speed

List overview All Threads
Download

newer

older

upgrade from Fedora Core 5

fsck

Joseph L. Casale

13 Feb 2008 13 Feb '08

2:57 a.m.

Are there any ways to improve/manage the speed of pvmove? Man doesn't show any documented switches for priority scheduling. Iostat shows the system way underutilized even though the lv whose pe's are being migrated is continuously being written (slowly) to.

Thanks! jlc

Show replies by date

William L. Maltby

13 Feb 13 Feb

3:21 a.m.

On Tue, 2008-02-12 at 19:57 -0700, Joseph L. Casale wrote:

...

Are there any ways to improve/manage the speed of pvmove? Man doesn't show any documented switches for priority scheduling. Iostat shows the system way underutilized even though the lv whose pe's are being migrated is continuously being written (slowly) to.

Thanks! jlc _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

William L. Maltby

3:25 a.m.

Sorry 'bout that previous one. Wrong key combo hit!

On Tue, 2008-02-12 at 19:57 -0700, Joseph L. Casale wrote:

...

Are there any ways to improve/manage the speed of pvmove?

Not that I am aware of. Keep in mind that a *lot* of work is being done.

You could "nice" it. "man nice". Since there is likely to be a lot of I/O happening, it may not help much.

...

Man doesn't show any documented switches for priority scheduling. Iostat shows the system way underutilized even though the lv whose pe's are being migrated is continuously being written (slowly) to.

If the drives are on the same channel, or other devices on the channel are also flooding the channel, that would be expected. Does "swapon -s" show a lot of swap being used? Does top give a clue? I suspect a lot of CPU may also be involved.

...

Thanks! jlc

<snip sig stuff>

-- Bill

Joseph L. Casale

3:41 a.m.

...

You could "nice" it. "man nice". Since there is likely to be a lot of I/O happening, it may not help much.

Ok, here's a noob question :) - What process would I nice?

...

If the drives are on the same channel, or other devices on the channel are also flooding the channel, that would be expected. Does "swapon -s" show a lot of swap being used? Does top give a clue? I suspect a lot of CPU may also be involved.

Swapon -s shows 0 being used, top shows cpu's next under 1%.

Thanks! jlc

William L. Maltby

4:40 a.m.

On Tue, 2008-02-12 at 20:41 -0700, Joseph L. Casale wrote:

...

...
You could "nice" it. "man nice". Since there is likely to be a lot of I/O happening, it may not help much.

Ok, here's a noob question :) - What process would I nice?

If you run pvmove from the command line, "nice -20 pvmove" for example. If you start lvm and run pvmove inside that, then "nice -20 lvm" e.g.

But based on the 1% CPU usage, probably won't help much.

...

...
If the drives are on the same channel, or other devices on the channel are also flooding the channel, that would be expected. Does "swapon -s" show a lot of swap being used? Does top give a clue? I suspect a lot of CPU may also be involved.

Swapon -s shows 0 being used, top shows cpu's next under 1%.

My guess then is that the writes to the HD are just large and slow. If the two HDs on the same channel that would make it even slower. If the drives are older/slower models, ditto. If they have small on-board cache, same thing.

But I really have a hunch that it is just a lot of I/O wait time due to either metadata maintenance and checkpointing and/or I/O failures, which have very long timeouts before failure is recognized and *then* alternate block assignment and mapping is done.

...

Thanks! jlc

<snip sig stuff>

-- Bill

Joseph L. Casale

5:24 a.m.

...

But I really have a hunch that it is just a lot of I/O wait time due to either metadata maintenance and checkpointing and/or I/O failures, which have very long timeouts before failure is recognized and *then* alternate block assignment and mapping is done.

One of the original arrays just needs to be rebuilt with more members, there are no errors but I believe you are right about simple I/O wait time.

Going from sdd to sde:

# iostat -d -m -x Linux 2.6.18-53.1.6.el5 (host) 02/12/2008

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.74 0.00 1.52 42.72 0.11 1.75 86.41 0.50 11.40 5.75 25.43 sde 0.00 0.82 0.28 1.04 0.00 0.11 177.52 0.13 98.71 53.55 7.09

Not very impressive :) Two different SATA II based arrays on an LSI controller, 5% complete in ~7 hours == a week to complete! I ran this command from an ssh session from my workstation (That was clearly a dumb move). Given the robustness of the pvmove command I have gleaned from reading, if the session bales how much time am I likely to lose by restarting? Are the checkpoints frequent?

Thanks! jlc

William L. Maltby

7:07 a.m.

On Tue, 2008-02-12 at 22:24 -0700, Joseph L. Casale wrote:

...

...
But I really have a hunch that it is just a lot of I/O wait time due to either metadata maintenance and checkpointing and/or I/O failures, which have very long timeouts before failure is recognized and *then* alternate block assignment and mapping is done.

One of the original arrays just needs to be rebuilt with more members, there are no errors but I believe you are right about simple I/O wait time.

Going from sdd to sde:

# iostat -d -m -x Linux 2.6.18-53.1.6.el5 (host) 02/12/2008

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.74 0.00 1.52 42.72 0.11 1.75 86.41 0.50 11.40 5.75 25.43 sde 0.00 0.82 0.28 1.04 0.00 0.11 177.52 0.13 98.71 53.55 7.09

Not very impressive :) Two different SATA II based arrays on an LSI controller, 5% complete in ~7 hours == a week to complete! I ran this command from an ssh session from my workstation (That was clearly a dumb move). Given the robustness of the pvmove command I have gleaned from reading, if the session bales how much time am I likely to lose by restarting? Are the checkpoints frequent?

Beyond my ken on the checkpoint frequency. Never had to use them. I'm in a situation where I can start 'em up and walk away. My best thought is to read the description of it in the man page and make a best-guess about letting it run or not.

Sorry I can't offer more, but I'd being spewing FUD if I tried!

I suggest that with an estimated 1 week completion, you can't lose much by killing it and restarting. Other checkpoints I've used in the past have *very* low overhead and easily justify their use.

I would anticipate this to be the same. IIRC from the man page description, it is essentially just marking completed portions and updating metadata to reflect the new status. With such a straightforward process, restart should be almost instantaneous with very low loss of time.

Again, this is all supposition as I don't know the code.

...

Thanks! jlc

<snip sig stuff>

-- Bill

Luke Dudney

1:07 p.m.

On 13/02/2008 05:24, Joseph L. Casale wrote:

...

...
But I really have a hunch that it is just a lot of I/O wait time due to either metadata maintenance and checkpointing and/or I/O failures, which have very long timeouts before failure is recognized and *then* alternate block assignment and mapping is done.

One of the original arrays just needs to be rebuilt with more members, there are no errors but I believe you are right about simple I/O wait time.

Going from sdd to sde:

# iostat -d -m -x Linux 2.6.18-53.1.6.el5 (host) 02/12/2008

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.74 0.00 1.52 42.72 0.11 1.75 86.41 0.50 11.40 5.75 25.43 sde 0.00 0.82 0.28 1.04 0.00 0.11 177.52 0.13 98.71 53.55 7.09

Not very impressive :) Two different SATA II based arrays on an LSI controller, 5% complete in ~7 hours == a week to complete! I ran this command from an ssh session from my workstation (That was clearly a dumb move). Given the robustness of the pvmove command I have gleaned from reading, if the session bales how much time am I likely to lose by restarting? Are the checkpoints frequent?

Thanks! jlc

Running iostat like this will give you utilisation statistics since boot, which will not be inidicative of what's happening now. If you give it a reporting interval, say 10 seconds (iostat -m -x 10), I am guessing you will see very different data (likely high r/s, w/s, await, and derived values).

Joseph L. Casale

3:01 p.m.

...

Running iostat like this will give you utilisation statistics since boot, which will not be inidicative of what's happening now. If you give it a reporting interval, say 10 seconds (iostat -m -x >10), I am guessing you will see very different data (likely high r/s, w/s, await, and derived values).

Thanks for all the pointers guys! jlc

Scott Silva

4:42 p.m.

on 2/12/2008 9:24 PM Joseph L. Casale spake the following:

...

...
But I really have a hunch that it is just a lot of I/O wait time due to either metadata maintenance and checkpointing and/or I/O failures, which have very long timeouts before failure is recognized and *then* alternate block assignment and mapping is done.

One of the original arrays just needs to be rebuilt with more members, there are no errors but I believe you are right about simple I/O wait time.

Going from sdd to sde:

# iostat -d -m -x Linux 2.6.18-53.1.6.el5 (host) 02/12/2008

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdd 0.74 0.00 1.52 42.72 0.11 1.75 86.41 0.50 11.40 5.75 25.43 sde 0.00 0.82 0.28 1.04 0.00 0.11 177.52 0.13 98.71 53.55 7.09

Not very impressive :) Two different SATA II based arrays on an LSI controller, 5% complete in ~7 hours == a week to complete! I ran this command from an ssh session from my workstation (That was clearly a dumb move). Given the robustness of the pvmove command I have gleaned from reading, if the session bales how much time am I likely to lose by restarting? Are the checkpoints frequent?

Thanks! jlc

I know it is too late for this one, but I usually run long running remote commands in a screen session just in case I lose the session.

-- MailScanner is like deodorant... You hope everybody uses it, and you notice quickly if they don't!!!!

Joseph L. Casale

6:18 p.m.

...

I know it is too late for this one, but I usually run long running remote commands in a screen session just in case I lose the session.

What provides 'screen' in CentOS? Also, is there a resource for finding out what yum packages provide when searching for a util?

Thanks! jlc

Matt Hyclak

6:20 p.m.

On Wed, Feb 13, 2008 at 11:18:10AM -0700, Joseph L. Casale enlightened us:

...

...
I know it is too late for this one, but I usually run long running remote commands in a screen session just in case I lose the session.

What provides 'screen' in CentOS? Also, is there a resource for finding out what yum packages provide when searching for a util?

Funny, your choice of language.

[hyclak@euclid ~]$ yum provides screen Loading "priorities" plugin Searching Packages: Setting up repositories Reading repository metadata in from local files

screen.i386 4.0.2-5 base Matched from: screen

I believe that answers both of your questions.

Matt

-- Matt Hyclak Department of Mathematics Department of Social Work Ohio University (740) 593-1263

Joseph L. Casale

6:47 p.m.

...

Funny, your choice of language.

/me wiping frantic look off face

Hilarious... But you had me going for a moment, I thought I slipped and spoke like I would if asking a buddy for a moment. I can't tell you how many times I needed that, I always searched the net until I came up with someone else's post that included the info...

Thanks! jlc

William L. Maltby

3:30 a.m.

On Tue, 2008-02-12 at 19:57 -0700, Joseph L. Casale wrote:

...

<snip>

...

Iostat shows the system way underutilized even though the lv whose pe's are being migrated is continuously being written (slowly) to.

I finally thought about that last line. Makes since because meta-data tracking must be done as various pieces are moved and a checkpoint is written (note in the man page about being able to restart without providing any parameters). And that is the drive that is failing too! May be a lot of write failures followed by alternate block assignments going on at the hardware level. Just a SWAG (Scientific Wild-Assed Guess).

...

Thanks! jlc

<snip sig stuff>

-- Bill

Ross S. W. Walker

6:26 p.m.

Joseph L. Casale wrote:

...

Are there any ways to improve/manage the speed of pvmove? Man doesn't show any documented switches for priority scheduling. Iostat shows the system way underutilized even though the lv whose pe's are being migrated is continuously being written (slowly) to.

I don't believe pvmove actually does any of the lifting. Pvmove merely creates a mirrored pv area in dev-mapper and then hangs around monitoring it's progress until the mirror is sync'd up then it throws a couple of barriers and removes the original pv from the mirror leaving the new pv as the new location for the data.

That is how the move continues through reboots. All lifting is actually done in dev-mapper and it's state is preserved there. On restart LVM will read it's meta-data to determine if there is a pvmove in progress and then spawn a pvmove to wait for it to complete so it can remove the mirror.

Any slowness is due to disk io errors and retries being thrown around.

You should really run LVM on top of a RAID1, software or hardware makes no difference, but LVM is more to storage management then fault tolerance and redundancy.

-Ross

______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.

Joseph L. Casale

6:44 p.m.

...

I don't believe pvmove actually does any of the lifting. Pvmove merely creates a mirrored pv area in dev-mapper and then hangs around monitoring it's progress until the mirror is sync'd up then it throws a couple of barriers and removes the original pv from the mirror leaving the new pv as the new location for the data.

That is how the move continues through reboots. All lifting is actually done in dev-mapper and it's state is preserved there. On restart LVM will read it's meta-data to determine if there is a pvmove in progress and then spawn a pvmove to wait for it to complete so it can remove the mirror.

Any slowness is due to disk io errors and retries being thrown around.

You should really run LVM on top of a RAID1, software or hardware makes no difference, but LVM is more to storage management then fault tolerance and redundancy.

-Ross

The LD's provided to LVM through the RAID controller are all fault tolerant...

Good info, Thanks! jlc

Ross S. W. Walker

7:15 p.m.

Joseph L. Casale wrote:

...

...
I don't believe pvmove actually does any of the lifting. Pvmove merely creates a mirrored pv area in dev-mapper and then hangs around monitoring it's progress until the mirror is sync'd up then it throws a couple of barriers and removes the original pv from the mirror leaving the new pv as the new location for the data.

That is how the move continues through reboots. All lifting is actually done in dev-mapper and it's state is preserved there. On restart LVM will read it's meta-data to determine if there is a pvmove in progress and then spawn a pvmove to wait for it to complete so it can remove the mirror.

Any slowness is due to disk io errors and retries being thrown around.

You should really run LVM on top of a RAID1, software or hardware makes no difference, but LVM is more to storage management then fault tolerance and redundancy.

-Ross

The LD's provided to LVM through the RAID controller are all fault tolerant...

If the PVs are fault tolerant then I don't know why pvmove would be running so slow, there should be no io errors being thrown as the bad drive would be marked as faulty and taken offline.

What are you pvmoving again?

-Ross

Joseph L. Casale

7:24 p.m.

...

What are you pvmoving again?

-Ross

Ok, here is what happened: I have a box running iet exporting an LV that started out as two 750 gig HD's mirrored off an 8 channel LSI SAS controller. I needed more space, and added 3 400 gig HD's in a r5 vd to this VG. Yes, I now need even more space, but I only have 8 channels, so... Moving it all over to 7 750's in an r5 either with a hotspare or maybe 8 750's in a r6, don't know yet

All vd's on the controller are optimal, nothing is degraded but I need to move all this data off the darn thing to free up the original ld so I can break and recreate it.

jlc

Ross S. W. Walker

8:10 p.m.

Joseph L. Casale wrote:

...

...
What are you pvmoving again?

-Ross

Ok, here is what happened: I have a box running iet exporting an LV that started out as two 750 gig HD's mirrored off an 8 channel LSI SAS controller. I needed more space, and added 3 400 gig HD's in a r5 vd to this VG. Yes, I now need even more space, but I only have 8 channels, so... Moving it all over to 7 750's in an r5 either with a hotspare or maybe 8 750's in a r6, don't know yet

Don't know? Where are you pvmoving everything now?

It would be a whole lot easier to get the new array fully setup, initialized and tested, then add it as a new PV to the existing VG, then do the pvmove then to pvmove it twice.

If you put the new array on a newer higher end controller and leave the existing setup as it is and pvmove between them things would move a lot faster.

...

All vd's on the controller are optimal, nothing is degraded but I need to move all this data off the darn thing to free up the original ld so I can break and recreate it.

Is that array on a different controller?

Is that array fully initialized?

Does the controller have a BBU write-back cache?

Maybe I am missing some important parts of the picture here?

-Ross

Joseph L. Casale

8:15 p.m.

...

Don't know? Where are you pvmoving everything now?

Where do I begin... Scenario is "No cash to do it right" so the interim step involves migration to a non fault tolerant setup temporarily. Server is a 1u HP and I don't have another controller that matches the remaining interface in that small server.

If I continue to explain all that I have to do, you'll likely not be impressed. Sigh, I can only do what I can!

Regardless, your help has been valuable! jlc

Ross S. W. Walker

8:29 p.m.

Joseph L. Casale

...

...
Don't know? Where are you pvmoving everything now?

Where do I begin... Scenario is "No cash to do it right" so the interim step involves migration to a non fault tolerant setup temporarily. Server is a 1u HP and I don't have another controller that matches the remaining interface in that small server.

Ah, well you are using SAS drives, so there is some cash there...

Need to learn how to shake the money maker, it's the only way we can get our jobs done these days. Tell management that there is no more room to get projects X or Y done because they need to invest in upgrading storage, or if it's for fault tolerance tell them what the worse case scenario will be. That usually gets them to find that extra $$ to make it happen.

What industry do you work in?

...

If I continue to explain all that I have to do, you'll likely not be impressed. Sigh, I can only do what I can!

That's not true! I'm unimpressed now ;-)

-Ross

Joseph L. Casale

8:33 p.m.

...

Ah, well you are using SAS drives, so there is some cash there...

My bad, SAS controller with SATA II drives :(

...

What industry do you work in?

All sorts, odd company: We do everything from automotive accessories to home building!

...

That's not true! I'm unimpressed now ;-)

-Ross

Love your honesty! <vbg> jlc

Ross S. W. Walker

8:56 p.m.

Joseph L. Casale wrote:

...

...
Ah, well you are using SAS drives, so there is some cash there...

My bad, SAS controller with SATA II drives :(

...
What industry do you work in?

All sorts, odd company: We do everything from automotive accessories to home building!

...
That's not true! I'm unimpressed now ;-)

-Ross

Love your honesty!

Since your moving the data over to a new server/array combo have you thought about using LTO tapes to back it up and restore it on the new server?

I know it isn't as sexy as LVM pv duplication and such, but it works...

If the LTO drives are too expensive why not just rent them for this activity? You need to buy the tapes, but that's not too much expense.

-Ross

Joseph L. Casale

9:08 p.m.

...

Since your moving the data over to a new server/array combo have you thought about using LTO tapes to back it up and restore it on the new server?

I know it isn't as sexy as LVM pv duplication and such, but it works...

We have an HP Autoloader, I thought of doing that actually, and I think I might :) I'll let it run through the weekend and make a decision on Monday. The autoloader is hooked up to a windows box running the scourge of my life (Backup exec 9 for windows) and I didn't know how to interface it easily to the data without installing an agent on the client running the ini which I thought would be just as painfully slow! The LV is exported through iet and is formatted NTFS.

Suggestions welcome :)

Jlc

Ps. My solution aint so sexy, it involves a non fault tolerant interim period so I am not pleased to say the least!

Ross S. W. Walker

10:08 p.m.

Joseph L. Casale wrote:

...

...
Since your moving the data over to a new server/array combo have you thought about using LTO tapes to back it up and restore it on the new server?

I know it isn't as sexy as LVM pv duplication and such, but it works...

We have an HP Autoloader, I thought of doing that actually, and I think I might :) I'll let it run through the weekend and make a decision on Monday. The autoloader is hooked up to a windows box running the scourge of my life (Backup exec 9 for windows) and I didn't know how to interface it easily to the data without installing an agent on the client running the ini which I thought would be just as painfully slow! The LV is exported through iet and is formatted NTFS.

Suggestions welcome :)

Well I suppose you have nightly backups of the data set already?

Maybe just abort the pvmove, let the Friday full backup run, then on Saturday do a full restore on the new server over iSCSI and bring it online that way.

I am facing the same issue with a migration of our VM machines to a new iSCSI setup this year, around 1TB of VMs need to be fork lifted over and I thought about exotic ways to move it over, but I think in the end it will be by good ole backup exec and tape.

Hey! Or maybe just use robocopy from one iSCSI volume to the other on the Windows side!

-Ross

Joseph L. Casale

10:30 p.m.

...

I am facing the same issue with a migration of our VM machines to a new iSCSI setup this year, around 1TB of VMs need to be fork lifted over and I thought about exotic ways to move it over, but I think in the end it will be by good ole backup exec and tape.

You're not running esx are you? Heh, I just did the same thing on a much smaller scale. Couldn't afford the long downtime while a copy took place so I shut the vm's off, snapped it and restarted it. I then scripted all files "without" 00000 in the name to rsync over (ssssslowly). I then only had to shut the vm off and sync the small snap's and restart the vm's on other storage. Only took a few minutes.

jlc

6350

Age (days ago)

6350

Last active (days ago)

discuss@lists.centos.org

25 comments

6 participants

tags (0)

participants (6)

Joseph L. Casale
Luke Dudney
Matt Hyclak
Ross S. W. Walker
Scott Silva
William L. Maltby