[CentOS] How to speed up Rsync transfers

Mon Feb 25 11:07:41 UTC 2008
William L. Maltby <CentOS4Bill at triad.rr.com>

On Sun, 2008-02-24 at 17:53 -0600, Dan Carl wrote:
> ----- Original Message ----- 
> From: "William L. Maltby" <CentOS4Bill at triad.rr.com>
> <snip list header and now irrelevant stuff>

> > In that case, it sounds like you need a local staging that can be
> > quickly done before starting upload sync. Then the upload can run 24/7.
> > How you might want to deal with new updates that happen before the
> > previous upload finishes is going to be an interesting problem.
> >
> This is exactly the situation I'm trying to avoid.
> Right now its less than 2GB new/edited images a day so the rsync backup 
> finishes before the script runs again.

General strategy: 1) maximize local operations to minimize intrusion
into the time-constrained resource window by using out-of-band available
resources and 2) minimize in-band demands.

> But I can't take it for granted that this will always be the case.
> Any ideas would be appreciated. What do you mean by local staging?
> 

1) E.g. if local HD space is available, do a local rsync from live ->
backup copy. This can be done even during normal hours while users are
making files (at low priority - see "man nice"), *prior* to the
communications window, *if* something like LVM snapshot is available
(that way you can be assured that activities starting in the live
environment after local copy begins don't get included, although
partials started prior to the copy can still get in there. But they will
be "corrected" on the next cycle).

In this scenario, it may be easier to use one of the "canned" utilities
like amanda or backuppc that have been extensively discussed on this
list. I've never used these things, so I don't really know if they are
appropriate in this scheme. However, nothing wrong with hand-crafted
stuff if you've the inclination and need.

Keep in mind that over time the local rsync will tend to take longer as
directory numbers and sizes grow unless there is also a significant
amount of file deletion by the users going on. So you may want to
schedule several low-priority snapshot/rsync runs throughout the
workday.

Don't be afraid to seek/request some kind of raid/NAS/SAN resource if
the data is mission-critical, growing constantly and volatile. It may
not be needed now, but look down the road so you don't get into a
constant cycle of scrambling to keep up with needs.

Ditto for additional band-width to the remote. It should be cheaper in
the long run if resource demand is certain to grow significantly.

> I'd like the backup to run from 7pm to 7am and then if it didn't finish to
> resume again the next night.

2) You mention images, so I'm not sure much can be gained by compression
because many types of image files are already compressed to a great
degree. But if there are a large number that can be (further) compressed
for significant gain, compress them *prior* to the start of the
communication window. You may need to do some testing to tell which file
types are suited for further compression.

The downside to this is that you no longer have an rsync-amenable image
on the backup local side. Additional scripting would be needed and
instead of rsync, hand-crafted copy operations would be needed. However
this is easily overcome using a time-stamp file in conjunction with
find's "time" parameters to select only things which have been modified
since the previous local copy started.

Another downside is that to restore from either the local or remote
copies, decompression would be needed. This is quite fast though. But,
again, some additional hand-crafting would be needed. Thorough testing
too.

> That way when nothing was added/edited on the weekends the backup can catch 
> up.

In conjunction with "lock" files mentioned in another reply, you may be
able to gain something by segmenting the local and remote rsync. This
allows 1) concurrent *local* compression and rsync (if CPU/memory
resources are sufficient to avoid unduly slowing the user's activities -
again "man nice" to reduce the effects on users) and 2) easier
management of the remote rsync start/stop on directory boundaries as the
window is entered/exited. This may not be needed at all or may be of
limited benefit.

Lastly, see if it's possible to run the rsync during normal hours. If
your site has upload of 750KB/sec and during 90% of the normal workday
only a small percentage is consumed, take advantage by doing some of the
rsync (maybe in small chunks) during these hours at low priority and
throttled appropriately. Presuming that most of your activity is
download, not upload during the normal workday, and knowing that most of
the rsync activity will be upload, not download, there is an opportunity
there.

Testing this scheme before opting for it would be advised.

Finally ...

"Some assembly required".   8-0

> Dan 
> <snip sig stuff>

HTH
-- 
Bill