On Sun, 2008-02-24 at 17:53 -0600, Dan Carl wrote:
----- Original Message ----- From: "William L. Maltby" CentOS4Bill@triad.rr.com
<snip list header and now irrelevant stuff>
In that case, it sounds like you need a local staging that can be quickly done before starting upload sync. Then the upload can run 24/7. How you might want to deal with new updates that happen before the previous upload finishes is going to be an interesting problem.
This is exactly the situation I'm trying to avoid. Right now its less than 2GB new/edited images a day so the rsync backup finishes before the script runs again.
General strategy: 1) maximize local operations to minimize intrusion into the time-constrained resource window by using out-of-band available resources and 2) minimize in-band demands.
But I can't take it for granted that this will always be the case. Any ideas would be appreciated. What do you mean by local staging?
1) E.g. if local HD space is available, do a local rsync from live -> backup copy. This can be done even during normal hours while users are making files (at low priority - see "man nice"), *prior* to the communications window, *if* something like LVM snapshot is available (that way you can be assured that activities starting in the live environment after local copy begins don't get included, although partials started prior to the copy can still get in there. But they will be "corrected" on the next cycle).
In this scenario, it may be easier to use one of the "canned" utilities like amanda or backuppc that have been extensively discussed on this list. I've never used these things, so I don't really know if they are appropriate in this scheme. However, nothing wrong with hand-crafted stuff if you've the inclination and need.
Keep in mind that over time the local rsync will tend to take longer as directory numbers and sizes grow unless there is also a significant amount of file deletion by the users going on. So you may want to schedule several low-priority snapshot/rsync runs throughout the workday.
Don't be afraid to seek/request some kind of raid/NAS/SAN resource if the data is mission-critical, growing constantly and volatile. It may not be needed now, but look down the road so you don't get into a constant cycle of scrambling to keep up with needs.
Ditto for additional band-width to the remote. It should be cheaper in the long run if resource demand is certain to grow significantly.
I'd like the backup to run from 7pm to 7am and then if it didn't finish to resume again the next night.
2) You mention images, so I'm not sure much can be gained by compression because many types of image files are already compressed to a great degree. But if there are a large number that can be (further) compressed for significant gain, compress them *prior* to the start of the communication window. You may need to do some testing to tell which file types are suited for further compression.
The downside to this is that you no longer have an rsync-amenable image on the backup local side. Additional scripting would be needed and instead of rsync, hand-crafted copy operations would be needed. However this is easily overcome using a time-stamp file in conjunction with find's "time" parameters to select only things which have been modified since the previous local copy started.
Another downside is that to restore from either the local or remote copies, decompression would be needed. This is quite fast though. But, again, some additional hand-crafting would be needed. Thorough testing too.
That way when nothing was added/edited on the weekends the backup can catch up.
In conjunction with "lock" files mentioned in another reply, you may be able to gain something by segmenting the local and remote rsync. This allows 1) concurrent *local* compression and rsync (if CPU/memory resources are sufficient to avoid unduly slowing the user's activities - again "man nice" to reduce the effects on users) and 2) easier management of the remote rsync start/stop on directory boundaries as the window is entered/exited. This may not be needed at all or may be of limited benefit.
Lastly, see if it's possible to run the rsync during normal hours. If your site has upload of 750KB/sec and during 90% of the normal workday only a small percentage is consumed, take advantage by doing some of the rsync (maybe in small chunks) during these hours at low priority and throttled appropriately. Presuming that most of your activity is download, not upload during the normal workday, and knowing that most of the rsync activity will be upload, not download, there is an opportunity there.
Testing this scheme before opting for it would be advised.
Finally ...
"Some assembly required". 8-0
Dan
<snip sig stuff>
HTH