Hey, Les,
Thanks for changing the subject to OT.
Les Mikesell wrote:
On Tue, Nov 5, 2013 at 1:28 PM, m.roth@5-cent.us wrote:
As I noted, we make sure rsync uses hard links... but we have a good
number of individual people and projects with who *each* have a good number of terabytes of data and generated data. Some of our 2TB drives are
over 90% full, and then there's the honkin' huge RAID, and at least one
14TB partition is over 9TB full....
If you have database dumps or big text files that aren't compressed,
backuppc could be a big win. I think it is the only thing that can keep a compressed copy on the server side and work directly with a stock rsync and uncompressed files on the target hosts (and it can cache the block-checksums so it doesn't have to uncompress and
recompute them every run). While it is 'just a perl script' it's not
quite what you expect from simple scripting...
We have a *bunch* of d/bs. Oracle. MySQL. Postgresql. All with about a week's dumps from every night, and then backups of them to the b/u servers. I can't imagine how they'd be a win - don't remember just off the top of my head if they're compressed or not.
A *lot* of our data is not huge text files - lots and lots of pure datafiles, output from things like Matlab, R, and some local programs, like the one for modeling protein folding.
mark
On 11/5/2013 12:41 PM, m.roth@5-cent.us wrote:
We have a*bunch* of d/bs. Oracle. MySQL. Postgresql. All with about a week's dumps from every night, and then backups of them to the b/u servers. I can't imagine how they'd be a win - don't remember just off the top of my head if they're compressed or not.
A*lot* of our data is not huge text files - lots and lots of pure datafiles, output from things like Matlab, R, and some local programs, like the one for modeling protein folding.
lots of binary data files are full of zeros and/ore repetitive patterns that compress quite easily.
On Tue, Nov 5, 2013 at 2:41 PM, m.roth@5-cent.us wrote:
Hey, Les,
Thanks for changing the subject to OT.
Errr... I just replied in gmail - I think it has been there all along.
We have a *bunch* of d/bs. Oracle. MySQL. Postgresql. All with about a week's dumps from every night, and then backups of them to the b/u servers. I can't imagine how they'd be a win - don't remember just off the top of my head if they're compressed or not.
If the dumps aren't pre-compressed, they would be compressed on the backuppc side. And if there are unchanged copies on the target hosts (i.e. more than the current night's dumps) that would still be recognized by the rsync run as unchanged, even though backuppc is looking at the compressed copy. If you already compress on the target host, there's not much more you can do.
A *lot* of our data is not huge text files - lots and lots of pure datafiles, output from things like Matlab, R, and some local programs, like the one for modeling protein folding.
Anything that isn't already compressed, encrypted, or isn't strictly intentionally random is likely to compress 2 to 10x. Just poking through the 'compression summary' on my backuppc servers, I don't see anything less than 55% and most of the bigger targets are closer to 80% compression. One that has 50Gb of logfiles, is around 90%.
Les Mikesell wrote:
On Tue, Nov 5, 2013 at 2:41 PM, m.roth@5-cent.us wrote:
<snip>
We have a *bunch* of d/bs. Oracle. MySQL. Postgresql. All with about a week's dumps from every night, and then backups of them to the b/u servers. I can't imagine how they'd be a win - don't remember just off the top of my head if they're compressed or not.
If the dumps aren't pre-compressed, they would be compressed on the backuppc side. And if there are unchanged copies on the target hosts
Right, but
(i.e. more than the current night's dumps) that would still be recognized by the rsync run as unchanged, even though backuppc is looking at the compressed copy. If you already compress on the target host, there's not much more you can do.
A *lot* of our data is not huge text files - lots and lots of pure datafiles, output from things like Matlab, R, and some local programs, like the one for modeling protein folding.
Anything that isn't already compressed, encrypted, or isn't strictly intentionally random is likely to compress 2 to 10x. Just poking through the 'compression summary' on my backuppc servers, I don't see anything less than 55% and most of the bigger targets are closer to 80% compression. One that has 50Gb of logfiles, is around 90%.
Oh, please - I see a filesystem fill up, and I start looking for what did it so suddenly... just the other week, I had one of our interns run Matlab and create a 35G nohup.out in his home directory... which was on the same filesystem mine was, and I was Not Amused when that blew out the filesystem.
Yeah, I know, we're trying to move stuff around, that's not infrequent, given the amount of data my folks generate.
mark
On Tue, Nov 5, 2013 at 3:45 PM, m.roth@5-cent.us wrote:
Yeah, I know, we're trying to move stuff around, that's not infrequent, given the amount of data my folks generate.
And that's the other place that backuppc will help. If you move a file that is already in an existing backup, backuppc's rsync will copy it over the network because it doesn't have a match in that location, but when it goes to add the compressed copy to the pool it will notice that there is already a file with identical content there and use a hardlink instead of needing additional space.
Les Mikesell wrote:
On Tue, Nov 5, 2013 at 3:45 PM, m.roth@5-cent.us wrote:
Yeah, I know, we're trying to move stuff around, that's not infrequent, given the amount of data my folks generate.
And that's the other place that backuppc will help. If you move a file that is already in an existing backup, backuppc's rsync will copy it over the network because it doesn't have a match in that location, but when it goes to add the compressed copy to the pool it will notice that there is already a file with identical content there and use a hardlink instead of needing additional space.
Um, but rsync will already do that. Anyway, when I mean move things, I meant whole backups to a less-full drive, or the much rarer times that we need to move a user who's using a *large* amount of space.
mark
On Tue, Nov 5, 2013 at 4:25 PM, m.roth@5-cent.us wrote:
Yeah, I know, we're trying to move stuff around, that's not infrequent, given the amount of data my folks generate.
And that's the other place that backuppc will help. If you move a file that is already in an existing backup, backuppc's rsync will copy it over the network because it doesn't have a match in that location, but when it goes to add the compressed copy to the pool it will notice that there is already a file with identical content there and use a hardlink instead of needing additional space.
Um, but rsync will already do that.
No, rsync itself will only do it when the identical file is still in the identical path from the identical host.
Anyway, when I mean move things, I meant whole backups to a less-full drive, or the much rarer times that we need to move a user who's using a *large* amount of space.
Backuppc will match up identical content, no matter where it finds it. If it is a different copy or moved to a different location it does have to transfer it to the backuppc server, but then it will be discarded and replaced with a link to the existing pooled copy.
Les Mikesell wrote:
On Tue, Nov 5, 2013 at 4:25 PM, m.roth@5-cent.us wrote:
Yeah, I know, we're trying to move stuff around, that's not infrequent, given the amount of data my folks generate.
And that's the other place that backuppc will help. If you move a file that is already in an existing backup, backuppc's rsync will copy it over the network because it doesn't have a match in that location, but when it goes to add the compressed copy to the pool it will notice that there is already a file with identical content there and use a hardlink instead of needing additional space.
Um, but rsync will already do that.
No, rsync itself will only do it when the identical file is still in the identical path from the identical host.
Unless you tell it a path to compare to - as I said, we point it to <backupdirectory><smylinkg "latest">.
Anyway, when I mean move things, I meant whole backups to a less-full
drive, or the much rarer
times that we need to move a user who's using a *large* amount of space.
Backuppc will match up identical content, no matter where it finds it. If it is a different copy or moved to a different location it does have to transfer it to the backuppc server, but then it will be discarded and replaced with a link to the existing pooled copy.
Right. Moving things, though, for us is manual, esp. since it can sometimes take days (like the 700+G I've been trying to copy from a 3TB drive that was defective to another that seems ok...)
mark
On Tue, Nov 5, 2013 at 4:42 PM, m.roth@5-cent.us wrote:
Backuppc will match up identical content, no matter where it finds it. If it is a different copy or moved to a different location it does have to transfer it to the backuppc server, but then it will be discarded and replaced with a link to the existing pooled copy.
Right. Moving things, though, for us is manual, esp. since it can sometimes take days (like the 700+G I've been trying to copy from a 3TB drive that was defective to another that seems ok...)
But even little automated things like logfile rotation can add up when you catch it across a bunch of noisy hosts. You don't really need to store the whole contents of yesterday's messages.1 and today's messages.2 separately when they are the same thing, just renamed.
Les Mikesell wrote:
On Tue, Nov 5, 2013 at 4:42 PM, m.roth@5-cent.us wrote:
Backuppc will match up identical content, no matter where it finds it. If it is a different copy or moved to a different location it does have to transfer it to the backuppc server, but then it will be discarded and replaced with a link to the existing pooled copy.
Right. Moving things, though, for us is manual, esp. since it can sometimes take days (like the 700+G I've been trying to copy from a 3TB drive that was defective to another that seems ok...)
But even little automated things like logfile rotation can add up when you catch it across a bunch of noisy hosts. You don't really need to store the whole contents of yesterday's messages.1 and today's messages.2 separately when they are the same thing, just renamed.
We don't back them up, except for /var/log on the central logging host.
But to return to the first para, there's no identical identical content. There's similar content on development and prod servers for each team, but that's not identical, so it's really not an issue.
mark
On Wed, Nov 6, 2013 at 8:34 AM, m.roth@5-cent.us wrote:
But even little automated things like logfile rotation can add up when you catch it across a bunch of noisy hosts. You don't really need to store the whole contents of yesterday's messages.1 and today's messages.2 separately when they are the same thing, just renamed.
We don't back them up, except for /var/log on the central logging host.
Are they rotated by renaming there?
But to return to the first para, there's no identical identical content. There's similar content on development and prod servers for each team, but that's not identical, so it's really not an issue.
If the data is compressible, you'd still likely get 2x+ space saving from compression on the backup server side. If the data sets are something like time series data that just change as additional samples are added it might be worth working out a scheme to chunk it up so only the 'current' time range changes and all of the historic instances would stay identical.
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Les Mikesell Sent: den 5 november 2013 22:10 To: CentOS mailing list Subject: Re: [CentOS] [OT] Building a new backup server
Thanks for changing the subject to OT.
Errr... I just replied in gmail - I think it has been there all along.
I did it from the beginning, wasn't sure if this topic was strictly CentOS.
-- //Sorin