On Tue, 2006-01-03 at 18:49 -0600, Johnny Hughes wrote:
On Fri, 2005-12-30 at 00:00 +0100, Maciej Żenczykowski wrote:
e) why aren't identical files between the two trees hardlinked?
$ ls -ali os/*/CentOS/RPMS/yum*noarch* 278532 -rw-rw-r-- 1 maze maze 395922 Sep 4 19:48 i386/CentOS/RPMS/yum-2.4.0-1.centos4.noarch.rpm 1165388 -rw-rw-r-- 1 maze maze 395922 Oct 10 22:20 x86_64/CentOS/RPMS/yum-2.4.0-1.centos4.noarch.rpm
$ md5sum os/*/CentOS/RPMS/yum*noarch* 371d55a19f8e4ca13d22974128ab4671 i386/CentOS/RPMS/yum-2.4.0-1.centos4.noarch.rpm 371d55a19f8e4ca13d22974128ab4671 x86_64/CentOS/RPMS/yum-2.4.0-1.centos4.noarch.rpm
Just an example of two identical files from my mirror, one of which is wasting space even though contents are identical. I expect we have this situation for almost _all_ i386 packages from the x86_64 distribution...
We run a program called hardlink++ on the master mirror that should hard link files that are identical. If it is not hardlinking those it should.
Are you using -H option on your rsyncing down?
$ pwd /opt/mirrors/centos/4.2/os/x86_64 $ find|grep "i386.rpm"|while read i;do diff -qr "$i" "../i386/$i";done $ find|grep "i386.rpm"|while read i;do cat "$i";done|wc -c 440745010 $ find|grep "noarch.rpm"|while read i;do diff -qr "$i" "../i386/$i";done $ find|grep "noarch.rpm"|while read i;do cat "$i";done|wc -c 426816227
$ pwd /opt/mirrors/centos/4.2/updates/x86_64 $ find|grep "i386.rpm"|while read i;do diff -qr "$i" "../i386/$i";done $ find|grep "i386.rpm"|while read i;do cat "$i";done|wc -c 12819616 $ find|grep "noarch.rpm"|while read i;do diff -qr "$i" "../i386/$i";done diff: ../i386/./RPMS/createrepo-0.4.3-1.noarch.rpm: No such file or directory $ find|grep "noarch.rpm"|while read i;do cat "$i";done|wc -c 2164495
$ ls RPMS/createrepo-0.4.3-1.noarch.rpm -al -rw-rw-r-- 2 maze maze 18284 Sep 5 13:59 RPMS/createrepo-0.4.3-1.noarch.rpm
That seems to me to be a 880 MB mirror space savings to be made there... Considering the i386/x86_64 mirror takes up 7.7GB (without iso's) that's quite a bit...
I also imagine the noarch files are shared with most of the other architectures... so I'd assume another 400MB per every next arch can be saved...
One thing to please remember is that we develop these files from separate locations on separate machines, so they have to be stand alone on those machines initially ... we then combine them together on the mirror and run hardlink++. That SHOULD hardlink all the files that are the same.
OK ... have done some specific testing, I have found out this about hardlink++
It only links files that have the same date/time stamp ... which means if a file has the same size and MD5 sum but a different date, it will not get linked. This is not what I thought it did.
I will try to get the arches I control (i386 / x86_64) better hardlinked in the future and try to maintain them that way, since what I thought the hardlink++ was doing, it is not. However, there are only so many hours in the day.