Hi folks,
I've just finished rsyncing/downloading/jigdoizing the entire i386/x86_64 CentOS 4.2 distribution.
If anyone is interested go to
http://mirror.tcs.ii.uj.edu.pl/jigdo/
You'll need to edit the .jigdo file by hand to change the server section
[Servers] CentOS42=file:/opt/mirrors/centos/4.2/
to point to a local mirror (file, http or ftp), ie. to use kernel.org:
[Servers] CentOS42=http://mirrors.kernel.org/centos/4.2/
While doing this I have come upon a few questions:
a) it seems the server cd's have a lot of stuff not present in the normal directory mirror, I guess this is an artifact of the build process? [the template files for the servercd's are ~120MB]
b) what are the .newheaders and .repodata directories on i386 CD1?
c) why do the mirror repodata/*.xml.gz files not match neither the CD nor DVD versions for i386?
d) why does the i386 DVD not match ideally, but the x86_64 DVD matches for _all_ files. The x86_64 CD1 also matches _much_ better than the i386 CD1...
e) why aren't identical files between the two trees hardlinked?
$ ls -ali os/*/CentOS/RPMS/yum*noarch* 278532 -rw-rw-r-- 1 maze maze 395922 Sep 4 19:48 i386/CentOS/RPMS/yum-2.4.0-1.centos4.noarch.rpm 1165388 -rw-rw-r-- 1 maze maze 395922 Oct 10 22:20 x86_64/CentOS/RPMS/yum-2.4.0-1.centos4.noarch.rpm
$ md5sum os/*/CentOS/RPMS/yum*noarch* 371d55a19f8e4ca13d22974128ab4671 i386/CentOS/RPMS/yum-2.4.0-1.centos4.noarch.rpm 371d55a19f8e4ca13d22974128ab4671 x86_64/CentOS/RPMS/yum-2.4.0-1.centos4.noarch.rpm
Just an example of two identical files from my mirror, one of which is wasting space even though contents are identical. I expect we have this situation for almost _all_ i386 packages from the x86_64 distribution...
$ pwd /opt/mirrors/centos/4.2/os/x86_64 $ find|grep "i386.rpm"|while read i;do diff -qr "$i" "../i386/$i";done $ find|grep "i386.rpm"|while read i;do cat "$i";done|wc -c 440745010 $ find|grep "noarch.rpm"|while read i;do diff -qr "$i" "../i386/$i";done $ find|grep "noarch.rpm"|while read i;do cat "$i";done|wc -c 426816227
$ pwd /opt/mirrors/centos/4.2/updates/x86_64 $ find|grep "i386.rpm"|while read i;do diff -qr "$i" "../i386/$i";done $ find|grep "i386.rpm"|while read i;do cat "$i";done|wc -c 12819616 $ find|grep "noarch.rpm"|while read i;do diff -qr "$i" "../i386/$i";done diff: ../i386/./RPMS/createrepo-0.4.3-1.noarch.rpm: No such file or directory $ find|grep "noarch.rpm"|while read i;do cat "$i";done|wc -c 2164495
$ ls RPMS/createrepo-0.4.3-1.noarch.rpm -al -rw-rw-r-- 2 maze maze 18284 Sep 5 13:59 RPMS/createrepo-0.4.3-1.noarch.rpm
That seems to me to be a 880 MB mirror space savings to be made there... Considering the i386/x86_64 mirror takes up 7.7GB (without iso's) that's quite a bit...
I also imagine the noarch files are shared with most of the other architectures... so I'd assume another 400MB per every next arch can be saved...
f) Why aren't jigdo files available on the site? They'd really come in useful, especially in the situation I had where I already had a complete mirror of all the files, but I still had to bittorrent the CD/DVD's even though I had 99% of the required data on disk!
Cheers, MaZe.
On Fri, 2005-12-30 at 00:00 +0100, Maciej Żenczykowski wrote:
Hi folks,
I've just finished rsyncing/downloading/jigdoizing the entire i386/x86_64 CentOS 4.2 distribution.
If anyone is interested go to
http://mirror.tcs.ii.uj.edu.pl/jigdo/
You'll need to edit the .jigdo file by hand to change the server section
[Servers] CentOS42=file:/opt/mirrors/centos/4.2/
to point to a local mirror (file, http or ftp), ie. to use kernel.org:
[Servers] CentOS42=http://mirrors.kernel.org/centos/4.2/
While doing this I have come upon a few questions:
a) it seems the server cd's have a lot of stuff not present in the normal directory mirror, I guess this is an artifact of the build process? [the template files for the servercd's are ~120MB]
b) what are the .newheaders and .repodata directories on i386 CD1?
c) why do the mirror repodata/*.xml.gz files not match neither the CD nor DVD versions for i386?
There was an issue after tree dissemination that required yum-arch and createrepo to be run again on the main tree. This may happen from time to time due to mirror rsync issues.
d) why does the i386 DVD not match ideally, but the x86_64 DVD matches for _all_ files. The x86_64 CD1 also matches _much_ better than the i386 CD1...
There was a need to rerun the yum-arch and createrepo on the tree after the ISOs were released ... that may or may not be the cause of the differences. However, from a yum and up2date prespective, the i386 tree, DVD, and CD set are the same.
Did I mention that we don't have 5 million dollars or 500 programmers to produce centos. All the trees and mirrors are donated ... and all the developers donate their time and machines to make this happen.
I do the best job I can to make this a good and FREE distro, as do all the other devels.
e) why aren't identical files between the two trees hardlinked?
$ ls -ali os/*/CentOS/RPMS/yum*noarch* 278532 -rw-rw-r-- 1 maze maze 395922 Sep 4 19:48 i386/CentOS/RPMS/yum-2.4.0-1.centos4.noarch.rpm 1165388 -rw-rw-r-- 1 maze maze 395922 Oct 10 22:20 x86_64/CentOS/RPMS/yum-2.4.0-1.centos4.noarch.rpm
$ md5sum os/*/CentOS/RPMS/yum*noarch* 371d55a19f8e4ca13d22974128ab4671 i386/CentOS/RPMS/yum-2.4.0-1.centos4.noarch.rpm 371d55a19f8e4ca13d22974128ab4671 x86_64/CentOS/RPMS/yum-2.4.0-1.centos4.noarch.rpm
Just an example of two identical files from my mirror, one of which is wasting space even though contents are identical. I expect we have this situation for almost _all_ i386 packages from the x86_64 distribution...
We run a program called hardlink++ on the master mirror that should hard link files that are identical. If it is not hardlinking those it should.
Are you using -H option on your rsyncing down?
$ pwd /opt/mirrors/centos/4.2/os/x86_64 $ find|grep "i386.rpm"|while read i;do diff -qr "$i" "../i386/$i";done $ find|grep "i386.rpm"|while read i;do cat "$i";done|wc -c 440745010 $ find|grep "noarch.rpm"|while read i;do diff -qr "$i" "../i386/$i";done $ find|grep "noarch.rpm"|while read i;do cat "$i";done|wc -c 426816227
$ pwd /opt/mirrors/centos/4.2/updates/x86_64 $ find|grep "i386.rpm"|while read i;do diff -qr "$i" "../i386/$i";done $ find|grep "i386.rpm"|while read i;do cat "$i";done|wc -c 12819616 $ find|grep "noarch.rpm"|while read i;do diff -qr "$i" "../i386/$i";done diff: ../i386/./RPMS/createrepo-0.4.3-1.noarch.rpm: No such file or directory $ find|grep "noarch.rpm"|while read i;do cat "$i";done|wc -c 2164495
$ ls RPMS/createrepo-0.4.3-1.noarch.rpm -al -rw-rw-r-- 2 maze maze 18284 Sep 5 13:59 RPMS/createrepo-0.4.3-1.noarch.rpm
That seems to me to be a 880 MB mirror space savings to be made there... Considering the i386/x86_64 mirror takes up 7.7GB (without iso's) that's quite a bit...
I also imagine the noarch files are shared with most of the other architectures... so I'd assume another 400MB per every next arch can be saved...
One thing to please remember is that we develop these files from separate locations on separate machines, so they have to be stand alone on those machines initially ... we then combine them together on the mirror and run hardlink++. That SHOULD hardlink all the files that are the same.
f) Why aren't jigdo files available on the site? They'd really come in useful, especially in the situation I had where I already had a complete mirror of all the files, but I still had to bittorrent the CD/DVD's even though I had 99% of the required data on disk!
I don't know how to do jigdo files ... however, I am willing to learn.
Fedora and Redhat don't, to my knowledge, create or distribute jigdo files ... so this is not something that we would normally do.
There are lots of things that we don't do ... maybe we need 48 hour days :)
I am willing to learn what jigdo is all about ... but for now I am totally ignorant.
On Tue, 2006-01-03 at 18:49 -0600, Johnny Hughes wrote:
On Fri, 2005-12-30 at 00:00 +0100, Maciej Żenczykowski wrote:
e) why aren't identical files between the two trees hardlinked?
$ ls -ali os/*/CentOS/RPMS/yum*noarch* 278532 -rw-rw-r-- 1 maze maze 395922 Sep 4 19:48 i386/CentOS/RPMS/yum-2.4.0-1.centos4.noarch.rpm 1165388 -rw-rw-r-- 1 maze maze 395922 Oct 10 22:20 x86_64/CentOS/RPMS/yum-2.4.0-1.centos4.noarch.rpm
$ md5sum os/*/CentOS/RPMS/yum*noarch* 371d55a19f8e4ca13d22974128ab4671 i386/CentOS/RPMS/yum-2.4.0-1.centos4.noarch.rpm 371d55a19f8e4ca13d22974128ab4671 x86_64/CentOS/RPMS/yum-2.4.0-1.centos4.noarch.rpm
Just an example of two identical files from my mirror, one of which is wasting space even though contents are identical. I expect we have this situation for almost _all_ i386 packages from the x86_64 distribution...
We run a program called hardlink++ on the master mirror that should hard link files that are identical. If it is not hardlinking those it should.
Are you using -H option on your rsyncing down?
$ pwd /opt/mirrors/centos/4.2/os/x86_64 $ find|grep "i386.rpm"|while read i;do diff -qr "$i" "../i386/$i";done $ find|grep "i386.rpm"|while read i;do cat "$i";done|wc -c 440745010 $ find|grep "noarch.rpm"|while read i;do diff -qr "$i" "../i386/$i";done $ find|grep "noarch.rpm"|while read i;do cat "$i";done|wc -c 426816227
$ pwd /opt/mirrors/centos/4.2/updates/x86_64 $ find|grep "i386.rpm"|while read i;do diff -qr "$i" "../i386/$i";done $ find|grep "i386.rpm"|while read i;do cat "$i";done|wc -c 12819616 $ find|grep "noarch.rpm"|while read i;do diff -qr "$i" "../i386/$i";done diff: ../i386/./RPMS/createrepo-0.4.3-1.noarch.rpm: No such file or directory $ find|grep "noarch.rpm"|while read i;do cat "$i";done|wc -c 2164495
$ ls RPMS/createrepo-0.4.3-1.noarch.rpm -al -rw-rw-r-- 2 maze maze 18284 Sep 5 13:59 RPMS/createrepo-0.4.3-1.noarch.rpm
That seems to me to be a 880 MB mirror space savings to be made there... Considering the i386/x86_64 mirror takes up 7.7GB (without iso's) that's quite a bit...
I also imagine the noarch files are shared with most of the other architectures... so I'd assume another 400MB per every next arch can be saved...
One thing to please remember is that we develop these files from separate locations on separate machines, so they have to be stand alone on those machines initially ... we then combine them together on the mirror and run hardlink++. That SHOULD hardlink all the files that are the same.
OK ... have done some specific testing, I have found out this about hardlink++
It only links files that have the same date/time stamp ... which means if a file has the same size and MD5 sum but a different date, it will not get linked. This is not what I thought it did.
I will try to get the arches I control (i386 / x86_64) better hardlinked in the future and try to maintain them that way, since what I thought the hardlink++ was doing, it is not. However, there are only so many hours in the day.