Hi,
I'm preping a new backend for our mirror host, and just found that centos mirror could use a little help from hardlinking. After running `hardlink -cvvn` on our copy of centos repo, I got these results:
*Directories 774** **Objects 220535** **IFREG 219740** **Comparisons 4839** **Would link 903** **Would save 2951557120*
This means that 903 files are exactly equal (ignoring metadata, like date, perms, etc), meaning that more than 2.9GB could be saved. Hardly much in a 207GB repo, but a save anyway. Also, this means that local file system cache would be optmized.
Problem is, everytime I resync my mirror, these hardlinks are lost. So the hardlink shall be done in the master repo.
Is there anything that I'm not seeing that prevents this optimization?
Regards,
Jonny
------------------------------------------------------------------------ globo.com *João Carlos Mendes Luís* *Senior DevOps Engineer* jonny@corp.globo.com mailto:jonny@corp.globo.com +55-21-2483-6893 +55-21-99218-1222
On Wed, 21 Aug 2019 at 19:26, João Carlos Mendes Luís jonny@corp.globo.com wrote:
Hi,
I'm preping a new backend for our mirror host, and just found that
centos mirror could use a little help from hardlinking. After running `hardlink -cvvn` on our copy of centos repo, I got these results:
*Directories 774* *Objects 220535* *IFREG 219740* *Comparisons 4839* *Would link 903* *Would save 2951557120*
This means that 903 files are exactly equal (ignoring metadata, like
date, perms, etc), meaning that more than 2.9GB could be saved. Hardly much in a 207GB repo, but a save anyway. Also, this means that local file system cache would be optmized.
It might be but it also depends on what the files are. Could you give exactly what files are doing this.. it may be that the other data is very important for some reason and a hardlink won't be possible.
Problem is, everytime I resync my mirror, these hardlinks are lost.
So the hardlink shall be done in the master repo.
Is there anything that I'm not seeing that prevents this optimization? Regards, Jonny
[image: globo.com] *João Carlos Mendes Luís* *Senior DevOps Engineer*
jonny@corp.globo.com +55-21-2483-6893 +55-21-99218-1222
CentOS-mirror mailing list CentOS-mirror@centos.org https://lists.centos.org/mailman/listinfo/centos-mirror
On 21/08/2019 20:34, Stephen John Smoogen wrote:
On Wed, 21 Aug 2019 at 19:26, João Carlos Mendes Luís <jonny@corp.globo.com mailto:jonny@corp.globo.com> wrote:
Hi, I'm preping a new backend for our mirror host, and just found that centos mirror could use a little help from hardlinking. After running `hardlink -cvvn` on our copy of centos repo, I got these results: *Directories 774** **Objects 220535** **IFREG 219740** **Comparisons 4839** **Would link 903** **Would save 2951557120* This means that 903 files are exactly equal (ignoring metadata, like date, perms, etc), meaning that more than 2.9GB could be saved. Hardly much in a 207GB repo, but a save anyway. Also, this means that local file system cache would be optmized.
It might be but it also depends on what the files are. Could you give exactly what files are doing this.. it may be that the other data is very important for some reason and a hardlink won't be possible.
From these 903 files, 859 are drpms, 1 rpm (storhaug-nfs-1.0-1.el7.noarch.rpm), 10 are RPM-GPG-KEYs, 2 are html (header and notes), 1 GPL, some isolinux config files and many repodata files (contrib, cr, extras).
Some examples:
*centos/6.10/centosplus/x86_64/drpms/kernel-firmware-2.6.32-696.30.1.el6.centos.plus_2.6.32-754.6.3.el6.centos.plus.noarch.drpm** **centos/6.10/centosplus/i386/drpms/kernel-firmware-2.6.32-696.30.1.el6.centos.plus_2.6.32-754.6.3.el6.centos.plus.noarch.drpm*
*centos/7.6.1810/storage/x86_64/gluster-4.1/storhaug-nfs-1.0-1.el7.noarch.rpm* *centos/7.6.1810/storage/x86_64/gluster-4.0/storhaug-nfs-1.0-1.el7.noarch.rpm*
*centos/RPM-GPG-KEY-CentOS-Testing-7* *centos/7.6.1810/os/x86_64/RPM-GPG-KEY-CentOS-Testing-7*
*centos/6.10/os/x86_64/isolinux/boot.msg **centos/7.6.1810/os/x86_64/isolinux/boot.msg*
*centos/6.10/cr/x86_64/repodata/dabe2ce5481d23de1f4f52bdcfee0f9af98316c9e0de2ce8123adeefa0dd08b9-primary.xml.gz* *centos/7.6.1810/cr/x86_64/repodata/dabe2ce5481d23de1f4f52bdcfee0f9af98316c9e0de2ce8123adeefa0dd08b9-primary.xml.gz *
You can easily check on your own repo by running `hardlink -cvvn centos`, it will NOT make any change, just compare files to generate list and report.* *
Problem is, everytime I resync my mirror, these hardlinks are lost. So the hardlink shall be done in the master repo. Is there anything that I'm not seeing that prevents this optimization? Regards, Jonny ------------------------------------------------------------------------ globo.com *João Carlos Mendes Luís* *Senior DevOps Engineer* jonny@corp.globo.com <mailto:jonny@corp.globo.com> +55-21-2483-6893 +55-21-99218-1222 _______________________________________________ CentOS-mirror mailing list CentOS-mirror@centos.org <mailto:CentOS-mirror@centos.org> https://lists.centos.org/mailman/listinfo/centos-mirror
-- Stephen J Smoogen.
CentOS-mirror mailing list CentOS-mirror@centos.org https://lists.centos.org/mailman/listinfo/centos-mirror
Hi everyone. Now that we're on the subject of hard linked files, here's my regular reminder to all the mirror admins to use the -H flag when rsyncing from msync.centos.org. The -H flag preserves hard links, which are already extensively used for CentOS content. One way to verify that your hard links are OK is to run "stat 6.10/os/x86_64/images/boot.iso". There should be "Links: 2" (or more) in the output, because boot.iso is the same file as 6.10/isos/x86_64/CentOS-6.10-x86_64-netinstall.iso. Using hard links makes syncs faster and saves hard disk space.
If you just added -H to your rsync command line, rsync will take care of deleting the unneeded copies of hard linked files automatically the next time you sync.
And now on to João's specific concerns.
We do run "hardlink" regularly on the master server, but we do so without the -c flag which "Disregards permission, ownership and other differences" [such as modification time].
The repodata files need to preserve their modification times, because the timestamp is included in repomd.xml. If hardlink changes the modification time of a file mentioned in repomd.xml, it may cause odd problems.
The drpms are easier in this regard and yes, it might make sense to run hardlink on those because the exact timestamp is less important for drpms (as far as I'm aware). I don't have the authority to do so, however, so it would need to be someone else.
But the drpm issue may soon be a moot point. There are plans to drop drpms altogether [1] and without drpms, there won't be a need to hard link them either.
[1] https://lists.centos.org/pipermail/centos-devel/2019-June/017433.html