n Fri, Aug 22, 2014 at 3:55 AM, Karanbir Singh mail-lists@karan.org wrote:
On 08/22/2014 03:13 AM, Nico Kadel-Garcia wrote:
Besides the potential corrupt snapshot problem, there's the inevitable discrepancies between the mirrors and git.centos.org itself. Content is likely to differ in small ways among the mirrors, due to the rsync based snapshot being in the past. I assume that some of the individual repos are changing during the overall rsumc update period, unless they're all done in parallel, which would be *really* nasty.
it does not matter... the content on the binary cache is hash'd, have you looked at how things are setup ?
Forgive me, please, if I wander among different expertise levels and seem to be teaching my granny to suck eggs. It's hard to aim analyses at people with different levels of expertise and experience, and it can be hard to balance completeness versus clarity.
In this case: Yes, I've looked, and transient inconsistencies break the mirrored repository. "git clone" operations against the broken repository are reported, fail, and most clients will be quite out of luck. It's like when you rsync an RPM repository before the repodata has been updated, it can get messy and broken.
And I assume you're not doing "git gc" on the upstream repositories, or not doing it often. Do git pushes ever trigger a repacking on the upstream repository? It's an interesting question, and is another factor that could trigger broken mirrors.
Unless.... Is there a top level directory to use for an rsync mirror? That's going to be a pretty bulky rsync operation, with over 6000 subdirectories and the amount of churn in any of the modified git repos.
I dont understand that statement, are you questioning rsync's ability to handle 6k dirs ?
Sorry if I was unclear. As git.centos.org is configured, each git repository is distinct and unique. We have no visibility to its layout out here in userland. into its back end fileystem. So if it's set up as "/mountpoint/gitrepos/k/kernel", "/mountpoint/gitrepos/s/sendmail", cool. You can set up one rsync daemon sharing "/mountpoint"gitrepos". If they're scattered all over your filesystems and you're publishing each of them as a different rsync target, that makes configuring an rsync daemon and relevent rsync targets quite awkward.
Yes, it may sound like I'm teaching my granny to suck eggs. Not everyone is as expert with setting up rsync daemons as some of us, so please forgive me for perhaps getting into too much detail.
Anyway so far, so good. The intriguing problems happen when mirror sites have to traverse that in a single operation, and potentially commit the entire environment with a '---delay-updates'. RPM based mirrors have the advantage that the number of files being changed is usually quite small, maybe a few dozen RPM's an busy day and the repodata transactions. Large operations are usually tied to specific directories, such as when CentOS 7 was first published.
Git repos.... are going to be more intriguing to merge and parse if and when CentOS 6 source material is also merged into the primary repos.
Anyway, verification of the consistency of all the mirrored repositories becomes awkward. There's also the lack of site
to be clear, I dont think the aim here is to setup content mirrors for general consumption, the aim is to have a rsync target that lets people run their own mirrors. And we dont need any real sync between git and binary sources - since they are tracked in git as hash'd objects. Something missing will get flagged up right away ( or corrupt )
I'm trying to suggest that the mirrors would be safer, and more robust, and have better provenance for their content, if you'd publish signed GPG tags in the repos and support git clones, rather than rsync mirrors, for offsite mirrors.
I realise we have an issue where some of the hash's are sha1's and others are sha256's and the checking code, client side, needs to check lenght and use the right algo - but thats something which should get fixed as we all end up using the same tools and convention.
Does this cause a problem? I thought the clients were quite robust about it when they make their local "git clone" of the upstream repository.
verification in the unencrypted and unsigned rsync protocol, which I'd not even thought about for git.centos.org. That puts it right into the "people cloning from each other's unsecured repos locally" world, in this case cloning from the rsync mirrors. And it directly brings up the "verify the provenance of local repos" problem that was discounted by some when I brought up the problem earlier.
Several folks did bring up the point of "git.centos.org has an SSL key, what's not secure about it?" If we're using rsync mirrors, we're relying on someone else's mirror site to be secure, as well. And we're probably relying on unencrypted rsync to git.centos.org, itself, to support those mirrors. And we're once again open to someone polluting the data stream with a fake repo.
right, so the confusion comes from other-mirrors, thats certainly not the aim here. its all for local consumption. And I dont know what is involved in getting rsync around a ssl wrapper. But the fact that metadata in the git repos' has the corresponding hash's should be good enough for validating per file. Doing this for the entire tree, every possible piece would be quite hard, admittedly.
Not when the metadata is poisoned by a trojaned merge. Git logs can be edited. Without the GPG sums, it's like a web mirror that has a pack of RPM's with a pack of checksums alongside them. The owner of the mirror, or a cracker attacking the host, can corrupt *both*, and without the GPG tag, it's hard to get provenance.
And *that* is one of the points where having a GPG signed tag, especially one tied to the contents of the SRPM builds, becomes a a useful tool for verifying provenance of the tree. You can't rely on a binary comparison, there's likely to be frequent skew between the rsync mirrors and the main repo as a matter of course.