[CentOS-devel] Importing CentOS-6 Sources into git.centos.org

Wed Aug 27 11:28:54 UTC 2014
Johnny Hughes <johnny at centos.org>

On 08/25/2014 10:59 PM, Nico Kadel-Garcia wrote:
> n Fri, Aug 22, 2014 at 3:55 AM, Karanbir Singh <mail-lists at karan.org> wrote:
>> On 08/22/2014 03:13 AM, Nico Kadel-Garcia wrote:
>>> Besides the potential corrupt snapshot problem, there's the inevitable
>>> discrepancies between the mirrors and git.centos.org itself. Content
>>> is likely to differ in small ways among the mirrors, due to the rsync
>>> based snapshot being in the past.  I assume that some of the
>>> individual repos are changing during the overall rsumc update period,
>>> unless they're all done in parallel, which would be *really* nasty.
>>
>> it does not matter... the content on the binary cache is hash'd, have
>> you looked at how things are setup ?
> 
> Forgive me, please, if I wander among different expertise levels and
> seem to be teaching my granny to suck eggs. It's hard to aim analyses
> at people with different levels of expertise and experience, and it
> can be hard to balance completeness versus clarity.
> 
> In this case: Yes, I've looked, and transient inconsistencies break
> the mirrored repository. "git clone" operations against the broken
> repository are reported, fail, and most clients will be quite out of
> luck. It's like when you rsync an RPM repository before the repodata
> has been updated, it can get messy and broken.
> 
> And I assume you're not doing "git gc" on the upstream repositories,
> or not doing it often. Do git pushes ever trigger a repacking on the
> upstream repository? It's an interesting question, and is another
> factor that could trigger broken mirrors.
> 
>>> Unless.... Is there a top level directory to use for an rsync mirror?
>>> That's going to be a pretty bulky rsync operation, with over 6000
>>> subdirectories and the amount of churn in any of the modified git
>>> repos.
>>
>> I dont understand that statement, are you questioning rsync's ability to
>> handle 6k dirs ?
> 
> Sorry if I was unclear. As git.centos.org is configured, each git
> repository is distinct and unique. We have no visibility to its layout
> out here in userland. into its back end fileystem.  So if it's set up
> as "/mountpoint/gitrepos/k/kernel", "/mountpoint/gitrepos/s/sendmail",
> cool. You can set up one rsync daemon sharing "/mountpoint"gitrepos".
> If they're scattered all over your filesystems and you're publishing
> each of them as a different rsync target, that makes configuring an
> rsync daemon and relevent rsync targets quite awkward.
> 
> Yes, it may sound like I'm teaching my granny to suck eggs. Not
> everyone is as expert with setting up rsync daemons as some of us, so
> please forgive me for perhaps getting into too much detail.
> 
> Anyway so far, so good. The intriguing problems happen when mirror
> sites have to traverse that in a single operation, and potentially
> commit the entire environment with a '---delay-updates'. RPM based
> mirrors have the advantage that the number of files being changed is
> usually quite small, maybe a few dozen RPM's an busy day and the
> repodata transactions. Large operations are usually tied to specific
> directories, such as when CentOS 7 was first published.
> 
> Git repos.... are going to be more intriguing to merge and parse if
> and when CentOS 6 source material is also merged into the primary
> repos.
> 
>>> Anyway, verification of the consistency of all the mirrored
>>> repositories becomes awkward. There's also the lack of site
>>
>> to be clear, I dont think the aim here is to setup content mirrors for
>> general consumption, the aim is to have a rsync target that lets people
>> run their own mirrors. And we dont need any real sync between git and
>> binary sources - since they are tracked in git as hash'd objects.
>> Something missing will get flagged up right away ( or corrupt )
> 
> I'm trying to suggest that the mirrors would be safer, and more
> robust, and have better provenance for their content, if you'd publish
> signed GPG tags in the repos and support git clones, rather than rsync
> mirrors, for offsite mirrors.
> 
>> I realise we have an issue where some of the hash's are sha1's and
>> others are sha256's and the checking code, client side, needs to check
>> lenght and use the right algo - but thats something which should get
>> fixed as we all end up using the same tools and convention.
> 
> Does this cause a problem? I thought the clients were quite robust
> about it when they make their local "git clone" of the upstream
> repository.
> 
>>> verification in the unencrypted and unsigned rsync protocol, which I'd
>>> not even thought about for git.centos.org. That puts it right into the
>>> "people cloning from each other's unsecured repos locally" world, in
>>> this case cloning from the rsync mirrors. And it directly brings up
>>> the "verify the provenance of local repos" problem that was discounted
>>> by some when I brought up the problem earlier.
>>>
>>> Several folks did bring up the point of "git.centos.org has an SSL
>>> key, what's not secure about it?"  If we're using rsync mirrors, we're
>>> relying on someone else's mirror site to be secure, as well. And we're
>>> probably relying on unencrypted rsync to git.centos.org, itself, to
>>> support those mirrors. And we're once again open to someone polluting
>>> the data stream with a fake repo.
>>
>> right, so the confusion comes from other-mirrors, thats certainly not
>> the aim here. its all for local consumption. And I dont know what is
>> involved in getting rsync around a ssl wrapper. But the fact that
>> metadata in the git repos' has the corresponding hash's should be good
>> enough for validating per file. Doing this for the entire tree, every
>> possible piece would be quite hard, admittedly.
> 
> Not when the metadata is poisoned by a trojaned merge. Git logs can be
> edited. Without the GPG sums, it's like a web mirror that has a pack
> of RPM's with a pack of checksums alongside them. The owner of the
> mirror, or a cracker attacking the host,  can corrupt *both*, and
> without the GPG tag, it's hard to get provenance.
> 
> And *that* is one of the points where having a GPG signed tag,
> especially one tied to the contents of the SRPM builds, becomes a a
> useful tool for verifying provenance of the tree. You can't rely on a
> binary comparison, there's likely to be frequent skew between the
> rsync mirrors and the main repo as a matter of course.

Red Hat does not want to provide us a gpg signed tag, so therefore we
will not be getting one.  No reason to keep bringing it up.  Its not
happening ant time soon.

We are not providing mirrors of this all over the place, we are quite
happy with one location and backups/failover.  What we are trying to
provide is the ability for people who want a local mirror of this to be
able to get it another way.  This is a convenience only, not something
that is required.

I am producing CentOS-7 directly with the git repo as it is RIGHT NOW,
using absolutely nothing but the tools also provided in this repo and
calls to mock.  Fermi Scientific Linux is also producing their SL7 from
this same git.centos.org repo, so this it is not a blocker to be able to
mirror this to get the source code or produce binaries.  All the tools
are being provided or updated by the community and everything is open.
It all works right now.

So, if we can create a mechanism to mirror the content as well .. other
than just a script to do it via the json API, then we will. This is not
critical, obviously, as both CentOS and Scientific Linux are tracking
EL7 and doing updates from git.centos.org just fine right now.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/centos-devel/attachments/20140827/217d3cb1/attachment-0007.sig>