[CentOS-devel] Importing CentOS-6 Sources into git.centos.org

Tue Aug 26 03:59:36 UTC 2014
Nico Kadel-Garcia <nkadel at gmail.com>

n Fri, Aug 22, 2014 at 3:55 AM, Karanbir Singh <mail-lists at karan.org> wrote:
> On 08/22/2014 03:13 AM, Nico Kadel-Garcia wrote:
>> Besides the potential corrupt snapshot problem, there's the inevitable
>> discrepancies between the mirrors and git.centos.org itself. Content
>> is likely to differ in small ways among the mirrors, due to the rsync
>> based snapshot being in the past.  I assume that some of the
>> individual repos are changing during the overall rsumc update period,
>> unless they're all done in parallel, which would be *really* nasty.
>
> it does not matter... the content on the binary cache is hash'd, have
> you looked at how things are setup ?

Forgive me, please, if I wander among different expertise levels and
seem to be teaching my granny to suck eggs. It's hard to aim analyses
at people with different levels of expertise and experience, and it
can be hard to balance completeness versus clarity.

In this case: Yes, I've looked, and transient inconsistencies break
the mirrored repository. "git clone" operations against the broken
repository are reported, fail, and most clients will be quite out of
luck. It's like when you rsync an RPM repository before the repodata
has been updated, it can get messy and broken.

And I assume you're not doing "git gc" on the upstream repositories,
or not doing it often. Do git pushes ever trigger a repacking on the
upstream repository? It's an interesting question, and is another
factor that could trigger broken mirrors.

>> Unless.... Is there a top level directory to use for an rsync mirror?
>> That's going to be a pretty bulky rsync operation, with over 6000
>> subdirectories and the amount of churn in any of the modified git
>> repos.
>
> I dont understand that statement, are you questioning rsync's ability to
> handle 6k dirs ?

Sorry if I was unclear. As git.centos.org is configured, each git
repository is distinct and unique. We have no visibility to its layout
out here in userland. into its back end fileystem.  So if it's set up
as "/mountpoint/gitrepos/k/kernel", "/mountpoint/gitrepos/s/sendmail",
cool. You can set up one rsync daemon sharing "/mountpoint"gitrepos".
If they're scattered all over your filesystems and you're publishing
each of them as a different rsync target, that makes configuring an
rsync daemon and relevent rsync targets quite awkward.

Yes, it may sound like I'm teaching my granny to suck eggs. Not
everyone is as expert with setting up rsync daemons as some of us, so
please forgive me for perhaps getting into too much detail.

Anyway so far, so good. The intriguing problems happen when mirror
sites have to traverse that in a single operation, and potentially
commit the entire environment with a '---delay-updates'. RPM based
mirrors have the advantage that the number of files being changed is
usually quite small, maybe a few dozen RPM's an busy day and the
repodata transactions. Large operations are usually tied to specific
directories, such as when CentOS 7 was first published.

Git repos.... are going to be more intriguing to merge and parse if
and when CentOS 6 source material is also merged into the primary
repos.

>> Anyway, verification of the consistency of all the mirrored
>> repositories becomes awkward. There's also the lack of site
>
> to be clear, I dont think the aim here is to setup content mirrors for
> general consumption, the aim is to have a rsync target that lets people
> run their own mirrors. And we dont need any real sync between git and
> binary sources - since they are tracked in git as hash'd objects.
> Something missing will get flagged up right away ( or corrupt )

I'm trying to suggest that the mirrors would be safer, and more
robust, and have better provenance for their content, if you'd publish
signed GPG tags in the repos and support git clones, rather than rsync
mirrors, for offsite mirrors.

> I realise we have an issue where some of the hash's are sha1's and
> others are sha256's and the checking code, client side, needs to check
> lenght and use the right algo - but thats something which should get
> fixed as we all end up using the same tools and convention.

Does this cause a problem? I thought the clients were quite robust
about it when they make their local "git clone" of the upstream
repository.

>> verification in the unencrypted and unsigned rsync protocol, which I'd
>> not even thought about for git.centos.org. That puts it right into the
>> "people cloning from each other's unsecured repos locally" world, in
>> this case cloning from the rsync mirrors. And it directly brings up
>> the "verify the provenance of local repos" problem that was discounted
>> by some when I brought up the problem earlier.
>>
>> Several folks did bring up the point of "git.centos.org has an SSL
>> key, what's not secure about it?"  If we're using rsync mirrors, we're
>> relying on someone else's mirror site to be secure, as well. And we're
>> probably relying on unencrypted rsync to git.centos.org, itself, to
>> support those mirrors. And we're once again open to someone polluting
>> the data stream with a fake repo.
>
> right, so the confusion comes from other-mirrors, thats certainly not
> the aim here. its all for local consumption. And I dont know what is
> involved in getting rsync around a ssl wrapper. But the fact that
> metadata in the git repos' has the corresponding hash's should be good
> enough for validating per file. Doing this for the entire tree, every
> possible piece would be quite hard, admittedly.

Not when the metadata is poisoned by a trojaned merge. Git logs can be
edited. Without the GPG sums, it's like a web mirror that has a pack
of RPM's with a pack of checksums alongside them. The owner of the
mirror, or a cracker attacking the host,  can corrupt *both*, and
without the GPG tag, it's hard to get provenance.

And *that* is one of the points where having a GPG signed tag,
especially one tied to the contents of the SRPM builds, becomes a a
useful tool for verifying provenance of the tree. You can't rely on a
binary comparison, there's likely to be frequent skew between the
rsync mirrors and the main repo as a matter of course.