hi guys,
with an increase in the number of slaves, we've noticed that the rdo jobs are deploying machines at a much higher velocity than before - as a result the ready pool is consistently hitting the low water mark.
Rather than do an overall quota limit, I'm looking at limiting the number of duffy deploy's per 10 min cycle, but rather than propose something I'd like to see what folks think is a reasonable number to start from ?
regards,
On 08/08/16 15:30, Karanbir Singh wrote:
hi guys,
with an increase in the number of slaves, we've noticed that the rdo jobs are deploying machines at a much higher velocity than before - as a result the ready pool is consistently hitting the low water mark.
Rather than do an overall quota limit, I'm looking at limiting the number of duffy deploy's per 10 min cycle, but rather than propose something I'd like to see what folks think is a reasonable number to start from ?
regards,
So do you mean limiting the number of "given" node per API key at the duffy level ? I guess that would be the only real way to control how many nodes would be given as limiting the number of slaves and/or executors per slave wouldn't really reflect that (as one job itself can consume 20 nodes if asked for it, while multiples jobs/executors in parallel can on the other end only ask one node per job ...
On 08/08/16 14:36, Fabian Arrotin wrote:
On 08/08/16 15:30, Karanbir Singh wrote:
hi guys,
with an increase in the number of slaves, we've noticed that the rdo jobs are deploying machines at a much higher velocity than before - as a result the ready pool is consistently hitting the low water mark.
Rather than do an overall quota limit, I'm looking at limiting the number of duffy deploy's per 10 min cycle, but rather than propose something I'd like to see what folks think is a reasonable number to start from ?
regards,
So do you mean limiting the number of "given" node per API key at the duffy level ?
yes
I guess that would be the only real way to control how many nodes would be given as limiting the number of slaves and/or executors per slave wouldn't really reflect that (as one job itself can consume 20 nodes if asked for it, while multiples jobs/executors in parallel can on the other end only ask one node per job ...
Ci-users mailing list Ci-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
Using the new slaves has a couple objectives: 1) Test the new OpenStack cloud 2) Increase redundancy (given the instability of our existing slave the past few weeks) 3) Increase concurrency/capacity
We had 16 threads on a single slave before (down from 24) and that single slave was struggling to cope when all those 16 threads were actually busy. The new slaves have 8 threads each and we lowered the amount of threads on the original slave back to 10 so it isn't loaded (and isn't crashing) as much.
So we're now at 34 threads total and I can indeed tell from our consumption logging that the usage has increased and peaks higher than before. We'll scale down the threads to 24 total, can you tell us if you see improvements ?
We're also waiting for the feature in Duffy that'll enable us to track which node is associated with which job so we can hunt jobs that are potentially not being very good citizens.
David Moreau Simard Senior Software Engineer | Openstack RDO
dmsimard = [irc, github, twitter]
On Mon, Aug 8, 2016 at 9:30 AM, Karanbir Singh mail-lists@karan.org wrote:
hi guys,
with an increase in the number of slaves, we've noticed that the rdo jobs are deploying machines at a much higher velocity than before - as a result the ready pool is consistently hitting the low water mark.
Rather than do an overall quota limit, I'm looking at limiting the number of duffy deploy's per 10 min cycle, but rather than propose something I'd like to see what folks think is a reasonable number to start from ?
regards,
-- Karanbir Singh +44-207-0999389 | http://www.karan.org/ | twitter.com/kbsingh GnuPG Key : http://www.karan.org/publickey.asc _______________________________________________ Ci-users mailing list Ci-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
hi David,
so is 2 machines every 10 min a good rate to start from ?
regards,
On 08/08/16 16:08, David Moreau Simard wrote:
Using the new slaves has a couple objectives:
- Test the new OpenStack cloud
- Increase redundancy (given the instability of our existing slave
the past few weeks) 3) Increase concurrency/capacity
We had 16 threads on a single slave before (down from 24) and that single slave was struggling to cope when all those 16 threads were actually busy. The new slaves have 8 threads each and we lowered the amount of threads on the original slave back to 10 so it isn't loaded (and isn't crashing) as much.
So we're now at 34 threads total and I can indeed tell from our consumption logging that the usage has increased and peaks higher than before. We'll scale down the threads to 24 total, can you tell us if you see improvements ?
We're also waiting for the feature in Duffy that'll enable us to track which node is associated with which job so we can hunt jobs that are potentially not being very good citizens.
David Moreau Simard Senior Software Engineer | Openstack RDO
dmsimard = [irc, github, twitter]
On Mon, Aug 8, 2016 at 9:30 AM, Karanbir Singh mail-lists@karan.org wrote:
hi guys,
with an increase in the number of slaves, we've noticed that the rdo jobs are deploying machines at a much higher velocity than before - as a result the ready pool is consistently hitting the low water mark.
Rather than do an overall quota limit, I'm looking at limiting the number of duffy deploy's per 10 min cycle, but rather than propose something I'd like to see what folks think is a reasonable number to start from ?
regards,
-- Karanbir Singh +44-207-0999389 | http://www.karan.org/ | twitter.com/kbsingh GnuPG Key : http://www.karan.org/publickey.asc _______________________________________________ Ci-users mailing list Ci-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
In our case, 2 might be a bit limiting and push us to using virtualization on the boxes instead of treating them as bare metal deployments. A "basic" Foreman/Katello topology would be a server and a capsule with 1 or more clients (with 1 being a generally OK test case). The difference is that it takes ~1-1.5 hours to do a full successful deployment, installation and test run currently. I don't mind switching to virtualization as this is how we test locally and do development, however, it does change the workflow and scripts which would need accounting for.
Do you have an idea of the average time to live for provisioned boxes? And how many at any given time tend to be provisioned from the pool?
On Mon, Aug 8, 2016 at 11:24 AM, Karanbir Singh mail-lists@karan.org wrote:
hi David,
so is 2 machines every 10 min a good rate to start from ?
regards,
On 08/08/16 16:08, David Moreau Simard wrote:
Using the new slaves has a couple objectives:
- Test the new OpenStack cloud
- Increase redundancy (given the instability of our existing slave
the past few weeks) 3) Increase concurrency/capacity
We had 16 threads on a single slave before (down from 24) and that single slave was struggling to cope when all those 16 threads were actually busy. The new slaves have 8 threads each and we lowered the amount of threads on the original slave back to 10 so it isn't loaded (and isn't crashing) as much.
So we're now at 34 threads total and I can indeed tell from our consumption logging that the usage has increased and peaks higher than before. We'll scale down the threads to 24 total, can you tell us if you see improvements ?
We're also waiting for the feature in Duffy that'll enable us to track which node is associated with which job so we can hunt jobs that are potentially not being very good citizens.
David Moreau Simard Senior Software Engineer | Openstack RDO
dmsimard = [irc, github, twitter]
On Mon, Aug 8, 2016 at 9:30 AM, Karanbir Singh mail-lists@karan.org
wrote:
hi guys,
with an increase in the number of slaves, we've noticed that the rdo jobs are deploying machines at a much higher velocity than before - as a result the ready pool is consistently hitting the low water mark.
Rather than do an overall quota limit, I'm looking at limiting the number of duffy deploy's per 10 min cycle, but rather than propose something I'd like to see what folks think is a reasonable number to start from ?
regards,
-- Karanbir Singh +44-207-0999389 | http://www.karan.org/ | twitter.com/kbsingh GnuPG Key : http://www.karan.org/publickey.asc _______________________________________________ Ci-users mailing list Ci-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
-- Karanbir Singh +44-207-0999389 | http://www.karan.org/ | twitter.com/kbsingh GnuPG Key : http://www.karan.org/publickey.asc _______________________________________________ Ci-users mailing list Ci-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
On 08/08/16 17:52, Eric D Helms wrote:
In our case, 2 might be a bit limiting and push us to using virtualization on the boxes instead of treating them as bare metal deployments. A "basic" Foreman/Katello topology would be a server and a capsule with 1 or more clients (with 1 being a generally OK test case). The difference is that it takes ~1-1.5 hours to do a full successful deployment, installation and test run currently. I don't mind switching to virtualization as this is how we test locally and do development, however, it does change the workflow and scripts which would need accounting for.
This is great feedback, we dont want you to switch to virt at all. So in your above model, 3 machines in one deployment / run is what you are looking at ?
Was talking to David on irc a few minutes back, and he too thinks that 2/10 min cycle might be too low to start from.
maybe we can start with 4/10 min, that allows you to get a new job off every 10 min ( you might still hit total deployment quota - which we usually set to 10 machines per project, and tweak up as needed ).
the key thing is that I dont want to get in the way, we encourage folks to run and test as they would in user scenarios - so just need a conservative number to start from, then tweak it as needed while protecting jobs against a runaway script or a mass queue run. ie. let jenkins handle the backlog queue, rather than have jobs fail due to machine pool being depleted.
Do you have an idea of the average time to live for provisioned boxes? And how many at any given time tend to be provisioned from the pool?
At the moment, we recommend folks plan on clearing out / tearing down within 6 hrs of deployment ( although we dont reap the machines well past 24 hrs, we might need to reduce that once we start hitting load / capacity ).
Regards,
I'm having a hard time understanding what "X machines every 10 minutes" means.
Our jobs are long-lived and tend to be launched simultaneously in a pipeline. For example, we have this pipeline where 8 jobs will launch at once but then 6 of those will complete in ~45 minutes and then the other two take around ~90 minutes.
Is the quota represented as an amount of nodes per 10 minutes or an absolute cap on the concurrent nodes ? I think a cap on the concurrent nodes would make more sense ?
i.e, a tenant would not be able to request more than 30 nodes. If he requests a node and the tenant already has 30 active nodes, the request is refused with an error similar to the one we bump into when duffy is out of inventory.
David Moreau Simard Senior Software Engineer | Openstack RDO
dmsimard = [irc, github, twitter]
On Mon, Aug 8, 2016 at 11:24 AM, Karanbir Singh mail-lists@karan.org wrote:
hi David,
so is 2 machines every 10 min a good rate to start from ?
regards,
On 08/08/16 16:08, David Moreau Simard wrote:
Using the new slaves has a couple objectives:
- Test the new OpenStack cloud
- Increase redundancy (given the instability of our existing slave
the past few weeks) 3) Increase concurrency/capacity
We had 16 threads on a single slave before (down from 24) and that single slave was struggling to cope when all those 16 threads were actually busy. The new slaves have 8 threads each and we lowered the amount of threads on the original slave back to 10 so it isn't loaded (and isn't crashing) as much.
So we're now at 34 threads total and I can indeed tell from our consumption logging that the usage has increased and peaks higher than before. We'll scale down the threads to 24 total, can you tell us if you see improvements ?
We're also waiting for the feature in Duffy that'll enable us to track which node is associated with which job so we can hunt jobs that are potentially not being very good citizens.
David Moreau Simard Senior Software Engineer | Openstack RDO
dmsimard = [irc, github, twitter]
On Mon, Aug 8, 2016 at 9:30 AM, Karanbir Singh mail-lists@karan.org wrote:
hi guys,
with an increase in the number of slaves, we've noticed that the rdo jobs are deploying machines at a much higher velocity than before - as a result the ready pool is consistently hitting the low water mark.
Rather than do an overall quota limit, I'm looking at limiting the number of duffy deploy's per 10 min cycle, but rather than propose something I'd like to see what folks think is a reasonable number to start from ?
regards,
-- Karanbir Singh +44-207-0999389 | http://www.karan.org/ | twitter.com/kbsingh GnuPG Key : http://www.karan.org/publickey.asc _______________________________________________ Ci-users mailing list Ci-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
-- Karanbir Singh +44-207-0999389 | http://www.karan.org/ | twitter.com/kbsingh GnuPG Key : http://www.karan.org/publickey.asc
On 08/08/16 17:59, David Moreau Simard wrote:
I'm having a hard time understanding what "X machines every 10 minutes" means.
Our jobs are long-lived and tend to be launched simultaneously in a pipeline. For example, we have this pipeline where 8 jobs will launch at once but then 6 of those will complete in ~45 minutes and then the other two take around ~90 minutes.
Is the quota represented as an amount of nodes per 10 minutes or an absolute cap on the concurrent nodes ?
that would be the number of machines that can be allocated per 10 minutes. They can then run through to reap limits ( ideally keep it under 6 hrs ); if there are jobs that need more than 1 machine - we'd need to tweak it further.
I think a cap on the concurrent nodes would make more sense ?
we do have that, but tweak it up as needed - eg. RDO is limited to 100 physical nodes at any one point; most projects start at 10 and we tweak up as needed ( again, were not trying to get in the way, just protecting machine stock against runaway scripts etc ).
i.e, a tenant would not be able to request more than 30 nodes. If he requests a node and the tenant already has 30 active nodes, the request is refused with an error similar to the one we bump into when duffy is out of inventory.
The rate at which we allocate machines, is the rate at which we need to install and provision machines from the unused machine stock - and the bootup + anaconda run + reboot and contextualisation takes time, so we need to tweak the flow in order to best optimise the allocation rate.
On Aug 08 12:59, David Moreau Simard wrote:
I'm having a hard time understanding what "X machines every 10 minutes" means.
Our jobs are long-lived and tend to be launched simultaneously in a pipeline. For example, we have this pipeline where 8 jobs will launch at once but then 6 of those will complete in ~45 minutes and then the other two take around ~90 minutes.
Is the quota represented as an amount of nodes per 10 minutes or an absolute cap on the concurrent nodes ? I think a cap on the concurrent nodes would make more sense ?
i.e, a tenant would not be able to request more than 30 nodes. If he requests a node and the tenant already has 30 active nodes, the request is refused with an error similar to the one we bump into when duffy is out of inventory.
David Moreau Simard Senior Software Engineer | Openstack RDO
dmsimard = [irc, github, twitter]
On Mon, Aug 8, 2016 at 11:24 AM, Karanbir Singh mail-lists@karan.org wrote:
hi David,
so is 2 machines every 10 min a good rate to start from ?
regards,
On 08/08/16 16:08, David Moreau Simard wrote:
Using the new slaves has a couple objectives:
- Test the new OpenStack cloud
- Increase redundancy (given the instability of our existing slave
the past few weeks) 3) Increase concurrency/capacity
We had 16 threads on a single slave before (down from 24) and that single slave was struggling to cope when all those 16 threads were actually busy. The new slaves have 8 threads each and we lowered the amount of threads on the original slave back to 10 so it isn't loaded (and isn't crashing) as much.
So we're now at 34 threads total and I can indeed tell from our consumption logging that the usage has increased and peaks higher than before. We'll scale down the threads to 24 total, can you tell us if you see improvements ?
We're also waiting for the feature in Duffy that'll enable us to track which node is associated with which job so we can hunt jobs that are potentially not being very good citizens.
David Moreau Simard Senior Software Engineer | Openstack RDO
dmsimard = [irc, github, twitter]
On Mon, Aug 8, 2016 at 9:30 AM, Karanbir Singh mail-lists@karan.org wrote:
hi guys,
with an increase in the number of slaves, we've noticed that the rdo jobs are deploying machines at a much higher velocity than before - as a result the ready pool is consistently hitting the low water mark.
Rather than do an overall quota limit, I'm looking at limiting the number of duffy deploy's per 10 min cycle, but rather than propose something I'd like to see what folks think is a reasonable number to start from ?
regards,
-- Karanbir Singh +44-207-0999389 | http://www.karan.org/ | twitter.com/kbsingh GnuPG Key : http://www.karan.org/publickey.asc _______________________________________________ Ci-users mailing list Ci-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
-- Karanbir Singh +44-207-0999389 | http://www.karan.org/ | twitter.com/kbsingh GnuPG Key : http://www.karan.org/publickey.asc
Did we land on an acceptable rate for this? We've exhausted the ready pool completely a few times today.
One thing that might help is to work on the backoff in cicoclient. @David do your jobs sit in a holding pattern if the API returns 'out of nodes'? If so, we should peg the scheduled retry interval to our estimate of how long it takes the workers to fill out some machines.
--Brian
On 12/08/16 21:34, Brian Stinson wrote: <snip>
Did we land on an acceptable rate for this? We've exhausted the ready pool completely a few times today.
One thing that might help is to work on the backoff in cicoclient. @David do your jobs sit in a holding pattern if the API returns 'out of nodes'? If so, we should peg the scheduled retry interval to our estimate of how long it takes the workers to fill out some machines.
--Brian
Another (possible) option is also to either : - increase the number of workers on the CI infra sides - have the workers job deploy in parallel mutiple nodes instead of just one (actually the ansible job can deploy multiple nodes already in parallel, but duffy limits the call to only one specific node per job)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Aug 14 09:07, Fabian Arrotin wrote:
On 12/08/16 21:34, Brian Stinson wrote:
<snip> > > Did we land on an acceptable rate for this? We've exhausted the ready > pool completely a few times today. > > One thing that might help is to work on the backoff in cicoclient. > @David do your jobs sit in a holding pattern if the API returns 'out of > nodes'? If so, we should peg the scheduled retry interval to our > estimate of how long it takes the workers to fill out some machines. > > --Brian
Another (possible) option is also to either :
- increase the number of workers on the CI infra sides
Did this temporarily yesterday.
- have the workers job deploy in parallel mutiple nodes instead of just
one (actually the ansible job can deploy multiple nodes already in parallel, but duffy limits the call to only one specific node per job)
This is in the works, but will take some time to work it into duffy.
-- Fabian Arrotin The CentOS Project | http://www.centos.org gpg key: 56BEC54E | twitter: @arrfab
- --Brian
On 14/08/16 17:17, Brian Stinson wrote:
On Aug 14 09:07, Fabian Arrotin wrote:
On 12/08/16 21:34, Brian Stinson wrote:
<snip> > > Did we land on an acceptable rate for this? We've exhausted the ready > pool completely a few times today. > > One thing that might help is to work on the backoff in cicoclient. > @David do your jobs sit in a holding pattern if the API returns 'out of > nodes'? If so, we should peg the scheduled retry interval to our > estimate of how long it takes the workers to fill out some machines. > > --Brian
Another (possible) option is also to either :
- increase the number of workers on the CI infra sides
Did this temporarily yesterday.
We should really not need to do this - it just increases the number of wasted nodes and increases the quantity of hardware that wont get used. In an idea world we want to be a point where hardware is deployed just in time to get consumed, ie. near zero ready nodes in the pool. increasing the pool size just mask's away the real problem. So its ok to do on a temporary basis for a day or to workaround a genuine spike, but we should not let this go past the 20 number as a regular thing.
- have the workers job deploy in parallel mutiple nodes instead of just
one (actually the ansible job can deploy multiple nodes already in parallel, but duffy limits the call to only one specific node per job)
This is in the works, but will take some time to work it into duffy.
The limits we have are more or less enforced from the hardware side are they not ? the api call rate into the firmware starts taking quite a hit once you go past a certain ( fairly low per chassis ? ) number. The only way around this would be to remove the ansible abstraction and just call the api directly from the python side.
However, I dont think this is really a problem we have at the moment. folks who need a very high density of instances per minute can fall back to just using the cloud infra, or using the jenkins queue management and serialise better.
Regards,
On 12/08/16 20:34, Brian Stinson wrote:
Did we land on an acceptable rate for this? We've exhausted the ready pool completely a few times today.
sent in a patch last Friday night that will start with 5 nodes per 10 minutes, we should test this for the next day or so and then deploy to prod.
We should then work back with the projects that see this as an issue, and try to workout what the best solution / limits are for them.
note that this only impacts the number of new sessions and nodes that can be started ( sessions) and requested ( nodes ) per a 10 min cycle, it does not have any bearing on the number of nodes already deployed.
regards,
On Mon, Aug 8, 2016, at 11:08 AM, David Moreau Simard wrote:
We're also waiting for the feature in Duffy that'll enable us to track which node is associated with which job so we can hunt jobs that are potentially not being very good citizens.
This is one of the things https://github.com/cgwalters/centos-ci-skeleton does, but I would like indeed to push this logic down into Duffy, as discussed earlier on the list.
All we need is a way to inject a little bit of metadata (e.g. < 1k) associated with a request and be able to get it back out when doing an inventory.
On 08/08/16 18:54, Colin Walters wrote:
On Mon, Aug 8, 2016, at 11:08 AM, David Moreau Simard wrote:
We're also waiting for the feature in Duffy that'll enable us to track which node is associated with which job so we can hunt jobs that are potentially not being very good citizens.
This is one of the things https://github.com/cgwalters/centos-ci-skeleton does, but I would like indeed to push this logic down into Duffy, as discussed earlier on the list.
All we need is a way to inject a little bit of metadata (e.g. < 1k) associated with a request and be able to get it back out when doing an inventory.
we should have this done in a day or so; you would be able to setup some text when requesting a new session, and then edit / retr via a call or just doing an inventory for your own key's.
regards,