On 08/08/16 17:59, David Moreau Simard wrote:
I'm having a hard time understanding what "X machines every 10 minutes" means.
Our jobs are long-lived and tend to be launched simultaneously in a pipeline. For example, we have this pipeline where 8 jobs will launch at once but then 6 of those will complete in ~45 minutes and then the other two take around ~90 minutes.
Is the quota represented as an amount of nodes per 10 minutes or an absolute cap on the concurrent nodes ?
that would be the number of machines that can be allocated per 10 minutes. They can then run through to reap limits ( ideally keep it under 6 hrs ); if there are jobs that need more than 1 machine - we'd need to tweak it further.
I think a cap on the concurrent nodes would make more sense ?
we do have that, but tweak it up as needed - eg. RDO is limited to 100 physical nodes at any one point; most projects start at 10 and we tweak up as needed ( again, were not trying to get in the way, just protecting machine stock against runaway scripts etc ).
i.e, a tenant would not be able to request more than 30 nodes. If he requests a node and the tenant already has 30 active nodes, the request is refused with an error similar to the one we bump into when duffy is out of inventory.
The rate at which we allocate machines, is the rate at which we need to install and provision machines from the unused machine stock - and the bootup + anaconda run + reboot and contextualisation takes time, so we need to tweak the flow in order to best optimise the allocation rate.