On 11/04/2009 10:05 PM, nate wrote:
Our monitoring is primarily nagios+cacti which are maintained by hand currently. Myself I have literally tens of thousands of hours invested in monitoring scripts mostly integrating with RRDTool for performance and trending analysis.
You say a lot with just that statement there - its a place we've all been at, and its the one issue that some of these tools around *today* help with.
Essentially, every admin has been down the route of setting up a bunch of machines and then working away at them, investing large portions of time with regular admin tasks - like writing scripts to manage small bits of state, writing some sort of config rollouts, doing some post-install tests etc etc. The list can go on and on. The important thing here really is that weve *all* done that - and a *large* portion of what we were trying to do was common in most scenarios. But there was never really any traction around any single community, that would encourage people to come together - talk about these things - and then move on creating tool sets that work for people.
To me, this is a major contribution by some of these tools today - spacewalk, puppet, cfengine, chef, bcfg2, slack : all becoming focal groups - even if they only address specific use-cases or only address certain mindsets / thought process's. The main thing is that people are talking and whats coming from those talks are more capable and better written tools that, kind of now, mean that it may no longer be necessary to spend those hours and hours working out of a silo doing the sort of work that we were doing in the past. On the flip side, people argue that doing the same level of work and working under the same conditions people are today producing a much better management system for their own use and for their users.
For example, if the monitoring tool is unable to accept tasks and report process from a tool, which in turn can be connected upto what the machine is actually supposed to be doing, its a monitoring tool that I dont even want to consider using. I'd rather have something which can let me write a snippet like:
------------- Machine of type webserver needs: - packages httpd, mod_ssl - monitoring for port :80 and :443 + if not working, run scriptX, if still not working, notify remote monitoring, and remove from production pool - dir /var/www/html should exist and if file /var/www/app/.TAG does not exist : notify {deploymentmachine} that {thismachine} needs app rollout - if all is good, run pre-production tests, if all pass, get us / keep us in the production pool
Make machine1,machine2,machine3 a webserver ------------
The advantage from this is that various bits of the descriptive code could be used in various options and scenarios. Compare that to having to go around to each machine and doing things on each box, manually, every time.