Karanbir Singh wrote:
To me, this is a major contribution by some of these tools today - spacewalk, puppet, cfengine, chef, bcfg2, slack : all becoming focal groups - even if they only address specific use-cases or only address certain mindsets / thought process's.
Another aspect of the management tools I forgot to mention is their own complexity. Most of the enterprise grade stuff is so complicated you need dedicated trained people to work on it as their sole task, whether it's something like BMC or HP Openview/sitescope type stuff.
I can only speak for what I've used but at my current company I was hired after the previous guy left. One of the reasons my company wanted me was my knowledge of cfengine, which they had deployed but really nobody knew how to use.
It wasn't easy going into a new enviornment with a five 9 SLA and dig into their systems with pretty much the only documentation that I had was a poorly setup cfengine implimentation. I mean their install process involved *manually* defining classes on the command line and running the agent to pick up the associated configurations for those classes. Where as my setup the classes are always defined and no manual intervention needed. The internal infrastructure was a total mess, and some of it still is. The older DNS systems run on windows and weren't setup properly, the master DNS failed a couple of days ago causing havok on the back end. I worked around it for a little while they couldn't get the system back up, then yesterday the main zone for the back end expired wrecking havok again, and we did an emergency migration to linux at that point. The guys doing the windows stuff don't know what they are doing, it's all legacy and hasn't been touched in years.
So I spent a few months slowly re-writing the entire system, along with re-doing the entire build system which was just as bad but fortunately kickstart is a lot easier to work with.
Even today more than a year later there are probably 45-50 systems out there running on the "old" stuff. There is no good/safe/clean way to migrate to the new w/o re-installing. So I've been doing that as opportunities present themselves. Probably 40 of those systems will be replaced in the next 3-5 months so that will allow me to almost complete the project..
Maybe when I leave my current company the next person will come in and re-write everything again, or switch to puppet or something..
Funny I just realized when I left my previous company I left them with another guy who was there the whole time who knew CFengine as well, and he tried to train another guy there before he left but since all of them have left.. They hired some other guy but not sure if he's still there and not sure if he ever learned how things worked, the company has mostly collapsed under multiple failed business models.
I feel for the folks who are so overworked and stressed out that they don't have the ability to learn better ways of doing things, I've been there too. And fortunately am now in a position where I have the luxury of refusing such positions and tasks now because it's not worth my time.
As for monitoring automation I hear ya, that would be cool to have, right now for us it's not much of an issue our growth rate is pretty small(roughly 400 systems). We have a tier 1 team that handles most of monitoring setup so I just give them a ticket and tell them to do it and they do.
I'll be deploying another new edge data center location in a couple of weeks, 14 servers, 10 bare metal, 4 run ESXi, about 50 instances of CentOS total, with kickstarting over the WAN I can usually get most everything up in about a day, which sadly is faster than the network guy takes to setup his two switches, two load balancers, and two firewalls. Despite the small server count the hardware is benchmarked to run our app at about 40,000 transactions a second as a whole, which is the fastest app by several orders of magnitude that I've ever worked on anyways. We have more capacity issues on the load balancers than we have with the servers with such a small server count.
It certainly was an interesting experience the first time deploying a data center from remote, we just had 1 really basic server config used to seed the rest of the network, everything done via remote management cards in the servers and remote installations over the WAN. With the exception of the network stuff which the network guy was on site for a couple of days to configure.
I have 10x the equipment to manage and can still get things done faster than him. I could manage all of the network stuff too without much effort.
If I spent some time I'm sure I could automate some nagios integration but forget about cacti, lost cause. Maybe OpenNMS at some point who knows.. At this time automating monitoring integration isn't a pressing issue. I spend more time writing custom scripts to query things in the most scalable way I can than we do adding new things to the monitoring stuff.
Myself I am not holding my breath on any movement or product to come around and make managing systems especially cross platform simpler and cheaper. The task is just too complex. The biggest companies such as Amazon, Google, MS etc have all realized there is little point in even trying such a thing.
It would only benefit really small companies that are at a growth point where they don't have enough business to hire people to standardize on something(or have teams for each thing). And those companies can't afford the costs involved with some big new fancy tool to make their lives easier. The bug guys don't care since they have the teams and stuff to handle it.
Though that won't stop companies from trying..the latest push of course is to the magical cloud where you only care about your apps no longer care about the infrastructure.
Maybe someday that will work, I talked to a small company recently that is investing nearly $1M to in source their application from the cloud to their own gear(they have never hosted it themselves) because of issues with the cloud that they couldn't work around.
good talks though, the most interesting thread I've seen here in I don't know how long. Even though it was soooooooo off topic (from what the list is about anyways) :)
nate