On Fri, Dec 16, 2011 at 12:02 PM, Alan McKay alan.mckay@gmail.com wrote:
Thoughts form anyone on any of this?
Network monitoring is not trivial no matter what tool you use. Pick something that you trust to scale to the proportions you will need so you don't do a lot of work and then hit a wall. And if you have a lot of systems, avoid anything that needs per-system configuration or agent installation.
Agreed. I'm definitely not looking for trivial - just trying to make sure I understand the strengths and weaknesses of each system to help me make the right decision. Because once I've made that decision, I have to live with it :-) Our environment is relatively small. About 80 servers that are mostly grouped into 3 compute clusters for the scientists I support. A few switches, and no routers under my direct control (though a few Linux boxes routing between NICs since some of the environment is on our own private LAN behind said Linux box, cut off from the Hospital's network)
You may not need 'direct' control of the routers - just read access for snmp to monitor them. And if the switches have snmp you can get per-interface traffic which will obviously match whatever is on the other end of the wire. Does the cluster software have its own close-coupled monitor like ganglia? One thing I haven't found in any of the frameworks I've seen that everybody is likely to need is a good concept of aggregates. That is, you will have some level of redundancy in fail-over sets and some level of group capacity in load-balanced sets. While you may want to be alerted about individual failures, what you really need to track is how close you are to capacity across the working group members - and nothing does that very well.