On Fri, Dec 16, 2011 at 12:02 PM, Alan McKay <alan.mckay at gmail.com> wrote: >> > Thoughts form anyone on any of this? >> >> Network monitoring is not trivial no matter what tool you use. Pick >> something that you trust to scale to the proportions you will need so >> you don't do a lot of work and then hit a wall. And if you have a >> lot of systems, avoid anything that needs per-system configuration or >> agent installation. >> > > Agreed. I'm definitely not looking for trivial - just trying to make sure > I understand the strengths and weaknesses of each system to help me make > the right decision. Because once I've made that decision, I have to live > with it :-) Our environment is relatively small. About 80 servers that > are mostly grouped into 3 compute clusters for the scientists I support. A > few switches, and no routers under my direct control (though a few Linux > boxes routing between NICs since some of the environment is on our own > private LAN behind said Linux box, cut off from the Hospital's network) You may not need 'direct' control of the routers - just read access for snmp to monitor them. And if the switches have snmp you can get per-interface traffic which will obviously match whatever is on the other end of the wire. Does the cluster software have its own close-coupled monitor like ganglia? One thing I haven't found in any of the frameworks I've seen that everybody is likely to need is a good concept of aggregates. That is, you will have some level of redundancy in fail-over sets and some level of group capacity in load-balanced sets. While you may want to be alerted about individual failures, what you really need to track is how close you are to capacity across the working group members - and nothing does that very well. -- Les Mikesell lesmikesell at gmail.com