Feb 28 2008

Zenoss System Monitoring

Tag: Open SourceSymon Rottem @ 2:10 pm

At the moment I’m working on a system that consists of about 45 machines working together, providing a range of services, all of which need to be monitored for availability and that everything is running within measured thresholds. It’s important that if any of these machines or services fail that someone is notified and that if the issue isn’t resolved in a reasonable time frame that some kind of escalation process will ensure others are notified until something is done.

Having no available budget when I began investigating options I started looking at FR/OSS solutions to solve the problem. After a couple of false starts with Nagios and Hyperic HQ (which only missed the cut because it’s free version was missing one particular feature I needed; the ability to schedule repetitive maintenance periods – a pity that because otherwise it looks like an excellent FR/OSS product) I took a look at Zenoss Core and had a nice surprise.

Zenoss does a spectacular job of simplifying the process of adding new machines to be monitored – once the hosts have been SNMP enabled (and WMI enabled, for Windows boxes) you can set up a network in Zenoss and tell it to scan the network for new devices after which it will dutifully add all detected hosts with SNMP.

What’s really nice is that you can simply switch devices from a basic detected profile type to a more specific type (ie, Windows host, Linux host, router, switch, etc.) and Zenoss will do further investigation on the device based on its type. For example, if you designate the device as a Windows host it will query for other Windows related information including the software and services installed on the box. Similarly if you select Linux or Solaris it will perform other OS specific checks, etc.

Also interesting is that it establishes an inventory of the software installed on all the devices so you can determine which machines are running which software. By default each host is re-profiled every 6 hours and if any changes are detected the database is updated and you can be notified of the changes.

Once the hosts have been added Zenoss dutifully harvests issues from system logs, checks for availability of designated processes or services, and tracks values like available memory and processor usage (and yes, even custom data can be collected) over time. Hell, there are event pretty graphs for you to look at. Once the data is coming in notifications can be configured which can be triggered by outages or data exceeding thresholds and they can be sent by a variety of methods (email is one and is certainly the easiest to get running, but there are SMS/paging options amongst others).

Zenoss isn’t perfect, however. It can perform very slowly sometimes due to the way it manages caching data – it looks like it uses all available system memory to the point where it actually uses up the swap space as well. And navigation can sometimes be a pain as you have to move through multiple menus to get to sub groupings of machines unless you’re prepared to type the url to the group by hand.

Regardless, overall I’ve been pretty impressed with what Zenoss can do. There are some features missing from Zenoss Core, but their enterprise version seems to address most of these, and since it’s OSS there’s nothing you can’t choose to do yourself…if you can find the time.