Nagios

Issue stopping Production

Consider the simple diagram below, which has a web server and db server (primary & secondary) and also email and dhcp servers in a corporate network
Now in this sample network there can be many failures, which can happen lets assume a failure which says db servers and webservers are down. Sounds like a serious error
- In this case now admin has to rush into the lab to identify the failure, fix the failure and resolve the issue
- This is pretty much time taking activity
- Root cause could be Switch on the production network is down, so admins recieve the alerts saying webserver and db servers are down
Admins need some kind of a monitoring system which can help in identifying the failure easily
Nagios can help diagnosing these kind of issues very easily, so lets get started with this journey of System Monitoring using Nagions.

Nagios is an open source tool for system monitoring.
It watches servers, services other devices on your network and informs if they are not working as expected
Monitoring in Nagios is split into two main categories
- hosts: Physical or virtual device on the network (servers, routers, switches, printers etc)
- services: Particular functionality which is running on a host (SSH, Email Services, Web Servers and Databases)
Hosts in Nagios can be grouped as Host groups for convience
If you consider the above image we can orgnize the hosts, hostgroups and services as follows

Nagios has 4 states
- OK
- Warning
- Critical
- Unknown
These are much like simple traffic signals which describe the health of host/service. This is much simpler than looking for graphs, analysing trends etc

Nagios performs all of its checks using plugins to which nagios passes on what should be checked and what are the warning and critical limits.
Nagios comes with standard set of plugins that allow you to check for almost all the services that are used mostly in Enterprises.
Nagios also provides easier way to write our own plugins

Commands:
- These define how nagios should performs checks
- Act as abstraction to actual plugins which all you to perform checks
Time Periods:
- Date and time periods during which the operations should or should not be performed
- Eg: Monday-Friday, 10:00 AM – 06:00 PM
Hosts and Host groups:
- Already defined above, but individual device/virtual/physical machine is generally host
- Hosts are grouped into host groups
- One Host might be part of multiple groups
Services:
- Functionality to monitor
- Eg: CPU Utilization, Storage Space or Web Server
Contacts and contact groups:
- People whom should be notified with the information about how is a contact
- Just like hosts are grouped into host groups, contacts are also grouped into contact Group
Notifcation:
- These define who should be notified of what
- Eg: All the Server failure report to admins during working times and outside of working times notify lead admins

During some temporary failures which are auto corrected for example restart of the webserver will bring some page down for few seconds after that the users will not see the failure of page not loading.
To make it easier whether the problem is temporary or permanent, soft states are introduced
Soft state is generally a temporary state and Hard State is Permanent
Lets assume we are monitoring webserver and the current state is webserver is up and running
Now lets assume some admin has restarted the server and now the current state will be
Now nagios will have configured number of soft state checks to performed before declaring the hard state and now lets assume number is 3 and it is checked for every 5 seconds. If the webserver comes into running state with this time status will be
If the webserver fails to come up into running state even after 3 attempts which is configured then the following will be the state of nagios
This concept helps admins from getting unnecessary alerts or noise
In the next series we will install nagios