My class, "Technology Manager's Survival Guide" is on the Friday afternoon training schedule at LOPSA-East on May 2 in New Brunswick, NJ.
And I'll be presenting a two-part class, "Leader's Survival Guide," in Jacksonville, FL on July 16 and 23.
Companion blog for Scott Cromar's Solaris Troubleshooting Handbook and Solaris Troubleshooting web site.
My class, "Technology Manager's Survival Guide" is on the Friday afternoon training schedule at LOPSA-East on May 2 in New Brunswick, NJ.
And I'll be presenting a two-part class, "Leader's Survival Guide," in Jacksonville, FL on July 16 and 23.
A common mistake by new monitoring administrators is to alert on everything. This is an ineffective strategy for several reasons. For starters, it may result in higher telecom charges for passing large numbers of alerts. Passing tons of irrelevant alerts will impact team morale. And, no matter how dedicated your team is, you are guaranteed to reach a state where alerts will start being ignored because "they're all garbage anyway."
For example, it is common for non-technical managers to want to send alerts to the systems team when system CPU hits 100%. But, from a technical perspective, this is absurd:
In order to be effective, a monitoring strategy needs to be thought out. You may end up monitoring a lot of things just to establish baselines or to view growth over time. Some things you monitor will need to be checked out right away. It is important to know which is which.
Historical information should be logged and retained for examination on an as-needed basis. It is wise to set up automated regular reports (distributed via email or web) to keep an eye on historical system trends, but there is no reason to send alerts on this sort of information.
Availability information should be characterized and handled in an appropriate way, probably through a tiered system of notifications. Depending on the urgency, it may show up on a monitoring console, be rolled up in a daily summary report, or paged out to the on-call person. Some common types of information in this category include:
(If possible, configure escalations into your alerting system, so that you are not dependent on a single person's cell phone for the availability of your entire enterprise. A typical escalation procedure would be for an unacknowledged alert to be sent up defined chain of escalation. For example, if the on-call person does not respond in 15 minutes, an alert may go to the entire group. If the alert is not acknowledged 15 minutes after that, the alert may go to the manager.)
In some environments, alerts are handled by a round-the-clock team that is sometimes called the Network Operations Center (NOC). The NOC will coordinate response to the issue, including an evaluation of the alert and any necessary escalations.
Before an alert is configured, the monitoring group should first make sure that the alert meets three important criteria. The alert should be:
uptime
command is reporting a time larger than the interval between monitoring sweeps, you can keep an eye on sudden, unexpected reboots.avserv
in Solaris, svctime
in Linux) > 20 ms for disk devices with more than 100 (r+w)/s, including NFS disk devices. This measures of I/O channel exhaustion. 20 ms is a very long time, so you will also want to keep an eye on trends on regular summary reports of sar -d
data.