Difference between revisions of "Service level priorities and actions"

From BCX Media Wiki
Jump to navigation Jump to search
(Created page with "Category:BCX Network Monitoring")
 
 
(2 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 +
==Service level priorities and actions==
 +
 +
<table class="w3-table w3-bordered w3-centered" style="width:100%" align="center">
 +
    <tr><th>Level</th><th>Display</th><th>Alert</th><th>Notes</th><th>Example</th></tr>
 +
    <tr><td style="background-color:#E45959">Major</td><td>Customer and Burconix Support Dashboard</td><td>Alert Customer and Burconix Support</td><td>Current major interruption to service</td><td>Core storage offline</td></tr>
 +
    <tr><td style="background-color:#E97659">High</td><td>Customer and Burconix Support Dashboard</td><td>Alert Customer and Burconix Support</td><td>Potential for future interruption to service</td><td>Single disk/psu fail in redundant configuration</td></tr>
 +
    <tr><td style="background-color:#FFA059">Concern</td><td>Customer Dashboard</td><td>Alert Customer</td><td>Potential interruption to service</td><td>HTTPS service offline</td></tr>
 +
    <tr><td style="background-color:#FFC859">Warning</td><td>Customer Dashboard</td><td>No Alert</td><td>Warning of potential issue</td><td>Free disk space less than 10GB</td></tr>
 +
    <tr><td style="background-color:#7499FF">Information</td><td>Customer Dashboard</td><td>No Alert</td><td>Possible future action may be required</td><td>Toner less than 10%</td></tr>
 +
</table>
 +
 +
 +
==Device Types==
 +
 +
<p>Below shows a summary of the core events we monitor for each device type with the notification level.</p>
 +
 +
 +
<div id="upspower" class="servicelayer" style="padding-top:0px;max-width:1000px;margin-right:auto;margin-left:auto">
 +
    <p><b>UPS and Power Protection</b></p>
 +
    <p>We monitor the UPS runtime status including load, battery and self-diagnostic status.</p>
 +
    <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center">
 +
        <tr><th style="width:180px">Level</th><th>Trigger</th></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>Battery needs replacing</td></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>UPS on battery</td></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>Runtime less than 10 mins</td></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>Load is critical 90%</td></tr>
 +
        <tr><td style="background-color:#FFA059">Concern</td><td>Battery temperature is too high</td></tr>
 +
        <tr><td style="background-color:#FFA059">Concern</td><td>No SMNP data received for 3 mins</td></tr>
 +
        <tr><td style="background-color:#FFA059">Concern</td><td>Load is too high 80%</td></tr>
 +
        <tr><td style="background-color:#FFA059">Concern</td><td>UPS has been restarted</td></tr>
 +
        <tr><td style="background-color:#FFC859">Warning</td><td>Battery power currently too low to support load</td></tr>
 +
        <tr><td style="background-color:#FFC859">Warning</td><td>Last diagnostic test failed</td></tr>
 +
       
 +
    </table>
 +
</div>
 +
 +
<div id="sanstorage" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto">
 +
    <p><b>SAN and Storage</b></p>
 +
    <p>We monitor all aspects of the storage hardware and volume availability.</p>
 +
    <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center">
 +
        <tr><th style="width:180px">Level</th><th>Trigger</th></tr>
 +
        <tr><td style="background-color:#E45959">Major</td><td>Storage array is offline</td></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>No SNMP/API data received for 3 mins</td></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>Physical disk failed</td></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>Controller health degraded</td></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>Virtual disk health degraded</td></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>Enclosure health degraded</td></tr>
 +
        <tr><td style="background-color:#FFA059">Concern</td><td>Virtual disk is not fault tolerant</td></tr>
 +
        <tr><td style="background-color:#FFA059">Concern</td><td>SAN has been restarted</td></tr>
 +
        <tr><td style="background-color:#FFC859">Warning</td><td>Controller redundancy lost</td></tr>
 +
        <tr><td style="background-color:#FFC859">Warning</td><td>Controller not responding to ping</td></tr>
 +
       
 +
    </table>
 +
</div>
 +
 +
<div id="physerverhardware" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto">
 +
    <p><b>Physical Server Hardware</b></p>
 +
    <p>We monitor all aspects of the server's physical hardware sensors.</p>
 +
    <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center">
 +
        <tr><th style="width:180px">Level</th><th>Trigger</th></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>System status is in warning or critical state</td></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>Power supply is in warning or critical state</td></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>Disk array controller is in warning or critical state</td></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>Disk array cache controller battery is in warning or critical state</td></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>Disk array cache controller is in warning or critical state</td></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>Physical disk failed</td></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>Virtual disk offline</td></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>Fan is in critical state</td></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>Ambient temperature is above critical threshold</td></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>No SNMP data received for 3 mins</td></tr>
 +
        <tr><td style="background-color:#FFA059">Concern</td><td>Ambient temperature is above warning threshold</td></tr>
 +
        <tr><td style="background-color:#FFA059">Concern</td><td>Ambient temperature is too low</td></tr>
 +
        <tr><td style="background-color:#FFC859">Warning</td><td>Disk array cache controller non-optimal</td></tr>
 +
        <tr><td style="background-color:#FFC859">Warning</td><td>Physical disk is in warning state</td></tr>
 +
        <tr><td style="background-color:#FFC859">Warning</td><td>Virtual disk is in warning state</td></tr>
 +
        <tr><td style="background-color:#FFC859">Warning</td><td>Fan is in warning state</td></tr>
 +
        <tr><td style="background-color:#FFC859">Warning</td><td>System has been restarted</td></tr>
 +
    </table>
 +
</div>
 +
 +
<div id="winlinagent" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto">
 +
    <p><b>Windows/Linux Server Agent</b></p>
 +
    <p>We monitor both availability and metrics including processor, memory, network and disk utilization, as well as monitoring the running status of core windows services.</p>
 +
    <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center">
 +
        <tr><th style="width:180px">Level</th><th>Trigger</th></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>Free disk space is less than 500MB on volume</td></tr>
 +
        <tr><td style="background-color:#FFA059">Concern</td><td>Free disk space is less than 5% and under 5GB on volume</td></tr>
 +
        <tr><td style="background-color:#FFA059">Concern</td><td>Agent is unreachable for 10 mins</td></tr>
 +
        <tr><td style="background-color:#FFA059">Concern</td><td>Monitored windows service is not running</td></tr>
 +
        <tr><td style="background-color:#FFC859">Warning</td><td>Free disk space is less than 10% and under 10GB on volume</td></tr>
 +
        <tr><td style="background-color:#FFC859">Warning</td><td>Agent is unreachable for 3 mins</td></tr>
 +
        <tr><td style="background-color:#FFC859">Warning</td><td>Server has been restarted</td></tr>
 +
    </table>
 +
</div>
 +
 +
<div id="networkswitch" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto">
 +
    <p><b>Network Switch</b></p>
 +
    <p>We monitor core switch traffic utilization on each interface and trigger a warning if a core link goes down.<br />On edge switch devices we monitor the hardware status and availability.</p>
 +
    <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center">
 +
        <tr><th style="width:180px">Level</th><th>Trigger</th></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>Temperature is above critical threshold</td></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>Power supply status is in warning or critical state</td></tr>
 +
        <tr><td style="background-color:#FFA059">Concern</td><td>Temperature above warning threshold</td></tr>
 +
        <tr><td style="background-color:#FFA059">Concern</td><td>Temperature is too low</td></tr>
 +
        <tr><td style="background-color:#FFA059">Concern</td><td>Fan is in critical state</td></tr>
 +
        <tr><td style="background-color:#FFA059">Concern</td><td>High memory utilization</td></tr>
 +
        <tr><td style="background-color:#FFA059">Concern</td><td>No SNMP data received for 3 mins</td></tr>
 +
        <tr><td style="background-color:#FFA059">Concern</td><td>Core switch has been restarted</td></tr>
 +
        <tr><td style="background-color:#FFC859">Warning</td><td>Core switch link down</td></tr>
 +
        <tr><td style="background-color:#FFC859">Warning</td><td>Fan is in warning state</td></tr>
 +
        <tr><td style="background-color:#FFC859">Warning</td><td>Edge switch has been restarted</td></tr>
 +
    </table>
 +
</div>
 +
 +
<div id="firewall" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto">
 +
    <p><b>Firewalls and Routers</b></p>
 +
    <p>We monitor traffic utilization, as well as service availability and interface link status.</p>
 +
    <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center">
 +
        <tr><th style="width:180px">Level</th><th>Trigger</th></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>Interface down</td></tr>
 +
        <tr><td style="background-color:#FFA059">Concern</td><td>No SNMP data received for 3 mins</td></tr>
 +
        <tr><td style="background-color:#FFC859">Warning</td><td>Device has been restarted</td></tr>
 +
    </table>
 +
</div>
 +
 +
<div id="netattcheddevice" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto">
 +
    <p><b>Network Attached Device (Ping)</b></p>
 +
    <p>We monitor service availability and verify ping response times.</p>
 +
    <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center">
 +
        <tr><th style="width:180px">Level</th><th>Trigger</th></tr>
 +
        <tr><td style="background-color:#FFA059">Concern</td><td>Unavailable by ICMP ping for 3 mins</td></tr>
 +
        <tr><td style="background-color:#FFC859">Warning</td><td>High ICMP ping loss</td></tr>
 +
        <tr><td style="background-color:#FFC859">Warning</td><td>High ICMP ping response time</td></tr>
 +
    </table>
 +
</div>
 +
 +
<div id="webservices" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto">
 +
    <p><b>Web Services</b></p>
 +
    <p>We monitor SSL certificates and other web service availability including HTTP, HTTPS, SMTP and TCP Port 443.<br />This can be monitored from within your network, or from a remote location based in Nottingham.</p>
 +
    <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center">
 +
        <tr><th style="width:180px">Level</th><th>Trigger</th></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>SSL certificate has expired</td></tr>
 +
        <tr><td style="background-color:#E97659">High</td><td>SSL certificate expires in less than 7 days</td></tr>
 +
        <tr><td style="background-color:#FFA059">Concern</td><td>Web service has been down for 3 mins</td></tr>
 +
        <tr><td style="background-color:#FFA059">Concern</td><td>SSL certificate expires in less than 14 days</td></tr>
 +
        <tr><td style="background-color:#FFC859">Warning</td><td>SSL certificate expires in less than 30 days</td></tr>
 +
        <tr><td style="background-color:#7499FF">Information</td><td>SSL certificate expires in less than 60 days</td></tr>
 +
    </table>
 +
</div>
 +
 +
<div id="printers" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto">
 +
    <p><b>Printers</b></p>
 +
    <p>We monitor printers for status and toner levels.</p>
 +
    <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center">
 +
        <tr><th style="width:180px">Level</th><th>Trigger</th></tr>
 +
        <tr><td style="background-color:#FFC859">Warning</td><td>No SNMP data received for 3 mins</td></tr>
 +
        <tr><td style="background-color:#FFC859">Warning</td><td>Printer is in error state</td></tr>
 +
        <tr><td style="background-color:#FFC859">Warning</td><td>Consumable on printer is empty</td></tr>
 +
        <tr><td style="background-color:#7499FF">Information</td><td>Consumable on printer is under 10%</td></tr>
 +
 +
    </table>
 +
</div>
 +
 
[[Category:BCX Network Monitoring]]
 
[[Category:BCX Network Monitoring]]

Latest revision as of 09:04, 23 April 2020

Service level priorities and actions

LevelDisplayAlertNotesExample
MajorCustomer and Burconix Support DashboardAlert Customer and Burconix SupportCurrent major interruption to serviceCore storage offline
HighCustomer and Burconix Support DashboardAlert Customer and Burconix SupportPotential for future interruption to serviceSingle disk/psu fail in redundant configuration
ConcernCustomer DashboardAlert CustomerPotential interruption to serviceHTTPS service offline
WarningCustomer DashboardNo AlertWarning of potential issueFree disk space less than 10GB
InformationCustomer DashboardNo AlertPossible future action may be requiredToner less than 10%


Device Types

Below shows a summary of the core events we monitor for each device type with the notification level.


UPS and Power Protection

We monitor the UPS runtime status including load, battery and self-diagnostic status.

LevelTrigger
HighBattery needs replacing
HighUPS on battery
HighRuntime less than 10 mins
HighLoad is critical 90%
ConcernBattery temperature is too high
ConcernNo SMNP data received for 3 mins
ConcernLoad is too high 80%
ConcernUPS has been restarted
WarningBattery power currently too low to support load
WarningLast diagnostic test failed

SAN and Storage

We monitor all aspects of the storage hardware and volume availability.

LevelTrigger
MajorStorage array is offline
HighNo SNMP/API data received for 3 mins
HighPhysical disk failed
HighController health degraded
HighVirtual disk health degraded
HighEnclosure health degraded
ConcernVirtual disk is not fault tolerant
ConcernSAN has been restarted
WarningController redundancy lost
WarningController not responding to ping

Physical Server Hardware

We monitor all aspects of the server's physical hardware sensors.

LevelTrigger
HighSystem status is in warning or critical state
HighPower supply is in warning or critical state
HighDisk array controller is in warning or critical state
HighDisk array cache controller battery is in warning or critical state
HighDisk array cache controller is in warning or critical state
HighPhysical disk failed
HighVirtual disk offline
HighFan is in critical state
HighAmbient temperature is above critical threshold
HighNo SNMP data received for 3 mins
ConcernAmbient temperature is above warning threshold
ConcernAmbient temperature is too low
WarningDisk array cache controller non-optimal
WarningPhysical disk is in warning state
WarningVirtual disk is in warning state
WarningFan is in warning state
WarningSystem has been restarted

Windows/Linux Server Agent

We monitor both availability and metrics including processor, memory, network and disk utilization, as well as monitoring the running status of core windows services.

LevelTrigger
HighFree disk space is less than 500MB on volume
ConcernFree disk space is less than 5% and under 5GB on volume
ConcernAgent is unreachable for 10 mins
ConcernMonitored windows service is not running
WarningFree disk space is less than 10% and under 10GB on volume
WarningAgent is unreachable for 3 mins
WarningServer has been restarted

Network Switch

We monitor core switch traffic utilization on each interface and trigger a warning if a core link goes down.
On edge switch devices we monitor the hardware status and availability.

LevelTrigger
HighTemperature is above critical threshold
HighPower supply status is in warning or critical state
ConcernTemperature above warning threshold
ConcernTemperature is too low
ConcernFan is in critical state
ConcernHigh memory utilization
ConcernNo SNMP data received for 3 mins
ConcernCore switch has been restarted
WarningCore switch link down
WarningFan is in warning state
WarningEdge switch has been restarted

Firewalls and Routers

We monitor traffic utilization, as well as service availability and interface link status.

LevelTrigger
HighInterface down
ConcernNo SNMP data received for 3 mins
WarningDevice has been restarted

Network Attached Device (Ping)

We monitor service availability and verify ping response times.

LevelTrigger
ConcernUnavailable by ICMP ping for 3 mins
WarningHigh ICMP ping loss
WarningHigh ICMP ping response time

Web Services

We monitor SSL certificates and other web service availability including HTTP, HTTPS, SMTP and TCP Port 443.
This can be monitored from within your network, or from a remote location based in Nottingham.

LevelTrigger
HighSSL certificate has expired
HighSSL certificate expires in less than 7 days
ConcernWeb service has been down for 3 mins
ConcernSSL certificate expires in less than 14 days
WarningSSL certificate expires in less than 30 days
InformationSSL certificate expires in less than 60 days

Printers

We monitor printers for status and toner levels.

LevelTrigger
WarningNo SNMP data received for 3 mins
WarningPrinter is in error state
WarningConsumable on printer is empty
InformationConsumable on printer is under 10%