Difference between revisions of "Service level priorities and actions"
(Created page with "Category:BCX Network Monitoring") |
|||
Line 1: | Line 1: | ||
+ | |||
+ | |||
+ | ==Service level priorities and actions== | ||
+ | |||
+ | <table class="w3-table w3-bordered w3-centered" style="width:1000px" align="center"> | ||
+ | <tr><th>Level</th><th>Display</th><th>Alert</th><th>Notes</th><th>Example</th></tr> | ||
+ | <tr><td style="background-color:#E45959">Major</td><td>Customer and Burconix Support Dashboard</td><td>Alert Customer and Burconix Support</td><td>Current major interruption to service</td><td>Core storage offline</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Customer and Burconix Support Dashboard</td><td>Alert Customer and Burconix Support</td><td>Potential for future interruption to service</td><td>Single disk/psu fail in redundant configuration</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Customer Dashboard</td><td>Alert Customer</td><td>Potential interruption to service</td><td>HTTPS service offline</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Customer Dashboard</td><td>No Alert</td><td>Warning of potential issue</td><td>Free disk space less than 10GB</td></tr> | ||
+ | <tr><td style="background-color:#7499FF">Information</td><td>Customer Dashboard</td><td>No Alert</td><td>Possible future action may be required</td><td>Toner less than 10%</td></tr> | ||
+ | </table> | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | <br /> | ||
+ | <p>Below shows a summary of the core events we monitor for each device type with the notification level.</p> | ||
+ | <br /> | ||
+ | |||
+ | ==Device Types== | ||
+ | |||
+ | |||
+ | <div id="upspower" class="servicelayer" style="padding-top:0px;max-width:1000px;margin-right:auto;margin-left:auto"> | ||
+ | <p><b>UPS and Power Protection</b></p> | ||
+ | <p>We monitor the UPS runtime status including load, battery and self-diagnostic status.</p> | ||
+ | <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center"> | ||
+ | <tr><th style="width:180px">Level</th><th>Trigger</th></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Battery needs replacing</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>UPS on battery</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Runtime less than 10 mins</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Load is critical 90%</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Battery temperature is too high</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>No SMNP data received for 3 mins</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Load is too high 80%</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>UPS has been restarted</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Battery power currently too low to support load</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Last diagnostic test failed</td></tr> | ||
+ | |||
+ | </table> | ||
+ | </div> | ||
+ | |||
+ | |||
+ | <div id="sanstorage" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto"> | ||
+ | <p><b>SAN and Storage</b></p> | ||
+ | <p>We monitor all aspects of the storage hardware and volume availability.</p> | ||
+ | <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center"> | ||
+ | <tr><th style="width:180px">Level</th><th>Trigger</th></tr> | ||
+ | <tr><td style="background-color:#E45959">Major</td><td>Storage array is offline</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>No SNMP/API data received for 3 mins</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Physical disk failed</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Controller health degraded</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Virtual disk health degraded</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Enclosure health degraded</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Virtual disk is not fault tolerant</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>SAN has been restarted</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Controller redundancy lost</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Controller not responding to ping</td></tr> | ||
+ | |||
+ | </table> | ||
+ | </div> | ||
+ | |||
+ | |||
+ | <div id="physerverhardware" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto"> | ||
+ | <p><b>Physical Server Hardware</b></p> | ||
+ | <p>We monitor all aspects of the server's physical hardware sensors.</p> | ||
+ | <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center"> | ||
+ | <tr><th style="width:180px">Level</th><th>Trigger</th></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>System status is in warning or critical state</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Power supply is in warning or critical state</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Disk array controller is in warning or critical state</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Disk array cache controller battery is in warning or critical state</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Disk array cache controller is in warning or critical state</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Physical disk failed</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Virtual disk offline</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Fan is in critical state</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Ambient temperature is above critical threshold</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>No SNMP data received for 3 mins</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Ambient temperature is above warning threshold</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Ambient temperature is too low</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Disk array cache controller non-optimal</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Physical disk is in warning state</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Virtual disk is in warning state</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Fan is in warning state</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>System has been restarted</td></tr> | ||
+ | </table> | ||
+ | </div> | ||
+ | |||
+ | |||
+ | <div id="winlinagent" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto"> | ||
+ | <p><b>Windows/Linux Server Agent</b></p> | ||
+ | <p>We monitor both availability and metrics including processor, memory, network and disk utilization, as well as monitoring the running status of core windows services.</p> | ||
+ | <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center"> | ||
+ | <tr><th style="width:180px">Level</th><th>Trigger</th></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Free disk space is less than 500MB on volume</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Free disk space is less than 5% and under 5GB on volume</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Agent is unreachable for 10 mins</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Monitored windows service is not running</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Free disk space is less than 10% and under 10GB on volume</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Agent is unreachable for 3 mins</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Server has been restarted</td></tr> | ||
+ | </table> | ||
+ | </div> | ||
+ | |||
+ | |||
+ | |||
+ | <div id="networkswitch" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto"> | ||
+ | <p><b>Network Switch</b></p> | ||
+ | <p>We monitor core switch traffic utilization on each interface and trigger a warning if a core link goes down.<br />On edge switch devices we monitor the hardware status and availability.</p> | ||
+ | <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center"> | ||
+ | <tr><th style="width:180px">Level</th><th>Trigger</th></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Temperature is above critical threshold</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Power supply status is in warning or critical state</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Temperature above warning threshold</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Temperature is too low</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Fan is in critical state</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>High memory utilization</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>No SNMP data received for 3 mins</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Core switch has been restarted</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Core switch link down</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Fan is in warning state</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Edge switch has been restarted</td></tr> | ||
+ | </table> | ||
+ | </div> | ||
+ | |||
+ | |||
+ | <div id="firewall" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto"> | ||
+ | <p><b>Firewalls and Routers</b></p> | ||
+ | <p>We monitor traffic utilization, as well as service availability and interface link status.</p> | ||
+ | <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center"> | ||
+ | <tr><th style="width:180px">Level</th><th>Trigger</th></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Interface down</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>No SNMP data received for 3 mins</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Device has been restarted</td></tr> | ||
+ | </table> | ||
+ | </div> | ||
+ | |||
+ | |||
+ | |||
+ | <div id="netattcheddevice" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto"> | ||
+ | <p><b>Network Attached Device (Ping)</b></p> | ||
+ | <p>We monitor service availability and verify ping response times.</p> | ||
+ | <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center"> | ||
+ | <tr><th style="width:180px">Level</th><th>Trigger</th></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Unavailable by ICMP ping for 3 mins</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>High ICMP ping loss</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>High ICMP ping response time</td></tr> | ||
+ | </table> | ||
+ | </div> | ||
+ | |||
+ | |||
+ | |||
+ | <div id="webservices" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto"> | ||
+ | <p><b>Web Services</b></p> | ||
+ | <p>We monitor SSL certificates and other web service availability including HTTP, HTTPS, SMTP and TCP Port 443.<br />This can be monitored from within your network, or from a remote location based in Nottingham.</p> | ||
+ | <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center"> | ||
+ | <tr><th style="width:180px">Level</th><th>Trigger</th></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>SSL certificate has expired</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>SSL certificate expires in less than 7 days</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Web service has been down for 3 mins</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>SSL certificate expires in less than 14 days</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>SSL certificate expires in less than 30 days</td></tr> | ||
+ | <tr><td style="background-color:#7499FF">Information</td><td>SSL certificate expires in less than 60 days</td></tr> | ||
+ | </table> | ||
+ | </div> | ||
+ | |||
+ | |||
+ | |||
+ | <div id="printers" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto"> | ||
+ | <p><b>Printers</b></p> | ||
+ | <p>We monitor printers for status and toner levels.</p> | ||
+ | <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center"> | ||
+ | <tr><th style="width:180px">Level</th><th>Trigger</th></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>No SNMP data received for 3 mins</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Printer is in error state</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Consumable on printer is empty</td></tr> | ||
+ | <tr><td style="background-color:#7499FF">Information</td><td>Consumable on printer is under 10%</td></tr> | ||
+ | |||
+ | </table> | ||
+ | </div> | ||
+ | |||
+ | <br /><br /> | ||
+ | |||
+ | |||
[[Category:BCX Network Monitoring]] | [[Category:BCX Network Monitoring]] |
Revision as of 14:34, 22 April 2020
Service level priorities and actions
Level | Display | Alert | Notes | Example |
---|---|---|---|---|
Major | Customer and Burconix Support Dashboard | Alert Customer and Burconix Support | Current major interruption to service | Core storage offline |
High | Customer and Burconix Support Dashboard | Alert Customer and Burconix Support | Potential for future interruption to service | Single disk/psu fail in redundant configuration |
Concern | Customer Dashboard | Alert Customer | Potential interruption to service | HTTPS service offline |
Warning | Customer Dashboard | No Alert | Warning of potential issue | Free disk space less than 10GB |
Information | Customer Dashboard | No Alert | Possible future action may be required | Toner less than 10% |
Below shows a summary of the core events we monitor for each device type with the notification level.
Device Types
UPS and Power Protection
We monitor the UPS runtime status including load, battery and self-diagnostic status.
Level | Trigger |
---|---|
High | Battery needs replacing |
High | UPS on battery |
High | Runtime less than 10 mins |
High | Load is critical 90% |
Concern | Battery temperature is too high |
Concern | No SMNP data received for 3 mins |
Concern | Load is too high 80% |
Concern | UPS has been restarted |
Warning | Battery power currently too low to support load |
Warning | Last diagnostic test failed |
SAN and Storage
We monitor all aspects of the storage hardware and volume availability.
Level | Trigger |
---|---|
Major | Storage array is offline |
High | No SNMP/API data received for 3 mins |
High | Physical disk failed |
High | Controller health degraded |
High | Virtual disk health degraded |
High | Enclosure health degraded |
Concern | Virtual disk is not fault tolerant |
Concern | SAN has been restarted |
Warning | Controller redundancy lost |
Warning | Controller not responding to ping |
Physical Server Hardware
We monitor all aspects of the server's physical hardware sensors.
Level | Trigger |
---|---|
High | System status is in warning or critical state |
High | Power supply is in warning or critical state |
High | Disk array controller is in warning or critical state |
High | Disk array cache controller battery is in warning or critical state |
High | Disk array cache controller is in warning or critical state |
High | Physical disk failed |
High | Virtual disk offline |
High | Fan is in critical state |
High | Ambient temperature is above critical threshold |
High | No SNMP data received for 3 mins |
Concern | Ambient temperature is above warning threshold |
Concern | Ambient temperature is too low |
Warning | Disk array cache controller non-optimal |
Warning | Physical disk is in warning state |
Warning | Virtual disk is in warning state |
Warning | Fan is in warning state |
Warning | System has been restarted |
Windows/Linux Server Agent
We monitor both availability and metrics including processor, memory, network and disk utilization, as well as monitoring the running status of core windows services.
Level | Trigger |
---|---|
High | Free disk space is less than 500MB on volume |
Concern | Free disk space is less than 5% and under 5GB on volume |
Concern | Agent is unreachable for 10 mins |
Concern | Monitored windows service is not running |
Warning | Free disk space is less than 10% and under 10GB on volume |
Warning | Agent is unreachable for 3 mins |
Warning | Server has been restarted |
Network Switch
We monitor core switch traffic utilization on each interface and trigger a warning if a core link goes down.
On edge switch devices we monitor the hardware status and availability.
Level | Trigger |
---|---|
High | Temperature is above critical threshold |
High | Power supply status is in warning or critical state |
Concern | Temperature above warning threshold |
Concern | Temperature is too low |
Concern | Fan is in critical state |
Concern | High memory utilization |
Concern | No SNMP data received for 3 mins |
Concern | Core switch has been restarted |
Warning | Core switch link down |
Warning | Fan is in warning state |
Warning | Edge switch has been restarted |
Firewalls and Routers
We monitor traffic utilization, as well as service availability and interface link status.
Level | Trigger |
---|---|
High | Interface down |
Concern | No SNMP data received for 3 mins |
Warning | Device has been restarted |
Network Attached Device (Ping)
We monitor service availability and verify ping response times.
Level | Trigger |
---|---|
Concern | Unavailable by ICMP ping for 3 mins |
Warning | High ICMP ping loss |
Warning | High ICMP ping response time |
Web Services
We monitor SSL certificates and other web service availability including HTTP, HTTPS, SMTP and TCP Port 443.
This can be monitored from within your network, or from a remote location based in Nottingham.
Level | Trigger |
---|---|
High | SSL certificate has expired |
High | SSL certificate expires in less than 7 days |
Concern | Web service has been down for 3 mins |
Concern | SSL certificate expires in less than 14 days |
Warning | SSL certificate expires in less than 30 days |
Information | SSL certificate expires in less than 60 days |
Printers
We monitor printers for status and toner levels.
Level | Trigger |
---|---|
Warning | No SNMP data received for 3 mins |
Warning | Printer is in error state |
Warning | Consumable on printer is empty |
Information | Consumable on printer is under 10% |