Difference between revisions of "Service level priorities and actions"
(Created page with "Category:BCX Network Monitoring") |
|||
(2 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
+ | ==Service level priorities and actions== | ||
+ | |||
+ | <table class="w3-table w3-bordered w3-centered" style="width:100%" align="center"> | ||
+ | <tr><th>Level</th><th>Display</th><th>Alert</th><th>Notes</th><th>Example</th></tr> | ||
+ | <tr><td style="background-color:#E45959">Major</td><td>Customer and Burconix Support Dashboard</td><td>Alert Customer and Burconix Support</td><td>Current major interruption to service</td><td>Core storage offline</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Customer and Burconix Support Dashboard</td><td>Alert Customer and Burconix Support</td><td>Potential for future interruption to service</td><td>Single disk/psu fail in redundant configuration</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Customer Dashboard</td><td>Alert Customer</td><td>Potential interruption to service</td><td>HTTPS service offline</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Customer Dashboard</td><td>No Alert</td><td>Warning of potential issue</td><td>Free disk space less than 10GB</td></tr> | ||
+ | <tr><td style="background-color:#7499FF">Information</td><td>Customer Dashboard</td><td>No Alert</td><td>Possible future action may be required</td><td>Toner less than 10%</td></tr> | ||
+ | </table> | ||
+ | |||
+ | |||
+ | ==Device Types== | ||
+ | |||
+ | <p>Below shows a summary of the core events we monitor for each device type with the notification level.</p> | ||
+ | |||
+ | |||
+ | <div id="upspower" class="servicelayer" style="padding-top:0px;max-width:1000px;margin-right:auto;margin-left:auto"> | ||
+ | <p><b>UPS and Power Protection</b></p> | ||
+ | <p>We monitor the UPS runtime status including load, battery and self-diagnostic status.</p> | ||
+ | <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center"> | ||
+ | <tr><th style="width:180px">Level</th><th>Trigger</th></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Battery needs replacing</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>UPS on battery</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Runtime less than 10 mins</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Load is critical 90%</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Battery temperature is too high</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>No SMNP data received for 3 mins</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Load is too high 80%</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>UPS has been restarted</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Battery power currently too low to support load</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Last diagnostic test failed</td></tr> | ||
+ | |||
+ | </table> | ||
+ | </div> | ||
+ | |||
+ | <div id="sanstorage" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto"> | ||
+ | <p><b>SAN and Storage</b></p> | ||
+ | <p>We monitor all aspects of the storage hardware and volume availability.</p> | ||
+ | <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center"> | ||
+ | <tr><th style="width:180px">Level</th><th>Trigger</th></tr> | ||
+ | <tr><td style="background-color:#E45959">Major</td><td>Storage array is offline</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>No SNMP/API data received for 3 mins</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Physical disk failed</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Controller health degraded</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Virtual disk health degraded</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Enclosure health degraded</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Virtual disk is not fault tolerant</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>SAN has been restarted</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Controller redundancy lost</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Controller not responding to ping</td></tr> | ||
+ | |||
+ | </table> | ||
+ | </div> | ||
+ | |||
+ | <div id="physerverhardware" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto"> | ||
+ | <p><b>Physical Server Hardware</b></p> | ||
+ | <p>We monitor all aspects of the server's physical hardware sensors.</p> | ||
+ | <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center"> | ||
+ | <tr><th style="width:180px">Level</th><th>Trigger</th></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>System status is in warning or critical state</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Power supply is in warning or critical state</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Disk array controller is in warning or critical state</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Disk array cache controller battery is in warning or critical state</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Disk array cache controller is in warning or critical state</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Physical disk failed</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Virtual disk offline</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Fan is in critical state</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Ambient temperature is above critical threshold</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>No SNMP data received for 3 mins</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Ambient temperature is above warning threshold</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Ambient temperature is too low</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Disk array cache controller non-optimal</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Physical disk is in warning state</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Virtual disk is in warning state</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Fan is in warning state</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>System has been restarted</td></tr> | ||
+ | </table> | ||
+ | </div> | ||
+ | |||
+ | <div id="winlinagent" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto"> | ||
+ | <p><b>Windows/Linux Server Agent</b></p> | ||
+ | <p>We monitor both availability and metrics including processor, memory, network and disk utilization, as well as monitoring the running status of core windows services.</p> | ||
+ | <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center"> | ||
+ | <tr><th style="width:180px">Level</th><th>Trigger</th></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Free disk space is less than 500MB on volume</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Free disk space is less than 5% and under 5GB on volume</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Agent is unreachable for 10 mins</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Monitored windows service is not running</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Free disk space is less than 10% and under 10GB on volume</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Agent is unreachable for 3 mins</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Server has been restarted</td></tr> | ||
+ | </table> | ||
+ | </div> | ||
+ | |||
+ | <div id="networkswitch" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto"> | ||
+ | <p><b>Network Switch</b></p> | ||
+ | <p>We monitor core switch traffic utilization on each interface and trigger a warning if a core link goes down.<br />On edge switch devices we monitor the hardware status and availability.</p> | ||
+ | <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center"> | ||
+ | <tr><th style="width:180px">Level</th><th>Trigger</th></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Temperature is above critical threshold</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Power supply status is in warning or critical state</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Temperature above warning threshold</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Temperature is too low</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Fan is in critical state</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>High memory utilization</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>No SNMP data received for 3 mins</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Core switch has been restarted</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Core switch link down</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Fan is in warning state</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Edge switch has been restarted</td></tr> | ||
+ | </table> | ||
+ | </div> | ||
+ | |||
+ | <div id="firewall" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto"> | ||
+ | <p><b>Firewalls and Routers</b></p> | ||
+ | <p>We monitor traffic utilization, as well as service availability and interface link status.</p> | ||
+ | <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center"> | ||
+ | <tr><th style="width:180px">Level</th><th>Trigger</th></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>Interface down</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>No SNMP data received for 3 mins</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Device has been restarted</td></tr> | ||
+ | </table> | ||
+ | </div> | ||
+ | |||
+ | <div id="netattcheddevice" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto"> | ||
+ | <p><b>Network Attached Device (Ping)</b></p> | ||
+ | <p>We monitor service availability and verify ping response times.</p> | ||
+ | <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center"> | ||
+ | <tr><th style="width:180px">Level</th><th>Trigger</th></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Unavailable by ICMP ping for 3 mins</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>High ICMP ping loss</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>High ICMP ping response time</td></tr> | ||
+ | </table> | ||
+ | </div> | ||
+ | |||
+ | <div id="webservices" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto"> | ||
+ | <p><b>Web Services</b></p> | ||
+ | <p>We monitor SSL certificates and other web service availability including HTTP, HTTPS, SMTP and TCP Port 443.<br />This can be monitored from within your network, or from a remote location based in Nottingham.</p> | ||
+ | <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center"> | ||
+ | <tr><th style="width:180px">Level</th><th>Trigger</th></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>SSL certificate has expired</td></tr> | ||
+ | <tr><td style="background-color:#E97659">High</td><td>SSL certificate expires in less than 7 days</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>Web service has been down for 3 mins</td></tr> | ||
+ | <tr><td style="background-color:#FFA059">Concern</td><td>SSL certificate expires in less than 14 days</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>SSL certificate expires in less than 30 days</td></tr> | ||
+ | <tr><td style="background-color:#7499FF">Information</td><td>SSL certificate expires in less than 60 days</td></tr> | ||
+ | </table> | ||
+ | </div> | ||
+ | |||
+ | <div id="printers" class="servicelayer" style="padding-top:50px;max-width:1000px;margin-right:auto;margin-left:auto"> | ||
+ | <p><b>Printers</b></p> | ||
+ | <p>We monitor printers for status and toner levels.</p> | ||
+ | <table class="w3-table w3-bordered w3-centered" style="width:800px" align="center"> | ||
+ | <tr><th style="width:180px">Level</th><th>Trigger</th></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>No SNMP data received for 3 mins</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Printer is in error state</td></tr> | ||
+ | <tr><td style="background-color:#FFC859">Warning</td><td>Consumable on printer is empty</td></tr> | ||
+ | <tr><td style="background-color:#7499FF">Information</td><td>Consumable on printer is under 10%</td></tr> | ||
+ | |||
+ | </table> | ||
+ | </div> | ||
+ | |||
[[Category:BCX Network Monitoring]] | [[Category:BCX Network Monitoring]] |
Latest revision as of 09:04, 23 April 2020
Service level priorities and actions
Level | Display | Alert | Notes | Example |
---|---|---|---|---|
Major | Customer and Burconix Support Dashboard | Alert Customer and Burconix Support | Current major interruption to service | Core storage offline |
High | Customer and Burconix Support Dashboard | Alert Customer and Burconix Support | Potential for future interruption to service | Single disk/psu fail in redundant configuration |
Concern | Customer Dashboard | Alert Customer | Potential interruption to service | HTTPS service offline |
Warning | Customer Dashboard | No Alert | Warning of potential issue | Free disk space less than 10GB |
Information | Customer Dashboard | No Alert | Possible future action may be required | Toner less than 10% |
Device Types
Below shows a summary of the core events we monitor for each device type with the notification level.
UPS and Power Protection
We monitor the UPS runtime status including load, battery and self-diagnostic status.
Level | Trigger |
---|---|
High | Battery needs replacing |
High | UPS on battery |
High | Runtime less than 10 mins |
High | Load is critical 90% |
Concern | Battery temperature is too high |
Concern | No SMNP data received for 3 mins |
Concern | Load is too high 80% |
Concern | UPS has been restarted |
Warning | Battery power currently too low to support load |
Warning | Last diagnostic test failed |
SAN and Storage
We monitor all aspects of the storage hardware and volume availability.
Level | Trigger |
---|---|
Major | Storage array is offline |
High | No SNMP/API data received for 3 mins |
High | Physical disk failed |
High | Controller health degraded |
High | Virtual disk health degraded |
High | Enclosure health degraded |
Concern | Virtual disk is not fault tolerant |
Concern | SAN has been restarted |
Warning | Controller redundancy lost |
Warning | Controller not responding to ping |
Physical Server Hardware
We monitor all aspects of the server's physical hardware sensors.
Level | Trigger |
---|---|
High | System status is in warning or critical state |
High | Power supply is in warning or critical state |
High | Disk array controller is in warning or critical state |
High | Disk array cache controller battery is in warning or critical state |
High | Disk array cache controller is in warning or critical state |
High | Physical disk failed |
High | Virtual disk offline |
High | Fan is in critical state |
High | Ambient temperature is above critical threshold |
High | No SNMP data received for 3 mins |
Concern | Ambient temperature is above warning threshold |
Concern | Ambient temperature is too low |
Warning | Disk array cache controller non-optimal |
Warning | Physical disk is in warning state |
Warning | Virtual disk is in warning state |
Warning | Fan is in warning state |
Warning | System has been restarted |
Windows/Linux Server Agent
We monitor both availability and metrics including processor, memory, network and disk utilization, as well as monitoring the running status of core windows services.
Level | Trigger |
---|---|
High | Free disk space is less than 500MB on volume |
Concern | Free disk space is less than 5% and under 5GB on volume |
Concern | Agent is unreachable for 10 mins |
Concern | Monitored windows service is not running |
Warning | Free disk space is less than 10% and under 10GB on volume |
Warning | Agent is unreachable for 3 mins |
Warning | Server has been restarted |
Network Switch
We monitor core switch traffic utilization on each interface and trigger a warning if a core link goes down.
On edge switch devices we monitor the hardware status and availability.
Level | Trigger |
---|---|
High | Temperature is above critical threshold |
High | Power supply status is in warning or critical state |
Concern | Temperature above warning threshold |
Concern | Temperature is too low |
Concern | Fan is in critical state |
Concern | High memory utilization |
Concern | No SNMP data received for 3 mins |
Concern | Core switch has been restarted |
Warning | Core switch link down |
Warning | Fan is in warning state |
Warning | Edge switch has been restarted |
Firewalls and Routers
We monitor traffic utilization, as well as service availability and interface link status.
Level | Trigger |
---|---|
High | Interface down |
Concern | No SNMP data received for 3 mins |
Warning | Device has been restarted |
Network Attached Device (Ping)
We monitor service availability and verify ping response times.
Level | Trigger |
---|---|
Concern | Unavailable by ICMP ping for 3 mins |
Warning | High ICMP ping loss |
Warning | High ICMP ping response time |
Web Services
We monitor SSL certificates and other web service availability including HTTP, HTTPS, SMTP and TCP Port 443.
This can be monitored from within your network, or from a remote location based in Nottingham.
Level | Trigger |
---|---|
High | SSL certificate has expired |
High | SSL certificate expires in less than 7 days |
Concern | Web service has been down for 3 mins |
Concern | SSL certificate expires in less than 14 days |
Warning | SSL certificate expires in less than 30 days |
Information | SSL certificate expires in less than 60 days |
Printers
We monitor printers for status and toner levels.
Level | Trigger |
---|---|
Warning | No SNMP data received for 3 mins |
Warning | Printer is in error state |
Warning | Consumable on printer is empty |
Information | Consumable on printer is under 10% |