Difference between revisions of "Service level priorities and actions"
Line 13: | Line 13: | ||
==Device Types== | ==Device Types== | ||
− | |||
<p>Below shows a summary of the core events we monitor for each device type with the notification level.</p> | <p>Below shows a summary of the core events we monitor for each device type with the notification level.</p> | ||
− | + | ||
<div id="upspower" class="servicelayer" style="padding-top:0px;max-width:1000px;margin-right:auto;margin-left:auto"> | <div id="upspower" class="servicelayer" style="padding-top:0px;max-width:1000px;margin-right:auto;margin-left:auto"> |
Latest revision as of 09:04, 23 April 2020
Service level priorities and actions
Level | Display | Alert | Notes | Example |
---|---|---|---|---|
Major | Customer and Burconix Support Dashboard | Alert Customer and Burconix Support | Current major interruption to service | Core storage offline |
High | Customer and Burconix Support Dashboard | Alert Customer and Burconix Support | Potential for future interruption to service | Single disk/psu fail in redundant configuration |
Concern | Customer Dashboard | Alert Customer | Potential interruption to service | HTTPS service offline |
Warning | Customer Dashboard | No Alert | Warning of potential issue | Free disk space less than 10GB |
Information | Customer Dashboard | No Alert | Possible future action may be required | Toner less than 10% |
Device Types
Below shows a summary of the core events we monitor for each device type with the notification level.
UPS and Power Protection
We monitor the UPS runtime status including load, battery and self-diagnostic status.
Level | Trigger |
---|---|
High | Battery needs replacing |
High | UPS on battery |
High | Runtime less than 10 mins |
High | Load is critical 90% |
Concern | Battery temperature is too high |
Concern | No SMNP data received for 3 mins |
Concern | Load is too high 80% |
Concern | UPS has been restarted |
Warning | Battery power currently too low to support load |
Warning | Last diagnostic test failed |
SAN and Storage
We monitor all aspects of the storage hardware and volume availability.
Level | Trigger |
---|---|
Major | Storage array is offline |
High | No SNMP/API data received for 3 mins |
High | Physical disk failed |
High | Controller health degraded |
High | Virtual disk health degraded |
High | Enclosure health degraded |
Concern | Virtual disk is not fault tolerant |
Concern | SAN has been restarted |
Warning | Controller redundancy lost |
Warning | Controller not responding to ping |
Physical Server Hardware
We monitor all aspects of the server's physical hardware sensors.
Level | Trigger |
---|---|
High | System status is in warning or critical state |
High | Power supply is in warning or critical state |
High | Disk array controller is in warning or critical state |
High | Disk array cache controller battery is in warning or critical state |
High | Disk array cache controller is in warning or critical state |
High | Physical disk failed |
High | Virtual disk offline |
High | Fan is in critical state |
High | Ambient temperature is above critical threshold |
High | No SNMP data received for 3 mins |
Concern | Ambient temperature is above warning threshold |
Concern | Ambient temperature is too low |
Warning | Disk array cache controller non-optimal |
Warning | Physical disk is in warning state |
Warning | Virtual disk is in warning state |
Warning | Fan is in warning state |
Warning | System has been restarted |
Windows/Linux Server Agent
We monitor both availability and metrics including processor, memory, network and disk utilization, as well as monitoring the running status of core windows services.
Level | Trigger |
---|---|
High | Free disk space is less than 500MB on volume |
Concern | Free disk space is less than 5% and under 5GB on volume |
Concern | Agent is unreachable for 10 mins |
Concern | Monitored windows service is not running |
Warning | Free disk space is less than 10% and under 10GB on volume |
Warning | Agent is unreachable for 3 mins |
Warning | Server has been restarted |
Network Switch
We monitor core switch traffic utilization on each interface and trigger a warning if a core link goes down.
On edge switch devices we monitor the hardware status and availability.
Level | Trigger |
---|---|
High | Temperature is above critical threshold |
High | Power supply status is in warning or critical state |
Concern | Temperature above warning threshold |
Concern | Temperature is too low |
Concern | Fan is in critical state |
Concern | High memory utilization |
Concern | No SNMP data received for 3 mins |
Concern | Core switch has been restarted |
Warning | Core switch link down |
Warning | Fan is in warning state |
Warning | Edge switch has been restarted |
Firewalls and Routers
We monitor traffic utilization, as well as service availability and interface link status.
Level | Trigger |
---|---|
High | Interface down |
Concern | No SNMP data received for 3 mins |
Warning | Device has been restarted |
Network Attached Device (Ping)
We monitor service availability and verify ping response times.
Level | Trigger |
---|---|
Concern | Unavailable by ICMP ping for 3 mins |
Warning | High ICMP ping loss |
Warning | High ICMP ping response time |
Web Services
We monitor SSL certificates and other web service availability including HTTP, HTTPS, SMTP and TCP Port 443.
This can be monitored from within your network, or from a remote location based in Nottingham.
Level | Trigger |
---|---|
High | SSL certificate has expired |
High | SSL certificate expires in less than 7 days |
Concern | Web service has been down for 3 mins |
Concern | SSL certificate expires in less than 14 days |
Warning | SSL certificate expires in less than 30 days |
Information | SSL certificate expires in less than 60 days |
Printers
We monitor printers for status and toner levels.
Level | Trigger |
---|---|
Warning | No SNMP data received for 3 mins |
Warning | Printer is in error state |
Warning | Consumable on printer is empty |
Information | Consumable on printer is under 10% |