Service level priorities and actions

From BCX Media Wiki
Jump to navigation Jump to search

Service level priorities and actions

LevelDisplayAlertNotesExample
MajorCustomer and Burconix Support DashboardAlert Customer and Burconix SupportCurrent major interruption to serviceCore storage offline
HighCustomer and Burconix Support DashboardAlert Customer and Burconix SupportPotential for future interruption to serviceSingle disk/psu fail in redundant configuration
ConcernCustomer DashboardAlert CustomerPotential interruption to serviceHTTPS service offline
WarningCustomer DashboardNo AlertWarning of potential issueFree disk space less than 10GB
InformationCustomer DashboardNo AlertPossible future action may be requiredToner less than 10%


Device Types

Below shows a summary of the core events we monitor for each device type with the notification level.


UPS and Power Protection

We monitor the UPS runtime status including load, battery and self-diagnostic status.

LevelTrigger
HighBattery needs replacing
HighUPS on battery
HighRuntime less than 10 mins
HighLoad is critical 90%
ConcernBattery temperature is too high
ConcernNo SMNP data received for 3 mins
ConcernLoad is too high 80%
ConcernUPS has been restarted
WarningBattery power currently too low to support load
WarningLast diagnostic test failed

SAN and Storage

We monitor all aspects of the storage hardware and volume availability.

LevelTrigger
MajorStorage array is offline
HighNo SNMP/API data received for 3 mins
HighPhysical disk failed
HighController health degraded
HighVirtual disk health degraded
HighEnclosure health degraded
ConcernVirtual disk is not fault tolerant
ConcernSAN has been restarted
WarningController redundancy lost
WarningController not responding to ping

Physical Server Hardware

We monitor all aspects of the server's physical hardware sensors.

LevelTrigger
HighSystem status is in warning or critical state
HighPower supply is in warning or critical state
HighDisk array controller is in warning or critical state
HighDisk array cache controller battery is in warning or critical state
HighDisk array cache controller is in warning or critical state
HighPhysical disk failed
HighVirtual disk offline
HighFan is in critical state
HighAmbient temperature is above critical threshold
HighNo SNMP data received for 3 mins
ConcernAmbient temperature is above warning threshold
ConcernAmbient temperature is too low
WarningDisk array cache controller non-optimal
WarningPhysical disk is in warning state
WarningVirtual disk is in warning state
WarningFan is in warning state
WarningSystem has been restarted

Windows/Linux Server Agent

We monitor both availability and metrics including processor, memory, network and disk utilization, as well as monitoring the running status of core windows services.

LevelTrigger
HighFree disk space is less than 500MB on volume
ConcernFree disk space is less than 5% and under 5GB on volume
ConcernAgent is unreachable for 10 mins
ConcernMonitored windows service is not running
WarningFree disk space is less than 10% and under 10GB on volume
WarningAgent is unreachable for 3 mins
WarningServer has been restarted

Network Switch

We monitor core switch traffic utilization on each interface and trigger a warning if a core link goes down.
On edge switch devices we monitor the hardware status and availability.

LevelTrigger
HighTemperature is above critical threshold
HighPower supply status is in warning or critical state
ConcernTemperature above warning threshold
ConcernTemperature is too low
ConcernFan is in critical state
ConcernHigh memory utilization
ConcernNo SNMP data received for 3 mins
ConcernCore switch has been restarted
WarningCore switch link down
WarningFan is in warning state
WarningEdge switch has been restarted

Firewalls and Routers

We monitor traffic utilization, as well as service availability and interface link status.

LevelTrigger
HighInterface down
ConcernNo SNMP data received for 3 mins
WarningDevice has been restarted

Network Attached Device (Ping)

We monitor service availability and verify ping response times.

LevelTrigger
ConcernUnavailable by ICMP ping for 3 mins
WarningHigh ICMP ping loss
WarningHigh ICMP ping response time

Web Services

We monitor SSL certificates and other web service availability including HTTP, HTTPS, SMTP and TCP Port 443.
This can be monitored from within your network, or from a remote location based in Nottingham.

LevelTrigger
HighSSL certificate has expired
HighSSL certificate expires in less than 7 days
ConcernWeb service has been down for 3 mins
ConcernSSL certificate expires in less than 14 days
WarningSSL certificate expires in less than 30 days
InformationSSL certificate expires in less than 60 days

Printers

We monitor printers for status and toner levels.

LevelTrigger
WarningNo SNMP data received for 3 mins
WarningPrinter is in error state
WarningConsumable on printer is empty
InformationConsumable on printer is under 10%