20 terms explained

Monitoring Glossary

Plain-English definitions of uptime monitoring, incident management, and site reliability concepts. Built for developers and ops teams.

Alert Fatigue

When too many false alerts cause teams to ignore real incidents.

How DNS record changes spread across the global network of servers.

Any period when a service is unavailable to its users.

The allowed amount of downtime before an SLA is violated.

Rules for escalating unacknowledged incidents to additional responders.

99.999% uptime — just 5.26 minutes of downtime per year.

Passive monitoring where the service pings the monitor on a schedule.

The process of identifying, analyzing, and resolving service disruptions.

The time delay between sending a request and receiving a response.

Mean Time Between Failures — average time from one failure to the next.

Mean Time To Detect — how long before a failure is noticed.

Mean Time To Failure — average operating time before a failure occurs.

Mean Time To Repair — average time to restore service after a failure.

A rotation system for who responds to incidents outside working hours.

Collecting performance data from actual user sessions.

Service Level Agreement defining expected availability and consequences for breaches.

Digital certificate enabling encrypted HTTPS communication.

A public page showing the current health and history of your services.

Simulating user requests to proactively test service availability.

The percentage of time a service is operational and accessible.

10 monitors free, all at 30s. No credit card required.