Glossary

The observability glossary

Precise, plain-language definitions of the monitoring and observability terms — understand them fast, use them with confidence.

Observability

Observability is the ability to understand a system's internal state by examining the data it produces externally — primarily logs, metrics, and traces — so you can answer new questions about its behavior without shipping new code.

APM

APM is the practice of measuring an application's performance and reliability — primarily its latency, throughput, and error rate — at the level of individual transactions, so you can find and fix slow or failing code paths in production.

Error tracking

Error tracking is the practice of automatically capturing application exceptions, grouping identical ones into a single issue, and alerting your team — with the stack trace, breadcrumbs, and release context needed to reproduce and fix each one.

Distributed tracing

Distributed tracing is a technique for following a single request as it travels across multiple services, linking each step into one timeline of spans through a shared trace ID and propagated context.

Log management

Log management is the practice of collecting, parsing, indexing, searching, and retaining log data from across your applications and infrastructure so you can investigate behavior and diagnose problems from a single searchable place.

Session replay

Session replay is a technique that reconstructs a user's browser session as a replayable recording of DOM changes and interactions, letting you watch exactly what the user saw and did when a bug or confusing experience occurred.

OpenTelemetry

OpenTelemetry (OTel) is an open, vendor-neutral CNCF standard — a set of APIs, SDKs, and the OTLP wire protocol — for generating and exporting telemetry (traces, metrics, and logs) from your software to any compatible backend.

Real user monitoring

Real user monitoring (RUM) is the practice of measuring the actual experience of real visitors as they use your site or app in their own browsers — page load times, Core Web Vitals, interactions, and errors — rather than from a simulated test.

Structured logging

Structured logging is the practice of writing log entries as machine-parseable key-value data — typically JSON — instead of free-form text, so that fields can be reliably searched, filtered, and aggregated by a log management system.

MTTR

MTTR is the average time it takes a team to recover from a failure, most often expanded as Mean Time to Recovery but also used for Repair, Resolve, or Respond.

SLO vs SLA vs SLI

An SLI is a measured indicator of service health, an SLO is the internal target you set on that indicator, and an SLA is the external contract that promises a level of service and defines consequences if you miss it.

Error Budget

An error budget is the allowable amount of unreliability over a window — 100% minus your SLO — that a team can "spend" before reliability work must take priority over new features.

Status Page

A status page is a dedicated web page that publicly communicates the real-time health of a service and the status of any ongoing incidents to its users.

Uptime Monitoring

Uptime monitoring is the practice of repeatedly checking a service's availability and response from outside the system, so you learn it's down the moment your users would.

Latency Percentiles

A latency percentile like p99 tells you the response time that 99% of requests come in under, exposing the slow tail that averages hide.

Incident Management

Incident management is the structured process a team uses to detect, respond to, resolve, and learn from unplanned disruptions to a service.

Three Pillars of Observability

The three pillars of observability are metrics, logs, and traces — three complementary data types that together let you understand a system's internal state from its outputs.

Error Tracking vs APM vs Logging

Error tracking, APM, and logging are overlapping but distinct categories: error tracking captures exceptions, APM measures performance, and logging records the detailed event stream.