Observability (software)
In distributed systems, observability is the ability to collect data about programs' execution, modules' internal states, and the communication among components.[1][2] To improve observability, software engineers use a wide range of logging and tracing techniques to gather telemetry information, and tools to analyze and use it.
Etymology, terminology and definition
The term is borrowed from control theory, where the "observability" of a system measures how well its state can be determined from its outputs. Similarly, software observability measures how well a system's state can be understood from the obtained telemetry (metrics, logs, traces, profiling).
The definition of observability varies by vendor:
a measure of how well you can understand and explain any state your system can get into, no matter how novel or bizarre [...] without needing to ship new code
— "Observability Engineering" (2022)[3]
software tools and practices for aggregating, correlating and analyzing a steady stream of performance data from a distributed application along with the hardware and network it runs on
— IBM[4]
the ability to measure a system’s current state based on the data it generates, such as logs, metrics, and traces
— dynatrace[5]
Observability is tooling or a technical solution that allows teams to actively debug their system. Observability is based on exploring properties and patterns not defined in advance.
— Google[6]
proactively collecting, visualizing, and applying intelligence to all of your metrics, events, logs, and traces—so you can understand the behavior of your complex digital system
— New Relic [7]
The term is frequently referred to as its numeronym O11y (where 11 stands for the number of letters between the first letter and the last letter of the word). This is similar to other computer science abbreviations such as i18n and L10n.[8]
Observability vs. monitoring
Observability and monitoring are sometimes used interchangeably.[9] As tooling, commercial offerings and practices evolved in complexity, "monitoring" was re-branded as observability in order to differentiate new tools from the old.
The terms are commonly contrasted in that systems are monitored using predefined sets of telemetry,[6] and monitored systems may be observable.[10]
Majors et. al. suggest that engineering teams that only have monitoring tools end up relying on expert foreknowledge (seniority), whereas teams that have observability tools rely on exploratory analysis (curiosity).[3]
Telemetry types
Observability relies on three main types of telemetry data: metrics, logs and traces.[5][6][11] Those are often referred to as "pillars of observability".[12]
Metrics
Application developers choose what kind of metrics to instrument their software with before it is released. Examples of common metrics include:
- number of HTTP requests per second;
- total number of query failures;
- database size in bytes;
- time in seconds since last garbage collection.
Monitoring tools are typically configured to emit alerts when certain metric values exceed set thresholds. Thresholds are set based on knowledge about normal operating conditions and experience.
Metrics have limitations: when a previously unknown issue is encountered, it is impossible to add new metrics without shipping new code. Furthermore, their cardinality can quickly explode the size of telemetry data.
Logs
Traces
Continuous profiling
Continuous profiling is another telemetry type used to precisely determine how an application consumes resources.[13]
"Pillars of observability"
Metrics, logs and traces are most commonly listed as the pillars of observability.[12] Majors et. al. suggest that the pillars of observability are high cardinality, high-dimensionality, and explorability.[3]
See also
Bibliography
- Boten, Alex; Majors, Charity (2022). Cloud-Native Observability with OpenTelemetry. Packt Publishing. ISBN 978-1-80107-190-1. OCLC 1314053525.
- Majors, Charity (2022). Observability engineering : achieving production excellence. Liz Fong-Jones, George Miranda (1st ed.). Sebastopol, CA: O'Reilly Media, Inc. ISBN 9781492076445. OCLC 1315555871.
- Sridharan, Cindy (2018). Distributed systems observability : a guide to building robust systems (1st ed.). Sebastopol, CA: O'Reilly Media, Inc. ISBN 978-1-4920-3342-4. OCLC 1044741317.
- Hausenblas, Michael (2023). Cloud Observability in Action. Manning. ISBN 9781633439597. OCLC 1359045370.
References
- Fellows, Geoff (1998). "High-Performance Client/Server: A Guide to Building and Managing Robust Distributed Systems". Internet Research. 8 (5). doi:10.1108/intr.1998.17208eaf.007. ISSN 1066-2243.
- Cantrill, Bryan (2006). "Hidden in Plain Sight: Improvements in the observability of software can help you diagnose your most crippling performance problems". Queue. 4 (1): 26–36. doi:10.1145/1117389.1117401. ISSN 1542-7730. S2CID 14505819.
- Majors, Charity (2022). Observability engineering : achieving production excellence. Liz Fong-Jones, George Miranda (1st ed.). Sebastopol, CA. ISBN 9781492076445. OCLC 1315555871.
- "What is observability". IBM. Retrieved 9 March 2023.
- Livens, Jay (October 2021). "What is observability?". dynatrace. Retrieved 9 March 2023.
- "DevOps measurement: Monitoring and observability". Google Cloud. Retrieved 9 March 2023.
- Reinholds, Amy. "What is observability?". New Relic. Retrieved 9 March 2023.
- "How Are Structured Logs Different from Events?". 26 June 2018.
- Hadfield, Ally (29 June 2022). "Observability vs. Monitoring: What's The Difference in DevOps?". Instana. Retrieved 15 March 2023.
- Kidd, Chrissy. "Monitoring, Observability & Telemetry: Everything You Need To Know for Observable Work". Retrieved 15 March 2023.
- "What is Observability? A Beginner's Guide". Splunk. Retrieved 9 March 2023.
- Sridharan, Cindy (2018). "Chapter 4. The Three Pillars of Observability". Distributed systems observability : a guide to building robust systems (1st ed.). Sebastopol, CA: O'Reilly Media, Inc. ISBN 978-1-4920-3342-4. OCLC 1044741317.
- "What is continuous profiling?". Cloud Native Computing Foundation. 31 May 2022. Retrieved 9 March 2023.