For some time, observability in IT operations has been associated with three data types that monitoring systems must ingest in order to be at least somewhat effective: logs, metrics, and traces. This limit to the type of data consumed is far from efficient when it comes to the true needs of a present-day IT operations practice.

With observability’s deep connection to causality, monitoring systems must be equipped to provide a causal analysis of the observed system. As we lean into true observability for IT operations practices, practitioners are charged with ensuring the wellbeing of enterprise systems to accurately observe those systems through causality analysis.

IT System State Changes Make Up Digital Business Processes

Industry visionaries have identified the need for a change in the goal of monitoring IT systems — moving from monitoring to observing — as apprehensions rise around the limitations of traditional monitoring technologies. While many of these industry leaders believe the key to observability lies in logs, metrics, and traces, harnessing this broad array of data does not automatically lead to a true understanding of the IT system state changes that make up digital business processes.

Derived from the mathematical Control Theory, observability hinges on the idea that we want to determine the actual sequence of state changes that either a deterministic or stochastic mechanism is going through during a given time period. The issue in doing so is that we do not always have direct access to the mechanism to observe the state changes and document the sequence. We have to tap data or signals produced by the system and its surrounding environment to conclude the state-change sequence through procedures. This means we can observe a mechanism accurately only if it and its environment produces a data set with pre-existing procedures that allow for precise conclusion around the state change sequence.

In the past, monitoring systems were not built for observability, but rather for the capturing, storing, and presenting of data generated by underlying IT systems. This meant that human operators were responsible for making conclusions around the IT system and providing analysis of the data set. While topology diagrams, CMDB data models, and other representations of the IT system aided in this process, they stood independent from the actual data ingested. At best, these models could lead to system modifications through context to data produced by the monitoring system.

Even with new technology that allows us to both ingest data and proactively identify patterns and anomalies in the data being produced, we still lack in true observability into the systems at hand. This is because the patterns and anomalies are derived from the data sets themselves, rather than insights from the system that generated that data. In short, these patterns and anomalies focus on correlation and departures from normalities, and not on causal relationships around the actual state changes within the system.

Causality Vs. Correlation

Let’s take some time to look at the difference between causality and correlational normality through this example: Two events are ingested by two data items — a CPU standing at 90 percent and end-user response time for a given application clocking at three seconds. When one occurs, so does the other.

The fact that when one event occurs, so does the other shows correlational normality. It does not mean that they have a causal relationship.

For the two events to have a causal relationship, it would have to also show that an intervention allowed the level of CPU usage to, let’s say, 80 percent, in turn shortening the response time by two seconds. We show causality by exhibiting that an intervention influencing one event will result in a change in the other event without a second intervention.

While most businesses will reject the idea of “conducting an experiment” on an IT system to prove causality, there is a way to gain insight into causality from the data generated by the system. In fact, system state changes are events that produce a causal relationship when a sequence of those events occurs. In turn, with the establishment of causality, we bring about a true understanding of system state changes. This is observability.

Previous post

Lyft data scientist shares five pieces of career advice

Next post

Data acquisition in 6 easy steps