What is "Observability" why is it necessary and what are the considerations in monitoring in cloud natives? - vol.2

What is "Observability"

Why is it necessary and what are the considerations in cloud-native monitoring? - vol.2

In the previous "What is "Observability" - Why is it necessary and what are the considerations in cloud-native monitoring? – vol.1", we provided the challenges of monitoring cloud-native systems. We also provided "Metrics" and "Logs" among the components of Observability.

In this article, we will continue with a combination of "Trace" and other Observability signals and recommended OSS combination.

What is tracing?

Tracing is the visualization of the entire request flow across multiple components considering call dependencies. This is a concept and technique that becomes necessary in microservice architectures where calls between components occur frequently.

The figure below shows how the invocation of the service "A" triggers a chain of invocations of other components "B," "C," "D," and "E," on which A depends internally.

Inclusion relationships of calls and processing times between components in microservices

Inclusion relationships of calls and processing times between components in microservices

When there is such a calling relationship, the processing time of the request is inclusive as shown on the right side of the figure. This is visualized by a technique called tracing. Specifically, when a request is invoked, a unique ID called a "trace ID" is propagated between services, and the invocation information is sent to the database.

For example, there are limits to how much monitoring and logging alone can address the following issues:

  • It took time to isolate which of the multiple microservices (or DBs, or external connection services) was the root cause of the delay.
  • When orchestrating many containers, the number of components is large and multi-layered, making it difficult to keep track of resource status such as CPU utilization for individual components.

For these, a latency-based approach with tracing is a concrete solution.

When considering tracing, there are two main perspectives: visualization through graphs of call dependencies and distributed tracing.

Graphical visualization of call dependencies

In complex architectures, there is often no single source or destination of calls, making it difficult to organize information such as which service is calling which service.

Complex invocation dependencies in microservices

Complex invocation dependencies in microservices

An effective approach to this is to visualize the call dependencies of each service as a graph in the trace and map the number of requests, errors, and response times for each call.

For example, the OSS " Kiali " can map and display RED method metrics (number of requests, errors, and response times) aggregated from each Envoy Proxy on the service mesh onto a service dependency graph, as shown in the figure below.

Visualization of Call Dependencies and Metrics Mapping of RED Methods with Kiali

Visualization of Call Dependencies and Metrics Mapping of RED Methods with Kiali

Distributed Tracing

Distributed tracing is a technique for collecting a series of distributed requests and stringing them together as a single trace for visualization. The figure below will help you understand.

Example of identifying services where processing time is dominant using tracing

Example of identifying services where processing time is dominant using tracing

Assume a case in which a call to service "A" actually calls "B" and "E" in sequence, and furthermore, in B's call, "C" and "D" are called in sequence. In this case, the processing time for A should encompass the processing time for each caller, as shown on the right side of the figure. Even if the call to service A is delayed, distributed tracing shows that the processing time of A itself is almost negligible, so it is necessary to analyze the dominant processing time among the subsequent calls from B to E.

Distributed tracing makes it possible to isolate such bottlenecks on a processing time basis based on call dependencies.

For example, the OSS "Jaeger" can display the processing time of each service invoked in a transaction with distributed tracing, as shown in the figure below.

Distributed tracing with Jaeger

Distributed tracing with Jaeger

Other Observability signals

Metrics, logs, and traces were introduced as components of Observability, but other components may also be included.

The whitepaper also includes "Dump" and "Profile" as other signals to acquire information to be analyzed in more detail. As per the definition by NewRelic, "Event" is included to express what happened in the system or application and when (This is more summarized information than logs. Together with metrics, logs, and traces, it is expressed as "MELT"). As you can see, this area is still unorganized, and we will keep an eye on future developments.

Obviously, it is difficult to get these as a single signal. Overloading the metrics with too much detailed information is costly, and tracing all operations with near-real-time latency is also expensive. Therefore, after acquiring each of these signals, it is necessary to analyze the signals in relation to each other based on metadata such as timestamps and "from which component" they were acquired.

Recommended OSS combination patterns for obtaining each signal of Observability

The author has picked up several OSS combination patterns that can easily obtain each of the signals of Observability and summarized them in the table.

Recommended implementation patterns for Observability by OSS (*Colored areas where changes have been made in 1 and 2)

Recommended implementation patterns for Observability by OSS (*Colored areas where changes have been made in 1 and 2)

The following is only a brief description of the author's thoughts on the perspectives picked up.

Metrics

  • Recommended Configuration 1: Configuration using Prometheus and Grafana, the de facto configuration for metrics collection by OSS.
  • Recommended Configuration 2: By using "OpenTelemetry", it is possible to switch backend when selecting a backend other than Prometheus * Configuration that gains the advantage of lighter weight metrics collection than Prometheus.
  • * You are running Prometheus on its own, and storing large amounts of metrics for long periods of time is performance-intensive, or you want to improve availability. Situations where you want to link to a highly available, horizontally scaling Prometheus-compatible OSS backend storage or SaaS.

Incidentally, other than OSS, there are managed monitoring services provided by cloud vendors ("AWS CloudWatch," "Google Cloud Monitoring," etc.) and paid APM SaaS (NewRelic, DataDog, Dynatrace, etc.). As a monitoring service provided by cloud vendors, not only the fully managed ones, but also Prometheus managed services are offered. (Amazon Managed Service for Prometheus, Google Cloud Managed Service for Prometheus).

These managed Prometheus solutions will free you from the tedious operation and management of Prometheus by allowing you to easily perform operational tasks such as upgrades and backups, as well as scale-out and increase availability when the size of the system being monitored grows.

Log

  • Recommended Configuration 1: Elastic stack configuration. Full-text search is available for multifaceted aggregation and filtering. If you are familiar with "Elasticsearch" and "Kibana", this is a good choice.
  • Recommended Configuration 2: Loki stack configuration. Lightweight and able to handle large volumes of logs; developed by Grafana Labs, so good at seamless integration with Grafana and Tempo (e.g., "Prometheus metrics" and integrated confirmation of Jaeger tracing and logs on Grafana).

Trace

As mentioned earlier, tracing has two types of representations: dependency graphs and distributed tracing.

The OSS that displays the dependency graph is Kiali; the information visualized by Kiali is based on RED metrics published by Envoy in the mesh.

The perspective on distributed tracing is:

  • Recommended Configuration 1: Jaeger storage implemented with Elasticsearch. In Jaeger's FAQ, Jaeger team recommends Elasticsearch for storage in terms of functionality and performance.
  • Recommended Configuration 2: Since it is difficult to run Elasticsearch and Cassandra as storage, Tempo can be installed so that object storage such as Amazon S3 and Google Cloud Storage can be used as backend storage.

Challenges facing Observability at this time and approaches to addressing them

The whitepaper describes Observability and its challenges of implementation method and approaches taken to address them. The following is a selection of some of them.

Difficulties that require handling multiple signals

As mentioned above for the signals that make up Observability, it is very difficult to collect these as a single signal, so multiple signals must be collected. There is a certain amount of difficulty in setting up and running each of the OSS mentioned so far for each Observability signal, as each OSS has a different technology, storage system, and installation method.

Larger organizations may have separate dedicated teams to install, manage, and maintain OSS for each Observability signal, or they may leverage APM's SaaS. OpenTelemetry is currently advancing approaches to this challenge.

    OpenTelemetry: standardization and integration of collections

    • Specification and implementation of instrumentation, conversion, and transfer of multiple signals including metrics and tracings.
    • Open specifications that standardize how telemetry data is collected and transmitted to the backend.
    • The advantages of using Prometheus are as follows: Input/output other than Prometheus can be expected.

      • Support for multiple telemetry data formats, including Jaeger as well as Prometheus (thereby allowing telemetry data to be collected by a single collector).
      • Can be sent to multiple OSS/commercial backends, not just Prometheus (AWS CloudWatch, NewRelic, DataDog, "Splunk", etc.).

Generation of metrics from logs and traces

Some OSS have implemented such a mechanism, and the following are some examples.

  • Generating metrics from traces
    Generate RED metrics from trace information at the "Processor" layer of OpenTelemetry, a layer that can process acquired data.
  • Generate metrics from logs
    Aggregate logs into metrics by Loki query (Reference)

For example, the logs could be used to generate metrics on the number and rate of errors at regular intervals. Since alerting on a single error from the logs can be overwhelming, it is more meaningful to set up alerts with a goal such as a 10-minute error rate.

Summary

In this blog, we provided an overview of Observability, its components, and considerations.

The concept and implementation of Observability, which is to "keep what is happening, when, and where, observable," is very important in configuring a distributed and scaled cloud-native architecture. On the other hand, to implement them without omissions, design and implementation from various perspectives are required.

There are several OSS projects around Observability, and it is an area that will continue to be updated, so keep an eye on it.

We hope these articles will help you become more aware of Observability. Please stay tuned as we continue to delve deeper into each of the elements that make up Observability.

Related Links

Original Article