Bookmark this page

Chapter 6.  OpenShift Monitoring

Abstract

Goal

Troubleshoot performance and availability issues with applications and clusters.

Sections
  • Cluster Monitoring (and Guided Exercise)

  • Alerts and Notifications (and Guided Exercise)

Lab
  • OpenShift Monitoring

Cluster Monitoring

Objectives

  • Describe the architecture of OpenShift Monitoring and query the information in its dashboards.

OpenShift Observability

In the cloud computing era, systems are becoming ever more complex. All organizations aim for reliable, efficient, and secure applications and infrastructure. Observability plays a key role in achieving this goal. Observability is the ability to understand a system's or an application's state by collecting and analyzing its output and logs.

Red Hat OpenShift Observability collects logs, traces, events, and system metrics to provide real-time monitoring. Real-time monitoring helps to identify and troubleshoot issues for OpenShift applications and clusters.

Some key features of the Red Hat OpenShift Observability portfolio are as follows:

OpenShift Logging

Red Hat OpenShift Logging aggregates all the logs from the pods and nodes of an OpenShift cluster to a centralized location. Centralized logging improves searching, visualizing, and reporting of data.

OpenShift Monitoring

Red Hat OpenShift Monitoring provides monitoring for core platform components.

Network Observability

Network observability monitors and analyzes network traffic and helps to resolve connectivity issues.

Distributed Tracing

Distributed tracing collects observability data in distributed systems. Distributed tracing is based on the OpenTelemetry project.

For more details about network observability and distributed tracing, refer to the References section.

You can use these features with any stand-alone OpenShift cluster. Red Hat Advanced Cluster Management for Kubernetes (RHACM) provides multicluster observability features.

OpenShift Monitoring

Red Hat OpenShift Container Platform comes with a monitoring stack. The cluster monitoring operator manages the monitoring components and ensures that they are always available and updated. The default monitoring stack collects metrics and generates alerts for the cluster, which includes core platform components and all projects. However, the default monitoring stack does not support custom metrics for user-defined projects.

You can enable monitoring for user-defined projects. You can collect custom metrics and generate alerts by using the monitoring stack for user-defined projects.

You can configure persistent storage for OpenShift monitoring. With this configuration, you can keep a record of the past cluster status, to investigate and correlate current and past issues within the cluster. The monitoring dashboard provides visuals for cluster metrics.

OpenShift Monitoring Stack

The OpenShift monitoring stack is based on the Prometheus open source project. The stack includes the components that are shown in the following figure and are then explained in the following section:

Figure 6.1: OpenShift monitoring architecture

Components for Default Monitoring Stack

The default monitoring stack is included with OpenShift Container Platform. All components of the default monitoring stack are installed in the openshift-monitoring project and provide monitoring features for core platform components.

Modifying any existing resource and creating additional ServiceMonitor, PodMonitor, or PrometheusRule resources in the openshift-monitoring project is not supported. The monitoring stack resets modified resources to ensure that its resources always remain in the expected state.

The monitoring stack deploys the following components in the environment for monitoring the infrastructure, receiving alerts, and consulting performance graphs.

Cluster Monitoring Operator

The cluster monitoring operator is the central component of the monitoring stack. The cluster monitoring operator controls the deployed monitoring components and ensures that they are always in sync with the latest version of the cluster monitoring operator.

Prometheus Operator

The Prometheus operator deploys and configures both Prometheus and Alertmanager. The operator also manages the generation of configuration targets (service monitors and pod monitors).

Prometheus

Prometheus is the monitoring server.

Prometheus Adapter

The Prometheus adapter exposes cluster resources for Horizontal Pod Autoscaling (HPA).

Prometheus Alertmanager

Alertmanager handles alerts from the Prometheus server. An alert is a rule that evaluates to true or false and is often based on cluster observations, such as cluster CPU utilisation. An alert fires when the alert rule meets the true condition. You can configure Prometheus Alertmanager to group and route the alerts to the receiver.

Kube state metrics

The kube-state-metrics converter agent exports Kubernetes objects to metrics that Prometheus can parse.

OpenShift state metrics

The openshift-state-metrics agent is based on the kube-state-metrics agent and adds monitoring for OpenShift-specific resources (such as image registry metrics).

Node exporter

The node-exporter agent exports low-level metrics for compute nodes.

Thanos Querier

Thanos Querier is a single, multitenant interface that enables aggregating and deduplicating cluster and user workload metrics.

Telemeter Client

Telemeter Client sends a data portion from Prometheus instances to Red Hat for remote health monitoring.

The monitoring web console

The OpenShift Container Platform web console provides the Observe section to access and manage monitoring features. In the Observe section, you can access monitoring dashboards, metrics, alerts, and metrics targets.

Components for Monitoring User-defined Projects

You can enable monitoring for user-defined projects. You can collect application-specific custom metrics and also create alerts by using custom metrics with this monitoring stack. The monitoring components for user-defined projects are installed in the openshift-user-workload-monitoring project.

The monitoring stack for user-defined projects deploys the following components:

  • Prometheus Operator

  • Prometheus

  • Thanos Ruler

  • Alertmanager

For more details about the OpenShift monitoring stack for user-defined projects, see the References section.

Prometheus

Prometheus is an open source project for system monitoring and alerting.

Both Red Hat OpenShift Container Platform and Kubernetes integrate Prometheus to enable cluster metrics, monitoring, and alerting capabilities.

Prometheus gathers and stores streams of data from the cluster as time-series data. Time-series data consists of a sequence of samples, where each sample contains the following elements:

  • A timestamp

  • A numeric value (such as an integer, float, or Boolean)

  • A set of labels in the form of key/value pairs The key/value pairs isolate groups of related values for filtering.

For example, the machine_cpu_cores metric in Prometheus contains a sequence of measurement samples of the number of CPU cores for each machine.

OpenShift Monitoring Web Console

You can access and manage the monitoring features by using the web console. The OpenShift Container Platform web console provides the Observe section, with the following subsections:

  • Alerting

  • Metrics

  • Dashboards

  • Targets

Cluster Monitoring Alerting

You access cluster alerts from the OpenShift web console at ObserveAlerting. For each alert, the Alerting page displays a brief description, the state, and the severity. You can view alert details by clicking the name of the alert. The Alert details page also displays a time-series graphic.

Figure 6.2: Cluster monitoring alerting

For more details about forwarding alerts to other systems, see the next section.

Cluster Monitoring Metrics

OpenShift integrates Prometheus metrics at ObserveMetrics.

From the Metrics page, enter an expression, such as a metric name, and then click Run Queries to retrieve the most recent sample for the metric.

The following example displays the instance:node_cpu_utilisation:rate1m metric over time.

Figure 6.3: Monitoring metrics

The metric contains data for each node instance in the cluster.

The OpenShift has three monitoring stack components to gather the metrics from the Kubernetes API: the kube-state-metrics, openshift-state-metrics, and node-exporter agents.

The dashboards in OpenShift cluster monitoring combine metrics from the three agents.

See the References section to learn about the complete list of exposed metrics.

Prometheus provides a query language, PromQL, to select and aggregate time-series data.

You can filter a metric to include only certain key/value pairs. For example, you can modify the previous query to show only metrics for the worker02 node by using the following expression:

instance:node_cpu_utilisation:rate1m{instance="worker02"}

Prometheus Query Language provides several operators to compute new time-series metrics. PromQL contains arithmetic operators, including addition, subtraction, multiplication, and division operators. PromQL contains comparison operators, including equality, greater-than, and less-than operators.

PromQL contains built-in functions, including the following ones, that you can include in PromQL expressions:

sum()

Adds the value of all sample entries at a given time.

rate()

Computes the per-second average of a time series for a given time range.

count()

Counts the number of sample entries at a given time.

max()

Selects the maximum value out of the sample entries.

The following examples of Prometheus Query Language expressions use one metric from the node-exporter agent, and another metric from the kube-state-metrics agent:

node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes*100<50

Shows nodes with less than 50% of available memory.

kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1

Shows persistent volume claims in the pending state.

The Red Hat add-on operators can define extra metrics and alerts. For example, the compliance operator exposes additional metrics to Prometheus. You can get a list of exposed metrics by using the following query:

{name=~"compliance.*"}

Cluster Monitoring Dashboards

The OpenShift console integrates dashboards based on the gathered metrics at ObserveDashboards. These dashboards refresh periodically to display current summary metrics and graphs.

In the graphs, which are interactive, you can further explore data features and characteristics that you observe.

The cluster monitoring dashboards serve as a good starting point for near real-time observability of cluster metrics and health.

After receiving an alert, an administrator might use the dashboards to investigate the problem. This investigation might include determining whether a specific node or project has a problem. Additionally, cluster monitoring dashboards can help identify whether a problem was temporary or appears to be persistent.

OpenShift cluster monitoring includes several default dashboards.

Figure 6.4: Monitoring dashboards

Some of the default monitoring default dashboards are as follows:

Kubernetes / Compute Resources / Cluster

This dashboard displays a high-level view of cluster resources. The Kubernetes / Compute Resources / Cluster dashboard page shows percentage values for CPU such as CPU Utilisation, CPU Requests Commitment, and CPU Limits Commitment. Similar values are also available for memory.

Figure 6.5: Kubernetes / Compute Resources / Cluster dashboard

You can see metrics by clicking Inspect for each parameter. Clicking Inspect shows the Metrics page where you can see metrics and a related graph.

For example, clicking Inspect for CPU Utilisation shows a graph and values for the following metrics:

cluster:node_cpu:ratio_rate5m{cluster=""}
Figure 6.6: Metrics CPU utilisation

The Kubernetes / Compute Resources / Cluster dashboard page also shows graphs for CPU, memory, and network, such as CPU Usage, CPU Quota, Memory Usage, and Memory Quota.

These graphs are common to some dashboard pages. The only difference is data filtration. For example, the Kubernetes / Compute Resources / Namespace (Workloads) dashboard filters resource usage, first by namespace and then by workload type, such as by deployment, daemon set, and stateful set.

USE Method / Cluster

USE stands for Utilisation Saturation and Errors. This dashboard displays several graphics to identify whether the cluster is overutilised, oversaturated, or experiencing many errors. Because the dashboard displays all nodes in the cluster, you might be able to identify a node that is not behaving in the same way as the other nodes in the cluster.

The following graphic indicates that the worker03 node is experiencing higher memory saturation than other nodes in the cluster.

Figure 6.7: USE Method / Cluster - Memory Saturation

References

Red Hat OpenShift Observability

Network Observability

Observability Across OpenShift Cluster Boundaries with Distributed Data Collection

For more information about OpenShift monitoring, refer to the Monitoring chapter in the Red Hat OpenShift Container Platform 4.14 Observability documentation at https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/monitoring/index#monitoring-overview

For more information about enabling monitoring for user-defined projects, refer to the Enabling Monitoring for User-defined Projects chapter in the Red Hat OpenShift Container Platform 4.14 Observability documentation at https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/monitoring/index#enabling-monitoring-for-user-defined-projects

Prometheus Overview

Prometheus Data Model

Node Exporter

Exposed Metrics - kube-state-metrics

Exposed Metrics - openshift-state-metrics

Querying Prometheus

Revision: do380-4.14-397a507