Abstract
| Goal |
Troubleshoot performance and availability issues with applications and clusters. |
| Sections |
|
| Lab |
|
Describe the architecture of OpenShift Monitoring and query the information in its dashboards.
In the cloud computing era, systems are becoming ever more complex. All organizations aim for reliable, efficient, and secure applications and infrastructure. Observability plays a key role in achieving this goal. Observability is the ability to understand a system's or an application's state by collecting and analyzing its output and logs.
Red Hat OpenShift Observability collects logs, traces, events, and system metrics to provide real-time monitoring. Real-time monitoring helps to identify and troubleshoot issues for OpenShift applications and clusters.
Some key features of the Red Hat OpenShift Observability portfolio are as follows:
Red Hat OpenShift Logging aggregates all the logs from the pods and nodes of an OpenShift cluster to a centralized location. Centralized logging improves searching, visualizing, and reporting of data.
Red Hat OpenShift Monitoring provides monitoring for core platform components.
Network observability monitors and analyzes network traffic and helps to resolve connectivity issues.
Distributed tracing collects observability data in distributed systems. Distributed tracing is based on the OpenTelemetry project.
For more details about network observability and distributed tracing, refer to the References section.
You can use these features with any stand-alone OpenShift cluster. Red Hat Advanced Cluster Management for Kubernetes (RHACM) provides multicluster observability features.
Red Hat OpenShift Container Platform comes with a monitoring stack. The cluster monitoring operator manages the monitoring components and ensures that they are always available and updated. The default monitoring stack collects metrics and generates alerts for the cluster, which includes core platform components and all projects. However, the default monitoring stack does not support custom metrics for user-defined projects.
You can enable monitoring for user-defined projects. You can collect custom metrics and generate alerts by using the monitoring stack for user-defined projects.
You can configure persistent storage for OpenShift monitoring. With this configuration, you can keep a record of the past cluster status, to investigate and correlate current and past issues within the cluster. The monitoring dashboard provides visuals for cluster metrics.
The OpenShift monitoring stack is based on the Prometheus open source project. The stack includes the components that are shown in the following figure and are then explained in the following section:
The default monitoring stack is included with OpenShift Container Platform.
All components of the default monitoring stack are installed in the openshift-monitoring project and provide monitoring features for core platform components.
Modifying any existing resource and creating additional ServiceMonitor, PodMonitor, or PrometheusRule resources in the openshift-monitoring project is not supported.
The monitoring stack resets modified resources to ensure that its resources always remain in the expected state.
The monitoring stack deploys the following components in the environment for monitoring the infrastructure, receiving alerts, and consulting performance graphs.
The cluster monitoring operator is the central component of the monitoring stack. The cluster monitoring operator controls the deployed monitoring components and ensures that they are always in sync with the latest version of the cluster monitoring operator.
The Prometheus operator deploys and configures both Prometheus and Alertmanager. The operator also manages the generation of configuration targets (service monitors and pod monitors).
Prometheus is the monitoring server.
The Prometheus adapter exposes cluster resources for Horizontal Pod Autoscaling (HPA).
Alertmanager handles alerts from the Prometheus server.
An alert is a rule that evaluates to true or false and is often based on cluster observations, such as cluster CPU utilisation.
An alert fires when the alert rule meets the true condition.
You can configure Prometheus Alertmanager to group and route the alerts to the receiver.
The kube-state-metrics converter agent exports Kubernetes objects to metrics that Prometheus can parse.
The openshift-state-metrics agent is based on the kube-state-metrics agent and adds monitoring for OpenShift-specific resources (such as image registry metrics).
The node-exporter agent exports low-level metrics for compute nodes.
Thanos Querier is a single, multitenant interface that enables aggregating and deduplicating cluster and user workload metrics.
Telemeter Client sends a data portion from Prometheus instances to Red Hat for remote health monitoring.
The OpenShift Container Platform web console provides the section to access and manage monitoring features. In the Observe section, you can access monitoring dashboards, metrics, alerts, and metrics targets.
You can enable monitoring for user-defined projects.
You can collect application-specific custom metrics and also create alerts by using custom metrics with this monitoring stack.
The monitoring components for user-defined projects are installed in the openshift-user-workload-monitoring project.
The monitoring stack for user-defined projects deploys the following components:
Prometheus Operator
Prometheus
Thanos Ruler
Alertmanager
For more details about the OpenShift monitoring stack for user-defined projects, see the References section.
Prometheus is an open source project for system monitoring and alerting.
Both Red Hat OpenShift Container Platform and Kubernetes integrate Prometheus to enable cluster metrics, monitoring, and alerting capabilities.
Prometheus gathers and stores streams of data from the cluster as time-series data. Time-series data consists of a sequence of samples, where each sample contains the following elements:
A timestamp
A numeric value (such as an integer, float, or Boolean)
A set of labels in the form of key/value pairs The key/value pairs isolate groups of related values for filtering.
For example, the machine_cpu_cores metric in Prometheus contains a sequence of measurement samples of the number of CPU cores for each machine.
You can access and manage the monitoring features by using the web console. The OpenShift Container Platform web console provides the section, with the following subsections:
Alerting
Metrics
Dashboards
Targets
You access cluster alerts from the OpenShift web console at → . For each alert, the page displays a brief description, the state, and the severity. You can view alert details by clicking the name of the alert. The page also displays a time-series graphic.
![]() |
For more details about forwarding alerts to other systems, see the next section.
OpenShift integrates Prometheus metrics at → .
From the page, enter an expression, such as a metric name, and then click to retrieve the most recent sample for the metric.
The following example displays the instance:node_cpu_utilisation:rate1m metric over time.
![]() |
The metric contains data for each node instance in the cluster.
The OpenShift has three monitoring stack components to gather the metrics from the Kubernetes API: the kube-state-metrics, openshift-state-metrics, and node-exporter agents.
The dashboards in OpenShift cluster monitoring combine metrics from the three agents.
See the References section to learn about the complete list of exposed metrics.
Prometheus provides a query language, PromQL, to select and aggregate time-series data.
You can filter a metric to include only certain key/value pairs.
For example, you can modify the previous query to show only metrics for the worker02 node by using the following expression:
instance:node_cpu_utilisation:rate1m{instance="worker02"}Prometheus Query Language provides several operators to compute new time-series metrics. PromQL contains arithmetic operators, including addition, subtraction, multiplication, and division operators. PromQL contains comparison operators, including equality, greater-than, and less-than operators.
PromQL contains built-in functions, including the following ones, that you can include in PromQL expressions:
sum()
Adds the value of all sample entries at a given time.
rate()
Computes the per-second average of a time series for a given time range.
count()
Counts the number of sample entries at a given time.
max()
Selects the maximum value out of the sample entries.
The following examples of Prometheus Query Language expressions use one metric from the node-exporter agent, and another metric from the kube-state-metrics agent:
node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes*100<50
Shows nodes with less than 50% of available memory.
kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1
Shows persistent volume claims in the pending state.
The Red Hat add-on operators can define extra metrics and alerts.
For example, the compliance operator exposes additional metrics to Prometheus.
You can get a list of exposed metrics by using the following query:
{name=~"compliance.*"}The OpenShift console integrates dashboards based on the gathered metrics at → . These dashboards refresh periodically to display current summary metrics and graphs.
In the graphs, which are interactive, you can further explore data features and characteristics that you observe.
The cluster monitoring dashboards serve as a good starting point for near real-time observability of cluster metrics and health.
After receiving an alert, an administrator might use the dashboards to investigate the problem. This investigation might include determining whether a specific node or project has a problem. Additionally, cluster monitoring dashboards can help identify whether a problem was temporary or appears to be persistent.
OpenShift cluster monitoring includes several default dashboards.
![]() |
Some of the default monitoring default dashboards are as follows:
This dashboard displays a high-level view of cluster resources. The dashboard page shows percentage values for CPU such as , , and . Similar values are also available for memory.
![]() |
You can see metrics by clicking for each parameter. Clicking shows the page where you can see metrics and a related graph.
For example, clicking for shows a graph and values for the following metrics:
cluster:node_cpu:ratio_rate5m{cluster=""}![]() |
The dashboard page also shows graphs for CPU, memory, and network, such as , , , and .
These graphs are common to some dashboard pages. The only difference is data filtration. For example, the dashboard filters resource usage, first by namespace and then by workload type, such as by deployment, daemon set, and stateful set.
USE stands for Utilisation Saturation and Errors. This dashboard displays several graphics to identify whether the cluster is overutilised, oversaturated, or experiencing many errors. Because the dashboard displays all nodes in the cluster, you might be able to identify a node that is not behaving in the same way as the other nodes in the cluster.
The following graphic indicates that the worker03 node is experiencing higher memory saturation than other nodes in the cluster.
![]() |
Red Hat OpenShift Observability
Observability Across OpenShift Cluster Boundaries with Distributed Data Collection
For more information about OpenShift monitoring, refer to the Monitoring chapter in the Red Hat OpenShift Container Platform 4.14 Observability documentation at https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/monitoring/index#monitoring-overview
For more information about enabling monitoring for user-defined projects, refer to the Enabling Monitoring for User-defined Projects chapter in the Red Hat OpenShift Container Platform 4.14 Observability documentation at https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/monitoring/index#enabling-monitoring-for-user-defined-projects
Exposed Metrics - kube-state-metrics