Extract specific insights from monitoring dashboards and metrics queries to troubleshoot performance and availability issues with cluster nodes.
Outcomes
Use the OpenShift monitoring default stack to identify a deployment that consumes excessive CPU and memory.
Use the OpenShift monitoring default stack to find the cause of an availability issue.
As the student user on the workstation machine, use the lab command to prepare your environment for this exercise, and to ensure that all required resources are available.
[student@workstation ~]$ lab start monitoring-cluster
Instructions
Your company has an OpenShift cluster for development and testing environments with three compute nodes. The cluster has two node pools.
The first node pool is for the development environment and uses the env=dev label.
The applications are deployed in the dev-monitor and dev-finance namespaces.
The latest performance testing for each application suggests that with the given load, the applications use minimal resources and do not exceed 100% of the CPU and memory requests.
Use the OpenShift monitoring default stack to identify whether a user workload consumes excessive CPU and memory.
The load generator traffic-simulator.sh script is available at the ~/DO380/labs/monitoring-cluster location.
In this classroom environment, the node pool for the development environment has one worker03 compute node.
The second node pool is for the testing environment and uses the env=test label.
The frontend-test and budget-test test applications are deployed in the test-monitor namespace.
The URL for both applications is as follows:
| Application | URL |
|---|---|
frontend-test
| http://frontend-test-monitor.apps.ocp4.example.com |
budget-test
| http://budget-test-monitor.apps.ocp4.example.com |
Both applications are giving an Application Not Available error.
Although the developers are trying to deploy more applications for testing, the pods are stuck at the Pending state.
Use the OpenShift monitoring default stack to find the cause of this availability issue.
In this classroom environment, the node pool for the testing environment has one worker02 compute node.
Run the load generator for applications that are deployed in the development environment.
The load generator traffic-simulator.sh script is available at the ~/DO380/labs/monitoring-cluster location.
Open a terminal window and navigate to the ~/DO380/labs/monitoring-cluster directory.
[student@workstation ~]$ cd ~/DO380/labs/monitoring-clusterExecute the traffic-simulator.sh script to generate the load for applications that are deployed in the development environment.
[student@workstation monitoring-cluster]$ ./traffic-simulator.sh
...output omitted...Use the cluster monitoring dashboards to display cluster resource usage information within the OpenShift web console.
Log in as the admin user to the OpenShift web console.
To do so, open a Firefox window and navigate to https://console-openshift-console.apps.ocp4.example.com.
Click .
Log in as the admin user with redhatocp as the password.
Navigate to → .
Select the dashboard.
Scroll to the section. Click the column header, if necessary more than once, until it turns blue with a downward arrow. Rows are sorted by namespace in descending CPU order.
![]() |
The dev-monitor and dev-finance namespaces are likely listed among the five namespaces that use the most CPU resources.
Although the dev-monitor namespace requests only 0.65 CPU resources, it uses considerably more.
The result is that the column reports more than 100%.
If you do not see the dev-monitor namespace with high resource consumption, then reload the page in your browser.
Scroll to the section. Click the column header, if necessary more than once, until it turns blue with a downward arrow. Rows are sorted by namespaces in descending memory order.
![]() |
The dev-monitor namespace is likely listed among the five namespaces that use the most memory resources.
As with CPU usage, the dev-monitor namespace uses considerably more memory than the 350 MiB that the pods in the namespace request.
The result is that the column reports more than 100%.
Use the cluster monitoring dashboards to identify the workload and pod that uses the most cluster resources in the dev-monitor namespace.
Select the dashboard.
Select the dev-monitor namespace and leave deployment as the workload type.
Scroll to the section.
The section shows that the frontend deployment requests a total of 0.5 CPU resources and currently uses 0.001 CPU resources.
The exoplanets-app and exoplanets-db deployments request a total of 0.05 CPU resources, and currently use 0.007 and 0.003 CPU resources respectively.
The python-app deployment requests a total of 0.1 CPU resources, but the one pod in the deployment consumes more than 2 CPU resources.
![]() |
Although the actual CPU usage in your cluster might differ, it is clear that the python-app deployment uses the most CPU resources.
Scroll to the section.
The graph shows that the python-app deployment has notable increases and decreases in memory usage.
However, the ten pods in the frontend deployment consistently use a total of about 63 MiB of memory.
One pod in the exoplanets-app deployment uses a total of about 15.02 MiB of memory, and one pod in the exoplanets-db deployment uses a total of about 129.2 MiB of memory.
![]() |
You identify a deployment that appears to be behaving erratically and that uses more resources than intended.
The next step might be to identify the owner of the python-app deployment to verify the application.
At minimum, the owner should add resource limits for both CPU and memory to the python-app deployment.
Although a cluster administrator can add limits, setting incorrect limits might cause the pod to enter a CrashLoopBackOff state.
Similarly, a cluster administrator can implement a quota for the dev-monitor namespace to limit the total CPU and memory resources that the project uses.
Because the python-app deployment does not specify limits for either CPU or memory, restricting CPU and memory resources with a quota would prevent the python-app pod from running.
Return to the terminal where the script is running. Press Ctrl+C to stop the script, and then close the terminal window.
Verify that the frontend-test and budget-test test applications are giving an Application Not Available error.
Open a Firefox window and navigate to the URL for the budget-test application, http://budget-test-monitor.apps.ocp4.example.com.
The web page shows the Application Not Available error.
![]() |
Open a Firefox tab and navigate to the URL for the frontend-test application, http://frontend-test-monitor.apps.ocp4.example.com.
The web page shows the Application Not Available error.
Use the cluster monitoring dashboards to monitor the workloads and pods in the test-monitor namespace.
Switch to the OpenShift web console Firefox tab and select the dashboard.
Select the test-monitor namespace and leave deployment as the workload type.
Scroll to the section.
The section shows that the frontend-test deployment requests a total of 1 CPU resource and currently uses almost no CPU resources.
The budget-test deployment requests a total of 1 CPU and currently uses almost no CPU resources.
![]() |
Scroll to the section.
The section shows that the budget-test and frontend-test deployments currently use no memory.
![]() |
Similarly, neither deployment currently shows any network usage.
The applications are showing no CPU, memory, or network usage, and developers cannot deploy applications in the test environment.
The testing environment has one worker02 compute node.
Use the cluster monitoring dashboards to monitor the state of compute node usage for the test environment.
Select the dashboard. The section shows that one node is in the not ready state.
Click in the section.
The metric shows that the sum of nodes with an unknown or false status is 1.
This metric value implies that one node is not ready.
Use the following metric query to the filter to identify the node with an unknown or false status.
kube_node_status_condition{condition="Ready", status=~"unknown|false"} == 1The metric lists all entries, and the == 1 condition filters to show the entries with an unknown or false status.
Click .
![]() |
The output shows that the worker02 node has unknown status.
You identify that the worker02 compute node is not ready.
The test applications are giving an Application Not Available error, because the test environment node pool has only compute node, which is not ready.
The cluster administrator can add a healthy node to the test node pool for application availability and troubleshoot the worker02 compute node.
Close the web browser.