Bookmark this page

Guided Exercise: Cluster Monitoring

Extract specific insights from monitoring dashboards and metrics queries to troubleshoot performance and availability issues with cluster nodes.

Outcomes

  • Use the OpenShift monitoring default stack to identify a deployment that consumes excessive CPU and memory.

  • Use the OpenShift monitoring default stack to find the cause of an availability issue.

As the student user on the workstation machine, use the lab command to prepare your environment for this exercise, and to ensure that all required resources are available.

[student@workstation ~]$ lab start monitoring-cluster

Instructions

Your company has an OpenShift cluster for development and testing environments with three compute nodes. The cluster has two node pools.

The first node pool is for the development environment and uses the env=dev label. The applications are deployed in the dev-monitor and dev-finance namespaces. The latest performance testing for each application suggests that with the given load, the applications use minimal resources and do not exceed 100% of the CPU and memory requests. Use the OpenShift monitoring default stack to identify whether a user workload consumes excessive CPU and memory. The load generator traffic-simulator.sh script is available at the ~/DO380/labs/monitoring-cluster location. In this classroom environment, the node pool for the development environment has one worker03 compute node.

The second node pool is for the testing environment and uses the env=test label. The frontend-test and budget-test test applications are deployed in the test-monitor namespace. The URL for both applications is as follows:

ApplicationURL
frontend-test http://frontend-test-monitor.apps.ocp4.example.com
budget-test http://budget-test-monitor.apps.ocp4.example.com

Both applications are giving an Application Not Available error. Although the developers are trying to deploy more applications for testing, the pods are stuck at the Pending state. Use the OpenShift monitoring default stack to find the cause of this availability issue. In this classroom environment, the node pool for the testing environment has one worker02 compute node.

  1. Run the load generator for applications that are deployed in the development environment. The load generator traffic-simulator.sh script is available at the ~/DO380/labs/monitoring-cluster location.

    1. Open a terminal window and navigate to the ~/DO380/labs/monitoring-cluster directory.

      [student@workstation ~]$ cd ~/DO380/labs/monitoring-cluster
    2. Execute the traffic-simulator.sh script to generate the load for applications that are deployed in the development environment.

      [student@workstation monitoring-cluster]$ ./traffic-simulator.sh
      ...output omitted...
  2. Use the cluster monitoring dashboards to display cluster resource usage information within the OpenShift web console.

    1. Log in as the admin user to the OpenShift web console. To do so, open a Firefox window and navigate to https://console-openshift-console.apps.ocp4.example.com. Click Red Hat Identity Management.

    2. Log in as the admin user with redhatocp as the password.

    3. Navigate to ObserveDashboards.

    4. Select the Kubernetes / Compute Resources / Cluster dashboard.

    5. Scroll to the CPU Quota section. Click the CPU Usage column header, if necessary more than once, until it turns blue with a downward arrow. Rows are sorted by namespace in descending CPU order.

      Figure 6.8: Namespaces with the highest CPU usage

      The dev-monitor and dev-finance namespaces are likely listed among the five namespaces that use the most CPU resources. Although the dev-monitor namespace requests only 0.65 CPU resources, it uses considerably more. The result is that the CPU Requests % column reports more than 100%.

      If you do not see the dev-monitor namespace with high resource consumption, then reload the page in your browser.

    6. Scroll to the Memory Requests section. Click the Memory Usage column header, if necessary more than once, until it turns blue with a downward arrow. Rows are sorted by namespaces in descending memory order.

      Figure 6.9: Namespaces with the highest memory usage

      The dev-monitor namespace is likely listed among the five namespaces that use the most memory resources. As with CPU usage, the dev-monitor namespace uses considerably more memory than the 350 MiB that the pods in the namespace request. The result is that the Memory Requests % column reports more than 100%.

  3. Use the cluster monitoring dashboards to identify the workload and pod that uses the most cluster resources in the dev-monitor namespace.

    1. Select the Kubernetes / Compute Resources / Namespace (Workloads) dashboard. Select the dev-monitor namespace and leave deployment as the workload type.

    2. Scroll to the CPU Quota section. The CPU Quota section shows that the frontend deployment requests a total of 0.5 CPU resources and currently uses 0.001 CPU resources. The exoplanets-app and exoplanets-db deployments request a total of 0.05 CPU resources, and currently use 0.007 and 0.003 CPU resources respectively. The python-app deployment requests a total of 0.1 CPU resources, but the one pod in the deployment consumes more than 2 CPU resources.

      Figure 6.10: Deployment CPU usage

      Although the actual CPU usage in your cluster might differ, it is clear that the python-app deployment uses the most CPU resources.

    3. Scroll to the Memory Usage section. The Memory Usage graph shows that the python-app deployment has notable increases and decreases in memory usage. However, the ten pods in the frontend deployment consistently use a total of about 63 MiB of memory. One pod in the exoplanets-app deployment uses a total of about 15.02 MiB of memory, and one pod in the exoplanets-db deployment uses a total of about 129.2 MiB of memory.

      Figure 6.11: Deployment memory usage
  4. You identify a deployment that appears to be behaving erratically and that uses more resources than intended. The next step might be to identify the owner of the python-app deployment to verify the application.

    At minimum, the owner should add resource limits for both CPU and memory to the python-app deployment. Although a cluster administrator can add limits, setting incorrect limits might cause the pod to enter a CrashLoopBackOff state.

    Similarly, a cluster administrator can implement a quota for the dev-monitor namespace to limit the total CPU and memory resources that the project uses. Because the python-app deployment does not specify limits for either CPU or memory, restricting CPU and memory resources with a quota would prevent the python-app pod from running.

  5. Return to the terminal where the script is running. Press Ctrl+C to stop the script, and then close the terminal window.

  6. Verify that the frontend-test and budget-test test applications are giving an Application Not Available error.

    1. Open a Firefox window and navigate to the URL for the budget-test application, http://budget-test-monitor.apps.ocp4.example.com. The web page shows the Application Not Available error.

      Figure 6.12: Application not available
    2. Open a Firefox tab and navigate to the URL for the frontend-test application, http://frontend-test-monitor.apps.ocp4.example.com. The web page shows the Application Not Available error.

  7. Use the cluster monitoring dashboards to monitor the workloads and pods in the test-monitor namespace.

    1. Switch to the OpenShift web console Firefox tab and select the Kubernetes / Compute Resources / Namespace (Workloads) dashboard. Select the test-monitor namespace and leave deployment as the workload type.

    2. Scroll to the CPU Quota section. The CPU Quota section shows that the frontend-test deployment requests a total of 1 CPU resource and currently uses almost no CPU resources. The budget-test deployment requests a total of 1 CPU and currently uses almost no CPU resources.

      Figure 6.13: Deployment CPU usage for the test environment
    3. Scroll to the Memory Quota section. The Memory Quota section shows that the budget-test and frontend-test deployments currently use no memory.

      Figure 6.14: Deployment memory usage for the test environment

      Similarly, neither deployment currently shows any network usage.

      The applications are showing no CPU, memory, or network usage, and developers cannot deploy applications in the test environment. The testing environment has one worker02 compute node.

  8. Use the cluster monitoring dashboards to monitor the state of compute node usage for the test environment.

    1. Select the Node Cluster dashboard. The NotReadyNodesCount section shows that one node is in the not ready state.

    2. Click Inspect in the NotReadyNodesCount section. The metric shows that the sum of nodes with an unknown or false status is 1. This metric value implies that one node is not ready.

    3. Use the following metric query to the filter to identify the node with an unknown or false status.

      kube_node_status_condition{condition="Ready", status=~"unknown|false"} == 1

      The metric lists all entries, and the == 1 condition filters to show the entries with an unknown or false status.

      Click Run queries.

      Figure 6.15: Node status metric

      The output shows that the worker02 node has unknown status.

  9. You identify that the worker02 compute node is not ready. The test applications are giving an Application Not Available error, because the test environment node pool has only compute node, which is not ready.

    The cluster administrator can add a healthy node to the test node pool for application availability and troubleshoot the worker02 compute node.

  10. Close the web browser.

Finish

On the workstation machine, use the lab command to complete this exercise. This step is important to ensure that resources from previous exercises do not impact upcoming exercises.

[student@workstation ~]$ lab finish monitoring-cluster

Revision: do380-4.14-397a507