Bookmark this page

Lab: OpenShift Monitoring

Configure alerts.

Extract specific insights from monitoring dashboards to troubleshoot performance issues with cluster nodes.

Outcomes

  • Configure the OpenShift alert manager to send email notifications for the NodeCPUOvercommit alert rule.

  • Review the alert email.

  • Use the OpenShift monitoring dashboards to identify issues with cluster nodes.

  • Fix the problematic deployment, by reducing its number of pods to zero.

As the student user on the workstation machine, use the lab command to prepare your environment for this exercise, and to ensure that all required resources are available.

[student@workstation ~]$ lab start monitoring-review

Instructions

Your company has a node pool for the stage environment in the OpenShift cluster. This node pool uses the env=stage label and includes the worker03 node.

The lab script deploys an alert rule for the stage environment that triggers the NodeCPUOvercommit alert. This alert fires when applications request more than the available CPU.

The lab script deploys three applications, which are called budget-app, frontend, and python-app, for the stage environment. These applications are deployed in the monitor-review namespace. One of the applications is requesting more vCPU than is available, so find the application and scale it to zero pods. Thus, logs and events for the application are preserved for developers to fix it.

Use the admin user with redhatocp as the password. For the resources that you must create, you can use the incomplete YAML files in the ~/DO380/labs/monitoring-review directory.

  1. As the admin user, connect to the OpenShift cluster and configure the OpenShift alert manager to send all notifications for alerts with the NodeCPUOvercommit name to the ocp-admins@example.com email address. Use the email configuration information from the following table:

    ParameterValue
    Host 192.168.50.254:25
    Username smtp_training
    Password Red_H4T@!
    Sender alerts@ocp4.example.com
    TLS False

    Use one minute for the group_interval, group_wait, and repeat_interval parameters. Use the preconfigured cpu-overcommit receiver. You can find an incomplete example for the alert manager configuration in the ~/DO380/labs/monitoring-review/alertmanager.yaml file.

    1. Open a terminal and connect to the OpenShift cluster as the admin user with redhatocp as the password.

      [student@workstation ~]$ oc login -u admin -p redhatocp \
        https://api.ocp4.example.com:6443
      Login successful.
      ...output omitted...
    2. Change to the ~/DO380/labs/monitoring-review directory.

      [student@workstation ~]$ cd ~/DO380/labs/monitoring-review
    3. Create the alert manager configuration file. You can find an incomplete example in the ~/DO380/labs/monitoring-review/alertmanager.yaml file.

      global:
        resolve_timeout: 5m
        smtp_smarthost: 192.168.50.254:25
        smtp_hello: localhost
        smtp_auth_username: smtp_training
        smtp_auth_password: Red_H4T@!
        smtp_from: alerts@ocp4.example.com
        smtp_require_tls: false
      ...output omitted...
      receivers:
      - name: Default
      - name: Watchdog
      - name: Critical
      - name: 'null'
      - name: cpu-overcommit
        email_configs:
          - to: ocp-admins@example.com
      ...output omitted...
      route:
        group_by:
        - namespace
        group_interval: 1m
        group_wait: 1m
        receiver: Default
        repeat_interval: 1m
        routes:
        - matchers:
          - alertname = Watchdog
          receiver: Watchdog
        - matchers:
          - alertname = InfoInhibitor
          receiver: 'null'
        - matchers:
          - severity = critical
          receiver: Critical
          continue: true
        - matchers:
          - alertname = NodeCPUOvercommit
          receiver: cpu-overcommit
    4. Apply the alert manager configuration.

      [student@workstation monitoring-review]$ oc set data secret/alertmanager-main \
        -n openshift-monitoring --from-file alertmanager.yaml
      secret/alertmanager-main data updated
    5. Follow the alertmanager container logs and verify that the alert manager configuration is applied successfully. New log messages can take a few minutes to display. Press Ctrl+C to exit the oc logs command.

      [student@workstation monitoring-review]$ oc logs -f alertmanager-main-0 \
        -c alertmanager -n openshift-monitoring
      ...output omitted...
      ts=2024-02-12T15:41:38.303Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
      ts=2024-02-12T15:41:38.305Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
  2. Verify that you receive email alerts from the OpenShift alert manager. The lab user on the utility.lab.example.com host receives emails that are sent to the ocp-admins@example.com email address.

    Use the mutt command to access the mail messages. The mutt command updates in real-time, so it automatically displays any new mail messages.

    It can take a few minutes for OpenShift to fire the alert and to start sending emails.

    Note

    The alert manager sends alerts as emails in HTML format. The lab script configures mutt to display HTML messages.

    1. Connect to the utility.lab.example.com host as the lab user.

      [student@workstation monitoring-review]$ ssh lab@utility.lab.example.com
      ...output omitted...
    2. Use the mutt command to access the mail messages.

      [lab@utility ~]$ mutt
    3. The existing emails that are sent from alerts@ocp4.example.com demonstrate that the alert manager sends email notifications for the node CPU overcommitment.

      q:Quit    d:Del    u:Undel    s:Save    m:Mail    r:Reply    g:Group    ?:Help
      
         1 N   Feb 12 alerts@ocp4.exa ( 217) [FIRING:1]  (NodeCPUOvercommit platform openshift-monitoring/k8s critical)
      ...output omitted...
    4. Select the first alert message and press Enter to open the email.

      i:Exit  -:PrevPg  <Space>:NextPg  v:View Attach…  d:Del  r:Reply  j:Next  ?:Help
      Date: Mon, 12 Feb 2024 11:44:44 +0000
      From: alerts@ocp4.example.com
      To: ocp-admins@example.com
      Subject: [FIRING:1]  (NodeCPUOvercommit platform openshift-monitoring/k8s critical)
      ...output omitted...
    5. Press q to close the message, and then press q again to exit mutt.

    6. Exit the SSH session.

      [lab@utility ~]$ exit
  3. Use the cluster alerting dashboards to verify the firing alert.

    1. Open a Firefox window and navigate to https://console-openshift-console.apps.ocp4.example.com. Click Red Hat Identity Management. Log in as the admin user with redhatocp as the password.

    2. Navigate to ObserveAlerting.

    3. Verify that the NodeCPUOvercommit alert is in the firing state.

  4. Use the cluster monitoring dashboards to display cluster CPU and memory resource usage information within the OpenShift web console. The compute nodes have four vCPUs. Identify the namespace that requests more than the available vCPU.

    1. Change to the web browser window, and navigate to ObserveDashboards.

    2. Select the Kubernetes / Compute Resources / Cluster dashboard.

    3. Scroll to the CPU Quota section. Click the CPU Requests column header, if necessary more than once, until it turns blue with a downward arrow. Rows are sorted by namespace in descending CPU order. The monitoring-review namespace is requesting more than the available vCPU from the stage node.

      If you do not see the monitoring-review namespace with high resource consumption, then reload the page in your browser.

    4. Scroll to the Memory Requests section. Click the Memory Requests column header, if necessary more than once, until it turns blue with a downward arrow. Rows are sorted by namespaces in descending memory order. The monitoring-review namespace is likely to be listed among the five namespaces that request the most memory resources.

  5. Use the cluster monitoring dashboard with the namespace that was identified in the previous step to investigate further and find the application that is responsible for requesting more than the available vCPU.

    1. Select the Kubernetes / Compute Resources / Namespace (Workloads) dashboard.

    2. Select the monitoring-review namespace in the Namespace drop-down menu.

    3. Scroll to the CPU Quota section. Verify that the python-app deployment requests more vCPU than is available.

    4. Scroll to the Memory Requests section. Verify that the python-app deployment requests the most memory resources.

  6. Use the information from the previous step to fix the problematic deployment by reducing its number of pods to zero. You scale the number of pods to zero to preserve application logs and events for troubleshooting. Verify that after fixing the deployment, the NodeCPUOvercommit alert is not fired.

    1. Change to the terminal window and scale down the python-app deployment to one replica.

      [student@workstation monitoring-review]$ oc scale deployment/python-app \
        -n monitoring-review --replicas 0
      deployment.apps/python-app scaled
    2. Change to the /home/student directory.

      [student@workstation monitoring-review]$ cd
    3. Change to the web browser window, and navigate to ObserveAlerting.

    4. After a few minutes, verify that the NodeCPUOvercommit alert is not fired.

Evaluation

As the student user on the workstation machine, use the lab command to grade your work. Correct any reported failures and rerun the command until successful.

[student@workstation ~]$ lab grade monitoring-review

Finish

As the student user on the workstation machine, use the lab command to complete this exercise. This step is important to ensure that resources from previous exercises do not impact upcoming exercises.

[student@workstation ~]$ lab finish monitoring-review

Revision: do380-4.14-397a507