DO380 - ch06s05

Bookmark this page

Lab: OpenShift Monitoring

Configure alerts.

Extract specific insights from monitoring dashboards to troubleshoot performance issues with cluster nodes.

Outcomes

Configure the OpenShift alert manager to send email notifications for the NodeCPUOvercommit alert rule.
Review the alert email.
Use the OpenShift monitoring dashboards to identify issues with cluster nodes.
Fix the problematic deployment, by reducing its number of pods to zero.

As the student user on the workstation machine, use the lab command to prepare your environment for this exercise, and to ensure that all required resources are available.

[student@workstation ~]$ lab start monitoring-review

Instructions

Your company has a node pool for the stage environment in the OpenShift cluster. This node pool uses the env=stage label and includes the worker03 node.

The lab script deploys an alert rule for the stage environment that triggers the NodeCPUOvercommit alert. This alert fires when applications request more than the available CPU.

The lab script deploys three applications, which are called budget-app, frontend, and python-app, for the stage environment. These applications are deployed in the monitor-review namespace. One of the applications is requesting more vCPU than is available, so find the application and scale it to zero pods. Thus, logs and events for the application are preserved for developers to fix it.

Use the admin user with redhatocp as the password. For the resources that you must create, you can use the incomplete YAML files in the ~/DO380/labs/monitoring-review directory.

As the admin user, connect to the OpenShift cluster and configure the OpenShift alert manager to send all notifications for alerts with the NodeCPUOvercommit name to the ocp-admins@example.com email address. Use the email configuration information from the following table:

Parameter	Value
Host	`192.168.50.254:25`
Username	`smtp_training`
Password	`Red_H4T@!`
Sender	`alerts@ocp4.example.com`
TLS	`False`

Use one minute for the group_interval, group_wait, and repeat_interval parameters. Use the preconfigured cpu-overcommit receiver. You can find an incomplete example for the alert manager configuration in the ~/DO380/labs/monitoring-review/alertmanager.yaml file.

Open a terminal and connect to the OpenShift cluster as the admin user with redhatocp as the password.

[student@workstation ~]$ oc login -u admin -p redhatocp \
  https://api.ocp4.example.com:6443
Login successful.
...output omitted...

Change to the ~/DO380/labs/monitoring-review directory.

[student@workstation ~]$ cd ~/DO380/labs/monitoring-review

Create the alert manager configuration file. You can find an incomplete example in the ~/DO380/labs/monitoring-review/alertmanager.yaml file.

global:
  resolve_timeout: 5m
  smtp_smarthost: 192.168.50.254:25
  smtp_hello: localhost
  smtp_auth_username: smtp_training
  smtp_auth_password: Red_H4T@!
  smtp_from: alerts@ocp4.example.com
  smtp_require_tls: false
...output omitted...
receivers:
- name: Default
- name: Watchdog
- name: Critical
- name: 'null'
- name: cpu-overcommit
  email_configs:
    - to: ocp-admins@example.com
...output omitted...
route:
  group_by:
  - namespace
  group_interval: 1m
  group_wait: 1m
  receiver: Default
  repeat_interval: 1m
  routes:
  - matchers:
    - alertname = Watchdog
    receiver: Watchdog
  - matchers:
    - alertname = InfoInhibitor
    receiver: 'null'
  - matchers:
    - severity = critical
    receiver: Critical
    continue: true
  - matchers:
    - alertname = NodeCPUOvercommit
    receiver: cpu-overcommit

Apply the alert manager configuration.

[student@workstation monitoring-review]$ oc set data secret/alertmanager-main \
  -n openshift-monitoring --from-file alertmanager.yaml
secret/alertmanager-main data updated

Follow the alertmanager container logs and verify that the alert manager configuration is applied successfully. New log messages can take a few minutes to display. Press Ctrl+C to exit the oc logs command.

[student@workstation monitoring-review]$ oc logs -f alertmanager-main-0 \
  -c alertmanager -n openshift-monitoring
...output omitted...
ts=2024-02-12T15:41:38.303Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2024-02-12T15:41:38.305Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml

Verify that you receive email alerts from the OpenShift alert manager. The lab user on the utility.lab.example.com host receives emails that are sent to the ocp-admins@example.com email address.
Use the mutt command to access the mail messages. The mutt command updates in real-time, so it automatically displays any new mail messages.
It can take a few minutes for OpenShift to fire the alert and to start sending emails.
Note
The alert manager sends alerts as emails in HTML format. The lab script configures mutt to display HTML messages.
Connect to the utility.lab.example.com host as the lab user.
[student@workstation monitoring-review]$ ssh lab@utility.lab.example.com ...output omitted...
Use the mutt command to access the mail messages.
[lab@utility ~]$ mutt
The existing emails that are sent from alerts@ocp4.example.com demonstrate that the alert manager sends email notifications for the node CPU overcommitment.
q:Quit d:Del u:Undel s:Save m:Mail r:Reply g:Group ?:Help 1 N Feb 12 alerts@ocp4.exa ( 217) [FIRING:1] (NodeCPUOvercommit platform openshift-monitoring/k8s critical) ...output omitted...
Select the first alert message and press Enter to open the email.
i:Exit -:PrevPg <Space>:NextPg v:View Attach… d:Del r:Reply j:Next ?:Help Date: Mon, 12 Feb 2024 11:44:44 +0000 From: alerts@ocp4.example.com To: ocp-admins@example.com Subject: [FIRING:1] (NodeCPUOvercommit platform openshift-monitoring/k8s critical) ...output omitted...
Press q to close the message, and then press q again to exit mutt.
Exit the SSH session.
[lab@utility ~]$ exit
Use the cluster alerting dashboards to verify the firing alert.
Open a Firefox window and navigate to https://console-openshift-console.apps.ocp4.example.com. Click Red Hat Identity Management. Log in as the admin user with redhatocp as the password.
Navigate to Observe → Alerting.
Verify that the NodeCPUOvercommit alert is in the firing state.
Use the cluster monitoring dashboards to display cluster CPU and memory resource usage information within the OpenShift web console. The compute nodes have four vCPUs. Identify the namespace that requests more than the available vCPU.
Change to the web browser window, and navigate to Observe → Dashboards.
Select the Kubernetes / Compute Resources / Cluster dashboard.
Scroll to the CPU Quota section. Click the CPU Requests column header, if necessary more than once, until it turns blue with a downward arrow. Rows are sorted by namespace in descending CPU order. The monitoring-review namespace is requesting more than the available vCPU from the stage node.
If you do not see the monitoring-review namespace with high resource consumption, then reload the page in your browser.
Scroll to the Memory Requests section. Click the Memory Requests column header, if necessary more than once, until it turns blue with a downward arrow. Rows are sorted by namespaces in descending memory order. The monitoring-review namespace is likely to be listed among the five namespaces that request the most memory resources.
Use the cluster monitoring dashboard with the namespace that was identified in the previous step to investigate further and find the application that is responsible for requesting more than the available vCPU.
Select the Kubernetes / Compute Resources / Namespace (Workloads) dashboard.
Select the monitoring-review namespace in the Namespace drop-down menu.
Scroll to the CPU Quota section. Verify that the python-app deployment requests more vCPU than is available.
Scroll to the Memory Requests section. Verify that the python-app deployment requests the most memory resources.
Use the information from the previous step to fix the problematic deployment by reducing its number of pods to zero. You scale the number of pods to zero to preserve application logs and events for troubleshooting. Verify that after fixing the deployment, the NodeCPUOvercommit alert is not fired.
Change to the terminal window and scale down the python-app deployment to one replica.
[student@workstation monitoring-review]$ oc scale deployment/python-app \ -n monitoring-review --replicas 0 deployment.apps/python-app scaled
Change to the /home/student directory.
[student@workstation monitoring-review]$ cd
Change to the web browser window, and navigate to Observe → Alerting.
After a few minutes, verify that the NodeCPUOvercommit alert is not fired.

Evaluation

As the student user on the workstation machine, use the lab command to grade your work. Correct any reported failures and rerun the command until successful.

[student@workstation ~]$ lab grade monitoring-review

Finish

As the student user on the workstation machine, use the lab command to complete this exercise. This step is important to ensure that resources from previous exercises do not impact upcoming exercises.

[student@workstation ~]$ lab finish monitoring-review

Discuss Red Hat OpenShift Administration III: Scaling Deployments in the Enterprise

Go to community

Welcome to Red Hat OpenShift Administration III: Scaling Kubernetes Deployments in the Enterprise!

Syed

12 wrz 2023

We are excited to launch a space dedicated to the Red Hat Training course Red Hat OpenShift Administration III: Scaling Kubernetes Deployments in the Enterprise! To gain the most value from this group - click the "Join Group" button in the upper right hand corner of the group home page.We encourage group members to collaborate in this group to discuss topics, ask questions, share best practices and tips, provide course feedback, and share their accomplishments as it relates to DO378.Read more about Red Hat OpenShift Administration III: Scaling Kubernetes Deployments in the Enterprise here.

Revision: do380-4.14-397a507

Red Hat OpenShift Administration III: Scaling Deployments in the Enterprise

Lab: OpenShift Monitoring

Note