Verify the health of an OpenShift cluster by querying the status of its cluster operators, nodes, pods, and systemd services. Also verify cluster events and alerts.
Outcomes
View the status and get information about cluster operators.
Retrieve information about cluster pods and nodes.
Retrieve the status of a node's systemd services.
View cluster events and alerts.
Retrieve debugging information for the cluster.
As the student user on the workstation machine, use the lab command to prepare your system for this exercise.
This command ensures that all resources are available for this exercise.
[student@workstation ~]$ lab start cli-health
Instructions
Retrieve the status and view information about cluster operators.
Log in to the OpenShift cluster as the admin user with the redhatocp password.
[student@workstation ~]$ oc login -u admin -p redhatocp \
https://api.ocp4.example.com:6443
Login successful
...output omitted...List the operators that users installed in the OpenShift cluster.
[student@workstation ~]$ oc get operators
NAME AGE
lvms-operator.openshift-storage 27d
metallb-operator.metallb-system 27dList the cluster operators that are installed by default in the OpenShift cluster.
[student@workstation ~]$ oc get clusteroperators
NAME VERSION AVAILABLE PROGRESSING DEGRADED ...
authentication 4.14.0 True False False ...
baremetal 4.14.0 True False False ...
cloud-controller-manager 4.14.0 True False False ...
cloud-credential 4.14.0 True False False ...
cluster-autoscaler 4.14.0 True False False ...
config-operator 4.14.0 True False False ...
console 4.14.0 True False False ...
control-plane-machine-set 4.14.0 True False False ...
csi-snapshot-controller 4.14.0 True False False ...
dns 4.14.0 True False False ...
etcd 4.14.0 True False False ...
...output omitted...Use the describe command to view detailed information about the openshift-apiserver cluster operator, such as related objects, events, and version.
[student@workstation ~]$ oc describe clusteroperators openshift-apiserver
Name: openshift-apiserver
Namespace:
Labels: <none>
Annotations: exclude.release.openshift.io/internal-openshift-hosted: true
include.release.openshift.io/self-managed-high-availability: true
include.release.openshift.io/single-node-developer: true
API Version: config.openshift.io/v1
Kind: ClusterOperator
Metadata:
...output omitted...
Spec:
Status:
Conditions:
Last Transition Time: 2023-02-09T22:41:08Z
Message: All is well
Reason: AsExpected
Status: False
Type: Degraded
...output omitted...
Extension: <nil>
Related Objects:
Group: operator.openshift.io
Name: cluster
Resource: openshiftapiservers
Group:
Name: openshift-config
Resource: namespaces
Group:
Name: openshift-config-managed
Resource: namespaces
Group:
Name: openshift-apiserver-operator
Resource: namespaces
Group:
Name: openshift-apiserver
Resource: namespaces
...output omitted...
Versions:
Name: operator
Version: 4.14.0
Name: openshift-apiserver
Version: 4.14.0
Events: <none>The Related Objects attribute includes information about the name, resource type, and groups for objects that are related to the operator.
List the pods in the openshift-apiserver-operator namespace.
Then, view the detailed status of an openshift-apiserver-operator pod by using the JSON format and the jq command.
Your pod names might differ.
[student@workstation ~]$ oc get pods -n openshift-apiserver-operator
NAME READY STATUS RESTARTS AGE
openshift-apiserver-operator-7ddc8958fb-7m2kr 1/1 Running 11 27d[student@workstation ~]$ oc get pod -n openshift-apiserver-operator \
openshift-apiserver-operator-7ddc8958fb-7m2kr \
-o json | jq .status
{
"conditions": [
...output omitted...
{
"lastProbeTime": null,
"lastTransitionTime": "2023-03-08T15:41:34Z",
"status": "True",
"type": "Ready"
},
...output omitted...
],
"containerStatuses": [
{
...output omitted...
"name": "openshift-apiserver-operator",
"ready": true,
"restartCount": 11,
"started": true,
"state": {
"running": {
"startedAt": "2023-03-08T15:41:34Z"
}
}
}
],
"hostIP": "192.168.50.10",
"phase": "Running",
"podIP": "10.8.0.5",
...output omitted...
}Retrieve the status, resource consumption, and events of cluster pods.
List the memory and CPU usage of all pods in the cluster.
Use the --sum option to print the sum of the resource usage.
The resource usage on your system probably differs.
[student@workstation ~]$ oc adm top pods -A --sum
NAMESPACE NAME CPU(cores) MEMORY(bytes)metallb-system controller-5f6dfd8c4f-ddr8v 0m 39Mi
metallb-system metallb-operator-controller-manager-... 1m 38Mi
metallb-system metallb-operator-webhook-server-... 1m 18Mi
metallb-system speaker-2dds4 10m 94Mi
...output omitted...
505m 8982MiList the pods and their labels in the openshift-etcd namespace.
[student@workstation ~]$ oc get pods -n openshift-etcd --show-labels
NAME READY STATUS RESTARTS AGE LABELS
etcd-master01 4/4 Running 40 27d app=etcd,etcd=true,k8s-app=etcd,revision=3
installer-3-master01 0/1 Completed 0 27d app=installerList the resource usage of the containers in the etcd-master01 pod in the openshift-etcd namespace.
The resource usage on your system probably differs.
[student@workstation ~]$ oc adm top pods etcd-master01 \
-n openshift-etcd --containers
POD NAME CPU(cores) MEMORY(bytes)
etcd-master01 POD 0m 0Mi
etcd-master01 etcd 57m 1096Mi
etcd-master01 etcd-metrics 7m 20Mi
etcd-master01 etcd-readyz 4m 40Mi
etcd-master01 etcdctl 0m 0MiDisplay a list of all resources, their status, and their types in the openshift-monitoring namespace.
[student@workstation ~]$ oc get all -n openshift-monitoring --show-kind
NAME READY STATUS ...
pod/alertmanager-main-0 6/6 Running ...
pod/cluster-monitoring-operator-56b769b58f-dtmqj 2/2 Running ...
pod/kube-state-metrics-75455b796c-8q28d 3/3 Running ...
...output omitted...
NAME TYPE CLUSTER-IP ...
service/alertmanager-main ClusterIP 172.30.85.183 ...
service/alertmanager-operated ClusterIP None ...
service/cluster-monitoring-operator ClusterIP None ...
service/kube-state-metrics ClusterIP None ...
...output omitted...View the logs of the alertmanager-main-0 pod in the openshift-monitoring namespace.
The logs might differ on your system.
[student@workstation ~]$ oc logs alertmanager-main-0 -n openshift-monitoring
...output omitted...
ts=2023-03-09T14:57:11.850Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2023-03-09T14:57:11.850Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yamlRetrieve the events for the openshift-kube-controller-manager namespace.
[student@workstation ~]$ oc get events -n openshift-kube-controller-manager
LAST SEEN TYPE REASON OBJECT ...
175m Normal CreatedSCCRanges pod/kube-controller-manager-master01...
11m Normal CreatedSCCRanges pod/kube-controller-manager-master01...Retrieve information about cluster nodes.
View the status of the nodes in the cluster.
[student@workstation ~]$ oc get nodes
NAME STATUS ROLES AGE VERSION
master01 Ready control-plane,master,worker 27d v1.27.6+f67aeb3Retrieve the resource consumption of the master01 node.
The resource usage on your system probably differs.
[student@workstation ~]$ oc adm top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
master01 781m 10% 11455Mi 60%Use a JSONPath filter to determine the capacity and allocatable CPU for the master01 node.
The values might differ on your system.
[student@workstation ~]$ oc get node master01 -o jsonpath=\
'Allocatable: {.status.allocatable.cpu}{"\n"}'\
'Capacity: {.status.capacity.cpu}{"\n"}'
Allocatable: 7500m
Capacity: 8Determine the number of allocatable pods for the node.
[student@workstation ~]$ oc get node master01 -o jsonpath=\
'{.status.allocatable.pods}{"\n"}'
250Use the describe command to view the events, resource requests, and resource limits for the node.
The output might differ on your system.
[student@workstation ~]$ oc describe node master01
...output omitted...
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 3158m (42%) 980m (13%)
memory 12667Mi (66%) 1250Mi (6%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 106m kubelet Starting kubelet.
Normal NodeHasSufficientMemory 106m (x9 over 106m) kubelet Node master01 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 106m (x7 over 106m) kubelet Node master01 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 106m (x7 over 106m) kubelet Node master01 status is now: NodeHasSufficientPID
...output omitted...Retrieve the logs and status of the systemd services on the master01 node.
Display the logs of the node.
Filter the logs to show the most recent log for the crio service.
The logs might differ on your system.
[student@workstation ~]$ oc adm node-logs master01 -u crio --tail 1
-- Logs begin at Thu 2023-02-09 21:19:09 UTC, end at Thu 2023-03-09 16:57:00 UTC. --
Mar 09 02:39:29.158989 master01 crio[3201]: time="2023-03-09 02:39:29.158737393Z" level=info msg="Image status: &ImageStatusResponse
...output omitted...Display the two most recent logs of the kubelet service on the node.
The logs might differ on your system.
[student@workstation ~]$ oc adm node-logs master01 -u kubelet --tail 2
-- Logs begin at Thu 2023-02-09 21:19:09 UTC, end at Thu 2023-03-09 16:59:16 UTC. --
Mar 09 02:40:57.466711 master01 systemd[1]: Stopped Kubernetes Kubelet.
Mar 09 02:40:57.466835 master01 systemd[1]: kubelet.service: Consumed 1h 27min 8.069s CPU time
-- Logs begin at Thu 2023-02-09 21:19:09 UTC, end at Thu 2023-03-09 16:59:16 UTC. --
Mar 09 16:58:52.133046 master01 kubenswrapper[3195]: I0309 16:58:52.132866 3195 kubelet_getters.go:182] "Pod status updated" pod="openshift-etcd/etcd-master01" status=Running
Mar 09 16:58:52.133046 master01 kubenswrapper[3195]: I0309 16:58:52.132882 3195 kubelet_getters.go:182] "Pod status updated" pod="openshift-kube-apiserver/kube-apiserver-master01" status=RunningCreate a debug session for the node.
The, use the chroot /host command to access the host binaries.
[student@workstation ~]$oc debug node/master01Starting pod/master01-debug-khltm ... To use host binaries, run `chroot /host` Pod IP: 192.168.50.10 If you don't see a command prompt, try pressing enter. sh-4.4#chroot /hostsh-5.1#
Verify the status of the kubelet service.
sh-5.1# systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─01-kubens.conf, 10-mco-default-madv.conf, 20-logging.conf, 20-nodenet.conf
Active: active (running) since Thu 2023-03-09 14:54:51 UTC; 2h 8min ago
Main PID: 3195 (kubelet)
Tasks: 28 (limit: 127707)
Memory: 540.7M
CPU: 18min 32.117s
...output omitted...Press Ctrl+C to quit the command.
Confirm that the crio service is active.
sh-5.1# systemctl is-active crio
activeExit the debug pod.
sh-5.1#exitexit sh-4.4#exitexit Removing debug pod ...
Retrieve debugging information for the cluster.
Retrieve debugging information of the cluster by using the oc adm must-gather command.
Specify the /home/student/must-gather directory as the destination directory.
This command might take several minutes to complete.
[student@workstation ~]$ oc adm must-gather --dest-dir /home/student/must-gather
[must-gather ] OUT Using must-gather plug-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:07d3...e94c
...output omitted...
Reprinting Cluster State:
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 94ff22c1-88a0-44cf-90f6-0b7b8b545434
ClusterVersion: Stable at "4.14.0"
ClusterOperators:
All healthy and stableVerify that the debugging information exists in the destination directory.
List the last five kubelet service logs, and confirm that an error occurred with proxying the data from the 192.168.50.10 IP address.
Replace quay-io… with the generated directory name.
[student@workstation ~]$ls ~/must-gatherevent-filter.html quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-07d3...e94c timestamp [student@workstation ~]$tail -5 \ ~/must-gather/quay-io.../host_service_logs/masters/kubelet_service.log...output omitted... Mar 09 01:12:09.680445 master01 kubenswrapper[3275]: I1206 01:12:09.680399 3275 logs.go:323] "Finished parsing log file" path="/var/log/pods/openshift-service-ca_service-ca-5d96446959-69jq8_c9800778-c955-4b89-9bce-9f043237c986/service-ca-controller/9.log" Mar 09 01:12:12.771111 master01 kubenswrapper[3275]: E1206 01:12:12.770971 3275 upgradeaware.go:426] Error proxying data from client to backend: readfrom tcp 192.168.50.10:44410->192.168.50.10:10010: write tcp 192.168.50.10:44410->192.168.50.10:10010: write: broken pipe
Generate debugging information for the openshift-apiserver cluster operator.
Specify the /home/student/inspect directory as the destination directory.
Limit the debugging information to the last five minutes.
[student@workstation ~]$ oc adm inspect clusteroperator/openshift-apiserver \
--dest-dir /home/student/inspect --since 5m
Gathering data for ns/openshift-config...
Gathering data for ns/openshift-config-managed...
Gathering data for ns/openshift-kube-apiserver-operator...
Gathering data for ns/openshift-kube-apiserver...
Gathering data for ns/openshift-etcd-operator...
Wrote inspect data to /home/student/inspect.
...output omitted...Verify that the debugging information exists in the destination directory, and review the cluster.yaml file from the ~/inspect/cluster-scoped-resources/operator.openshift.io/openshiftapiservers directory.
[student@workstation ~]$ls inspect/cluster-scoped-resources event-filter.html namespaces timestamp [student@workstation ~]$cat \ ~/inspect/cluster-scoped-resources/operator.openshift.io/\ openshiftapiservers/cluster.yamlapiVersion: operator.openshift.io/v1 kind: OpenShiftAPIServer metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" release.openshift.io/create-only: "true" creationTimestamp: "2023-12-12T16:03:42Z" generation: 3 managedFields: - apiVersion: operator.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} ...output omitted...
Delete the debugging information from your system.
[student@workstation ~]$ rm -rf must-gather inspect