In this exercise, you will configure tuning parameters and diagnose common problems for various Red Hat Ceph Storage services.
Outcomes
You should be able to identify the error code for each Ceph component and resolve the issues.
As the student user on the workstation machine, use the lab command to prepare your system for this exercise.
[student@workstation ~]$ lab start tuning-troubleshoot
Procedure 12.3. Instructions
Log in to clienta as the admin user and use sudo to run the cephadm shell.
Verify the health of the Ceph storage cluster.
Two separate issues need troubleshooting. The first issue is a clock skew error, and the second issue is a down OSD which is degrading the PGs.
[student@workstation ~]$ssh admin@clienta[admin@clienta ~]$sudo cephadm shell[ceph: root@clienta /]#ceph health detailHEALTH_WARN clock skew detected on mon.serverd; 1 osds down; Degraded data redundancy: 63/567 objects degraded (11.111%), 14 pgs degraded, 10 pgs undersized; 278 slow ops, oldest one blocked for 170 sec, mon.serverd has slow ops [WRN] MON_CLOCK_SKEW:clock skew detected on mon.serverdmon.serverd clock skew 299.103s > max 0.05s (latency 0.0204872s) [WRN] OSD_DOWN:1 osds downosd.0 (root=default,host=serverc) is down [WRN] PG_DEGRADED: Degraded data redundancy: 63/567 objects degraded (11.111%), 14 pgs degraded, 10 pgs undersized pg 2.5 is stuck undersized for 2m, current state active+undersized, last acting [8,7] pg 2.c is stuck undersized for 2m, current state active+undersized, last acting [6,5] ...output omitted...
The lab uses chronyd for time synchronization with the classroom server.
First, troubleshoot the clock skew issue.
Log in to serverd as the admin user.
The previous health detail output stated that the time on the serverd system is 300 seconds different than on the other servers.
Viewing the chronyd service status on the serverd system should identify the problem.
Exit the cephadm shell.
On the serverd system, view the chronyd service status.
The chronyd service is inactive on the serverd system.
[ceph: root@clienta /]#exit[admin@clienta ~]$ssh admin@serverdadmin@serverd's password:redhat[admin@serverd ~]$systemctl status chronyd● chronyd.service - NTP client/server Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled) Active:inactive(dead) since Wed 2021-10-20 08:49:21 EDT; 13min ago Docs: man:chronyd(8) man:chrony.conf(5) Main PID: 876 (code=exited, status=0/SUCCESS)
Start the chronyd service.
[admin@serverd ~]$ sudo systemctl start chronydVerify that the chronyd service is active.
[admin@serverd ~]$systemctl status chronyd● chronyd.service - NTP client/server Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled) Active:active(running) since Wed 2021-10-20 09:04:01 EDT; 3min 10s left Docs: man:chronyd(8) man:chrony.conf(5) Process: 15221 ExecStartPost=/usr/libexec/chrony-helper update-daemon (code=exited, status=0/SUCCESS) Process: 15218 ExecStart=/usr/sbin/chronyd $OPTIONS (code=exited, status=0/SUCCESS) Main PID: 15220 (chronyd) Tasks: 1 (limit: 36236) Memory: 360.0K CGroup: /system.slice/chronyd.service └─15220 /usr/sbin/chronyd
Return to the clienta system and use sudo to run the cephadm shell.
[admin@serverd ~]$exitConnection to serverd closed. [admin@clienta ~]$sudo cephadm shell[ceph: root@clienta /]#
Verify the health of the storage cluster.
The time skew message might still display if the monitoring service has not yet updated the time. Allow the cluster sufficient time for services to obtain the corrected time. Continue with these exercise steps, but verify that the skew issue is resolved before finishing the exercise.
[ceph: root@clienta /]# ceph health detail
HEALTH_WARN Degraded data redundancy: 40/567 objects degraded (7.055%), 8 pgs degraded, 20 pgs undersized; 1099 slow ops, oldest one blocked for 580 sec, mon.serverd has slow ops
[WRN] PG_DEGRADED: Degraded data redundancy: 40/567 objects degraded (7.055%), 8 pgs degraded, 20 pgs undersized
pg 2.0 is stuck undersized for 8m, current state active+undersized, last acting [3,6]
pg 2.8 is stuck undersized for 8m, current state active+undersized, last acting [3,8]
...output omitted...The health detail output might show the cluster state as HEALTH_OK.
When an OSD is down, the cluster migrates its PGs to other OSDs to return the cluster to a healthy state.
However, the down OSD still requires troubleshooting.
Locate the down OSD.
The osd.0 service on the serverc system is reported as down.
[ceph: root@clienta /]#ceph osd treeID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.08817 root default -3 0.02939 hostserverc0 hdd 0.00980osd.0down0 1.00000 1 hdd 0.00980 osd.1 up 1.00000 1.00000 2 hdd 0.00980 osd.2 up 1.00000 1.00000 -5 0.02939 host serverd 3 hdd 0.00980 osd.3 up 1.00000 1.00000 5 hdd 0.00980 osd.5 up 1.00000 1.00000 7 hdd 0.00980 osd.7 up 1.00000 1.00000 -7 0.02939 host servere 4 hdd 0.00980 osd.4 up 1.00000 1.00000 6 hdd 0.00980 osd.6 up 1.00000 1.00000 8 hdd 0.00980 osd.8 up 1.00000 1.00000
Exit the cephadm shell.
On the serverc system, list the Ceph service units.
[ceph: root@clienta /]#exitexit [admin@clienta ~]$ssh admin@servercadmin@serverc's password:redhat[admin@serverc ~]$systemctl list-units --all 'ceph*'UNIT LOAD ACTIVE SUB DESCRIPTION ...output omitted... ceph-2ae6...fa0c@node-exporter.serverc.service loaded active running Ceph node-exporter.serverc for 2ae6...fa0cceph-2ae6...fa0c@osd.0.serviceloadedinactivedeadCeph osd.0 for 2ae6...fa0c ceph-2ae6...fa0c@osd.1.service loaded active running Ceph osd.1 for 2ae6...fa0c ceph-2ae6...fa0c@osd.2.service loaded active running Ceph osd.2 for 2ae6...fa0c ceph-2ae6...fa0c@prometheus.serverc.service loaded active running Ceph prometheus.serverc for 2ae6...fa0c ...output omitted...
Restart the OSD 0 service.
The fsid service and the OSD 0 service name are different in your lab environment.
[admin@serverc ~]$ sudo systemctl start ceph-2ae6...fa0c@osd.0.serviceVerify the status of the OSD 0 service.
[admin@serverc ~]$systemctl status ceph-2ae6...fa0c@osd.0.service● ceph-2ae6...fa0c@osd.0.service - Ceph osd.0 for 2ae6...fa0c Loaded: loaded (/etc/systemd/system/ceph-2ae6...fa0c@.service; enabled; vendor preset: disabled) Active:active(running) since Wed 2021-10-20 09:16:45 EDT; 1min 7s ago Process: 14368 ExecStopPost=/bin/rm -f //run/ceph-2ae6...fa0c@osd.0.service-pid //run/ceph-2ae6...fa0c@osd.0.service-cid (code=exited, st> Process: 14175 ExecStopPost=/bin/bash /var/lib/ceph/2ae6...fa0c/osd.0/unit.poststop (code=exited, status=0/SUCCESS) ...output omitted...
Return to the clienta system and use sudo to run the cephadm shell.
[admin@serverc ~]$exitConnection to serverc closed. [admin@clienta ~]$sudo cephadm shell[ceph: root@clienta /]#
Verify the OSD health. The OSD is now up.
[ceph: root@clienta /]#ceph osd treeID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.08817 root default -3 0.02939 hostserverc0 hdd 0.00980osd.0up1.00000 1.00000 1 hdd 0.00980 osd.1 up 1.00000 1.00000 2 hdd 0.00980 osd.2 up 1.00000 1.00000 -5 0.02939 host serverd 3 hdd 0.00980 osd.3 up 1.00000 1.00000 5 hdd 0.00980 osd.5 up 1.00000 1.00000 7 hdd 0.00980 osd.7 up 1.00000 1.00000 -7 0.02939 host servere 4 hdd 0.00980 osd.4 up 1.00000 1.00000 6 hdd 0.00980 osd.6 up 1.00000 1.00000 8 hdd 0.00980 osd.8 up 1.00000 1.00000
Verify the health of the storage cluster.
If the status is HEALTH_WARN and you have resolved the time skew and OSD issues, then wait until the cluster status is HEALTH_OK before continuing.
[ceph: root@clienta /]# ceph health
HEALTH_OKReturn to workstation as the student user.
[ceph: root@clienta /]#exit[admin@clienta ~]$exit[student@workstation ~]$
This concludes the guided exercise.