In this lab, you will tune and troubleshoot Red Hat Ceph Storage.
Outcomes
You should be able to:
Troubleshoot Ceph clients and Ceph performance issues.
Modify the logging options for Ceph clients.
Tune the recovery and backfill processes to preserve cluster performance.
As the student user on the workstation machine, use the lab command to prepare your system for this lab.
This command ensures that the lab environment is available for the exercise.
[student@workstation ~]$ lab start tuning-review
Procedure 12.4. Instructions
Log in to clienta as the admin user.
Verify the health of the Ceph storage cluster.
Hypothesize possible causes for the displayed issues.
Log in to clienta as the admin user and use sudo to run the cephadm shell.
[student@workstation ~]$ssh admin@clienta[admin@clienta ~]$sudo cephadm shell[ceph: root@clienta /]#
Verify the status of the storage cluster.
[ceph: root@clienta /]# ceph status
cluster:
id: 2ae6d05a-229a-11ec-925e-52540000fa0c
health: HEALTH_WARN
1 failed cephadm daemon(s)
clock skew detected on mon.serverd
1 osds down
Reduced data availability: 4 pgs inactive, 19 pgs peering
Degraded data redundancy: 35/567 objects degraded (6.173%), 7 pgs degraded
services:
mon: 4 daemons, quorum serverc.lab.example.com,clienta,serverd,servere (age 33s)
mgr: serverc.lab.example.com.aiqepd(active, since 9m), standbys: clienta.nncugs, servere.kjwyko, serverd.klrkci
osd: 9 osds: 8 up (since 32s), 9 in (since 8m)
rgw: 2 daemons active (2 hosts, 1 zones)
data:
pools: 5 pools, 105 pgs
objects: 189 objects, 4.9 KiB
usage: 105 MiB used, 90 GiB / 90 GiB avail
pgs: 18.095% pgs not active
35/567 objects degraded (6.173%)
68 active+clean
19 peering
11 active+undersized
7 active+undersized+degradedVerify the health details of the storage cluster.
Two separate issues need troubleshooting. The first issue is a clock skew error, and the second issue is a down OSD that is degrading the PGs.
[ceph: root@clienta /]# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s); clock skew detected on mon.serverd; 1 osds down; Degraded data redundancy: 63/567 objects degraded (11.111%), 15 pgs degraded; 44 slow ops, oldest one blocked for 79 sec, mon.serverd has slow ops
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
daemon osd.4 on servere.lab.example.com is in unknown state
[WRN] MON_CLOCK_SKEW: clock skew detected on mon.serverd
mon.serverd clock skew 299.62s > max 0.05s (latency 0.0155988s)
[WRN] OSD_DOWN: 1 osds down
osd.4 (root=default,host=servere) is down
[WRN] PG_DEGRADED: Degraded data redundancy: 63/567 objects degraded (11.111%), 15 pgs degraded
pg 3.1 is active+undersized+degraded, acting [0,7]
...output omitted...
pg 4.15 is active+undersized+degraded, acting [0,7]First, troubleshoot the clock skew issue.
Open a second terminal and log in to serverd as the admin user.
The previous health detail output stated that the time on the serverd system is 300 seconds different from the other servers.
View the chronyd service status on the serverd system to identify the problem.
The chronyd service is inactive on the serverd system.
[student@workstation ~]$ssh admin@serverd[admin@serverd ~]$systemctl status chronyd● chronyd.service - NTP client/server Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled) Active: inactive (dead) since Mon 2021-10-25 03:51:44 EDT; 8min ago Docs: man:chronyd(8) man:chrony.conf(5) Main PID: 867 (code=exited, status=0/SUCCESS)
Restart the chronyd service.
[admin@serverd ~]$ sudo systemctl start chronydVerify that the chronyd service is active.
[admin@serverd ~]$ systemctl status chronyd
● chronyd.service - NTP client/server
Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2021-10-25 04:00:54 EDT; 4min 53s left
Docs: man:chronyd(8)
man:chrony.conf(5)
Process: 14604 ExecStartPost=/usr/libexec/chrony-helper update-daemon (code=exited, status=0/SUCCESS)
Process: 14601 ExecStart=/usr/sbin/chronyd $OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 14603 (chronyd)
Tasks: 1 (limit: 36236)
Memory: 360.0K
CGroup: /system.slice/chronyd.service
└─14603 /usr/sbin/chronydReturn to workstation as the student user and close the second terminal.
[admin@serverd ~]$exitConnection to serverd closed. [student@workstation ~]$exit
In the first terminal, verify the health of the storage cluster
The time skew message might still display if the monitoring service has not yet updated the time. Allow the cluster sufficient time for services to obtain the corrected time. Continue with these exercise steps, but verify that the skew issue is resolved before finishing the exercise.
[ceph: root@clienta /]# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s); 1 osds down; Degraded data redundancy: 63/567 objects degraded (11.111%), 15 pgs degraded, 16 pgs undersized; 45 slow ops, oldest one blocked for 215 sec, mon.serverd has slow ops
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
daemon osd.4 on servere.lab.example.com is in error state
[WRN] OSD_DOWN: 1 osds down
osd.4 (root=default,host=servere) is down
[WRN] PG_DEGRADED: Degraded data redundancy: 63/567 objects degraded (11.111%), 15 pgs degraded, 16 pgs undersized
pg 2.4 is stuck undersized for 4m, current state active+undersized, last acting [1,3]
...ouput omitted...
pg 5.6 is stuck undersized for 4m, current state active+undersized, last acting [2,5]Troubleshoot the down OSD issue. Use diagnostic logging to find and correct a non-working configuration.
Locate the down OSD.
[ceph: root@clienta /]#ceph osd treeID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.08817 root default -3 0.02939 host serverc 0 hdd 0.00980 osd.0 up 1.00000 1.00000 1 hdd 0.00980 osd.1 up 1.00000 1.00000 2 hdd 0.00980 osd.2 up 1.00000 1.00000 -5 0.02939 host serverd 3 hdd 0.00980 osd.3 up 1.00000 1.00000 5 hdd 0.00980 osd.5 up 1.00000 1.00000 7 hdd 0.00980 osd.7 up 1.00000 1.00000 -7 0.02939 hostservere4 hdd 0.00980osd.4down1.00000 1.00000 6 hdd 0.00980 osd.6 up 1.00000 1.00000 8 hdd 0.00980 osd.8 up 1.00000 1.00000
Attempt to restart the OSD 4 service with the ceph orch command.
OSD 4 remains down after waiting a sufficient time.
[ceph: root@clienta /]# ceph orch daemon restart osd.4
Scheduled to restart osd.4 on host 'servere.lab.example.com'Attempt to retrieve the configuration from the OSD 4 service.
An error messages indicates a communication problem with the OSD 4 service.
[ceph: root@clienta /]# ceph tell osd.4 config show
Error ENXIO: problem getting command descriptions from osd.4Open a second terminal and log in to servere as the admin user.
On the servere system, list the Ceph units.
[student@workstation ~]$ssh admin@servere[admin@servere ~]$systemctl list-units --all 'ceph*'UNIT LOAD ACTIVE SUB DESCRIPTION ceph-2ae6...fa0c@crash.servere.service loaded active running Ceph crash.servere for 2ae6...fa0c ceph-2ae6...fa0c@mgr.servere.kjwyko.service loaded active running Ceph mgr.servere.kjwyko for 2ae6...fa0c ceph-2ae6...fa0c@mon.servere.service loaded active running Ceph mon.servere for 2ae6...fa0c ceph-2ae6...fa0c@node-exporter.servere.service loaded active running Ceph node-exporter.servere for 2ae6...fa0c ●ceph-2ae6...fa0c@osd.4.serviceloadedfailedfailedCeph osd.4 for 2ae6...fa0c ceph-2ae6...fa0c@osd.6.service loaded active running Ceph osd.6 for 2ae6...fa0c ceph-2ae6...fa0c@osd.8.service loaded active running Ceph osd.8 for 2ae6...fa0c ...output omitted...
The OSD 4 services might not yet list as failed if the orchestrator is still attempting to restart the service.
Wait until the service lists as failed before continuing this exercise.
Restart the OSD 4 service.
The fsid service and the OSD 0 service name are different in your lab environment.
The OSD 4 service still fails to start.
[admin@servere ~]$ sudo systemctl restart \
ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@osd.4.service
Job for ceph-2ae6...fa0c@osd.4.service failed because the control process exited with error code.
See "systemctl status ceph-2ae6...fa0c@osd.4.service" and "journalctl -xe" for details.In the first terminal, modify the OSD 4 logging configuration to write to the /var/log/ceph/myosd4.log file and increase the logging level for OSD 4.
Attempt to restart the OSD 4 service with the ceph orch command.
[ceph: root@clienta ~]#ceph config set osd.4 log_file /var/log/ceph/myosd4.log[ceph: root@clienta ~]#ceph config set osd.4 log_to_file true[ceph: root@clienta ~]#ceph config set osd.4 debug_ms 1[ceph: root@clienta /]#ceph orch daemon restart osd.4Scheduled to restart osd.4 on host 'servere.lab.example.com'
In the second terminal window, view the myosd4.log file to discover the issue.
Error messages indicate an incorrect cluster network address configuration.
[admin@servere ~]$sudo cat \ /var/log/ceph/2ae6d05a-229a-11ec-925e-52540000fa0c/myosd4.log...output omitted... 2021-10-25T08:26:22.617+0000 7f86c1f9e080 1 bdev(0x561a77661000 /var/lib/ceph/osd/ceph-4/block) close 2021-10-25T08:26:22.877+0000 7f86c1f9e080 1 bdev(0x561a77660c00 /var/lib/ceph/osd/ceph-4/block) close 2021-10-25T08:26:23.125+0000 7f86c1f9e080 0 starting osd.4 osd_data /var/lib/ceph/osd/ceph-4 /var/lib/ceph/osd/ceph-4/journal 2021-10-25T08:26:23.125+0000 7f86c1f9e080 -1unable to find any IPv4 address in networks '192.168.0.0/24' interfaces ''2021-10-25T08:26:23.125+0000 7f86c1f9e080 -1Failed to pick cluster address.
Return to workstation as the student user and close the second terminal.
[admin@servere ~]$exit[student@workstation ~]$exit
In the first terminal, compare the cluster network addresses for the OSD 0 and OSD 4 services.
[ceph: root@clienta /]#ceph config get osd.0 cluster_network172.25.249.0/24 [ceph: root@clienta /]#ceph config get osd.4 cluster_network192.168.0.0/24
Modify the cluster network value for the OSD 4 service. Attempt to restart the OSD 4 service.
[ceph: root@clienta /]#ceph config set osd.4 cluster_network 172.25.249.0/24[ceph: root@clienta /]#ceph orch daemon restart osd.4
Verify that the OSD 4 service is now up.
Verify the health of the storage cluster.
[ceph: root@clienta /]#ceph osd treeID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.08817 root default -3 0.02939 host serverc 0 hdd 0.00980 osd.0 up 1.00000 1.00000 1 hdd 0.00980 osd.1 up 1.00000 1.00000 2 hdd 0.00980 osd.2 up 1.00000 1.00000 -7 0.02939 host serverd 3 hdd 0.00980 osd.3 up 1.00000 1.00000 5 hdd 0.00980 osd.5 up 1.00000 1.00000 7 hdd 0.00980 osd.7 up 1.00000 1.00000 -5 0.02939 host servere 4 hdd 0.00980osd.4up1.00000 1.00000 6 hdd 0.00980 osd.6 up 1.00000 1.00000 8 hdd 0.00980 osd.8 up 1.00000 1.00000 [ceph: root@clienta /]#ceph health detailHEALTH_OK
For the OSD 5 service, set the operations history size to track 40 completed operations and the operations history duration to 700 seconds.
Modify the osd_op_history_size and osd_op_history_duration parameters.
Set the history size parameter to 40 and the history duration parameter to 700.
Use the ceph daemon command to verify the changed values.
[ceph: root@clienta /]#ceph tell osd.5 config set osd_op_history_size 40{ "success": "osd_op_history_size = '40' " } [ceph: root@clienta /]#ceph tell osd.5 config set osd_op_history_duration 700{ "success": "osd_op_history_duration = '700' " } [ceph: root@clienta /]#ceph tell osd.5 dump_historic_ops | head -n 3{ "size": 40, "duration": 700,
For all OSDs, modify the current runtime value for the maximum concurrent backfills to 3 and for the maximum active recovery operations to 1.
Set the value of the osd_max_backfills parameter to 3.
[ceph: root@clienta /]# ceph tell osd.* config set osd_max_backfills 3
osd.0: {
"success": "osd_max_backfills = '3' "
}
...output omitted...
osd.8: {
"success": "osd_max_backfills = '3' "
}Set the value of the osd_recovery_max_active parameter to 1.
[ceph: root@clienta /]# ceph tell osd.* config set osd_recovery_max_active 1
osd.0: {
"success": "osd_recovery_max_active = '1' "
}
...output omitted...
osd.8: {
"success": "osd_recovery_max_active = '1' "
}Return to workstation as the student user.
This concludes the lab.