In this exercise, you will perform maintenance activities on an operational Red Hat Ceph Storage cluster.
Outcomes
You should be able to add, replace, and remove components in an operational Red Hat Ceph Storage cluster.
As the student user on the workstation machine, use the lab command to prepare your system for this exercise.
[student@workstation ~]$ lab start cluster-maint
This command confirms that the hosts required for this exercise are accessible and stops the osd.3 daemon to simulate an OSD failure.
Procedure 11.2. Instructions
Log in to clienta as the admin user and use sudo to run the cephadm shell.
[student@workstation ~]$ssh admin@clienta[admin@clienta ~]$sudo cephadm shell[ceph: root@clienta /]#
Set the noscrub and nodeep-scrub flags to prevent the cluster from starting scrubbing operations temporarily.
[ceph: root@clienta /]#ceph osd set noscrubnoscrub is set [ceph: root@clienta /]#ceph osd set nodeep-scrubnodeep-scrub is set
Verify the Ceph cluster status.
The cluster will transition to the HEALTH_WARN status after some time.
[ceph: root@clienta /]#ceph health detailHEALTH_WARNnoscrub,nodeep-scrub flag(s) set; 1 osds down; Degraded data redundancy: 82/663 objects degraded (12.368%), 14 pgs degraded [WRN] OSDMAP_FLAGS: noscrub,nodeep-scrub flag(s) set [WRN] OSD_DOWN: 1 osds down osd.3 (root=default,host=serverd) is down [WRN] PG_DEGRADED: Degraded data redundancy: 82/663 objects degraded (12.368%), 14 pgs degraded pg 2.f is active+undersized+degraded, acting [8,0] pg 2.19 is active+undersized+degraded, acting [0,8] pg 3.0 is active+undersized+degraded, acting [8,1] ...output omitted...
Identity the failed OSD device for replacement.
Identify which OSD is down.
[ceph: root@clienta /]# ceph osd tree | grep -i down
3 hdd 0.00980 osd.3 down 1.00000 1.0000Identify which host the OSD is on.
[ceph: root@clienta /]#ceph osd find osd.3{"osd": 3, ...output omitted..."host": "serverd.lab.example.com", "crush_location": { "host": "serverd", "root": "default" } }
Log in to the serverd node and use sudo to run the cephadm shell.
Identify the device name for the failed OSD.
[ceph: root@clienta /]#ssh admin@serverdadmin@serverd's password:redhat[admin@serverd ~]$sudo cephadm shell[ceph: root@serverd /]#ceph-volume lvm list====== osd.3 ======= ...output omitted... devices/dev/vdb...output omitted...
You can also identify the device name of an OSD by using the ceph osd metadata command from the admin node.OSD_ID
Exit the cephadm shell.
Identify the service name of the osd.3 daemon running on the serverd node.
The service name will be different in your lab environment.
[ceph: root@serverd /]#exitexit [admin@serverd ~]$sudo systemctl list-units --all "ceph*"UNIT LOAD ACTIVE SUB DESCRIPTION ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@crash.serverd.service loaded active running Ceph crash.serverd for 2a> ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@mgr.serverd.klrkci.service loaded active running Ceph mgr.serverd.klr> ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@mon.serverd.service loaded active running Ceph mon.serverd for 2ae6> ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@node-exporter.serverd.service loaded active running Ceph node-exporte>ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@osd.3.serviceloaded inactive dead Ceph osd.3 for 2ae6d05a-2> ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@osd.5.service loaded active running Ceph osd.5 for 2ae6d05a-2> ...output omitted...
Check the recent log entries for the osd.3 service.
[admin@serverd ~]$sudo journalctl -ru \ ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@osd.3.service...output omitted... Oct 06 22:25:49 serverd.lab.example.com systemd[1]:Stopped Ceph osd.3for 2ae6d05a-229a-11ec-925e-52540000fa0c. ...output omitted...
On the serverd node, start the osd.3 service.
On the admin node, verify that the OSD has started.
[admin@serverd ~]$sudo systemctl start \ ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@osd.3.service[admin@serverd ~]$exitlogout Connection to serverd closed. [ceph: root@clienta /]#ceph osd treeID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.09796 root default -3 0.03918 host serverc 0 hdd 0.00980 osd.0 up 1.00000 1.00000 1 hdd 0.00980 osd.1 up 1.00000 1.00000 2 hdd 0.00980 osd.2 up 1.00000 1.00000 -5 0.02939 host serverd3hdd 0.00980osd.3up1.000001.00000 5 hdd 0.00980 osd.5 up 1.00000 1.00000 7 hdd 0.00980 osd.7 up 1.00000 1.00000 -7 0.02939 host servere 4 hdd 0.00980 osd.4 up 1.00000 1.00000 6 hdd 0.00980 osd.6 up 1.00000 1.00000 8 hdd 0.00980 osd.8 up 1.00000 1.00000
Clear the noscrub and nodeep-scrub flags.
Verify that the cluster health status returns to HEALTH_OK.
Press CTL+C to exit the ceph -w command.
[ceph: root@clienta /]#ceph osd unset noscrubnoscrub is unset [ceph: root@clienta /]#ceph osd unset nodeep-scrubnodeep-scrub is unset [ceph: root@clienta /]#ceph -w...output omitted... health: HEALTH_OK ...output omitted...
Adjust the number of MONs and their placement in the cluster.
View the number of running MONs and their placement.
[ceph: root@clienta /]#ceph orch ls --service_type=monNAME RUNNING REFRESHED AGE PLACEMENT mon4/4 6m ago 5dclienta.lab.example.com;serverc.lab.example.com;serverd.lab.example.com;servere.lab.example.com
Add the serverg node to the cluster.
[ceph: root@clienta /]#ceph cephadm get-pub-key > ~/ceph.pub[ceph: root@clienta /]#ssh-copy-id -f -i ~/ceph.pub root@servergroot@serverg's password:redhat...output omitted... [ceph: root@clienta /]#ceph orch host add serverg.lab.example.comAdded host 'serverg.lab.example.com' with addr '172.25.250.16'
Add a MON and place it on the serverg node.
[ceph: root@clienta /]# ceph orch apply mon \
--placement="clienta.lab.example.com serverc.lab.example.com \
serverd.lab.example.com servere.lab.example.com \
serverg.lab.example.com"
Scheduled mon update...Verify that the MONs are active and correctly placed.
[ceph: root@clienta /]#ceph orch ls --service-type=monNAME RUNNING REFRESHED AGE PLACEMENT mon5/5 - 58sclienta.lab.example.com;serverc.lab.example.com;serverd.lab.example.com;servere.lab.example.com;serverg.lab.example.com
Remove the MON service from serverg node, remove its OSDs, and then remove serverg from the cluster.
Verify that the serverg node is removed.
Remove the MON node from the serverg node.
[ceph: root@clienta /]#ceph orch apply mon \ --placement="clienta.lab.example.com serverc.lab.example.com \ serverd.lab.example.com servere.lab.example.com"Scheduled mon update... [ceph: root@clienta /]#ceph mon state6:4 monsat {clienta=[v2:172.25.250.10:3300/0,v1:172.25.250.10:6789/0], serverc.lab.example.com=[v2:172.25.250.12:3300/0,v1:172.25.250.12:6789/0], serverd=[v2:172.25.250.13:3300/0,v1:172.25.250.13:6789/0], servere=[v2:172.25.250.14:3300/0,v1:172.25.250.14:6789/0]}, election epoch 46, leader 0 serverc.lab.example.com, quorum 0,1,2,3 serverc.lab.example.com,clienta,serverd,servere
Always keep at least three MONs running in a production cluster.
Remove the serverg node's OSDs.
[ceph: root@clienta /]#ceph orch ps serverg.lab.example.comNAME HOST STATUS REFRESHED AGE PORTS VERSION IMAGE ID CONTAINER ID crash.serverg serverg.lab.example.com running (3m) 35s ago 3m - 16.2.0-117.el8cp 2142b60d7974 db0eb4d442b2 node-exporter.serverg serverg.lab.example.com running (3m) 35s ago 3m *:9100 0.18.1 68b1be7484d4 982fc365dc88osd.10serverg.lab.example.com running (2m) 35s ago 2m - 16.2.0-117.el8cp 2142b60d7974 c503c770f6efosd.11serverg.lab.example.com running (2m) 35s ago 2m - 16.2.0-117.el8cp 2142b60d7974 3e4f85ad8384osd.9serverg.lab.example.com running (2m) 35s ago 2m - 16.2.0-117.el8cp 2142b60d7974 ab9563910c19 [ceph: root@clienta /]#ceph osd stop 9 10 11stop down osd.9. stop down osd.10. stop down osd.11. [ceph: root@clienta /]#ceph osd out 9 10 11marked out osd.9. marked out osd.10. marked out osd.11. [ceph: root@clienta /]#ceph osd crush remove osd.9removed item id 9 name 'osd.9' from crush map [ceph: root@clienta /]#ceph osd crush remove osd.10removed item id 10 name 'osd.10' from crush map [ceph: root@clienta /]#ceph osd crush remove osd.11removed item id 11 name 'osd.11' from crush map [ceph: root@clienta /]#ceph osd rm 9 10 11removed osd.9, osd.10, osd.11
Remove the serverg node from the cluster.
Verify that the serverg node has been removed.
[ceph: root@clienta /]#ceph orch host rm serverg.lab.example.comRemoved host 'serverg.lab.example.com' [ceph: root@clienta /]#ceph orch host lsHOST ADDR LABELS STATUS clienta.lab.example.com 172.25.250.10 _admin serverc.lab.example.com 172.25.250.12 serverd.lab.example.com 172.25.250.13 servere.lab.example.com 172.25.250.14
You receive an alert that there is an issue on the servere node.
Put the servere node into maintenance mode, reboot the host, and then exit maintenance mode.
Put the servere node into maintenance mode, and then verify that it has a maintenance status.
[ceph: root@clienta /]#ceph orch host maintenance enter servere.lab.example.comCeph cluster 2ae6d05a-229a-11ec-925e-52540000fa0c on servere.lab.example.com moved to maintenance [ceph: root@clienta /]#ceph orch host lsHOST ADDR LABELS STATUS clienta.lab.example.com 172.25.250.10 _admin serverc.lab.example.com 172.25.250.12 serverd.lab.example.com 172.25.250.13 servere.lab.example.com 172.25.250.14 Maintenance
Reboot the servere node.
[ceph: root@clienta /]#ssh admin@servere sudo rebootadmin@servere's password:redhatConnection to servere closed by remote host.
After the servere node reboots, exit maintenance mode.
[ceph: root@clienta /]# ceph orch host maintenance exit servere.lab.example.com
Ceph cluster 2ae6d05a-229a-11ec-925e-52540000fa0c on servere.lab.example.com has exited maintenance modeReturn to workstation as the student user.
[ceph: root@servera /]#exit[admin@clienta ~]$exit[student@workstation ~]$
This concludes the guided exercise.