CL260 - ch11s04

Bookmark this page

Guided Exercise: Performing Cluster Maintenance Operations

In this exercise, you will perform maintenance activities on an operational Red Hat Ceph Storage cluster.

Outcomes

You should be able to add, replace, and remove components in an operational Red Hat Ceph Storage cluster.

As the student user on the workstation machine, use the lab command to prepare your system for this exercise.

[student@workstation ~]$ lab start cluster-maint

This command confirms that the hosts required for this exercise are accessible and stops the osd.3 daemon to simulate an OSD failure.

Procedure 11.2. Instructions

[student@workstation ~]$ ssh admin@clienta
[admin@clienta ~]$ sudo cephadm shell
[ceph: root@clienta /]#

Set the noscrub and nodeep-scrub flags to prevent the cluster from starting scrubbing operations temporarily.

[ceph: root@clienta /]# ceph osd set noscrub
noscrub is set
[ceph: root@clienta /]# ceph osd set nodeep-scrub
nodeep-scrub is set

Verify the Ceph cluster status. The cluster will transition to the HEALTH_WARN status after some time.

[ceph: root@clienta /]# ceph health detail
HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 1 osds down; Degraded data redundancy: 82/663 objects degraded (12.368%), 14 pgs degraded
[WRN] OSDMAP_FLAGS: noscrub,nodeep-scrub flag(s) set
[WRN] OSD_DOWN: 1 osds down
    osd.3 (root=default,host=serverd) is down
[WRN] PG_DEGRADED: Degraded data redundancy: 82/663 objects degraded (12.368%), 14 pgs degraded
    pg 2.f is active+undersized+degraded, acting [8,0]
    pg 2.19 is active+undersized+degraded, acting [0,8]
    pg 3.0 is active+undersized+degraded, acting [8,1]
...output omitted...

Identity the failed OSD device for replacement.

Identify which OSD is down.

[ceph: root@clienta /]# ceph osd tree | grep -i down
 3   hdd 0.00980         osd.3      down  1.00000 1.0000

Identify which host the OSD is on.

[ceph: root@clienta /]# ceph osd find osd.3
{
    "osd": 3,
...output omitted...
    "host": "serverd.lab.example.com",
    "crush_location": {
        "host": "serverd",
        "root": "default"
    }
}

[ceph: root@clienta /]# ssh admin@serverd
admin@serverd's password: redhat
[admin@serverd ~]$ sudo cephadm shell
[ceph: root@serverd /]# ceph-volume lvm list

====== osd.3 =======
...output omitted...
      devices                   /dev/vdb
...output omitted...

Note

You can also identify the device name of an OSD by using the ceph osd metadata OSD_ID command from the admin node.

Exit the cephadm shell. Identify the service name of the osd.3 daemon running on the serverd node. The service name will be different in your lab environment.

[ceph: root@serverd /]# exit
exit
[admin@serverd ~]$ sudo systemctl list-units --all "ceph*"
UNIT                                                            LOAD   ACTIVE SUB     DESCRIPTION
ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@crash.serverd.service loaded active running Ceph crash.serverd for 2a>
ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@mgr.serverd.klrkci.service loaded active running Ceph mgr.serverd.klr>
ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@mon.serverd.service   loaded active running Ceph mon.serverd for 2ae6>
ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@node-exporter.serverd.service loaded active running Ceph node-exporte>
ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@osd.3.service         loaded inactive dead Ceph osd.3 for 2ae6d05a-2>
ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@osd.5.service         loaded active running Ceph osd.5 for 2ae6d05a-2>
...output omitted...

Check the recent log entries for the osd.3 service.

[admin@serverd ~]$ sudo journalctl -ru \
ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@osd.3.service
...output omitted...
Oct 06 22:25:49 serverd.lab.example.com systemd[1]: Stopped Ceph osd.3 for 2ae6d05a-229a-11ec-925e-52540000fa0c.
...output omitted...

On the serverd node, start the osd.3 service. On the admin node, verify that the OSD has started.

[admin@serverd ~]$ sudo systemctl start \
ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@osd.3.service
[admin@serverd ~]$ exit
logout
Connection to serverd closed.
[ceph: root@clienta /]# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME         STATUS     REWEIGHT  PRI-AFF
-1         0.09796  root default
-3         0.03918      host serverc
 0    hdd  0.00980          osd.0            up   1.00000  1.00000
 1    hdd  0.00980          osd.1            up   1.00000  1.00000
 2    hdd  0.00980          osd.2            up   1.00000  1.00000
-5         0.02939      host serverd
 3    hdd  0.00980          osd.3            up   1.00000  1.00000
 5    hdd  0.00980          osd.5            up   1.00000  1.00000
 7    hdd  0.00980          osd.7            up   1.00000  1.00000
-7         0.02939      host servere
 4    hdd  0.00980          osd.4            up   1.00000  1.00000
 6    hdd  0.00980          osd.6            up   1.00000  1.00000
 8    hdd  0.00980          osd.8            up   1.00000  1.00000

Clear the noscrub and nodeep-scrub flags. Verify that the cluster health status returns to HEALTH_OK. Press CTL+C to exit the ceph -w command.

[ceph: root@clienta /]# ceph osd unset noscrub
noscrub is unset
[ceph: root@clienta /]# ceph osd unset nodeep-scrub
nodeep-scrub is unset
[ceph: root@clienta /]# ceph -w
...output omitted...
health: HEALTH_OK
...output omitted...

Adjust the number of MONs and their placement in the cluster.

View the number of running MONs and their placement.

[ceph: root@clienta /]# ceph orch ls --service_type=mon
NAME  RUNNING  REFRESHED  AGE  PLACEMENT
mon       4/4  6m ago     5d   clienta.lab.example.com;serverc.lab.example.com;serverd.lab.example.com;servere.lab.example.com

Add the serverg node to the cluster.

[ceph: root@clienta /]# ceph cephadm get-pub-key > ~/ceph.pub
[ceph: root@clienta /]# ssh-copy-id -f -i ~/ceph.pub root@serverg
root@serverg's password: redhat
...output omitted...
[ceph: root@clienta /]# ceph orch host add serverg.lab.example.com
Added host 'serverg.lab.example.com' with addr '172.25.250.16'

Add a MON and place it on the serverg node.

[ceph: root@clienta /]# ceph orch apply mon \
--placement="clienta.lab.example.com serverc.lab.example.com \
serverd.lab.example.com servere.lab.example.com \
serverg.lab.example.com"
Scheduled mon update...

Verify that the MONs are active and correctly placed.

[ceph: root@clienta /]# ceph orch ls --service-type=mon
NAME  RUNNING  REFRESHED  AGE  PLACEMENT
mon       5/5  -          58s  clienta.lab.example.com;serverc.lab.example.com;serverd.lab.example.com;servere.lab.example.com;serverg.lab.example.com

Remove the MON service from serverg node, remove its OSDs, and then remove serverg from the cluster. Verify that the serverg node is removed.

Remove the MON node from the serverg node.

[ceph: root@clienta /]# ceph orch apply mon \
--placement="clienta.lab.example.com serverc.lab.example.com \
serverd.lab.example.com servere.lab.example.com"
Scheduled mon update...
[ceph: root@clienta /]# ceph mon stat
e6: 4 mons at {clienta=[v2:172.25.250.10:3300/0,v1:172.25.250.10:6789/0], serverc.lab.example.com=[v2:172.25.250.12:3300/0,v1:172.25.250.12:6789/0], serverd=[v2:172.25.250.13:3300/0,v1:172.25.250.13:6789/0], servere=[v2:172.25.250.14:3300/0,v1:172.25.250.14:6789/0]}, election epoch 46, leader 0 serverc.lab.example.com, quorum 0,1,2,3 serverc.lab.example.com,clienta,serverd,servere

Important

Always keep at least three MONs running in a production cluster.

Remove the serverg node's OSDs.

[ceph: root@clienta /]# ceph orch ps serverg.lab.example.com
NAME                   HOST                     STATUS        REFRESHED  AGE  PORTS   VERSION           IMAGE ID      CONTAINER ID
crash.serverg          serverg.lab.example.com  running (3m)  35s ago    3m   -       16.2.0-117.el8cp  2142b60d7974  db0eb4d442b2
node-exporter.serverg  serverg.lab.example.com  running (3m)  35s ago    3m   *:9100  0.18.1            68b1be7484d4  982fc365dc88
osd.10                 serverg.lab.example.com  running (2m)  35s ago    2m   -       16.2.0-117.el8cp  2142b60d7974  c503c770f6ef
osd.11                 serverg.lab.example.com  running (2m)  35s ago    2m   -       16.2.0-117.el8cp  2142b60d7974  3e4f85ad8384
osd.9                  serverg.lab.example.com  running (2m)  35s ago    2m   -       16.2.0-117.el8cp  2142b60d7974  ab9563910c19
[ceph: root@clienta /]# ceph osd stop 9 10 11
stop down osd.9. stop down osd.10. stop down osd.11.
[ceph: root@clienta /]# ceph osd out 9 10 11
marked out osd.9. marked out osd.10. marked out osd.11.
[ceph: root@clienta /]# ceph osd crush remove osd.9
removed item id 9 name 'osd.9' from crush map
[ceph: root@clienta /]# ceph osd crush remove osd.10
removed item id 10 name 'osd.10' from crush map
[ceph: root@clienta /]# ceph osd crush remove osd.11
removed item id 11 name 'osd.11' from crush map
[ceph: root@clienta /]# ceph osd rm 9 10 11
removed osd.9, osd.10, osd.11

Remove the serverg node from the cluster. Verify that the serverg node has been removed.

[ceph: root@clienta /]# ceph orch host rm serverg.lab.example.com
Removed host 'serverg.lab.example.com'
[ceph: root@clienta /]# ceph orch host ls
HOST                     ADDR           LABELS  STATUS
clienta.lab.example.com  172.25.250.10  _admin
serverc.lab.example.com  172.25.250.12
serverd.lab.example.com  172.25.250.13
servere.lab.example.com  172.25.250.14

You receive an alert that there is an issue on the servere node. Put the servere node into maintenance mode, reboot the host, and then exit maintenance mode.

Put the servere node into maintenance mode, and then verify that it has a maintenance status.

[ceph: root@clienta /]# ceph orch host maintenance enter servere.lab.example.com
Ceph cluster 2ae6d05a-229a-11ec-925e-52540000fa0c on servere.lab.example.com moved to maintenance
[ceph: root@clienta /]# ceph orch host ls
HOST                     ADDR           LABELS  STATUS
clienta.lab.example.com  172.25.250.10  _admin
serverc.lab.example.com  172.25.250.12
serverd.lab.example.com  172.25.250.13
servere.lab.example.com  172.25.250.14          Maintenance

Reboot the servere node.

[ceph: root@clienta /]# ssh admin@servere sudo reboot
admin@servere's password: redhat
Connection to servere closed by remote host.

After the servere node reboots, exit maintenance mode.

[ceph: root@clienta /]# ceph orch host maintenance exit servere.lab.example.com
Ceph cluster 2ae6d05a-229a-11ec-925e-52540000fa0c on servere.lab.example.com has exited maintenance mode

Return to workstation as the student user.

[ceph: root@servera /]# exit
[admin@clienta ~]$ exit
[student@workstation ~]$

Finish

On the workstation machine, use the lab command to complete this exercise. This is important to ensure that resources from previous exercises do not impact upcoming exercises.

[student@workstation ~]$ lab finish cluster-maint

This concludes the guided exercise.

Discuss Cloud Storage with Red Hat Ceph Storage

Go to community

Red Hat Ceph Storage for OpenStack (CL260)

Haley_Ruccio

3 sie 2023

Build, expand and maintain cloud-scale, clustered storage for your applications with Red Hat Ceph StorageCloud Storage with Red Hat Ceph Storage (CL260) is designed for storage administrators and cloud operators who deploy Red Hat Ceph Storage in a production data center environment or as a component of a Red Hat OpenStack Platform or OpenShift Container Platform infrastructure. Learn how to deploy, manage, and scale a Ceph storage cluster to provide hybrid storage resources, including Amazon S3 and OpenStack Swift-compatible object storage, Ceph-native and iSCSI-based block storage, and shared file storage. This course is based on Red Hat Ceph Storage 5.0.Course summaryDeploy and manage a Red Hat Ceph Storage cluster on commodity servers.Perform common management operations using the web-based management interface.Create, expand, and control access to storage pools provided by the Ceph cluster.Access Red Hat Ceph Storage from clients using object, block, and file-based methods.Analyze and tune Red Hat Ceph Storage performance.Integrate Red Hat OpenStack Platform image, object, block, and file storage with a Red Hat Ceph Storage cluster.Integrate OpenShift Container Platform with a Red Hat Ceph Storage cluster.Target AudienceThis course is intended for storage administrators and cloud operators who want to learn how to deploy and manage Red Hat Ceph Storage on servers in an enterprise data center or within a Red Hat OpenStack Platform or OpenShift Container Platform environment.Developers writing applications that use cloud-based storage will learn the distinctions of various storage types and client access methods.Recommended trainingTake our free assessment to gauge whether this offering is the best fit for your skills.Red Hat Certified System Administrator (RHCSA) certification, or equivalent experience.For candidates that have not earned an RHCSA or equivalent, confirmation of the correct skill set knowledge can be obtained by taking the online skills assessment.Some experience with storage administration is recommended but not required.Technology considerationsThis course does not have any special technical requirements.This course is not intended for BYOD.Internet access is recommended.

Welcome to the Red Hat Ceph Storage for OpenStack (CL260) group in the Red Hat Learning Community!

cschunke

31 lip 2023

We are excited to launch a space dedicated to the Red Hat Training course CL260! To gain the most value from this group - click the "Join Group" button in the upper right hand corner of the group home page.We encourage group members to collaborate in this group to discuss topics, ask questions, share best practices and tips, provide course feedback, and share their accomplishments as it relates to CL260.Read more about Red Hat Ceph Storage for OpenStack here.

381

Revision: cl260-5.0-29d2128