Bookmark this page

Guided Exercise: Performing Cluster Maintenance Operations

In this exercise, you will perform maintenance activities on an operational Red Hat Ceph Storage cluster.

Outcomes

You should be able to add, replace, and remove components in an operational Red Hat Ceph Storage cluster.

As the student user on the workstation machine, use the lab command to prepare your system for this exercise.

[student@workstation ~]$ lab start cluster-maint

This command confirms that the hosts required for this exercise are accessible and stops the osd.3 daemon to simulate an OSD failure.

Procedure 11.2. Instructions

  1. Log in to clienta as the admin user and use sudo to run the cephadm shell.

    [student@workstation ~]$ ssh admin@clienta
    [admin@clienta ~]$ sudo cephadm shell
    [ceph: root@clienta /]#
  2. Set the noscrub and nodeep-scrub flags to prevent the cluster from starting scrubbing operations temporarily.

    [ceph: root@clienta /]# ceph osd set noscrub
    noscrub is set
    [ceph: root@clienta /]# ceph osd set nodeep-scrub
    nodeep-scrub is set
  3. Verify the Ceph cluster status. The cluster will transition to the HEALTH_WARN status after some time.

    [ceph: root@clienta /]# ceph health detail
    HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 1 osds down; Degraded data redundancy: 82/663 objects degraded (12.368%), 14 pgs degraded
    [WRN] OSDMAP_FLAGS: noscrub,nodeep-scrub flag(s) set
    [WRN] OSD_DOWN: 1 osds down
        osd.3 (root=default,host=serverd) is down
    [WRN] PG_DEGRADED: Degraded data redundancy: 82/663 objects degraded (12.368%), 14 pgs degraded
        pg 2.f is active+undersized+degraded, acting [8,0]
        pg 2.19 is active+undersized+degraded, acting [0,8]
        pg 3.0 is active+undersized+degraded, acting [8,1]
    ...output omitted...
  4. Identity the failed OSD device for replacement.

    1. Identify which OSD is down.

      [ceph: root@clienta /]# ceph osd tree | grep -i down
       3   hdd 0.00980         osd.3      down  1.00000 1.0000
    2. Identify which host the OSD is on.

      [ceph: root@clienta /]# ceph osd find osd.3
      {
          "osd": 3,
      ...output omitted...
          "host": "serverd.lab.example.com",
          "crush_location": {
              "host": "serverd",
              "root": "default"
          }
      }
    3. Log in to the serverd node and use sudo to run the cephadm shell. Identify the device name for the failed OSD.

      [ceph: root@clienta /]# ssh admin@serverd
      admin@serverd's password: redhat
      [admin@serverd ~]$ sudo cephadm shell
      [ceph: root@serverd /]# ceph-volume lvm list
      
      ====== osd.3 =======
      ...output omitted...
            devices                   /dev/vdb
      ...output omitted...

      Note

      You can also identify the device name of an OSD by using the ceph osd metadata OSD_ID command from the admin node.

  5. Exit the cephadm shell. Identify the service name of the osd.3 daemon running on the serverd node. The service name will be different in your lab environment.

    [ceph: root@serverd /]# exit
    exit
    [admin@serverd ~]$ sudo systemctl list-units --all "ceph*"
    UNIT                                                            LOAD   ACTIVE SUB     DESCRIPTION
    ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@crash.serverd.service loaded active running Ceph crash.serverd for 2a>
    ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@mgr.serverd.klrkci.service loaded active running Ceph mgr.serverd.klr>
    ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@mon.serverd.service   loaded active running Ceph mon.serverd for 2ae6>
    ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@node-exporter.serverd.service loaded active running Ceph node-exporte>
    ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@osd.3.service         loaded inactive dead Ceph osd.3 for 2ae6d05a-2>
    ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@osd.5.service         loaded active running Ceph osd.5 for 2ae6d05a-2>
    ...output omitted...
  6. Check the recent log entries for the osd.3 service.

    [admin@serverd ~]$ sudo journalctl -ru \
    ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@osd.3.service
    ...output omitted...
    Oct 06 22:25:49 serverd.lab.example.com systemd[1]: Stopped Ceph osd.3 for 2ae6d05a-229a-11ec-925e-52540000fa0c.
    ...output omitted...
  7. On the serverd node, start the osd.3 service. On the admin node, verify that the OSD has started.

    [admin@serverd ~]$ sudo systemctl start \
    ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@osd.3.service
    [admin@serverd ~]$ exit
    logout
    Connection to serverd closed.
    [ceph: root@clienta /]# ceph osd tree
    ID  CLASS  WEIGHT   TYPE NAME         STATUS     REWEIGHT  PRI-AFF
    -1         0.09796  root default
    -3         0.03918      host serverc
     0    hdd  0.00980          osd.0            up   1.00000  1.00000
     1    hdd  0.00980          osd.1            up   1.00000  1.00000
     2    hdd  0.00980          osd.2            up   1.00000  1.00000
    -5         0.02939      host serverd
     3    hdd  0.00980          osd.3            up   1.00000  1.00000
     5    hdd  0.00980          osd.5            up   1.00000  1.00000
     7    hdd  0.00980          osd.7            up   1.00000  1.00000
    -7         0.02939      host servere
     4    hdd  0.00980          osd.4            up   1.00000  1.00000
     6    hdd  0.00980          osd.6            up   1.00000  1.00000
     8    hdd  0.00980          osd.8            up   1.00000  1.00000
  8. Clear the noscrub and nodeep-scrub flags. Verify that the cluster health status returns to HEALTH_OK. Press CTL+C to exit the ceph -w command.

    [ceph: root@clienta /]# ceph osd unset noscrub
    noscrub is unset
    [ceph: root@clienta /]# ceph osd unset nodeep-scrub
    nodeep-scrub is unset
    [ceph: root@clienta /]# ceph -w
    ...output omitted...
    health: HEALTH_OK
    ...output omitted...
  9. Adjust the number of MONs and their placement in the cluster.

    1. View the number of running MONs and their placement.

      [ceph: root@clienta /]# ceph orch ls --service_type=mon
      NAME  RUNNING  REFRESHED  AGE  PLACEMENT
      mon       4/4  6m ago     5d   clienta.lab.example.com;serverc.lab.example.com;serverd.lab.example.com;servere.lab.example.com
    2. Add the serverg node to the cluster.

      [ceph: root@clienta /]# ceph cephadm get-pub-key > ~/ceph.pub
      [ceph: root@clienta /]# ssh-copy-id -f -i ~/ceph.pub root@serverg
      root@serverg's password: redhat
      ...output omitted...
      [ceph: root@clienta /]# ceph orch host add serverg.lab.example.com
      Added host 'serverg.lab.example.com' with addr '172.25.250.16'
    3. Add a MON and place it on the serverg node.

      [ceph: root@clienta /]# ceph orch apply mon \
      --placement="clienta.lab.example.com serverc.lab.example.com \
      serverd.lab.example.com servere.lab.example.com \
      serverg.lab.example.com"
      Scheduled mon update...
    4. Verify that the MONs are active and correctly placed.

      [ceph: root@clienta /]# ceph orch ls --service-type=mon
      NAME  RUNNING  REFRESHED  AGE  PLACEMENT
      mon       5/5  -          58s  clienta.lab.example.com;serverc.lab.example.com;serverd.lab.example.com;servere.lab.example.com;serverg.lab.example.com
  10. Remove the MON service from serverg node, remove its OSDs, and then remove serverg from the cluster. Verify that the serverg node is removed.

    1. Remove the MON node from the serverg node.

      [ceph: root@clienta /]# ceph orch apply mon \
      --placement="clienta.lab.example.com serverc.lab.example.com \
      serverd.lab.example.com servere.lab.example.com"
      Scheduled mon update...
      [ceph: root@clienta /]# ceph mon stat
      e6: 4 mons at {clienta=[v2:172.25.250.10:3300/0,v1:172.25.250.10:6789/0], serverc.lab.example.com=[v2:172.25.250.12:3300/0,v1:172.25.250.12:6789/0], serverd=[v2:172.25.250.13:3300/0,v1:172.25.250.13:6789/0], servere=[v2:172.25.250.14:3300/0,v1:172.25.250.14:6789/0]}, election epoch 46, leader 0 serverc.lab.example.com, quorum 0,1,2,3 serverc.lab.example.com,clienta,serverd,servere

      Important

      Always keep at least three MONs running in a production cluster.

    2. Remove the serverg node's OSDs.

      [ceph: root@clienta /]# ceph orch ps serverg.lab.example.com
      NAME                   HOST                     STATUS        REFRESHED  AGE  PORTS   VERSION           IMAGE ID      CONTAINER ID
      crash.serverg          serverg.lab.example.com  running (3m)  35s ago    3m   -       16.2.0-117.el8cp  2142b60d7974  db0eb4d442b2
      node-exporter.serverg  serverg.lab.example.com  running (3m)  35s ago    3m   *:9100  0.18.1            68b1be7484d4  982fc365dc88
      osd.10                 serverg.lab.example.com  running (2m)  35s ago    2m   -       16.2.0-117.el8cp  2142b60d7974  c503c770f6ef
      osd.11                 serverg.lab.example.com  running (2m)  35s ago    2m   -       16.2.0-117.el8cp  2142b60d7974  3e4f85ad8384
      osd.9                  serverg.lab.example.com  running (2m)  35s ago    2m   -       16.2.0-117.el8cp  2142b60d7974  ab9563910c19
      [ceph: root@clienta /]# ceph osd stop 9 10 11
      stop down osd.9. stop down osd.10. stop down osd.11.
      [ceph: root@clienta /]# ceph osd out 9 10 11
      marked out osd.9. marked out osd.10. marked out osd.11.
      [ceph: root@clienta /]# ceph osd crush remove osd.9
      removed item id 9 name 'osd.9' from crush map
      [ceph: root@clienta /]# ceph osd crush remove osd.10
      removed item id 10 name 'osd.10' from crush map
      [ceph: root@clienta /]# ceph osd crush remove osd.11
      removed item id 11 name 'osd.11' from crush map
      [ceph: root@clienta /]# ceph osd rm 9 10 11
      removed osd.9, osd.10, osd.11
    3. Remove the serverg node from the cluster. Verify that the serverg node has been removed.

      [ceph: root@clienta /]# ceph orch host rm serverg.lab.example.com
      Removed host 'serverg.lab.example.com'
      [ceph: root@clienta /]# ceph orch host ls
      HOST                     ADDR           LABELS  STATUS
      clienta.lab.example.com  172.25.250.10  _admin
      serverc.lab.example.com  172.25.250.12
      serverd.lab.example.com  172.25.250.13
      servere.lab.example.com  172.25.250.14
  11. You receive an alert that there is an issue on the servere node. Put the servere node into maintenance mode, reboot the host, and then exit maintenance mode.

    1. Put the servere node into maintenance mode, and then verify that it has a maintenance status.

      [ceph: root@clienta /]# ceph orch host maintenance enter servere.lab.example.com
      Ceph cluster 2ae6d05a-229a-11ec-925e-52540000fa0c on servere.lab.example.com moved to maintenance
      [ceph: root@clienta /]# ceph orch host ls
      HOST                     ADDR           LABELS  STATUS
      clienta.lab.example.com  172.25.250.10  _admin
      serverc.lab.example.com  172.25.250.12
      serverd.lab.example.com  172.25.250.13
      servere.lab.example.com  172.25.250.14          Maintenance
    2. Reboot the servere node.

      [ceph: root@clienta /]# ssh admin@servere sudo reboot
      admin@servere's password: redhat
      Connection to servere closed by remote host.
    3. After the servere node reboots, exit maintenance mode.

      [ceph: root@clienta /]# ceph orch host maintenance exit servere.lab.example.com
      Ceph cluster 2ae6d05a-229a-11ec-925e-52540000fa0c on servere.lab.example.com has exited maintenance mode
  12. Return to workstation as the student user.

    [ceph: root@servera /]# exit
    [admin@clienta ~]$ exit
    [student@workstation ~]$

Finish

On the workstation machine, use the lab command to complete this exercise. This is important to ensure that resources from previous exercises do not impact upcoming exercises.

[student@workstation ~]$ lab finish cluster-maint

This concludes the guided exercise.

Revision: cl260-5.0-29d2128