Bookmark this page

Lab: Managing a Red Hat Ceph Storage Cluster

In this lab, you will perform common administration and maintenance operations on a Red Hat Ceph Storage cluster.

Outcomes

You should be able to locate the Ceph Dashboard URL, set an OSD out and in, watch cluster events, find and start a down OSD, find an object's PG location and state, and view the balancer status.

As the student user on the workstation machine, use the lab command to prepare your system for this lab.

[student@workstation ~]$ lab start cluster-review

This command confirms that the hosts required for this exercise are accessible.

Procedure 11.3. Instructions

  1. Log in to clienta as the admin user. Verify that the dashboard module is enabled. Find the dashboard URL of the active MGR.

    1. Log in to clienta as the admin user and use sudo to run the cephadm shell.

      [student@workstation ~]$ ssh admin@clienta
      [admin@clienta ~]$ sudo cephadm shell
      [ceph: root@clienta /]#
    2. Verify that the dashboard module is enabled.

      [ceph: root@clienta /]# ceph mgr module ls | more
      {
      ...output omitted...
          "enabled_modules": [
              "cephadm",
              "dashboard",
              "iostat",
              "prometheus",
              "restful"
          ],
      ...output omitted...
    3. Find the dashboard URL of the active MGR.

      [ceph: root@clienta /]# ceph mgr services
      {
          "dashboard": "https://172.25.250.12:8443/",
          "prometheus": "http://172.25.250.12:9283/"
      }

      Note

      Your output might be different depending on which MGR node is active in your lab environment.

  2. You receive an alert that an OSD is down. Identify which OSD is down. Identify on which node the down OSD runs, and start the OSD.

    1. Verify cluster health.

      [ceph: root@clienta /]# ceph health detail
      HEALTH_WARN 1 osds down; Degraded data redundancy: 72/666 objects degraded (10.811%), 14 pgs degraded, 50 pgs undersized
      [WRN] OSD_DOWN: 1 osds down
          osd.6 (root=default,host=servere) is down
      [WRN] PG_DEGRADED: Degraded data redundancy: 72/666 objects degraded (10.811%), 14 pgs degraded, 50 pgs undersized
          pg 2.0 is stuck undersized for 61s, current state active+undersized, last acting [3,0]
          pg 2.1 is stuck undersized for 61s, current state active+undersized, last acting [2,3]
          pg 2.6 is stuck undersized for 61s, current state active+undersized, last acting [1,3]
          pg 2.7 is stuck undersized for 61s, current state active+undersized, last acting [3,2]
      ...output omitted...
    2. Identify which OSD is down.

      [ceph: root@clienta /]# ceph osd tree | grep -i down
       6    hdd  0.00980          osd.6       down   1.00000  1.00000
    3. Identify on which host the down OSD runs.

      [ceph: root@clienta /]# ceph osd find osd.6 | grep host
          "host": "servere.lab.example.com",
              "host": "servere",
    4. Start the OSD.

      [ceph: root@clienta /]# ceph orch daemon start osd.6
      Scheduled to start osd.6 on host 'servere.lab.example.com'
    5. Verify that the OSD is up.

      [ceph: root@clienta /]# ceph osd tree | grep osd.6
       6    hdd  0.00980          osd.6         up   1.00000  1.00000
  3. Set the OSD 5 daemon to the out state and verify that all data has been migrated off of the OSD.

    1. Set the OSD 5 daemon to the out state.

      [ceph: root@clienta /]# ceph osd out 5
      marked out osd.5.
    2. Verify that all PGs have been migrated off of the OSD 5 daemon. It will take some time for the data migration to finish. Press CTL+C to exit the command.

      [ceph: root@clienta /]# ceph -w
        cluster:
          id:     2ae6d05a-229a-11ec-925e-52540000fa0c
          health: HEALTH_WARN
                  Reduced data availability: 5 pgs peering
                  Degraded data redundancy: 1/663 objects degraded (0.151%), 1 pg degraded
      
        services:
          mon: 4 daemons, quorum serverc.lab.example.com,clienta,serverd,servere (age 9h)
          mgr: serverc.lab.example.com.aiqepd(active, since 9h), standbys: serverd.klrkci, servere.kjwyko, clienta.nncugs
          osd: 9 osds: 9 up (since 46s), 8 in (since 7s); 4 remapped pgs
          rgw: 2 daemons active (2 hosts, 1 zones)
      
        data:
          pools:   5 pools, 105 pgs
          objects: 221 objects, 4.9 KiB
          usage:   235 MiB used, 80 GiB / 80 GiB avail
          pgs:     12.381% pgs not active
                   1/663 objects degraded (0.151%)
                   92 active+clean
                   10 remapped+peering
                   2  activating
                   1  activating+degraded
      
        io:
          recovery: 199 B/s, 0 objects/s
      
        progress:
          Global Recovery Event (2s)
            [............................]
      
      2021-03-28 21:23:25.557849 mon.serverc [WRN] Health check failed: Reduced data availability: 1 pg inactive, 1 pg peering (PG_AVAILABILITY)
      2021-03-28 21:23:25.557884 mon.serverc [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 36/2163 objects degraded (1.664%), 5 pgs degraded)
      2021-03-28 21:23:31.741476 mon.serverc [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 1 pg peering)
      2021-03-28 21:23:31.741495 mon.serverc [INF] Cluster is now healthy
      ...output omitted...
      
      [ceph: root@clienta /]# ceph osd df
      ID  CLASS  ...output omitted...    AVAIL   %USE  VAR   PGS  STATUS
       0    hdd  ...output omitted...    10 GiB  0.38  1.29   34      up
       1    hdd  ...output omitted...    10 GiB  0.33  1.13   42      up
       2    hdd  ...output omitted...    10 GiB  0.30  1.02   29      up
       3    hdd  ...output omitted...    10 GiB  0.28  0.97   58      up
       5    hdd  ...output omitted...     0 B     0     0    0      up
       7    hdd  ...output omitted...    10 GiB  0.29  0.99   47      up
       4    hdd  ...output omitted...    10 GiB  0.33  1.13   34      up
       6    hdd  ...output omitted...    10 GiB  0.10  0.36   39      up
       8    hdd  ...output omitted...    10 GiB  0.32  1.12   32      up
          TOTAL  ...output omitted...    80 GiB  0.29
      
      MIN/MAX VAR: 0.36/1.29  STDDEV: 0.08
  4. Set the OSD 5 daemon to the in state and verify that PGs have been placed onto it.

    1. Set the OSD 5 daemon to the in state.

      [ceph: root@clienta /]# ceph osd in 5
      marked in osd.5.
    2. Verify that PGs have been placed onto the OSD 5 daemon.

      [ceph: root@clienta /]# ceph osd df
      ID  CLASS  ...output omitted...    AVAIL   %USE  VAR   PGS  STATUS
       0    hdd  ...output omitted...    10 GiB  0.23  0.76   34      up
       1    hdd  ...output omitted...    10 GiB  0.37  1.26   42      up
       2    hdd  ...output omitted...    10 GiB  0.34  1.15   29      up
       3    hdd  ...output omitted...    10 GiB  0.29  0.99   39      up
       5    hdd  ...output omitted...    10 GiB  0.37  1.24   31      up
       7    hdd  ...output omitted...    10 GiB  0.30  1.00   35      up
       4    hdd  ...output omitted...    10 GiB  0.33  1.12   34      up
       6    hdd  ...output omitted...    10 GiB  0.11  0.37   39      up
       8    hdd  ...output omitted...    10 GiB  0.33  1.11   32      up
          TOTAL  90 GiB  0.30
      
      MIN/MAX VAR: 0.37/1.26  STDDEV: 0.08
  5. Display the balancer status.

    [ceph: root@clienta /]# ceph balancer status
    {
        "active": true,
        "last_optimize_duration": "0:00:00.000647",
        "last_optimize_started": "Thu Oct 14 01:38:13 2021",
        "mode": "upmap",
        "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect",
        "plans": []
    }
  6. Identify the PG for object data1 in the pool1 pool. Query the PG and find its state.

    1. Identify the PG for object data1 in the pool1 pool.

      [ceph: root@clienta /]# ceph osd map pool1 data1
      osdmap e218 pool 'pool1' (6) object 'data1' -> pg 6.d4f4553c (6.1c)` -> up ([8,2,3], p8) acting ([8,2,3], p8)

      Note

      In this example, the PG is 6.1c. Use the PG value in the output displayed in your lab environment.

    2. Query the PG and view its state and primary OSD.

      [ceph: root@clienta /]# ceph pg 6.1c query
      {
          "snap_trimq": "[]",
          "snap_trimq_len": 0,
          "state": "active+clean",
          "epoch": 218,
          "up": [
              8,
              2,
              3
          ],
          "acting": [
              8,
              2,
              3
          ],
          "acting_recovery_backfill": [
              "2",
              "3",
              "8"
          ],
          "info": {
              "pgid": "6.1c",
      
      ...output omitted...
  7. Return to workstation as the student user.

    [ceph: root@clienta /]# exit
    [admin@clienta ~]$ exit
    [student@workstation ~]$

Evaluation

Grade your work by running the lab grade cluster-review command from your workstation machine. Correct any reported failures and rerun the script until successful.

[student@workstation ~]$ lab grade cluster-review

Finish

On the workstation machine, use the lab command to complete this exercise. This is important to ensure that resources from previous exercises do not impact upcoming exercises.

[student@workstation ~]$ lab finish cluster-review

This concludes the lab.

Revision: cl260-5.0-29d2128