Bookmark this page

Guided Exercise: Performing Cluster Administration and Monitoring

In this exercise, you will perform common administration operations on a Red Hat Ceph Storage cluster.

Outcomes

You should be able to administer and monitor the cluster, including starting and stopping specific services, analyzing placement groups, setting OSD primary affinity, verifying daemon versions, and querying cluster health and utilization.

As the student user on the workstation machine, use the lab command to prepare your system for this exercise.

[student@workstation ~]$ lab start cluster-admin

This command confirms that the hosts required for this exercise are accessible.

Procedure 11.1. Instructions

  1. Log in to clienta as the admin user and use sudo to run the cephadm shell.

    [student@workstation ~]$ ssh admin@clienta
    [admin@clienta ~]$ sudo cephadm shell
    [ceph: root@clienta /]#
  2. View the enabled MGR modules. Verify that the dashboard module is enabled.

    [ceph: root@clienta /]# ceph mgr module ls | more
    {
        "always_on_modules": [
            "balancer",
            "crash",
            "devicehealth",
            "orchestrator",
            "pg_autoscaler",
            "progress",
            "rbd_support",
            "status",
            "telemetry",
            "volumes"
        ],
        "enabled_modules": [
            "cephadm",
            "dashboard",
            "insights",
            "iostat",
            "prometheus",
            "restful"
        ],
        "disabled_modules": [
            {
                "name": "alerts",
                "can_run": true,
                "error_string": "",
                "module_options": {
    ...output omitted...
  3. Obtain the dashboard URL for the active MGR node.

    [ceph: root@clienta /]# ceph mgr services
    {
        "dashboard": "https://172.25.250.12:8443/",
        "prometheus": "http://172.25.250.12:9283/"
    }
  4. View the status of the Monitors on the Ceph Dashboard page.

    1. Using a web browser, navigate to the dashboard URL obtained in the previous step. Log in as the admin user with the redhat password.

    2. On the Dashboard page, click Monitors to view the status of the Monitor nodes and quorum.

  5. View the status of all OSDs in the cluster.

    [ceph: root@clienta /]# ceph osd stat
    9 osds: 9 up (since 38m), 9 in (since 38m); epoch: e294
  6. Find the location of the OSD 2 daemon, stop the OSD, and view the cluster OSD status.

    1. Find the location of the OSD 2 daemon.

      [ceph: root@clienta /]# ceph osd find 2
      {
          "osd": 2,
          "addrs": {
              "addrvec": [
                  {
                      "type": "v2",
                      "addr": "172.25.250.12:6808",
                      "nonce": 2361545815
                  },
                  {
                      "type": "v1",
                      "addr": "172.25.250.12:6809",
                      "nonce": 2361545815
                  }
              ]
          },
          "osd_fsid": "1163a19e-e580-40e0-918f-25fd94e97b86",
          "host": "serverc.lab.example.com",
          "crush_location": {
              "host": "serverc",
              "root": "default"
          }
      }
    2. Log in to the serverc node. Stop the OSD 2 daemon.

      [ceph: root@clienta /]# ssh admin@serverc
      admin@serverc's password: redhat
      [admin@serverc ~]$ sudo systemctl list-units "ceph*"
      UNIT                                                        LOAD   ACTIVE SUB     DESCRIPTION
      ...output omitted...
      ceph-ff97a876-1fd2-11ec-8258-52540000fa0c@osd.0.service     loaded active running Ceph osd.0 for ff97a876-1fd2-11ec-8258-52540000fa0c
      ceph-ff97a876-1fd2-11ec-8258-52540000fa0c@osd.1.service     loaded active running Ceph osd.1 for ff97a876-1fd2-11ec-8258-52540000fa0c
      ceph-ff97a876-1fd2-11ec-8258-52540000fa0c@osd.2.service     loaded active running Ceph osd.2 for ff97a876-1fd2-11ec-8258-52540000fa0c
      ...output omitted...
      [admin@serverc ~]$ sudo systemctl stop \
      ceph-ff97a876-1fd2-11ec-8258-52540000fa0c@osd.2.service
    3. Exit the serverc node. View the cluster OSD status.

      [admin@serverc ~]$ exit
      [ceph: root@clienta /]# ceph osd stat
      9 osds: 8 up (since 24s), 9 in (since 45m); epoch: e296
  7. Start osd.2 on the serverc node, and then view the cluster OSD status.

    [ceph: root@clienta /]# ssh admin@serverc sudo systemctl start \
      ceph-ff97a876-1fd2-11ec-8258-52540000fa0c@osd.2.service
    admin@serverc's password: redhat
    [ceph: root@clienta /]# ceph osd stat
    9 osds: 9 up (since 6s), 9 in (since 47m); epoch: e298
  8. View the log files for the osd.2 daemon. Filter the output to view only systemd events.

    [ceph: root@clienta /]# ssh admin@serverc sudo journalctl \
      -u ceph-ff97a876-1fd2-11ec-8258-52540000fa0c@osd.2.service | grep systemd
    admin@serverc's password: redhat
    ...output omitted...
    Sep 30 01:57:36 serverc.lab.example.com systemd[1]: Stopping Ceph osd.2 for ff97a876-1fd2-11ec-8258-52540000fa0c...
    Sep 30 01:57:37 serverc.lab.example.com systemd[1]: ceph-ff97a876-1fd2-11ec-8258-52540000fa0c@osd.2.service: Succeeded.
    Sep 30 01:57:37 serverc.lab.example.com systemd[1]: Stopped Ceph osd.2 for ff97a876-1fd2-11ec-8258-52540000fa0c.
    Sep 30 02:00:12 serverc.lab.example.com systemd[1]: Starting Ceph osd.2 for ff97a876-1fd2-11ec-8258-52540000fa0c...
    Sep 30 02:00:13 serverc.lab.example.com systemd[1]: Started Ceph osd.2 for ff97a876-1fd2-11ec-8258-52540000fa0c.
  9. Mark the osd.4 daemon as being out of the cluster and observe how it affects the cluster status. Then, mark the osd.4 daemon as being in the cluster again.

    1. Mark the osd.4 daemon as being out of the cluster. Verify that the osd.4 daemon is marked out of the cluster and notice that the OSD's weight is now 0.

      [ceph: root@clienta /]# ceph osd out 4
      marked out osd.4.
      [ceph: root@clienta /]# ceph osd stat
      9 osds: 9 up (since 2m), 8 in (since 3s); epoch: e312
      [ceph: root@clienta /]# ceph osd tree
      ID  CLASS  WEIGHT   TYPE NAME         STATUS  REWEIGHT  PRI-AFF
      -1         0.08817  root default
      -3         0.02939      host serverc
       0    hdd  0.00980          osd.0         up   1.00000  1.00000
       1    hdd  0.00980          osd.1         up   1.00000  1.00000
       2    hdd  0.00980          osd.2         up   1.00000  1.00000
      -7         0.02939      host serverd
       3    hdd  0.00980          osd.3         up   1.00000  1.00000
       5    hdd  0.00980          osd.5         up   1.00000  1.00000
       7    hdd  0.00980          osd.7         up   1.00000  1.00000
      -5         0.02939      host servere
       4    hdd  0.00980          osd.4         up         0  1.00000
       6    hdd  0.00980          osd.6         up   1.00000  1.00000
       8    hdd  0.00980          osd.8         up   1.00000  1.00000

      Note

      Ceph recreates the missing object replicas previously available on the osd.4 daemon on different OSDs. You can trace the recovery of the objects using the ceph status or the ceph -w commands.

    2. Mark the osd.4 daemon as being in again.

      [ceph: root@clienta /]# ceph osd in 4
      marked in osd.4.

      Note

      You can mark an OSD as out even though it is still running (up). The in or out status does not correlate to an OSD's running state.

  10. Analyze the current utilization and number of PGs on the OSD 2 daemon.

    [ceph: root@clienta /]# ceph osd df tree
    ID  CLASS  WEIGHT   REWEIGHT  SIZE    RAW USE  DATA     OMAP    META     AVAIL   %USE  VAR   PGS  STATUS  TYPE NAME
    -1         0.08817         -  90 GiB  256 MiB   36 MiB  56 KiB  220 MiB  90 GiB  0.28  1.00    -          root default
    -3         0.02939         -  30 GiB   71 MiB   12 MiB  20 KiB   59 MiB  30 GiB  0.23  0.83    -              host serverc
     0    hdd  0.00980   1.00000  10 GiB   26 MiB  4.0 MiB  11 KiB   22 MiB  10 GiB  0.25  0.91   68      up          osd.0
     1    hdd  0.00980   1.00000  10 GiB   29 MiB  4.0 MiB   6 KiB   25 MiB  10 GiB  0.28  1.01   74      up          osd.1
     2    hdd  0.00980   1.00000  10 GiB   16 MiB  3.9 MiB   3 KiB   12 MiB  10 GiB  0.16  0.57   59      up          osd.2
    ...output omitted...
                           TOTAL  90 GiB  256 MiB   36 MiB  61 KiB  220 MiB  90 GiB  0.28
    MIN/MAX VAR: 0.57/1.48  STDDEV: 0.06
  11. View the placement group status for the cluster. Create a test pool and a test object. Find the placement group to which the test object belongs and analyze that placement group's status.

    1. View the placement group status for the cluster. Examine the PG states. Your output may be different in your lab environment.

      [ceph: root@clienta /]# ceph pg stat
      201 pgs: 201 active+clean; 8.6 KiB data, 261 MiB used, 90 GiB / 90 GiB avail; 511 B/s rd, 0 op/s
    2. Create a pool called testpool and an object called testobject containing the /etc/ceph/ceph.conf file.

      [ceph: root@clienta /]# ceph osd pool create testpool 32 32
      pool 'testpool' created
      [ceph: root@clienta /]# rados -p testpool put testobject /etc/ceph/ceph.conf
    3. Find the placement group of the testobject object in the testpool pool and analyze its status. Use the placement group information from your lab environment in the query.

      [ceph: root@clienta /]# ceph osd map testpool testobject
      osdmap e332 pool 'testpool' (9) object 'testobject' -> pg 9.98824931 (9.11) -> up ([8,2,5], p8) acting ([8,2,5], p8)
      [ceph: root@clienta /]# ceph pg 9.11 query
      {
          "snap_trimq": "[]",
          "snap_trimq_len": 0,
          "state": "active+clean",
          "epoch": 334,
          "up": [
              8,
              2,
              5
          ],
          "acting": [
              8,
              2,
              5
          ],
          "acting_recovery_backfill": [
              "2",
              "5",
              "8"
          ],
          "info": {
              "pgid": "9.11",
      ...output omitted...
  12. List the OSD and cluster daemon versions. This is a useful command to run after cluster upgrades.

    1. List all cluster daemon versions.

      [ceph: root@clienta /]# ceph versions
      {
          "mon": {
              "ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)": 4
          },
          "mgr": {
              "ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)": 4
          },
          "osd": {
              "ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)": 9
          },
          "mds": {
              "ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)": 3
          },
          "rgw": {
              "ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)": 2
          },
          "overall": {
              "ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)": 22
          }
      }
    2. List all OSD versions.

      [ceph: root@clienta /]# ceph tell osd.* version
      osd.0: {
          "version": "16.2.0-117.el8cp",
          "release": "pacific",
          "release_type": "stable"
      }
      osd.1: {
          "version": "16.2.0-117.el8cp",
          "release": "pacific",
          "release_type": "stable"
      }
      osd.2: {
          "version": "16.2.0-117.el8cp",
          "release": "pacific",
          "release_type": "stable"
      }
      ...output omitted...
  13. View the balancer status.

    [ceph: root@clienta /]# ceph balancer status
    {
        "active": true,
        "last_optimize_duration": "0:00:00.001072",
        "last_optimize_started": "Thu Sep 30 06:07:53 2021",
        "mode": "upmap",
        "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect",
        "plans": []
    }
  14. Return to workstation as the student user.

    [ceph: root@clienta /]# exit
    [admin@clienta ~]$ exit
    [student@workstation ~]$

Finish

On the workstation machine, use the lab command to complete this exercise. This is important to ensure that resources from previous exercises do not impact upcoming exercises.

[student@workstation ~]$ lab finish cluster-admin

This concludes the guided exercise.

Revision: cl260-5.0-29d2128