Bookmark this page

Guided Exercise: Troubleshooting Clusters and Clients

In this exercise, you will configure tuning parameters and diagnose common problems for various Red Hat Ceph Storage services.

Outcomes

You should be able to identify the error code for each Ceph component and resolve the issues.

As the student user on the workstation machine, use the lab command to prepare your system for this exercise.

[student@workstation ~]$ lab start tuning-troubleshoot

Procedure 12.3. Instructions

  1. Log in to clienta as the admin user and use sudo to run the cephadm shell. Verify the health of the Ceph storage cluster.

    Two separate issues need troubleshooting. The first issue is a clock skew error, and the second issue is a down OSD which is degrading the PGs.

    [student@workstation ~]$ ssh admin@clienta
    [admin@clienta ~]$ sudo cephadm shell
    [ceph: root@clienta /]# ceph health detail
    HEALTH_WARN clock skew detected on mon.serverd; 1 osds down; Degraded data redundancy: 63/567 objects degraded (11.111%), 14 pgs degraded, 10 pgs undersized; 278 slow ops, oldest one blocked for 170 sec, mon.serverd has slow ops
    [WRN] MON_CLOCK_SKEW: clock skew detected on mon.serverd
        mon.serverd clock skew 299.103s > max 0.05s (latency 0.0204872s)
    [WRN] OSD_DOWN: 1 osds down
        osd.0 (root=default,host=serverc) is down
    [WRN] PG_DEGRADED: Degraded data redundancy: 63/567 objects degraded (11.111%), 14 pgs degraded, 10 pgs undersized
        pg 2.5 is stuck undersized for 2m, current state active+undersized, last acting [8,7]
        pg 2.c is stuck undersized for 2m, current state active+undersized, last acting [6,5]
    ...output omitted...

    Note

    The lab uses chronyd for time synchronization with the classroom server.

  2. First, troubleshoot the clock skew issue.

    Log in to serverd as the admin user. The previous health detail output stated that the time on the serverd system is 300 seconds different than on the other servers. Viewing the chronyd service status on the serverd system should identify the problem.

    1. Exit the cephadm shell. On the serverd system, view the chronyd service status.

      The chronyd service is inactive on the serverd system.

      [ceph: root@clienta /]# exit
      [admin@clienta ~]$ ssh admin@serverd
      admin@serverd's password: redhat
      [admin@serverd ~]$ systemctl status chronyd
      ● chronyd.service - NTP client/server
         Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled)
         Active: inactive (dead) since Wed 2021-10-20 08:49:21 EDT; 13min ago
           Docs: man:chronyd(8)
                 man:chrony.conf(5)
       Main PID: 876 (code=exited, status=0/SUCCESS)
    2. Start the chronyd service.

      [admin@serverd ~]$ sudo systemctl start chronyd
    3. Verify that the chronyd service is active.

      [admin@serverd ~]$ systemctl status chronyd
      ● chronyd.service - NTP client/server
         Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled)
         Active: active (running) since Wed 2021-10-20 09:04:01 EDT; 3min 10s left
           Docs: man:chronyd(8)
                 man:chrony.conf(5)
        Process: 15221 ExecStartPost=/usr/libexec/chrony-helper update-daemon (code=exited, status=0/SUCCESS)
        Process: 15218 ExecStart=/usr/sbin/chronyd $OPTIONS (code=exited, status=0/SUCCESS)
       Main PID: 15220 (chronyd)
          Tasks: 1 (limit: 36236)
         Memory: 360.0K
         CGroup: /system.slice/chronyd.service
                 └─15220 /usr/sbin/chronyd
    4. Return to the clienta system and use sudo to run the cephadm shell.

      [admin@serverd ~]$ exit
      Connection to serverd closed.
      [admin@clienta ~]$ sudo cephadm shell
      [ceph: root@clienta /]#
    5. Verify the health of the storage cluster.

      The time skew message might still display if the monitoring service has not yet updated the time. Allow the cluster sufficient time for services to obtain the corrected time. Continue with these exercise steps, but verify that the skew issue is resolved before finishing the exercise.

      [ceph: root@clienta /]# ceph health detail
      HEALTH_WARN Degraded data redundancy: 40/567 objects degraded (7.055%), 8 pgs degraded, 20 pgs undersized; 1099 slow ops, oldest one blocked for 580 sec, mon.serverd has slow ops
      [WRN] PG_DEGRADED: Degraded data redundancy: 40/567 objects degraded (7.055%), 8 pgs degraded, 20 pgs undersized
          pg 2.0 is stuck undersized for 8m, current state active+undersized, last acting [3,6]
          pg 2.8 is stuck undersized for 8m, current state active+undersized, last acting [3,8]
      ...output omitted...

      Note

      The health detail output might show the cluster state as HEALTH_OK. When an OSD is down, the cluster migrates its PGs to other OSDs to return the cluster to a healthy state. However, the down OSD still requires troubleshooting.

  3. Locate the down OSD.

    The osd.0 service on the serverc system is reported as down.

    [ceph: root@clienta /]# ceph osd tree
    ID  CLASS  WEIGHT   TYPE NAME         STATUS  REWEIGHT  PRI-AFF
    -1         0.08817  root default
    -3         0.02939      host serverc
     0    hdd  0.00980          osd.0       down         0  1.00000
     1    hdd  0.00980          osd.1         up   1.00000  1.00000
     2    hdd  0.00980          osd.2         up   1.00000  1.00000
    -5         0.02939      host serverd
     3    hdd  0.00980          osd.3         up   1.00000  1.00000
     5    hdd  0.00980          osd.5         up   1.00000  1.00000
     7    hdd  0.00980          osd.7         up   1.00000  1.00000
    -7         0.02939      host servere
     4    hdd  0.00980          osd.4         up   1.00000  1.00000
     6    hdd  0.00980          osd.6         up   1.00000  1.00000
     8    hdd  0.00980          osd.8         up   1.00000  1.00000
    1. Exit the cephadm shell. On the serverc system, list the Ceph service units.

      [ceph: root@clienta /]# exit
      exit
      [admin@clienta ~]$ ssh admin@serverc
      admin@serverc's password: redhat
      [admin@serverc ~]$ systemctl list-units --all 'ceph*'
      UNIT                                                                                 LOAD   ACTIVE   SUB     DESCRIPTION
      ...output omitted...
      ceph-2ae6...fa0c@node-exporter.serverc.service              loaded active   running Ceph node-exporter.serverc for 2ae6...fa0c
      ceph-2ae6...fa0c@osd.0.service                              loaded inactive dead    Ceph osd.0 for 2ae6...fa0c
      ceph-2ae6...fa0c@osd.1.service                              loaded active   running Ceph osd.1 for 2ae6...fa0c
      ceph-2ae6...fa0c@osd.2.service                              loaded active   running Ceph osd.2 for 2ae6...fa0c
      ceph-2ae6...fa0c@prometheus.serverc.service                 loaded active   running Ceph prometheus.serverc for 2ae6...fa0c
      ...output omitted...
    2. Restart the OSD 0 service. The fsid service and the OSD 0 service name are different in your lab environment.

      [admin@serverc ~]$ sudo systemctl start ceph-2ae6...fa0c@osd.0.service
    3. Verify the status of the OSD 0 service.

      [admin@serverc ~]$ systemctl status ceph-2ae6...fa0c@osd.0.service
      ● ceph-2ae6...fa0c@osd.0.service - Ceph osd.0 for 2ae6...fa0c
         Loaded: loaded (/etc/systemd/system/ceph-2ae6...fa0c@.service; enabled; vendor preset: disabled)
         Active: active (running) since Wed 2021-10-20 09:16:45 EDT; 1min 7s ago
        Process: 14368 ExecStopPost=/bin/rm -f //run/ceph-2ae6...fa0c@osd.0.service-pid //run/ceph-2ae6...fa0c@osd.0.service-cid (code=exited, st>
        Process: 14175 ExecStopPost=/bin/bash /var/lib/ceph/2ae6...fa0c/osd.0/unit.poststop (code=exited, status=0/SUCCESS)
      ...output omitted...
    4. Return to the clienta system and use sudo to run the cephadm shell.

      [admin@serverc ~]$ exit
      Connection to serverc closed.
      [admin@clienta ~]$ sudo cephadm shell
      [ceph: root@clienta /]#
    5. Verify the OSD health. The OSD is now up.

      [ceph: root@clienta /]# ceph osd tree
      ID  CLASS  WEIGHT   TYPE NAME         STATUS  REWEIGHT  PRI-AFF
      -1         0.08817  root default
      -3         0.02939      host serverc
       0    hdd  0.00980          osd.0         up   1.00000  1.00000
       1    hdd  0.00980          osd.1         up   1.00000  1.00000
       2    hdd  0.00980          osd.2         up   1.00000  1.00000
      -5         0.02939      host serverd
       3    hdd  0.00980          osd.3         up   1.00000  1.00000
       5    hdd  0.00980          osd.5         up   1.00000  1.00000
       7    hdd  0.00980          osd.7         up   1.00000  1.00000
      -7         0.02939      host servere
       4    hdd  0.00980          osd.4         up   1.00000  1.00000
       6    hdd  0.00980          osd.6         up   1.00000  1.00000
       8    hdd  0.00980          osd.8         up   1.00000  1.00000
    6. Verify the health of the storage cluster. If the status is HEALTH_WARN and you have resolved the time skew and OSD issues, then wait until the cluster status is HEALTH_OK before continuing.

      [ceph: root@clienta /]# ceph health
      HEALTH_OK
  4. Return to workstation as the student user.

    [ceph: root@clienta /]# exit
    [admin@clienta ~]$ exit
    [student@workstation ~]$

Finish

On the workstation machine, use the lab command to complete this exercise. This is important to ensure that resources from previous exercises do not impact upcoming exercises.

[student@workstation ~]$ lab finish tuning-troubleshoot

This concludes the guided exercise.

Revision: cl260-5.0-29d2128