CL260 - ch12s06

Bookmark this page

Guided Exercise: Troubleshooting Clusters and Clients

In this exercise, you will configure tuning parameters and diagnose common problems for various Red Hat Ceph Storage services.

Outcomes

You should be able to identify the error code for each Ceph component and resolve the issues.

As the student user on the workstation machine, use the lab command to prepare your system for this exercise.

[student@workstation ~]$ lab start tuning-troubleshoot

Procedure 12.3. Instructions

Log in to clienta as the admin user and use sudo to run the cephadm shell. Verify the health of the Ceph storage cluster.

Two separate issues need troubleshooting. The first issue is a clock skew error, and the second issue is a down OSD which is degrading the PGs.

[student@workstation ~]$ ssh admin@clienta
[admin@clienta ~]$ sudo cephadm shell
[ceph: root@clienta /]# ceph health detail
HEALTH_WARN clock skew detected on mon.serverd; 1 osds down; Degraded data redundancy: 63/567 objects degraded (11.111%), 14 pgs degraded, 10 pgs undersized; 278 slow ops, oldest one blocked for 170 sec, mon.serverd has slow ops
[WRN] MON_CLOCK_SKEW: clock skew detected on mon.serverd
    mon.serverd clock skew 299.103s > max 0.05s (latency 0.0204872s)
[WRN] OSD_DOWN: 1 osds down
    osd.0 (root=default,host=serverc) is down
[WRN] PG_DEGRADED: Degraded data redundancy: 63/567 objects degraded (11.111%), 14 pgs degraded, 10 pgs undersized
    pg 2.5 is stuck undersized for 2m, current state active+undersized, last acting [8,7]
    pg 2.c is stuck undersized for 2m, current state active+undersized, last acting [6,5]
...output omitted...

Note

The lab uses chronyd for time synchronization with the classroom server.

First, troubleshoot the clock skew issue.

Log in to serverd as the admin user. The previous health detail output stated that the time on the serverd system is 300 seconds different than on the other servers. Viewing the chronyd service status on the serverd system should identify the problem.

Exit the cephadm shell. On the serverd system, view the chronyd service status.

The chronyd service is inactive on the serverd system.

[ceph: root@clienta /]# exit
[admin@clienta ~]$ ssh admin@serverd
admin@serverd's password: redhat
[admin@serverd ~]$ systemctl status chronyd
● chronyd.service - NTP client/server
   Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Wed 2021-10-20 08:49:21 EDT; 13min ago
     Docs: man:chronyd(8)
           man:chrony.conf(5)
 Main PID: 876 (code=exited, status=0/SUCCESS)

Start the chronyd service.

[admin@serverd ~]$ sudo systemctl start chronyd

Verify that the chronyd service is active.

[admin@serverd ~]$ systemctl status chronyd
● chronyd.service - NTP client/server
   Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2021-10-20 09:04:01 EDT; 3min 10s left
     Docs: man:chronyd(8)
           man:chrony.conf(5)
  Process: 15221 ExecStartPost=/usr/libexec/chrony-helper update-daemon (code=exited, status=0/SUCCESS)
  Process: 15218 ExecStart=/usr/sbin/chronyd $OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 15220 (chronyd)
    Tasks: 1 (limit: 36236)
   Memory: 360.0K
   CGroup: /system.slice/chronyd.service
           └─15220 /usr/sbin/chronyd

Return to the clienta system and use sudo to run the cephadm shell.

[admin@serverd ~]$ exit
Connection to serverd closed.
[admin@clienta ~]$ sudo cephadm shell
[ceph: root@clienta /]#

Verify the health of the storage cluster.
The time skew message might still display if the monitoring service has not yet updated the time. Allow the cluster sufficient time for services to obtain the corrected time. Continue with these exercise steps, but verify that the skew issue is resolved before finishing the exercise.
```
[ceph: root@clienta /]# ceph health detail
HEALTH_WARN Degraded data redundancy: 40/567 objects degraded (7.055%), 8 pgs degraded, 20 pgs undersized; 1099 slow ops, oldest one blocked for 580 sec, mon.serverd has slow ops
[WRN] PG_DEGRADED: Degraded data redundancy: 40/567 objects degraded (7.055%), 8 pgs degraded, 20 pgs undersized
    pg 2.0 is stuck undersized for 8m, current state active+undersized, last acting [3,6]
    pg 2.8 is stuck undersized for 8m, current state active+undersized, last acting [3,8]
...output omitted...
```
Note
The health detail output might show the cluster state as HEALTH_OK. When an OSD is down, the cluster migrates its PGs to other OSDs to return the cluster to a healthy state. However, the down OSD still requires troubleshooting.

Locate the down OSD.

The osd.0 service on the serverc system is reported as down.

[ceph: root@clienta /]# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME         STATUS  REWEIGHT  PRI-AFF
-1         0.08817  root default
-3         0.02939      host serverc
 0    hdd  0.00980          osd.0       down         0  1.00000
 1    hdd  0.00980          osd.1         up   1.00000  1.00000
 2    hdd  0.00980          osd.2         up   1.00000  1.00000
-5         0.02939      host serverd
 3    hdd  0.00980          osd.3         up   1.00000  1.00000
 5    hdd  0.00980          osd.5         up   1.00000  1.00000
 7    hdd  0.00980          osd.7         up   1.00000  1.00000
-7         0.02939      host servere
 4    hdd  0.00980          osd.4         up   1.00000  1.00000
 6    hdd  0.00980          osd.6         up   1.00000  1.00000
 8    hdd  0.00980          osd.8         up   1.00000  1.00000

Exit the cephadm shell. On the serverc system, list the Ceph service units.

[ceph: root@clienta /]# exit
exit
[admin@clienta ~]$ ssh admin@serverc
admin@serverc's password: redhat
[admin@serverc ~]$ systemctl list-units --all 'ceph*'
UNIT                                                                                 LOAD   ACTIVE   SUB     DESCRIPTION
...output omitted...
ceph-2ae6...fa0c@node-exporter.serverc.service              loaded active   running Ceph node-exporter.serverc for 2ae6...fa0c
ceph-2ae6...fa0c@osd.0.service                              loaded inactive dead    Ceph osd.0 for 2ae6...fa0c
ceph-2ae6...fa0c@osd.1.service                              loaded active   running Ceph osd.1 for 2ae6...fa0c
ceph-2ae6...fa0c@osd.2.service                              loaded active   running Ceph osd.2 for 2ae6...fa0c
ceph-2ae6...fa0c@prometheus.serverc.service                 loaded active   running Ceph prometheus.serverc for 2ae6...fa0c
...output omitted...

Restart the OSD 0 service. The fsid service and the OSD 0 service name are different in your lab environment.
```
[admin@serverc ~]$ sudo systemctl start ceph-2ae6...fa0c@osd.0.service
```

Verify the status of the OSD 0 service.

[admin@serverc ~]$ systemctl status ceph-2ae6...fa0c@osd.0.service
● ceph-2ae6...fa0c@osd.0.service - Ceph osd.0 for 2ae6...fa0c
   Loaded: loaded (/etc/systemd/system/ceph-2ae6...fa0c@.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2021-10-20 09:16:45 EDT; 1min 7s ago
  Process: 14368 ExecStopPost=/bin/rm -f //run/ceph-2ae6...fa0c@osd.0.service-pid //run/ceph-2ae6...fa0c@osd.0.service-cid (code=exited, st>
  Process: 14175 ExecStopPost=/bin/bash /var/lib/ceph/2ae6...fa0c/osd.0/unit.poststop (code=exited, status=0/SUCCESS)
...output omitted...

Return to the clienta system and use sudo to run the cephadm shell.

[admin@serverc ~]$ exit
Connection to serverc closed.
[admin@clienta ~]$ sudo cephadm shell
[ceph: root@clienta /]#

Verify the OSD health. The OSD is now up.

[ceph: root@clienta /]# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME         STATUS  REWEIGHT  PRI-AFF
-1         0.08817  root default
-3         0.02939      host serverc
 0    hdd  0.00980          osd.0         up   1.00000  1.00000
 1    hdd  0.00980          osd.1         up   1.00000  1.00000
 2    hdd  0.00980          osd.2         up   1.00000  1.00000
-5         0.02939      host serverd
 3    hdd  0.00980          osd.3         up   1.00000  1.00000
 5    hdd  0.00980          osd.5         up   1.00000  1.00000
 7    hdd  0.00980          osd.7         up   1.00000  1.00000
-7         0.02939      host servere
 4    hdd  0.00980          osd.4         up   1.00000  1.00000
 6    hdd  0.00980          osd.6         up   1.00000  1.00000
 8    hdd  0.00980          osd.8         up   1.00000  1.00000

Verify the health of the storage cluster. If the status is HEALTH_WARN and you have resolved the time skew and OSD issues, then wait until the cluster status is HEALTH_OK before continuing.
```
[ceph: root@clienta /]# ceph health
HEALTH_OK
```

Return to workstation as the student user.

[ceph: root@clienta /]# exit
[admin@clienta ~]$ exit
[student@workstation ~]$

Finish

On the workstation machine, use the lab command to complete this exercise. This is important to ensure that resources from previous exercises do not impact upcoming exercises.

[student@workstation ~]$ lab finish tuning-troubleshoot

This concludes the guided exercise.

Discuss Cloud Storage with Red Hat Ceph Storage

Go to community

Red Hat Ceph Storage for OpenStack (CL260)

Haley_Ruccio

3 sie 2023

Build, expand and maintain cloud-scale, clustered storage for your applications with Red Hat Ceph StorageCloud Storage with Red Hat Ceph Storage (CL260) is designed for storage administrators and cloud operators who deploy Red Hat Ceph Storage in a production data center environment or as a component of a Red Hat OpenStack Platform or OpenShift Container Platform infrastructure. Learn how to deploy, manage, and scale a Ceph storage cluster to provide hybrid storage resources, including Amazon S3 and OpenStack Swift-compatible object storage, Ceph-native and iSCSI-based block storage, and shared file storage. This course is based on Red Hat Ceph Storage 5.0.Course summaryDeploy and manage a Red Hat Ceph Storage cluster on commodity servers.Perform common management operations using the web-based management interface.Create, expand, and control access to storage pools provided by the Ceph cluster.Access Red Hat Ceph Storage from clients using object, block, and file-based methods.Analyze and tune Red Hat Ceph Storage performance.Integrate Red Hat OpenStack Platform image, object, block, and file storage with a Red Hat Ceph Storage cluster.Integrate OpenShift Container Platform with a Red Hat Ceph Storage cluster.Target AudienceThis course is intended for storage administrators and cloud operators who want to learn how to deploy and manage Red Hat Ceph Storage on servers in an enterprise data center or within a Red Hat OpenStack Platform or OpenShift Container Platform environment.Developers writing applications that use cloud-based storage will learn the distinctions of various storage types and client access methods.Recommended trainingTake our free assessment to gauge whether this offering is the best fit for your skills.Red Hat Certified System Administrator (RHCSA) certification, or equivalent experience.For candidates that have not earned an RHCSA or equivalent, confirmation of the correct skill set knowledge can be obtained by taking the online skills assessment.Some experience with storage administration is recommended but not required.Technology considerationsThis course does not have any special technical requirements.This course is not intended for BYOD.Internet access is recommended.

Welcome to the Red Hat Ceph Storage for OpenStack (CL260) group in the Red Hat Learning Community!

cschunke

31 lip 2023

We are excited to launch a space dedicated to the Red Hat Training course CL260! To gain the most value from this group - click the "Join Group" button in the upper right hand corner of the group home page.We encourage group members to collaborate in this group to discuss topics, ask questions, share best practices and tips, provide course feedback, and share their accomplishments as it relates to CL260.Read more about Red Hat Ceph Storage for OpenStack here.

381

Revision: cl260-5.0-29d2128