CL260 - ch12s07

Bookmark this page

P 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Lab: Tuning and Troubleshooting Red Hat Ceph Storage

In this lab, you will tune and troubleshoot Red Hat Ceph Storage.

Outcomes

You should be able to:

Troubleshoot Ceph clients and Ceph performance issues.
Modify the logging options for Ceph clients.
Tune the recovery and backfill processes to preserve cluster performance.

As the student user on the workstation machine, use the lab command to prepare your system for this lab.

This command ensures that the lab environment is available for the exercise.

[student@workstation ~]$ lab start tuning-review

Procedure 12.4. Instructions

Log in to clienta as the admin user. Verify the health of the Ceph storage cluster. Hypothesize possible causes for the displayed issues.

[student@workstation ~]$ ssh admin@clienta
[admin@clienta ~]$ sudo cephadm shell
[ceph: root@clienta /]#

Verify the status of the storage cluster.

[ceph: root@clienta /]# ceph status
  cluster:
    id:     2ae6d05a-229a-11ec-925e-52540000fa0c
    health: HEALTH_WARN
            1 failed cephadm daemon(s)
            clock skew detected on mon.serverd
            1 osds down
            Reduced data availability: 4 pgs inactive, 19 pgs peering
            Degraded data redundancy: 35/567 objects degraded (6.173%), 7 pgs degraded

  services:
    mon: 4 daemons, quorum serverc.lab.example.com,clienta,serverd,servere (age 33s)
    mgr: serverc.lab.example.com.aiqepd(active, since 9m), standbys: clienta.nncugs, servere.kjwyko, serverd.klrkci
    osd: 9 osds: 8 up (since 32s), 9 in (since 8m)
    rgw: 2 daemons active (2 hosts, 1 zones)

  data:
    pools:   5 pools, 105 pgs
    objects: 189 objects, 4.9 KiB
    usage:   105 MiB used, 90 GiB / 90 GiB avail
    pgs:     18.095% pgs not active
             35/567 objects degraded (6.173%)
             68 active+clean
             19 peering
             11 active+undersized
             7  active+undersized+degraded

Verify the health details of the storage cluster.

Two separate issues need troubleshooting. The first issue is a clock skew error, and the second issue is a down OSD that is degrading the PGs.

[ceph: root@clienta /]# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s); clock skew detected on mon.serverd; 1 osds down; Degraded data redundancy: 63/567 objects degraded (11.111%), 15 pgs degraded; 44 slow ops, oldest one blocked for 79 sec, mon.serverd has slow ops
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
    daemon osd.4 on servere.lab.example.com is in unknown state
[WRN] MON_CLOCK_SKEW: clock skew detected on mon.serverd
    mon.serverd clock skew 299.62s > max 0.05s (latency 0.0155988s)
[WRN] OSD_DOWN: 1 osds down
    osd.4 (root=default,host=servere) is down
[WRN] PG_DEGRADED: Degraded data redundancy: 63/567 objects degraded (11.111%), 15 pgs degraded
    pg 3.1 is active+undersized+degraded, acting [0,7]
...output omitted...
    pg 4.15 is active+undersized+degraded, acting [0,7]

First, troubleshoot the clock skew issue.

Open a second terminal and log in to serverd as the admin user. The previous health detail output stated that the time on the serverd system is 300 seconds different from the other servers. View the chronyd service status on the serverd system to identify the problem.

The chronyd service is inactive on the serverd system.

[student@workstation ~]$ ssh admin@serverd
[admin@serverd ~]$ systemctl status chronyd
● chronyd.service - NTP client/server
   Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Mon 2021-10-25 03:51:44 EDT; 8min ago
     Docs: man:chronyd(8)
           man:chrony.conf(5)
 Main PID: 867 (code=exited, status=0/SUCCESS)

Restart the chronyd service.

[admin@serverd ~]$ sudo systemctl start chronyd

Verify that the chronyd service is active.

[admin@serverd ~]$ systemctl status chronyd
● chronyd.service - NTP client/server
   Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2021-10-25 04:00:54 EDT; 4min 53s left
     Docs: man:chronyd(8)
           man:chrony.conf(5)
  Process: 14604 ExecStartPost=/usr/libexec/chrony-helper update-daemon (code=exited, status=0/SUCCESS)
  Process: 14601 ExecStart=/usr/sbin/chronyd $OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 14603 (chronyd)
    Tasks: 1 (limit: 36236)
   Memory: 360.0K
   CGroup: /system.slice/chronyd.service
           └─14603 /usr/sbin/chronyd

Return to workstation as the student user and close the second terminal.

[admin@serverd ~]$ exit
Connection to serverd closed.
[student@workstation ~]$ exit

In the first terminal, verify the health of the storage cluster

The time skew message might still display if the monitoring service has not yet updated the time. Allow the cluster sufficient time for services to obtain the corrected time. Continue with these exercise steps, but verify that the skew issue is resolved before finishing the exercise.

[ceph: root@clienta /]# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s); 1 osds down; Degraded data redundancy: 63/567 objects degraded (11.111%), 15 pgs degraded, 16 pgs undersized; 45 slow ops, oldest one blocked for 215 sec, mon.serverd has slow ops
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
    daemon osd.4 on servere.lab.example.com is in error state
[WRN] OSD_DOWN: 1 osds down
    osd.4 (root=default,host=servere) is down
[WRN] PG_DEGRADED: Degraded data redundancy: 63/567 objects degraded (11.111%), 15 pgs degraded, 16 pgs undersized
    pg 2.4 is stuck undersized for 4m, current state active+undersized, last acting [1,3]
...ouput omitted...
    pg 5.6 is stuck undersized for 4m, current state active+undersized, last acting [2,5]

Troubleshoot the down OSD issue. Use diagnostic logging to find and correct a non-working configuration.

Locate the down OSD.

[ceph: root@clienta /]# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME         STATUS  REWEIGHT  PRI-AFF
-1         0.08817  root default
-3         0.02939      host serverc
 0    hdd  0.00980          osd.0         up   1.00000  1.00000
 1    hdd  0.00980          osd.1         up   1.00000  1.00000
 2    hdd  0.00980          osd.2         up   1.00000  1.00000
-5         0.02939      host serverd
 3    hdd  0.00980          osd.3         up   1.00000  1.00000
 5    hdd  0.00980          osd.5         up   1.00000  1.00000
 7    hdd  0.00980          osd.7         up   1.00000  1.00000
-7         0.02939      host servere
 4    hdd  0.00980          osd.4       down   1.00000  1.00000
 6    hdd  0.00980          osd.6         up   1.00000  1.00000
 8    hdd  0.00980          osd.8         up   1.00000  1.00000

Attempt to restart the OSD 4 service with the ceph orch command.

OSD 4 remains down after waiting a sufficient time.

[ceph: root@clienta /]# ceph orch daemon restart osd.4
Scheduled to restart osd.4 on host 'servere.lab.example.com'

Attempt to retrieve the configuration from the OSD 4 service.
An error messages indicates a communication problem with the OSD 4 service.
```
[ceph: root@clienta /]# ceph tell osd.4 config show
Error ENXIO: problem getting command descriptions from osd.4
```

Open a second terminal and log in to servere as the admin user. On the servere system, list the Ceph units.

[student@workstation ~]$ ssh admin@servere
[admin@servere ~]$ systemctl list-units --all 'ceph*'
  UNIT                                                                    LOAD   ACTIVE SUB     DESCRIPTION
  ceph-2ae6...fa0c@crash.servere.service         loaded active running Ceph crash.servere for 2ae6...fa0c
  ceph-2ae6...fa0c@mgr.servere.kjwyko.service    loaded active running Ceph mgr.servere.kjwyko for 2ae6...fa0c
  ceph-2ae6...fa0c@mon.servere.service           loaded active running Ceph mon.servere for 2ae6...fa0c
  ceph-2ae6...fa0c@node-exporter.servere.service loaded active running Ceph node-exporter.servere for 2ae6...fa0c
● ceph-2ae6...fa0c@osd.4.service                 loaded failed failed  Ceph osd.4 for 2ae6...fa0c
  ceph-2ae6...fa0c@osd.6.service                 loaded active running Ceph osd.6 for 2ae6...fa0c
  ceph-2ae6...fa0c@osd.8.service                 loaded active running Ceph osd.8 for 2ae6...fa0c
...output omitted...

The OSD 4 services might not yet list as failed if the orchestrator is still attempting to restart the service. Wait until the service lists as failed before continuing this exercise.

Restart the OSD 4 service. The fsid service and the OSD 0 service name are different in your lab environment.

The OSD 4 service still fails to start.

[admin@servere ~]$ sudo systemctl restart \
ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@osd.4.service
Job for ceph-2ae6...fa0c@osd.4.service failed because the control process exited with error code.
See "systemctl status ceph-2ae6...fa0c@osd.4.service" and "journalctl -xe" for details.

In the first terminal, modify the OSD 4 logging configuration to write to the /var/log/ceph/myosd4.log file and increase the logging level for OSD 4. Attempt to restart the OSD 4 service with the ceph orch command.

[ceph: root@clienta ~]# ceph config set osd.4 log_file /var/log/ceph/myosd4.log
[ceph: root@clienta ~]# ceph config set osd.4 log_to_file true
[ceph: root@clienta ~]# ceph config set osd.4 debug_ms 1
[ceph: root@clienta /]# ceph orch daemon restart osd.4
Scheduled to restart osd.4 on host 'servere.lab.example.com'

In the second terminal window, view the myosd4.log file to discover the issue.

Error messages indicate an incorrect cluster network address configuration.

[admin@servere ~]$ sudo cat \
/var/log/ceph/2ae6d05a-229a-11ec-925e-52540000fa0c/myosd4.log
...output omitted...
2021-10-25T08:26:22.617+0000 7f86c1f9e080  1 bdev(0x561a77661000 /var/lib/ceph/osd/ceph-4/block) close
2021-10-25T08:26:22.877+0000 7f86c1f9e080  1 bdev(0x561a77660c00 /var/lib/ceph/osd/ceph-4/block) close
2021-10-25T08:26:23.125+0000 7f86c1f9e080  0 starting osd.4 osd_data /var/lib/ceph/osd/ceph-4 /var/lib/ceph/osd/ceph-4/journal
2021-10-25T08:26:23.125+0000 7f86c1f9e080 -1 unable to find any IPv4 address in networks '192.168.0.0/24' interfaces ''
2021-10-25T08:26:23.125+0000 7f86c1f9e080 -1 Failed to pick cluster address.

Return to workstation as the student user and close the second terminal.
```
[admin@servere ~]$ exit
[student@workstation ~]$ exit
```

In the first terminal, compare the cluster network addresses for the OSD 0 and OSD 4 services.

[ceph: root@clienta /]# ceph config get osd.0 cluster_network
172.25.249.0/24
[ceph: root@clienta /]# ceph config get osd.4 cluster_network
192.168.0.0/24

Modify the cluster network value for the OSD 4 service. Attempt to restart the OSD 4 service.

[ceph: root@clienta /]# ceph config set osd.4 cluster_network 172.25.249.0/24
[ceph: root@clienta /]# ceph orch daemon restart osd.4

Verify that the OSD 4 service is now up. Verify the health of the storage cluster.

[ceph: root@clienta /]# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME         STATUS  REWEIGHT  PRI-AFF
-1         0.08817  root default
-3         0.02939      host serverc
 0    hdd  0.00980          osd.0         up   1.00000  1.00000
 1    hdd  0.00980          osd.1         up   1.00000  1.00000
 2    hdd  0.00980          osd.2         up   1.00000  1.00000
-7         0.02939      host serverd
 3    hdd  0.00980          osd.3         up   1.00000  1.00000
 5    hdd  0.00980          osd.5         up   1.00000  1.00000
 7    hdd  0.00980          osd.7         up   1.00000  1.00000
-5         0.02939      host servere
 4    hdd  0.00980          osd.4         up   1.00000  1.00000
 6    hdd  0.00980          osd.6         up   1.00000  1.00000
 8    hdd  0.00980          osd.8         up   1.00000  1.00000
[ceph: root@clienta /]# ceph health detail
HEALTH_OK

For the OSD 5 service, set the operations history size to track 40 completed operations and the operations history duration to 700 seconds.

Modify the osd_op_history_size and osd_op_history_duration parameters. Set the history size parameter to 40 and the history duration parameter to 700. Use the ceph daemon command to verify the changed values.

[ceph: root@clienta /]# ceph tell osd.5 config set osd_op_history_size 40
{
    "success": "osd_op_history_size = '40' "
}
[ceph: root@clienta /]# ceph tell osd.5 config set osd_op_history_duration 700
{
    "success": "osd_op_history_duration = '700' "
}
[ceph: root@clienta /]# ceph tell osd.5 dump_historic_ops | head -n 3
{
    "size": 40,
    "duration": 700,

For all OSDs, modify the current runtime value for the maximum concurrent backfills to 3 and for the maximum active recovery operations to 1.

Set the value of the osd_max_backfills parameter to 3.

[ceph: root@clienta /]# ceph tell osd.* config set osd_max_backfills 3
osd.0: {
    "success": "osd_max_backfills = '3' "
}
...output omitted...
osd.8: {
    "success": "osd_max_backfills = '3' "
}

Set the value of the osd_recovery_max_active parameter to 1.

[ceph: root@clienta /]# ceph tell osd.* config set osd_recovery_max_active 1
osd.0: {
    "success": "osd_recovery_max_active = '1' "
}
...output omitted...
osd.8: {
    "success": "osd_recovery_max_active = '1' "
}

Return to workstation as the student user.

[ceph: root@clienta /]# exit
[admin@clienta ~]$ exit
[student@workstation ~]$

Evaluation

Grade your work by running the lab grade tuning-review command from your workstation machine. Correct any reported failures and rerun the script until successful.

[student@workstation ~]$ lab grade tuning-review

Finish

On the workstation machine, use the lab command to complete this exercise. This is important to ensure that resources from previous exercises do not impact upcoming exercises.

[student@workstation ~]$ lab finish tuning-review

This concludes the lab.

Discuss Cloud Storage with Red Hat Ceph Storage

Go to community

Red Hat Ceph Storage for OpenStack (CL260)

Haley_Ruccio

3 sie 2023

Build, expand and maintain cloud-scale, clustered storage for your applications with Red Hat Ceph StorageCloud Storage with Red Hat Ceph Storage (CL260) is designed for storage administrators and cloud operators who deploy Red Hat Ceph Storage in a production data center environment or as a component of a Red Hat OpenStack Platform or OpenShift Container Platform infrastructure. Learn how to deploy, manage, and scale a Ceph storage cluster to provide hybrid storage resources, including Amazon S3 and OpenStack Swift-compatible object storage, Ceph-native and iSCSI-based block storage, and shared file storage. This course is based on Red Hat Ceph Storage 5.0.Course summaryDeploy and manage a Red Hat Ceph Storage cluster on commodity servers.Perform common management operations using the web-based management interface.Create, expand, and control access to storage pools provided by the Ceph cluster.Access Red Hat Ceph Storage from clients using object, block, and file-based methods.Analyze and tune Red Hat Ceph Storage performance.Integrate Red Hat OpenStack Platform image, object, block, and file storage with a Red Hat Ceph Storage cluster.Integrate OpenShift Container Platform with a Red Hat Ceph Storage cluster.Target AudienceThis course is intended for storage administrators and cloud operators who want to learn how to deploy and manage Red Hat Ceph Storage on servers in an enterprise data center or within a Red Hat OpenStack Platform or OpenShift Container Platform environment.Developers writing applications that use cloud-based storage will learn the distinctions of various storage types and client access methods.Recommended trainingTake our free assessment to gauge whether this offering is the best fit for your skills.Red Hat Certified System Administrator (RHCSA) certification, or equivalent experience.For candidates that have not earned an RHCSA or equivalent, confirmation of the correct skill set knowledge can be obtained by taking the online skills assessment.Some experience with storage administration is recommended but not required.Technology considerationsThis course does not have any special technical requirements.This course is not intended for BYOD.Internet access is recommended.

Welcome to the Red Hat Ceph Storage for OpenStack (CL260) group in the Red Hat Learning Community!

cschunke

31 lip 2023

We are excited to launch a space dedicated to the Red Hat Training course CL260! To gain the most value from this group - click the "Join Group" button in the upper right hand corner of the group home page.We encourage group members to collaborate in this group to discuss topics, ask questions, share best practices and tips, provide course feedback, and share their accomplishments as it relates to CL260.Read more about Red Hat Ceph Storage for OpenStack here.

381

Revision: cl260-5.0-29d2128