Bookmark this page

Chapter 11. Managing a Red Hat Ceph Storage Cluster

Abstract

Goal	Manage an operational Ceph cluster using tools to check status, monitor services, and properly start and stop all or part of the cluster. Perform cluster maintenance by replacing or repairing cluster components, including MONs, OSDs, and PGs.
Objectives	Administer and monitor a Red Hat Ceph Storage cluster, including starting and stopping specific services or the full cluster, and querying cluster health and utilization. Perform common cluster maintenance tasks, such as adding or removing MONs and OSDs, and recovering from various component failures.
Sections	Performing Cluster Administration and Monitoring (and Guided Exercise) Performing Cluster Maintenance Operations (and Guided Exercise)
Lab	Managing a Red Hat Ceph Storage Cluster

Performing Cluster Administration and Monitoring

Objectives

After completing this section, you should be able to administer and monitor a Red Hat Ceph Storage cluster, including starting and stopping specific services or the full cluster, and querying cluster health and utilization.

Defining the Ceph Manager (MGR)

The role of the Red Hat Ceph Storage Manager (MGR) is to collect cluster statistics.

Client I/O operations continue normally while MGR nodes are down, but queries for cluster statistics fail. Deploy at least two MGRs for each cluster to provide high availability. MGRs are typically run on the same hosts as MON nodes, but it is not required.

The first MGR daemon that is started in a cluster becomes the active MGR and all other MGRs are on standby. If the active MGR does not send a beacon within the configured time interval, a standby MGR takes over. You can configure the mon_mgr_beacon_grace setting to change the beacon time interval if needed. The default value is 30 seconds.

Use the ceph mgr fail <MGR_NAME> command to manually failover from the active MGR to a standby MGR.

Use the ceph mgr stat command to view the status of the MGRs.

[ceph: root@node /]# ceph mgr stat
{
    "epoch": 32,
    "available": true,
    "active_name": "mgr1",
    "num_standby": 3
}

Ceph MGR Modules

The Ceph MGR has a modular architecture. You can enable or disable modules as needed. The MGR collects cluster statistical data and can send the data to external monitoring and management systems.

View the modules that are available and enabled by using the ceph mgr module ls command.

View published addresses for specific modules, such as the Dashboard module URL, by using the ceph mgr services command.

The Ceph Dashboard Module

The Ceph Dashboard provides cluster management and monitoring through a browser-based user interface. The Dashboard enables viewing cluster statistics and alerts, and performing selected cluster management tasks. The Ceph Dashboard requires an active MGR daemon with the Dashboard MGR module enabled.

The Dashboard relies on the Prometheus and Grafana services to display collected monitoring data and to generate alerts. Prometheus is an open source monitoring and alerting tool. Grafana is an open source statistical graphing tool.

The Dashboard supports alerts based on Ceph metrics and configured thresholds. The Prometheus AlertManager component configures, gathers, and triggers the alerts. Alerts are displayed in the Dashboard as notifications. You can view details of recent alerts and mute alerts.

Monitoring Cluster Health

You can use the ceph health command to quickly verify the state of the cluster. This command returns one of the following states:

HEALTH_OK indicates that the cluster is operating normally.
HEALTH_WARN indicates that the cluster is in a warning condition. For example, an OSD is down, but there are enough OSDs working properly for the cluster to function.
HEALTH_ERR indicates that the cluster is in an error condition. For example, a full OSD could have an impact on the functionality of the cluster.

If the Ceph cluster is in a warning or an error state, the ceph health detail command provides additional details.

[ceph: root@node /]# ceph health detail

The ceph -w command displays additional real-time monitoring information about the events happening in the Ceph cluster.

[ceph: root@node /]# ceph -w

This command provides the status of cluster activities, such as the following details:

Data rebalancing across the cluster
Replica recovery across the cluster
Scrubbing activity
OSDs starting and stopping

To monitor the cephadm log, use the ceph -W cephadm command. Use the ceph log last cephadm to view the most recent log entries.

[ceph: root@node /]# ceph -W cephadm

Managing Ceph Services

Containerized services are controlled by systemd on the container host system. Run systemctl commands on the container host system to start, stop, or restart cluster daemons.

Cluster daemons are referred to by the type of $daemon and the daemon $id. The type of $daemon is mon, mgr, mds, osd, rgw, rbd-mirror,crash, or cephfs-mirror.

The daemon $id for MON, MGR, and RGW is the host name. The daemon $id for OSD is the OSD ID. The daemon $id for MDS is the file system name followed by the host name.

Use the ceph orch ps command to list all cluster daemons. Use the --daemon_type=DAEMON option to filter for a specific daemon type.

[ceph: root@node /]# ceph orch ps --daemon_type=osd
NAME   HOST   STATUS         REFRESHED  AGE  PORTS  VERSION           IMAGE ID      CONTAINER ID
osd.0  node1  running (14h)  2m ago     2d   -      16.2.0-117.el8cp  2142b60d7974  7b1e76ef06d1
osd.1  node1  running (14h)  2m ago     2d   -      16.2.0-117.el8cp  2142b60d7974  85fb30af4ec2
osd.2  node1  running (13h)  2m ago     2d   -      16.2.0-117.el8cp  2142b60d7974  bb66b3b6107c
osd.3  node2  running (14h)  2m ago     2d   -      16.2.0-117.el8cp  2142b60d7974  1f63f7fb88f4
osd.4  node2  running (14h)  2m ago     2d   -      16.2.0-117.el8cp  2142b60d7974  3f1c54eee927
osd.5  node2  running (14h)  2m ago     2d   -      16.2.0-117.el8cp  2142b60d7974  366d5208c73f
osd.6  node3  running (14h)  2m ago     2d   -      16.2.0-117.el8cp  2142b60d7974  5e5f9cde6c55
osd.7  node3  running (14h)  2m ago     2d   -      16.2.0-117.el8cp  2142b60d7974  0824d7e78aa0
osd.8  node3  running (14h)  2m ago     2d   -      16.2.0-117.el8cp  2142b60d7974  f85c8af8996d

To stop, start, or restart a daemon on a host, use systemctl commands and the daemon name. To list the names of all daemons on a cluster host, run the systemctl list-units command and search for ceph.

The cluster fsid is in the daemon name. Some service names end in a random six character string to distinguish individual services of the same type on the same host.

[root@node ~]# systemctl list-units 'ceph*'
UNIT                                                                    LOAD   ACTIVE SUB     DESCRIPTION
ceph-ff97a876-1fd2-11ec-8258-52540000fa0c@crash.clienta.service         loaded active running Ceph crash.clienta...
ceph-ff97a876-1fd2-11ec-8258-52540000fa0c@mgr.clienta.soxncl.service    loaded active running Ceph mgr.clienta.soxncl...
ceph-ff97a876-1fd2-11ec-8258-52540000fa0c@mon.clienta.service           loaded active running Ceph mon.clienta...
ceph-ff97a876-1fd2-11ec-8258-52540000fa0c@node-exporter.clienta.service loaded active running Ceph node-exporter...
ceph-ff97a876-1fd2-11ec-8258-52540000fa0c.target                        loaded active active  Ceph cluster...
ceph.target                                                             loaded active active  All Ceph...
...output omitted...

Use the ceph.target command to manage all the daemons on a cluster node.

[root@node ~]# systemctl restart ceph.target

You can also use the ceph orch command to manage cluster services. First, obtain the service name by using the ceph orch ls command. For example, find the service name for cluster OSDs and restart the service.

[ceph: root@node /]# ceph orch ls
NAME                     RUNNING  REFRESHED  AGE   PLACEMENT
alertmanager                 1/1  2s ago     2d    count:1
crash                        4/4  3s ago     2d    *
grafana                      1/1  2s ago     2d    count:1
mds.fs1                      3/3  3s ago     114m  node1;node2;node3;count:3
mgr                          4/4  3s ago     2d    node1;node2;node3;node4
mon                          4/4  3s ago     2d    node1;node2;node3;node4
node-exporter                4/4  3s ago     2d    *
osd.default_drive_group     8/12  3s ago     2d    server*
prometheus                   1/1  2s ago     2d    count:1
rgw.realm.zone               2/2  3s ago     2d    node3;node4

[ceph: root@node /]# ceph orch restart osd.default_drive_group
Scheduled to restart osd.0 on host 'node1'
Scheduled to restart osd.1 on host 'node1'
Scheduled to restart osd.2 on host 'node1'
Scheduled to restart osd.3 on host 'node2'
Scheduled to restart osd.5 on host 'node2'
Scheduled to restart osd.7 on host 'node2'
Scheduled to restart osd.4 on host 'node3'
Scheduled to restart osd.6 on host 'node3'
Scheduled to restart osd.8 on host 'node3'

You can manage an individual cluster daemon by using the ceph orch daemon command.

[ceph: root@node /]# ceph orch daemon restart osd.1

Powering Down or Restarting the Cluster

Ceph supports cluster flags to control the behavior of the cluster. You must set some flags when restarting the cluster or performing cluster maintenance. You can use cluster flags to limit the impact of a failed cluster component or to prevent cluster performance issues.

Use the ceph osd set and ceph osd unset commands to manage these flags:

noup: Do not automatically mark a starting OSD as up. If the cluster network is experiencing latency issues, OSDs can mark each other down on the MON, then mark themselves up. This scenario is called flapping. Set the noup and nodown flags to prevent flapping.
nodown: The nodown flag tells the Ceph MON to mark a stopping OSD with the down state. Use the nodown flag when performing maintenance or a cluster shutdown. Set the nodown flag to prevent flapping.
noout: The noout flag tells the Ceph MON not to remove any OSDs from the CRUSH map, which prevents CRUSH from automatically rebalancing the cluster when OSDs are stopped. Use the noout flag when performing maintenance on a subset of the cluster. Clear the flag after the OSDs are restarted.
noin: The noin flag prevent booting OSDs from being marked with the in state. The flag prevents data from being automatically allocated to that specific OSD.
norecover: The norecover flag prevents recovery operations from running. Use the norecover flag when performing maintenance or a cluster shutdown.
nobackfill: The nobackfill flag prevents backfill operations from running. Use the nobackfill flag when performing maintenance or a cluster shutdown. Backfilling is discussed later in this section.
norebalance: The norebalance flag prevents rebalancing operations from running. Use the norebalance flag when performing maintenance or a cluster shutdown.
noscrub: The noscrub flag prevents scrubbing operations from running. Scrubbing will be discussed later in this section.
nodeep-scrub: The nodeep-scrub flag prevents any deep-scrubbing operation from running. Deep-scrubbing is discussed later in this section.

Cluster Power Down

Perform the following steps to shut down the entire cluster:

Prevent clients from accessing the cluster.
Ensure that the cluster is in a healthy state (HEALTH_OK) and that all PGs are in an active+clean state before proceeding.
Bring down CephFS.
Set the noout, norecover, norebalance, nobackfill, nodown, and pause flags.
Shut down all Ceph Object Gateways (RGW) and iSCSI Gateways.
Shut down OSD nodes one by one.
Shut down MON and MGR nodes one by one.
Shut down the admin node.

Cluster Power Up

Perform the following steps to power on the cluster:

Power up cluster nodes in the following order: admin node, MON and MGR nodes, OSD nodes, MDS nodes.
Clear the noout, norecover, norebalance, nobackfill, nodown and pause flags.
Bring up Ceph Object Gateways and iSCSI Gateways.
Bring up CephFS.

Monitoring the Cluster

View the MON quorum status with the ceph mon stat or the ceph quorum_status -f json-pretty commands.

[ceph: root@node /]# ceph mon stat

[ceph: root@node /]# ceph quorum_status -f json-pretty

You can also view the status of MONs in the Dashboard.

Viewing Daemon Logs

To view daemon logs, use the journalctl -u $daemon@$id command. To show only recent journal entries, use the -f option. For example, this example views logs for that host's OSD 10 daemon.

[root@node ~]$ journalctl -u \
ceph-ff97a876-1fd2-11ec-8258-52540000fa0c@osd.10.service

Ceph containers write to individual log files for each daemon. Enable logging for each specific Ceph daemon by configuring the daemon's log_to_file setting to true. This example enables logging for MON nodes.

[ceph: root@node /]# ceph config set mon log_to_file true

Monitoring OSDs

If the cluster is not healthy, Ceph displays a detailed status report containing the following information:

Current status of the OSDs (up/down/out/in)
OSD near capacity limit information (nearfull/full)
Current status of the placement groups (PGs)

The ceph status and ceph health commands report space-related warning or error conditions. The various ceph osd subcommands report OSD usage details, status, and location information.

Analyzing OSD Usage

The ceph osd df command displays OSD usage statistics. Use the ceph osd df tree command to display the CRUSH tree in the command output.

[ceph: root@node /]# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   RAW USE DATA    OMAP    META     AVAIL   %USE  VAR  PGS STATUS
0   hdd 0.00980  1.00000 10 GiB 1.0 GiB  28 MiB  20 KiB 1024 MiB 9.0 GiB 10.28 1.00  41     up
1   hdd 0.00980  1.00000 10 GiB 1.0 GiB  29 MiB  40 KiB 1024 MiB 9.0 GiB 10.29 1.00  58     up
2   hdd 0.00980  1.00000 10 GiB 1.0 GiB  28 MiB  20 KiB 1024 MiB 9.0 GiB 10.28 1.00  30     up
3   hdd 0.00980  1.00000 10 GiB 1.0 GiB  28 MiB  20 KiB 1024 MiB 9.0 GiB 10.28 1.00  43     up
4   hdd 0.00980  1.00000 10 GiB 1.0 GiB  28 MiB  20 KiB 1024 MiB 9.0 GiB 10.28 1.00  46     up
5   hdd 0.00980  1.00000 10 GiB 1.0 GiB  28 MiB  20 KiB 1024 MiB 9.0 GiB 10.28 1.00  40     up
6   hdd 0.00980  1.00000 10 GiB 1.0 GiB  29 MiB  44 KiB 1024 MiB 9.0 GiB 10.28 1.00  44     up
7   hdd 0.00980  1.00000 10 GiB 1.0 GiB  28 MiB  44 KiB 1024 MiB 9.0 GiB 10.28 1.00  38     up
8   hdd 0.00980  1.00000 10 GiB 1.0 GiB  28 MiB  44 KiB 1024 MiB 9.0 GiB 10.28 1.00  47     up
                  TOTAL 90 GiB 9.2 GiB 255 MiB 274 KiB  9.0 GiB  81 GiB 10.28

The following table describes each column of the command output:

Output column	Description
`ID`	The OSD ID.
`CLASS`	The type of devices that the OSD uses (HDD, SDD, or NVMe).
`WEIGHT`	The weight of the OSD in the CRUSH map. By default, this is set to the OSD capacity in TB and is changed by using the `ceph osd crush reweight` command. The weight determines how much data CRUSH places onto the OSD relative to other OSDs. For example, two OSDs with the same weight receive roughly the same number of I/O requests and store approximately the same amount of data.
`REWEIGHT`	Either the default reweight value or the actual value set by the `ceph osd reweight` command. You can reweight an OSD to temporarily override the CRUSH weight.
`SIZE`	The total OSD storage capacity.
`RAW USE`	The utilized OSD storage capacity.
`DATA`	OSD capacity used by user data.
`OMAP`	The BlueFS storage that is used to store object map (OMAP) data, which are the key-value pairs stored in RocksDB.
`META`	The total BlueFS space allocated, or the value of the `bluestore_bluefs_min` setting, whichever is larger. This is the internal BlueStore metadata, which is calculated as the total space allocated to BlueFS minus the estimated OMAP data size.
`AVAIL`	Free space available on the OSD.
`%USE`	The percentage of storage capacity used on the OSD.
`VAR`	The variation above or below the average OSD utilization.
`PGS`	The number of placement groups on the OSD.
`STATUS`	The status of the OSD.

Use the ceph osd perf command to view OSD performance statistics.

[ceph: root@node /]# ceph osd perf

Interpreting OSD Status

An OSD daemon can be in one of four states, based on the combination of these two flags:

down or up - indicating whether the daemon is running and communicating with the MONs.
out or in - indicating whether the OSD is participating in cluster data placement.

The state of an OSD in normal operation is up and in.

If an OSD fails and the daemon goes offline, the cluster might report it as down and in for a short period of time. This is intended to give the OSD a chance to recover on its own and rejoin the cluster, avoiding unnecessary recovery traffic.

For example, a brief network interruption might cause the OSD to lose communication with the cluster and be temporarily reported as down. After a short interval controlled by the mon_osd_down_out_interval configuration option (five minutes by default), the cluster reports the OSD as down and out. At this point, the placement groups assigned to the failed OSD are migrated to other OSDs.

If the failed OSD then returns to the up and in states, the cluster reassigns placement groups based on the new set of OSDs and by rebalancing the objects in the cluster.

Note

Use the ceph osd set noout and ceph osd unset noout commands to enable or disable the noout flag on the cluster. However, the ceph osd out osdid command tells the Ceph cluster to ignore an OSD for data placement and marks the OSD with the out state.

OSDs verify each other's status at regular time intervals (six seconds by default). They report their status to the MONs every 120 seconds, by default. If an OSD is down, the other OSDs or the MONs do not receive heartbeat responses from that down OSD.

The following configuration settings manage OSD heartbeats:

Configuration option	Description
`osd_heartbeat_interval`	Number of seconds between OSD peer checks.
`osd_heartbeat_grace`	Number of seconds before an unresponsive OSD moves to the `down` state.
`mon_osd_min_down_reporters`	Number of peers reporting that an OSD is down before a MON considers it to be `down`.
`mon_osd_min_down_reports`	Number of times an OSD is reported to be down before a MON considers it to be `down`.
`mon_osd_down_out_subtree_limit`	Prevents a CRUSH unit type (such as a host) from being automatically marked as `out` when it fails.
`osd_mon_report_interval_min`	A newly booted OSD has to report to a MON within this number of seconds.
`osd_mon_report_interval_max`	Maximum number of seconds between reports from an OSD to a MON.
`osd_mon_heartbeat_interval`	Ceph monitor heartbeat interval.
`mon_osd_report_timeout`	The time-out (in seconds) before the MON marks an OSD as `down` if it does not report.

Monitoring OSD Capacity

Red Hat Ceph Storage provides configuration parameters to help prevent data loss due to a lack of storage space in the cluster. You can set these parameters to provide an alert when OSDs are low on storage space.

When the value of the mon_osd_full_ratio setting is reached or exceeded, the cluster stops accepting write requests from clients and enters the HEALTH_ERR state. The default full ratio is 0.95 (95%) of the available storage space in the cluster. Use the full ratio to reserve enough space so that if OSDs fail, there is enough space left that automatic recovery succeeds without running out of space.

The mon_osd_nearfull_ratio setting is a more conservative limit. When the value of the mon_osd_nearfull_ratio limit is reached or exceeded, the cluster enters the HEALTH_WARN state. This is intended to alert you to the need to add OSDs to the cluster or fix issues before you reach the full ratio. The default near full ratio is 0.85 (85%) of the available storage space in the cluster.

The mon_osd_backfillfull_ratio setting is the threshold at which cluster OSDs are considered too full to begin a backfill operation. The default backfill full ratio is 0.90 (90%) of the available storage space in the cluster.

Use the ceph osd set-full-ratio, ceph osd set-nearfull-ratio, and ceph osd set-backfillfull-ratio commands to configure these settings.

[ceph: root@node /]# ceph osd set-full-ratio .85

[ceph: root@node /]# ceph osd set-nearfull-ratio .75

[ceph: root@node /]# ceph osd set-backfillfull-ratio .80

Note

The default ratio settings are appropriate for small clusters, such as the one used in this lab environment. Production clusters typically require lower ratios.

Different OSDs might be at full or nearfull depending on exactly what objects are stored in which placement groups. If you have some OSDs full or nearfull and others with plenty of space remaining, analyze your placement group distribution and CRUSH map weights.

Monitoring Placement Groups

Every placement group (PG) has a status string assigned to it that indicates its health state. When all placement groups are in the active+clean state, the cluster is healthy. A PG status of scrubbing or deep-scrubbing can also occur in a healthy cluster and does not indicate a problem.

Placement group scrubbing is a background process that verifies data consistency by comparing an object's size and other metadata with its replicas on other OSDs and reporting inconsistencies. Deep scrubbing is a resource-intensive process that compares the contents of data objects by using a bitwise comparison and recalculates checksums to identify bad sectors on the drive.

Note

Although scrubbing operations are critical to maintain a healthy cluster, they have a performance impact, particularly deep scrubbing. Schedule scrubbing to avoid peak I/O times. Temporarily prevent scrub operations with the noscrub and nodeep-scrub cluster flags.

Placement groups can have the following states:

PG state	Description
`creating`	PG creation is in progress.
`peering`	The OSDs are being brought into agreement about the current state of the objects in the PG.
`active`	Peering is complete. The PG is available for read and write requests.
`clean`	The PG has the correct number of replicas and there are no stray replicas.
`degraded`	The PG has objects with an incorrect number of replicas.
`recovering`	Objects are being migrated or synchronized with replicas.
`recovery_wait`	The PG is waiting for local or remote reservations.
`undersized`	The PG is configured to store more replicas than there are OSDs available to the placement group.
`inconsistent`	Replicas of this PG are not consistent. One or more replicas in the PG are different, indicating some form of corruption of the PG.
`replay`	The PG is waiting for clients to replay operations from a log after an OSD crash.
`repair`	The PG is scheduled for repair.
`backfill`, `backfill_wait`, `backfill_toofull`	A backfill operation is waiting, occurring, or unable to complete due to insufficient storage.
`incomplete`	The PG is missing information from its history log about writes that might have occurred. This could indicate that an OSD has failed or is not started.
`stale`	The PG is in an unknown state (OSD report time-out).
`inactive`	The PG has been inactive for too long.
`unclean`	The PG has been unclean for too long.
`remapped`	The acting set has changed, and the PG is temporarily remapped to a different set of OSDs while the primary OSD recovers or backfills.
`down`	The PG is offline.
`splitting`	The PG is being split; the number of PGs is being increased.
`scrubbing`, `deep-scrubbing`	A PG scrub or deep-scrub operation is in progress.

When an OSD is added to a placement group, the PG enters the peering state to ensure that all nodes agree about the state of the PG. If the PG can handle read and write requests after peering completes, then it reports an active state . If the PG also has the correct number of replicas for all of its objects, then it reports a clean state. The normal PG operating state after writes are complete is active+clean.

When an object is written to the PG's primary OSD, the PG reports a degraded state until all replica OSDs acknowledge that they have also written the object.

The backfill state means that data is being copied or migrated to rebalance PGs across OSDs. If a new OSD is added to the PG, it is gradually backfilled with objects to avoid excessive network traffic. Backfilling occurs in the background to minimize the performance impact on the cluster. The backfill_wait state indicates that a backfill operation is pending. The backfill state indicates that a backfill operation is in progress. The backfill_too_full state indicates that a backfill operation was requested, but could not be completed due to insufficient storage capacity.

A PG marked as inconsistent might have replicas that are different from the others, detected as a different data checksum or metadata size on one or more replicas. A clock skew in the Ceph cluster and corrupted object content can also cause an inconsistent PG state.

Identifying Stuck Placement Groups

The placement groups transition into degraded or peering states after a failure. If a placement group remains in one of these states for a long period, then the MON marks the placement group as stuck. A stuck PG might be in one or more of the following states:

An inactive PG might be having a peering problem.
An unclean PG might be having problems recovering after a failure.
A stale PG has no OSDs reporting, which might indicate that all OSDs are down and out.
An undersized PG does not have enough OSDs to store the configured number of replicas.

Note

The MONs use the mon_pg_stuck_threshold parameter to decide if a PG has been in an error state for too long. The default value for the threshold is 300 seconds.

Ceph marks a PG as stale when all OSDs that have copies of a specific PG are in down and out states. To return from a stale state, an OSD must be revived to have a PG copy available and for PG recovery to begin. If the situation remains unresolved, the PG is inaccessible and I/O requests to the PG hang.

By default, Ceph performs an automatic recovery. If recovery fails for any PGs, the cluster status continues to display HEALTH_ERR.

Ceph can declare that an OSDs or a PG is lost, which might result in data loss. To determine the affected OSDs, first retrieve an overview of cluster status with the ceph health detail command. Then, use the ceph pg dump_stuck option command to inspect the state of PGs.

Note

If many PGs remain in the peering state, the ceph osd blocked-by command displays the OSD that is preventing the OSD peering.

Inspect the PG using either the ceph pg dump | grep pgid or the ceph pg query pgid command. The OSDs hosting the PG are displayed in square brackets ([]).

To mark a PG as lost, use the ceph pg pgid mark_unfound_lost revert|delete command. To mark an OSD as lost, use the ceph osd lost osdid --yes-i-really-mean-it command. The state of the OSD must be down and out.

Upgrading the Cluster

Use the ceph orch upgrade command to upgrade your Red Hat Ceph Storage 5 cluster.

First, update cephadm by running the cephadm-ansible preflight playbook with the upgrade_ceph_packages option set to true.

[root@node ~]# ansible-playbook -i /etc/ansible/hosts/ cephadm-preflight.yml \
--extra-vars "ceph_origin=rhcs upgrade_ceph_packages=true"

Then run the ceph orch upgrade start --ceph-version VERSION command using the name of the new version.

[ceph: root@node /]# ceph orch upgrade start --ceph-version 16.2.0-117.el8cp

Run the ceph status command to view the progress of the upgrade.

[ceph: root@node /]# ceph status
...output omitted...
progress:
    Upgrade to 16.2.0-115.el8cp (1s)
      [............................]

Do not mix clients and cluster nodes that use different versions of Red Hat Ceph Storage in the same cluster. Clients include RADOS gateways, iSCSI gateways, and other applications that use librados, librbd, or libceph.

Use the ceph versions command after a cluster upgrade to verify that matching versions are installed.

[ceph: root@node /]# ceph versions
{
    "mon": {
        "ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)": 4
    },
    "mgr": {
        "ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)": 4
    },
    "osd": {
        "ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)": 9
    },
    "mds": {
        "ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)": 3
    },
    "rgw": {
        "ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)": 2
    },
    "overall": {
        "ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)": 22
    }
}

Using the Balancer Module

Red Hat Ceph Storage provides a MGR module called balancer that automatically optimizes the placement of PGs across OSDs to achieve a balanced distribution. This module can also be run manually.

The balancer module does not run if the cluster is not in the HEALTH_OK state. When the cluster is healthy, it throttles its changes so that it keeps the number of PGs that need to be moved under a 5% threshold. Configure the target_max_misplaced_ratio MGR setting to adjust this threshold:

[ceph: root@node /]# ceph config set mgr.* target_max_misplaced_ratio .10

The balancer module is enabled by default. Use the ceph balancer on and ceph balancer off commands to enable or disable the balancer.

Use the ceph balancer status command to display the balancer status.

[ceph: root@node /]# ceph balancer status

Automated Balancing

Automated balancing uses one of the following modes:

crush-compat

This mode uses the compat weight-set feature to calculate and manage an alternative set of weights for devices in the CRUSH hierarchy. The balancer optimizes these weight-set values, adjusting them up or down in small increments to achieve a distribution that matches the target distribution as closely as possible.

This mode is fully backwards compatible with older clients.

upmap

The PG upmap mode enables storing explicit PG mappings for individual OSDs in the OSD map as exceptions to the normal CRUSH placement calculation. The upmap mode analyzes PG placement, and then runs the required pg-upmap-items commands to optimize PG placement and achieve a balanced distribution.

Because these upmap entries provide fine-grained control over the PG mapping, the upmap mode is usually able to distribute PGs evenly among OSDs, or +/-1 PG if there is an odd number of PGs.

Setting the mode to upmap requires that all clients be Luminous or newer. Use the ceph osd set-require-min-compat-client luminous command to set the required minimum client version.

Use the ceph balancer mode upmap command to set the balancer mode to upmap.

[ceph: root@node /]# ceph balancer mode upmap

Use the ceph balancer mode crush-compat command to set the balancer mode to crush-compat.

[ceph: root@node /]# ceph balancer mode crush-compat

Manual Balancing

You can run the balancer manually to control when balancing occurs and to evaluate the balancer plan before executing it. To run the balancer manually, use the following commands to disable automatic balancing, and then generate and execute a plan.

Evaluate and score the current distribution for the cluster.
```
[ceph: root@node /]# ceph balancer eval
```
Evaluate and score the current distribution for a specific pool.
```
[ceph: root@node /]# ceph balancer eval POOL_NAME
```

Generate a PG optimization plan and give it a name.

[ceph: root@node /]# ceph balancer optimize PLAN_NAME

Display the contents of the plan.

[ceph: root@node /]# ceph balancer show PLAN_NAME

Analyze the predicted results of executing the plan.
```
[ceph: root@node /]# ceph balancer eval PLAN_NAME
```
If you approve of the predicted results, then execute the plan.
```
[ceph: root@node /]# ceph balancer execute PLAN_NAME
```

Note

Only execute the plan if you expect it to improve the distribution. The plan is discarded after execution.

Use the ceph balancer ls command to show currently recorded plans.

[ceph: root@node /]# ceph balancer ls

Use the ceph balancer rm command to remove a plan.

[ceph: root@node /]# ceph balancer rm  PLAN_NAME

References

For more information, refer to the Red Hat Ceph Storage 5 Administration Guide at https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/administration_guide/index

Discuss Cloud Storage with Red Hat Ceph Storage

Go to community

Red Hat Ceph Storage for OpenStack (CL260)

Haley_Ruccio

3 sie 2023

Build, expand and maintain cloud-scale, clustered storage for your applications with Red Hat Ceph StorageCloud Storage with Red Hat Ceph Storage (CL260) is designed for storage administrators and cloud operators who deploy Red Hat Ceph Storage in a production data center environment or as a component of a Red Hat OpenStack Platform or OpenShift Container Platform infrastructure. Learn how to deploy, manage, and scale a Ceph storage cluster to provide hybrid storage resources, including Amazon S3 and OpenStack Swift-compatible object storage, Ceph-native and iSCSI-based block storage, and shared file storage. This course is based on Red Hat Ceph Storage 5.0.Course summaryDeploy and manage a Red Hat Ceph Storage cluster on commodity servers.Perform common management operations using the web-based management interface.Create, expand, and control access to storage pools provided by the Ceph cluster.Access Red Hat Ceph Storage from clients using object, block, and file-based methods.Analyze and tune Red Hat Ceph Storage performance.Integrate Red Hat OpenStack Platform image, object, block, and file storage with a Red Hat Ceph Storage cluster.Integrate OpenShift Container Platform with a Red Hat Ceph Storage cluster.Target AudienceThis course is intended for storage administrators and cloud operators who want to learn how to deploy and manage Red Hat Ceph Storage on servers in an enterprise data center or within a Red Hat OpenStack Platform or OpenShift Container Platform environment.Developers writing applications that use cloud-based storage will learn the distinctions of various storage types and client access methods.Recommended trainingTake our free assessment to gauge whether this offering is the best fit for your skills.Red Hat Certified System Administrator (RHCSA) certification, or equivalent experience.For candidates that have not earned an RHCSA or equivalent, confirmation of the correct skill set knowledge can be obtained by taking the online skills assessment.Some experience with storage administration is recommended but not required.Technology considerationsThis course does not have any special technical requirements.This course is not intended for BYOD.Internet access is recommended.

Welcome to the Red Hat Ceph Storage for OpenStack (CL260) group in the Red Hat Learning Community!

cschunke

31 lip 2023

We are excited to launch a space dedicated to the Red Hat Training course CL260! To gain the most value from this group - click the "Join Group" button in the upper right hand corner of the group home page.We encourage group members to collaborate in this group to discuss topics, ask questions, share best practices and tips, provide course feedback, and share their accomplishments as it relates to CL260.Read more about Red Hat Ceph Storage for OpenStack here.

381

Revision: cl260-5.0-29d2128