Abstract
| Goal | Manage an operational Ceph cluster using tools to check status, monitor services, and properly start and stop all or part of the cluster. Perform cluster maintenance by replacing or repairing cluster components, including MONs, OSDs, and PGs. |
| Objectives |
|
| Sections |
|
| Lab |
Managing a Red Hat Ceph Storage Cluster |
After completing this section, you should be able to administer and monitor a Red Hat Ceph Storage cluster, including starting and stopping specific services or the full cluster, and querying cluster health and utilization.
The role of the Red Hat Ceph Storage Manager (MGR) is to collect cluster statistics.
Client I/O operations continue normally while MGR nodes are down, but queries for cluster statistics fail. Deploy at least two MGRs for each cluster to provide high availability. MGRs are typically run on the same hosts as MON nodes, but it is not required.
The first MGR daemon that is started in a cluster becomes the active MGR and all other MGRs are on standby.
If the active MGR does not send a beacon within the configured time interval, a standby MGR takes over.
You can configure the mon_mgr_beacon_grace setting to change the beacon time interval if needed.
The default value is 30 seconds.
Use the ceph mgr fail
<MGR_NAME> command to manually failover from the active MGR to a standby MGR.
Use the ceph mgr stat command to view the status of the MGRs.
[ceph: root@node /]# ceph mgr stat
{
"epoch": 32,
"available": true,
"active_name": "mgr1",
"num_standby": 3
}The Ceph MGR has a modular architecture. You can enable or disable modules as needed. The MGR collects cluster statistical data and can send the data to external monitoring and management systems.
View the modules that are available and enabled by using the ceph mgr module ls command.
View published addresses for specific modules, such as the Dashboard module URL, by using the ceph mgr services command.
The Ceph Dashboard provides cluster management and monitoring through a browser-based user interface. The Dashboard enables viewing cluster statistics and alerts, and performing selected cluster management tasks. The Ceph Dashboard requires an active MGR daemon with the Dashboard MGR module enabled.
The Dashboard relies on the Prometheus and Grafana services to display collected monitoring data and to generate alerts. Prometheus is an open source monitoring and alerting tool. Grafana is an open source statistical graphing tool.
The Dashboard supports alerts based on Ceph metrics and configured thresholds. The Prometheus AlertManager component configures, gathers, and triggers the alerts. Alerts are displayed in the Dashboard as notifications. You can view details of recent alerts and mute alerts.
You can use the ceph health command to quickly verify the state of the cluster.
This command returns one of the following states:
HEALTH_OK indicates that the cluster is operating normally.
HEALTH_WARN indicates that the cluster is in a warning condition.
For example, an OSD is down, but there are enough OSDs working properly for the cluster to function.
HEALTH_ERR indicates that the cluster is in an error condition.
For example, a full OSD could have an impact on the functionality of the cluster.
If the Ceph cluster is in a warning or an error state, the ceph health detail command provides additional details.
[ceph: root@node /]# ceph health detailThe ceph -w command displays additional real-time monitoring information about the events happening in the Ceph cluster.
[ceph: root@node /]# ceph -wThis command provides the status of cluster activities, such as the following details:
Data rebalancing across the cluster
Replica recovery across the cluster
Scrubbing activity
OSDs starting and stopping
To monitor the cephadm log, use the ceph -W cephadm command.
Use the ceph log last cephadm to view the most recent log entries.
[ceph: root@node /]# ceph -W cephadmContainerized services are controlled by systemd on the container host system.
Run systemctl commands on the container host system to start, stop, or restart cluster daemons.
Cluster daemons are referred to by the type of $daemon and the daemon $id.
The type of $daemon is mon, mgr, mds, osd, rgw, rbd-mirror,crash, or cephfs-mirror.
The daemon $id for MON, MGR, and RGW is the host name.
The daemon $id for OSD is the OSD ID.
The daemon $id for MDS is the file system name followed by the host name.
Use the ceph orch ps command to list all cluster daemons.
Use the --daemon_type=DAEMON option to filter for a specific daemon type.
[ceph: root@node /]# ceph orch ps --daemon_type=osd
NAME HOST STATUS REFRESHED AGE PORTS VERSION IMAGE ID CONTAINER ID
osd.0 node1 running (14h) 2m ago 2d - 16.2.0-117.el8cp 2142b60d7974 7b1e76ef06d1
osd.1 node1 running (14h) 2m ago 2d - 16.2.0-117.el8cp 2142b60d7974 85fb30af4ec2
osd.2 node1 running (13h) 2m ago 2d - 16.2.0-117.el8cp 2142b60d7974 bb66b3b6107c
osd.3 node2 running (14h) 2m ago 2d - 16.2.0-117.el8cp 2142b60d7974 1f63f7fb88f4
osd.4 node2 running (14h) 2m ago 2d - 16.2.0-117.el8cp 2142b60d7974 3f1c54eee927
osd.5 node2 running (14h) 2m ago 2d - 16.2.0-117.el8cp 2142b60d7974 366d5208c73f
osd.6 node3 running (14h) 2m ago 2d - 16.2.0-117.el8cp 2142b60d7974 5e5f9cde6c55
osd.7 node3 running (14h) 2m ago 2d - 16.2.0-117.el8cp 2142b60d7974 0824d7e78aa0
osd.8 node3 running (14h) 2m ago 2d - 16.2.0-117.el8cp 2142b60d7974 f85c8af8996dTo stop, start, or restart a daemon on a host, use systemctl commands and the daemon name.
To list the names of all daemons on a cluster host, run the systemctl list-units command and search for ceph.
The cluster fsid is in the daemon name.
Some service names end in a random six character string to distinguish individual services of the same type on the same host.
[root@node ~]# systemctl list-units 'ceph*'
UNIT LOAD ACTIVE SUB DESCRIPTION
ceph-ff97a876-1fd2-11ec-8258-52540000fa0c@crash.clienta.service loaded active running Ceph crash.clienta...
ceph-ff97a876-1fd2-11ec-8258-52540000fa0c@mgr.clienta.soxncl.service loaded active running Ceph mgr.clienta.soxncl...
ceph-ff97a876-1fd2-11ec-8258-52540000fa0c@mon.clienta.service loaded active running Ceph mon.clienta...
ceph-ff97a876-1fd2-11ec-8258-52540000fa0c@node-exporter.clienta.service loaded active running Ceph node-exporter...
ceph-ff97a876-1fd2-11ec-8258-52540000fa0c.target loaded active active Ceph cluster...
ceph.target loaded active active All Ceph...
...output omitted...Use the ceph.target command to manage all the daemons on a cluster node.
[root@node ~]# systemctl restart ceph.targetYou can also use the ceph orch command to manage cluster services.
First, obtain the service name by using the ceph orch ls command.
For example, find the service name for cluster OSDs and restart the service.
[ceph: root@node /]#ceph orch lsNAME RUNNING REFRESHED AGE PLACEMENT alertmanager 1/1 2s ago 2d count:1 crash 4/4 3s ago 2d * grafana 1/1 2s ago 2d count:1 mds.fs1 3/3 3s ago 114m node1;node2;node3;count:3 mgr 4/4 3s ago 2d node1;node2;node3;node4 mon 4/4 3s ago 2d node1;node2;node3;node4 node-exporter 4/4 3s ago 2d *osd.default_drive_group8/12 3s ago 2d server* prometheus 1/1 2s ago 2d count:1 rgw.realm.zone 2/2 3s ago 2d node3;node4 [ceph: root@node /]#ceph orch restart osd.default_drive_groupScheduled to restart osd.0 on host 'node1' Scheduled to restart osd.1 on host 'node1' Scheduled to restart osd.2 on host 'node1' Scheduled to restart osd.3 on host 'node2' Scheduled to restart osd.5 on host 'node2' Scheduled to restart osd.7 on host 'node2' Scheduled to restart osd.4 on host 'node3' Scheduled to restart osd.6 on host 'node3' Scheduled to restart osd.8 on host 'node3'
You can manage an individual cluster daemon by using the ceph orch daemon command.
[ceph: root@node /]# ceph orch daemon restart osd.1Ceph supports cluster flags to control the behavior of the cluster. You must set some flags when restarting the cluster or performing cluster maintenance. You can use cluster flags to limit the impact of a failed cluster component or to prevent cluster performance issues.
Use the ceph osd set and ceph osd unset commands to manage these flags:
noup
Do not automatically mark a starting OSD as up.
If the cluster network is experiencing latency issues, OSDs can mark each other down on the MON, then mark themselves up.
This scenario is called flapping. Set the noup and nodown flags to prevent flapping.
nodown
The nodown flag tells the Ceph MON to mark a stopping OSD with the down state.
Use the nodown flag when performing maintenance or a cluster shutdown. Set the nodown flag to prevent flapping.
noout
The noout flag tells the Ceph MON not to remove any OSDs from the CRUSH map, which prevents CRUSH from automatically rebalancing the cluster when OSDs are stopped.
Use the noout flag when performing maintenance on a subset of the cluster.
Clear the flag after the OSDs are restarted.
noin
The noin flag prevent booting OSDs from being marked with the in state.
The flag prevents data from being automatically allocated to that specific OSD.
norecover
The norecover flag prevents recovery operations from running.
Use the norecover flag when performing maintenance or a cluster shutdown.
nobackfill
The nobackfill flag prevents backfill operations from running.
Use the nobackfill flag when performing maintenance or a cluster shutdown.
Backfilling is discussed later in this section.
norebalance
The norebalance flag prevents rebalancing operations from running.
Use the norebalance flag when performing maintenance or a cluster shutdown.
noscrub
The noscrub flag prevents scrubbing operations from running.
Scrubbing will be discussed later in this section.
nodeep-scrub
The nodeep-scrub flag prevents any deep-scrubbing operation from running.
Deep-scrubbing is discussed later in this section.
Perform the following steps to shut down the entire cluster:
Prevent clients from accessing the cluster.
Ensure that the cluster is in a healthy state (HEALTH_OK) and that all PGs are in an active+clean state before proceeding.
Bring down CephFS.
Set the noout, norecover, norebalance, nobackfill, nodown, and pause flags.
Shut down all Ceph Object Gateways (RGW) and iSCSI Gateways.
Shut down OSD nodes one by one.
Shut down MON and MGR nodes one by one.
Shut down the admin node.
Perform the following steps to power on the cluster:
Power up cluster nodes in the following order: admin node, MON and MGR nodes, OSD nodes, MDS nodes.
Clear the noout, norecover, norebalance, nobackfill, nodown and pause flags.
Bring up Ceph Object Gateways and iSCSI Gateways.
Bring up CephFS.
View the MON quorum status with the ceph mon stat or the ceph quorum_status -f json-pretty commands.
[ceph: root@node /]# ceph mon stat[ceph: root@node /]# ceph quorum_status -f json-prettyYou can also view the status of MONs in the Dashboard.
To view daemon logs, use the journalctl -u command.
To show only recent journal entries, use the $daemon@$id-f option.
For example, this example views logs for that host's OSD 10 daemon.
[root@node ~]$ journalctl -u \
ceph-ff97a876-1fd2-11ec-8258-52540000fa0c@osd.10.serviceCeph containers write to individual log files for each daemon.
Enable logging for each specific Ceph daemon by configuring the daemon's log_to_file setting to true.
This example enables logging for MON nodes.
[ceph: root@node /]# ceph config set mon log_to_file trueIf the cluster is not healthy, Ceph displays a detailed status report containing the following information:
Current status of the OSDs (up/down/out/in)
OSD near capacity limit information (nearfull/full)
Current status of the placement groups (PGs)
The ceph status and ceph health commands report space-related warning or error conditions.
The various ceph osd subcommands report OSD usage details, status, and location information.
The ceph osd df command displays OSD usage statistics.
Use the ceph osd df tree command to display the CRUSH tree in the command output.
[ceph: root@node /]# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 0.00980 1.00000 10 GiB 1.0 GiB 28 MiB 20 KiB 1024 MiB 9.0 GiB 10.28 1.00 41 up
1 hdd 0.00980 1.00000 10 GiB 1.0 GiB 29 MiB 40 KiB 1024 MiB 9.0 GiB 10.29 1.00 58 up
2 hdd 0.00980 1.00000 10 GiB 1.0 GiB 28 MiB 20 KiB 1024 MiB 9.0 GiB 10.28 1.00 30 up
3 hdd 0.00980 1.00000 10 GiB 1.0 GiB 28 MiB 20 KiB 1024 MiB 9.0 GiB 10.28 1.00 43 up
4 hdd 0.00980 1.00000 10 GiB 1.0 GiB 28 MiB 20 KiB 1024 MiB 9.0 GiB 10.28 1.00 46 up
5 hdd 0.00980 1.00000 10 GiB 1.0 GiB 28 MiB 20 KiB 1024 MiB 9.0 GiB 10.28 1.00 40 up
6 hdd 0.00980 1.00000 10 GiB 1.0 GiB 29 MiB 44 KiB 1024 MiB 9.0 GiB 10.28 1.00 44 up
7 hdd 0.00980 1.00000 10 GiB 1.0 GiB 28 MiB 44 KiB 1024 MiB 9.0 GiB 10.28 1.00 38 up
8 hdd 0.00980 1.00000 10 GiB 1.0 GiB 28 MiB 44 KiB 1024 MiB 9.0 GiB 10.28 1.00 47 up
TOTAL 90 GiB 9.2 GiB 255 MiB 274 KiB 9.0 GiB 81 GiB 10.28The following table describes each column of the command output:
| Output column | Description |
|---|---|
ID
| The OSD ID. |
CLASS
| The type of devices that the OSD uses (HDD, SDD, or NVMe). |
WEIGHT
| The weight of the OSD in the CRUSH map.
By default, this is set to the OSD capacity in TB and is changed by using the ceph osd crush reweight command. The weight determines how much data CRUSH places onto the OSD relative to other OSDs.
For example, two OSDs with the same weight receive roughly the same number of I/O requests and store approximately the same amount of data. |
REWEIGHT
| Either the default reweight value or the actual value set by the ceph osd reweight command.
You can reweight an OSD to temporarily override the CRUSH weight. |
SIZE
| The total OSD storage capacity. |
RAW USE
| The utilized OSD storage capacity. |
DATA
| OSD capacity used by user data. |
OMAP
| The BlueFS storage that is used to store object map (OMAP) data, which are the key-value pairs stored in RocksDB. |
META
| The total BlueFS space allocated, or the value of the bluestore_bluefs_min setting, whichever is larger.
This is the internal BlueStore metadata, which is calculated as the total space allocated to BlueFS minus the estimated OMAP data size. |
AVAIL
| Free space available on the OSD. |
%USE
| The percentage of storage capacity used on the OSD. |
VAR
| The variation above or below the average OSD utilization. |
PGS
| The number of placement groups on the OSD. |
STATUS
| The status of the OSD. |
Use the ceph osd perf command to view OSD performance statistics.
[ceph: root@node /]# ceph osd perfAn OSD daemon can be in one of four states, based on the combination of these two flags:
down
or
up
- indicating whether the daemon is running and communicating with the MONs.
out
or
in
- indicating whether the OSD is participating in cluster data placement.
The state of an OSD in normal operation is up and in.
If an OSD fails and the daemon goes offline, the cluster might report it as down and in for a short period of time.
This is intended to give the OSD a chance to recover on its own and rejoin the cluster, avoiding unnecessary recovery traffic.
For example, a brief network interruption might cause the OSD to lose communication with the cluster and be temporarily reported as down.
After a short interval controlled by the mon_osd_down_out_interval configuration option (five minutes by default), the cluster reports the OSD as down and out.
At this point, the placement groups assigned to the failed OSD are migrated to other OSDs.
If the failed OSD then returns to the up and in states, the cluster reassigns placement groups based on the new set of OSDs and by rebalancing the objects in the cluster.
Use the ceph osd set noout and ceph osd unset noout commands to enable or disable the noout flag on the cluster.
However, the ceph osd out command tells the Ceph cluster to ignore an OSD for data placement and marks the OSD with the osdidout state.
OSDs verify each other's status at regular time intervals (six seconds by default).
They report their status to the MONs every 120 seconds, by default.
If an OSD is down, the other OSDs or the MONs do not receive heartbeat responses from that down OSD.
The following configuration settings manage OSD heartbeats:
| Configuration option | Description |
|---|---|
osd_heartbeat_interval
| Number of seconds between OSD peer checks. |
osd_heartbeat_grace
| Number of seconds before an unresponsive OSD moves to the down state. |
mon_osd_min_down_reporters
| Number of peers reporting that an OSD is down before a MON considers it to be down. |
mon_osd_min_down_reports
| Number of times an OSD is reported to be down before a MON considers it to be down. |
mon_osd_down_out_subtree_limit
| Prevents a CRUSH unit type (such as a host) from being automatically marked as out when it fails. |
osd_mon_report_interval_min
| A newly booted OSD has to report to a MON within this number of seconds. |
osd_mon_report_interval_max
| Maximum number of seconds between reports from an OSD to a MON. |
osd_mon_heartbeat_interval
| Ceph monitor heartbeat interval. |
mon_osd_report_timeout
| The time-out (in seconds) before the MON marks an OSD as down if it does not report. |
Red Hat Ceph Storage provides configuration parameters to help prevent data loss due to a lack of storage space in the cluster. You can set these parameters to provide an alert when OSDs are low on storage space.
When the value of the mon_osd_full_ratio setting is reached or exceeded, the cluster stops accepting write requests from clients and enters the HEALTH_ERR state.
The default full ratio is 0.95 (95%) of the available storage space in the cluster.
Use the full ratio to reserve enough space so that if OSDs fail, there is enough space left that automatic recovery succeeds without running out of space.
The mon_osd_nearfull_ratio setting is a more conservative limit.
When the value of the mon_osd_nearfull_ratio limit is reached or exceeded, the cluster enters the HEALTH_WARN state.
This is intended to alert you to the need to add OSDs to the cluster or fix issues before you reach the full ratio.
The default near full ratio is 0.85 (85%) of the available storage space in the cluster.
The mon_osd_backfillfull_ratio setting is the threshold at which cluster OSDs are considered too full to begin a backfill operation.
The default backfill full ratio is 0.90 (90%) of the available storage space in the cluster.
Use the ceph osd set-full-ratio, ceph osd set-nearfull-ratio, and ceph osd set-backfillfull-ratio commands to configure these settings.
[ceph: root@node /]# ceph osd set-full-ratio .85[ceph: root@node /]# ceph osd set-nearfull-ratio .75[ceph: root@node /]# ceph osd set-backfillfull-ratio .80The default ratio settings are appropriate for small clusters, such as the one used in this lab environment. Production clusters typically require lower ratios.
Different OSDs might be at full or nearfull depending on exactly what objects are stored in which placement groups.
If you have some OSDs full or nearfull and others with plenty of space remaining, analyze your placement group distribution and CRUSH map weights.
Every placement group (PG) has a status string assigned to it that indicates its health state.
When all placement groups are in the active+clean state, the cluster is healthy.
A PG status of scrubbing or deep-scrubbing can also occur in a healthy cluster and does not indicate a problem.
Placement group scrubbing is a background process that verifies data consistency by comparing an object's size and other metadata with its replicas on other OSDs and reporting inconsistencies. Deep scrubbing is a resource-intensive process that compares the contents of data objects by using a bitwise comparison and recalculates checksums to identify bad sectors on the drive.
Although scrubbing operations are critical to maintain a healthy cluster, they have a performance impact, particularly deep scrubbing.
Schedule scrubbing to avoid peak I/O times.
Temporarily prevent scrub operations with the noscrub and nodeep-scrub cluster flags.
Placement groups can have the following states:
| PG state | Description |
|---|---|
creating
| PG creation is in progress. |
peering
| The OSDs are being brought into agreement about the current state of the objects in the PG. |
active
| Peering is complete. The PG is available for read and write requests. |
clean
| The PG has the correct number of replicas and there are no stray replicas. |
degraded
| The PG has objects with an incorrect number of replicas. |
recovering
| Objects are being migrated or synchronized with replicas. |
recovery_wait
| The PG is waiting for local or remote reservations. |
undersized
| The PG is configured to store more replicas than there are OSDs available to the placement group. |
inconsistent
| Replicas of this PG are not consistent. One or more replicas in the PG are different, indicating some form of corruption of the PG. |
replay
| The PG is waiting for clients to replay operations from a log after an OSD crash. |
repair
| The PG is scheduled for repair. |
backfill, backfill_wait, backfill_toofull
| A backfill operation is waiting, occurring, or unable to complete due to insufficient storage. |
incomplete
| The PG is missing information from its history log about writes that might have occurred. This could indicate that an OSD has failed or is not started. |
stale
| The PG is in an unknown state (OSD report time-out). |
inactive
| The PG has been inactive for too long. |
unclean
| The PG has been unclean for too long. |
remapped
| The acting set has changed, and the PG is temporarily remapped to a different set of OSDs while the primary OSD recovers or backfills. |
down
| The PG is offline. |
splitting
| The PG is being split; the number of PGs is being increased. |
scrubbing, deep-scrubbing
| A PG scrub or deep-scrub operation is in progress. |
When an OSD is added to a placement group, the PG enters the peering state to ensure that all nodes agree about the state of the PG.
If the PG can handle read and write requests after peering completes, then it reports an active state .
If the PG also has the correct number of replicas for all of its objects, then it reports a clean state.
The normal PG operating state after writes are complete is active+clean.
When an object is written to the PG's primary OSD, the PG reports a degraded state until all replica OSDs acknowledge that they have also written the object.
The backfill state means that data is being copied or migrated to rebalance PGs across OSDs.
If a new OSD is added to the PG, it is gradually backfilled with objects to avoid excessive network traffic.
Backfilling occurs in the background to minimize the performance impact on the cluster.
The backfill_wait state indicates that a backfill operation is pending.
The backfill state indicates that a backfill operation is in progress.
The backfill_too_full state indicates that a backfill operation was requested, but could not be completed due to insufficient storage capacity.
A PG marked as inconsistent might have replicas that are different from the others, detected as a different data checksum or metadata size on one or more replicas.
A clock skew in the Ceph cluster and corrupted object content can also cause an inconsistent PG state.
The placement groups transition into degraded or peering states after a failure.
If a placement group remains in one of these states for a long period, then the MON marks the placement group as stuck.
A stuck PG might be in one or more of the following states:
An
inactive
PG might be having a peering problem.
An
unclean
PG might be having problems recovering after a failure.
A
stale
PG has no OSDs reporting, which might indicate that all OSDs are down and out.
An
undersized
PG does not have enough OSDs to store the configured number of replicas.
The MONs use the mon_pg_stuck_threshold parameter to decide if a PG has been in an error state for too long.
The default value for the threshold is 300 seconds.
Ceph marks a PG as stale when all OSDs that have copies of a specific PG are in down and out states.
To return from a stale state, an OSD must be revived to have a PG copy available and for PG recovery to begin.
If the situation remains unresolved, the PG is inaccessible and I/O requests to the PG hang.
By default, Ceph performs an automatic recovery.
If recovery fails for any PGs, the cluster status continues to display HEALTH_ERR.
Ceph can declare that an OSDs or a PG is lost, which might result in data loss.
To determine the affected OSDs, first retrieve an overview of cluster status with the ceph health detail command.
Then, use the ceph pg dump_stuck command to inspect the state of PGs.option
If many PGs remain in the peering state, the ceph osd blocked-by command displays the OSD that is preventing the OSD peering.
Inspect the PG using either the ceph pg dump | grep or the pgidceph pg query command.
The OSDs hosting the PG are displayed in square brackets ([]).pgid
To mark a PG as lost, use the ceph pg command.
To mark an OSD as lost, use the pgid mark_unfound_lost revert|deleteceph osd lost command.
The state of the OSD must be osdid --yes-i-really-mean-itdown and out.
Use the ceph orch upgrade command to upgrade your Red Hat Ceph Storage 5 cluster.
First, update cephadm by running the cephadm-ansible preflight playbook with the upgrade_ceph_packages option set to true.
[root@node ~]# ansible-playbook -i /etc/ansible/hosts/ cephadm-preflight.yml \
--extra-vars "ceph_origin=rhcs upgrade_ceph_packages=true"Then run the ceph orch upgrade start --ceph-version VERSION command using the name of the new version.
[ceph: root@node /]# ceph orch upgrade start --ceph-version 16.2.0-117.el8cpRun the ceph status command to view the progress of the upgrade.
[ceph: root@node /]# ceph status
...output omitted...
progress:
Upgrade to 16.2.0-115.el8cp (1s)
[............................]Do not mix clients and cluster nodes that use different versions of Red Hat Ceph Storage in the same cluster.
Clients include RADOS gateways, iSCSI gateways, and other applications that use librados, librbd, or libceph.
Use the ceph versions command after a cluster upgrade to verify that matching versions are installed.
[ceph: root@node /]# ceph versions
{
"mon": {
"ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)": 4
},
"mgr": {
"ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)": 4
},
"osd": {
"ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)": 9
},
"mds": {
"ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)": 3
},
"rgw": {
"ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)": 2
},
"overall": {
"ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)": 22
}
}Red Hat Ceph Storage provides a MGR module called balancer that automatically optimizes the placement of PGs across OSDs to achieve a balanced distribution.
This module can also be run manually.
The balancer module does not run if the cluster is not in the HEALTH_OK state.
When the cluster is healthy, it throttles its changes so that it keeps the number of PGs that need to be moved under a 5% threshold. Configure the target_max_misplaced_ratio MGR setting to adjust this threshold:
[ceph: root@node /]# ceph config set mgr.* target_max_misplaced_ratio .10The balancer module is enabled by default.
Use the ceph balancer on and ceph balancer off commands to enable or disable the balancer.
Use the ceph balancer status command to display the balancer status.
[ceph: root@node /]# ceph balancer statusAutomated balancing uses one of the following modes:
crush-compat
This mode uses the compat weight-set feature to calculate and manage an alternative set of weights for devices in the CRUSH hierarchy. The balancer optimizes these weight-set values, adjusting them up or down in small increments to achieve a distribution that matches the target distribution as closely as possible.
This mode is fully backwards compatible with older clients.
upmap
The PG upmap mode enables storing explicit PG mappings for individual OSDs in the OSD map as exceptions to the normal CRUSH placement calculation. The upmap mode analyzes PG placement, and then runs the required pg-upmap-items commands to optimize PG placement and achieve a balanced distribution.
Because these upmap entries provide fine-grained control over the PG mapping, the upmap mode is usually able to distribute PGs evenly among OSDs, or +/-1 PG if there is an odd number of PGs.
Setting the mode to upmap requires that all clients be Luminous or newer. Use the ceph osd set-require-min-compat-client luminous command to set the required minimum client version.
Use the ceph balancer mode upmap command to set the balancer mode to upmap.
[ceph: root@node /]# ceph balancer mode upmapUse the ceph balancer mode crush-compat command to set the balancer mode to crush-compat.
[ceph: root@node /]# ceph balancer mode crush-compatYou can run the balancer manually to control when balancing occurs and to evaluate the balancer plan before executing it.
To run the balancer manually, use the following commands to disable automatic balancing, and then generate and execute a plan.
Evaluate and score the current distribution for the cluster.
[ceph: root@node /]# ceph balancer evalEvaluate and score the current distribution for a specific pool.
[ceph: root@node /]# ceph balancer eval POOL_NAMEGenerate a PG optimization plan and give it a name.
[ceph: root@node /]# ceph balancer optimize PLAN_NAMEDisplay the contents of the plan.
[ceph: root@node /]# ceph balancer show PLAN_NAMEAnalyze the predicted results of executing the plan.
[ceph: root@node /]# ceph balancer eval PLAN_NAMEIf you approve of the predicted results, then execute the plan.
[ceph: root@node /]# ceph balancer execute PLAN_NAMEOnly execute the plan if you expect it to improve the distribution. The plan is discarded after execution.
Use the ceph balancer ls command to show currently recorded plans.
[ceph: root@node /]# ceph balancer lsUse the ceph balancer rm command to remove a plan.
[ceph: root@node /]# ceph balancer rm PLAN_NAMEFor more information, refer to the Red Hat Ceph Storage 5 Administration Guide at https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/administration_guide/index
Click CREATE to build all of the virtual machines needed for the classroom lab environment. This may take several minutes to complete. Once created the environment can then be stopped and restarted to pause your experience.
If you DELETE your lab, you will remove all of the virtual machines in your classroom and lose all of your progress.