Bookmark this page

Managing the OSD Map

Objectives

After completing this section, you should be able to describe the purpose and modification of the OSD maps.

Describing the OSD Map

The cluster OSD map contains the address and status of each OSD, the pool list and details, and other information such as the OSD near-capacity limit information. Ceph uses these last parameters to send warnings and to stop accepting write requests when an OSD reaches full capacity.

When a change occurs in the cluster's infrastructure, such as OSDs joining or leaving the cluster, the MONs update the corresponding map accordingly. The MONs maintain a history of map revisions. Ceph identifies each version of each map using an ordered set of incremented integers known as epochs.

The ceph status -f json-pretty command displays the epoch of each map. Use the ceph map dump subcommand to display each individual map, such as ceph osd dump.

[ceph: root@serverc /]# ceph status -f json-pretty
...output omitted...
    "osdmap": {
        "epoch": 478,
        "num_osds": 15,
        "num_up_osds": 15,
        "osd_up_since": 1632743988,
        "num_in_osds": 15,
        "osd_in_since": 1631712883,
        "num_remapped_pgs": 0
    },
...output omitted...
[ceph: root@serverc /]# ceph osd dump
epoch 478
fsid 11839bde-156b-11ec-bb71-52540000fa0c
created 2021-09-14T14:50:39.401260+0000
modified 2021-09-27T12:04:26.832212+0000
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 69
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client luminous
min_compat_client luminous
require_osd_release pacific
stretch_mode_enabled false
pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 475 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
...output omitted...
osd.0 up   in  weight 1 up_from 471 up_thru 471 down_at 470 last_clean_interval [457,466) [v2:172.25.250.12:6801/1228351148,v1:172.25.250.12:6802/1228351148] [v2:172.25.249.12:6803/1228351148,v1:172.25.249.12:6804/1228351148] exists,up cfe311b0-dea9-4c0c-a1ea-42aaac4cb160
...output omitted...

Analyzing OSD Map Updates

Ceph updates the OSD map every time an OSD joins or leaves the cluster. An OSD can leave the Ceph cluster either because of an OSD failure or a hardware failure.

Even though the cluster map as a whole is maintained by the MONs, OSDs do not use a leader to manage the OSD map; they propagate the map among themselves. OSDs tag every message they exchange with the OSD map epoch. When an OSD detects that it is lagging behind, it performs a map update with its peer OSD.

In large clusters, where OSD map updates are frequent, it is not practical to always distribute the full map. Instead, receiving OSD nodes perform incremental map updates.

Ceph also tags the messages between OSDs and clients with the epoch. Whenever a client connects to an OSD, the OSD inspects the epoch. If the epoch does not match, then the OSD responds with the correct increment so that the client can update its OSD map. This negates the need for aggressive propagation, because clients learn about the updated map only at the time of next contact.

Updating Cluster Maps with Paxos

To access a Ceph cluster, a client first retrieves a copy of the cluster map from the MONs. All MONs must have the same cluster map for the cluster to function correctly.

MONs use the Paxos algorithm as a mechanism to ensure that they agree on the cluster state. Paxos is a distributed consensus algorithm. Every time a MON modifies a map, it sends the update to the other monitors through Paxos. Ceph only commits the new version of the map after a majority of monitors agree on the update.

The MON submits a map update to Paxos and only writes the new version to the local key-value store after Paxos acknowledges the update. The read operations directly access the key-value store.

Figure 5.2: Cluster map consistency using Paxos

Propagating the OSD Map

OSDs regularly report their status to the monitors. In addition, OSDs exchange heartbeats so that an OSD can detect the failure of a peer and report that event to the monitors.

When a leader monitor learns of an OSD failure, it updates the map, increments the epoch, and uses the Paxos update protocol to notify the other monitors, at the same time revoking their leases. After a majority of monitors acknowledge the update, and the cluster has a quorum, the leader monitor issues a new lease so that the monitors can distribute the updated OSD map. This method avoids the map epoch ever going backwards anywhere in the cluster, and finding previous leases that are still valid.

OSD Map Commands

Use the following commands to manage the OSD map as an administrator:

CommandAction
ceph osd dump Dump the OSD map to standard output.
ceph osd getmap -o binfile Export a binary copy of the current map.
osdmaptool --print binfile Display a human-readable copy of the map to standard output.
osdmaptool --export-crush crushbinfile binfile Extract the CRUSH map from the OSD map.
osdmaptool --import-crush crushbinfile binfile Embed a new CRUSH map.
osdmaptool --test-map-pg pgid binfile Verify the mapping of a given PG.

 

References

For more information, refer to the Red Hat Ceph Storage 5 Storage Strategies Guide at https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/storage_strategies_guide

Revision: cl260-5.0-29d2128