Bookmark this page

Chapter 5. Creating and Customizing Storage Maps

Abstract

Goal Manage and adjust the CRUSH and OSD maps to optimize data placement to meet the performance and redundancy requirements of cloud applications.
Objectives
  • Administer and update the cluster CRUSH map used by the Ceph cluster.

  • Describe the purpose and modification of the OSD maps.

Sections
  • Managing and Customizing the CRUSH Map (and Guided Exercise)

  • Managing the OSD Map (and Guided Exercise)

Lab

Creating and Customizing Storage Maps

Managing and Customizing the CRUSH Map

Objectives

After completing this section, you should be able to administer and update the cluster CRUSH map used by the Ceph cluster.

CRUSH and Object Placement Strategies

Ceph calculates which OSDs should hold which objects by using a placement algorithm called CRUSH (Controlled Replication Under Scalable Hashing). Objects are assigned to placement groups (PGs) and CRUSH determines which OSDs those placement groups should use to store their objects.

The CRUSH Algorithm

The CRUSH algorithm enables Ceph clients to directly communicate with OSDs; this avoids a centralized service bottleneck. Ceph clients and OSDs use the CRUSH algorithm to efficiently compute information about object locations, instead of having to depend on a central lookup table. Ceph clients retrieve the cluster maps and use the CRUSH map to algorithmically determine how to store and retrieve data. This enables massive scalability for the Ceph cluster by avoiding a single point of failure and a performance bottleneck.

The CRUSH algorithm works to uniformly distribute the data in the object store, manage replication, and respond to system growth and hardware failures. When new OSDs are added or an existing OSD or OSD host fails, Ceph uses CRUSH to rebalance the objects in the cluster among the active OSDs.

CRUSH Map Components

Conceptually, a CRUSH map contains two major components:

A CRUSH hierarchy

This lists all available OSDs and organizes them into a treelike structure of buckets.

The CRUSH hierarchy is often used to represent where OSDs are located. By default, there is a root bucket representing the whole hierarchy, which contains a host bucket for each OSD host.

The OSDs are the leaves of the tree, and by default all OSDs on the same OSD host are placed in that host's bucket. You can customize the tree structure to rearrange it, add more levels, and group OSD hosts into buckets representing their location in different server racks or data centers.

At least one CRUSH rule

CRUSH rules determine how placement groups are assigned OSDs from those buckets. This determines where objects for those placement groups are stored. Different pools might use different CRUSH rules from the CRUSH map.

CRUSH Bucket Types

The CRUSH hierarchy organizes OSDs into a tree of different containers, called buckets. For a large installation, you can create a specific hierarchy to describe your storage infrastructure: data centers with rows of racks, racks, hosts, and OSD devices on those hosts. By creating a CRUSH map rule, you can cause Ceph to place an object's replicas on OSDs on separate servers, on servers in different racks, or even on servers in different data centers.

To summarize, buckets are the containers or branches in the CRUSH hierarchy. Devices are OSDs, and are leaves in the CRUSH hierarchy.

Some of the most important bucket attributes are:

  • The ID of the bucket. These IDs are negative numbers to distinguish them from storage device IDs.

  • The name of the bucket.

  • The type of the bucket. The default map defines several types that you can retrieve with the ceph osd crush dump command.

    Bucket types include root, region, datacenter, room, pod, pdu, row, rack, chassis, and host, but you can also add your own types. The bucket at the root of the hierarchy is of the root type.

  • The algorithm that Ceph uses to select items inside the bucket when mapping PG replicas to OSDs. Several algorithms are available: uniform, list, tree, and straw2. Each algorithm represents a trade-off between performance and reorganization efficiency. The default algorithm is straw2.

Figure 5.1: CRUSH map default hierarchy example

Customizing Failure and Performance Domains

The CRUSH map is the central configuration mechanism for the CRUSH algorithm. You can edit this map to influence data placement and customize the CRUSH algorithm.

Configuring the CRUSH map and creating separate failure domains allows OSDs and cluster nodes to fail without any data loss occurring. The cluster simply operates in a degraded state until the problem is fixed.

Configuring the CRUSH map and creating separate performance domains can reduce performance bottlenecks for clients and applications that use the cluster to store and retrieve data. For example, CRUSH can create one hierarchy for HDDs and another hierarchy for SSDs.

A typical use case for customizing the CRUSH map is to provide additional protection against hardware failures. You can configure the CRUSH map to match the underlying physical infrastructure, which helps mitigate the impact of hardware failures.

By default, the CRUSH algorithm places replicated objects on OSDs on different hosts. You can customize the CRUSH map so that object replicas are placed across OSDs in different shelves, or on hosts in different rooms, or in different racks with distinct power sources.

Another use case is to allocate OSDs with SSD drives to pools used by applications requiring very fast storage, and OSDs with traditional HDDs to pools supporting less demanding workloads.

The CRUSH map can contain multiple hierarchies that you can select through different CRUSH rules. By using separate CRUSH hierarchies, you can establish separate performance domains. Use case examples for configuring separate performance domains are:

  • To separate block storage used by VMs from object storage used by applications.

  • To separate "cold" storage, containing infrequently accessed data, from "hot" storage, containing frequently accessed data.

If you examine an actual CRUSH map definition, it contains:

  • A list of all available physical storage devices.

  • A list of all the infrastructure buckets and the IDs of the storage devices or other buckets in each of them. Remember that a bucket is a container, or a branch, in the infrastructure tree. For example, it might represent a location or a piece of physical hardware.

  • A list of CRUSH rules to map PGs to OSDs.

  • A list of other CRUSH tunables and their settings.

The cluster installation process deploys a default CRUSH map. You can use the ceph osd crush dump command to print the CRUSH map in JSON format. You can also export a binary copy of the map and decompile it into a text file:

[ceph: root@node /]# ceph osd getcrushmap -o ./map.bin
[ceph: root@node /]# crushtool -d ./map.bin -o ./map.txt

Customizing OSD CRUSH Settings

The CRUSH map contains a list of all the storage devices in the cluster. For each storage device the following information is available:

  • The ID of the storage device.

  • The name of the storage device.

  • The weight of the storage device, normally based on its capacity in terabytes. For example, a 4 TB storage device has a weight of about 4.0. This is the relative amount of data the device can store, which the CRUSH algorithm uses to help ensure uniform object distribution.

    You can set the weight of an OSD with the ceph osd crush reweight command. CRUSH tree bucket weights should equal the sum of their leaf weights. If you manually edit the CRUSH map weights, then you should execute the following command to ensure that the CRUSH tree bucket weights accurately reflect the sum of the leaf OSDs within the bucket.

    [ceph: root@node /]# ceph osd crush reweight-all
    reweighted crush hierarchy
  • The class of the storage device. Multiple types of storage devices can be used in a storage cluster, such as HDDs, SSDs, or NVMe SSDs. A storage device's class reflects this information and you can use that to create pools optimized for different application workloads. OSDs automatically detect and set their device class. You can explicitly set the device class of an OSD with the ceph osd crush set-device-class command. Use the ceph osd crush rm-device-class to remove a device class from an OSD.

The ceph osd crush tree command shows the CRUSH map's current CRUSH hierarchy:

[ceph: root@node /]# ceph osd crush tree
  ID CLASS WEIGHT  TYPE NAME
  -1       1.52031 root default
  -3       0.48828     host serverg
   0   hdd 0.48828         osd.0 
  -5       0.48828     host serverh
   1   hdd 0.48828         osd.1 
  -7       0.48828     host serveri
   2   hdd 0.48828         osd.2 
  -9       0.01849     host serverj
   3   ssd 0.01849         osd.3 
  -11      0.01849     host serverk
   4   ssd 0.01849         osd.4 
  -13      0.01849     host serverl
   5   ssd 0.01849         osd.5 

Device classes are implemented by creating a “shadow” CRUSH hierarchy for each device class in use that contains only devices of that class. CRUSH rules can then distribute data over the shadow hierarchy. You can view the CRUSH hierarchy with shadow items with the ceph osd crush tree --show-shadow command.

Create a new device class by using the ceph osd crush class create command. Remove a device class using the ceph osd crush class rm command.

List configured device classes with the ceph osd crush class ls command.

Using CRUSH Rules

The CRUSH map also contains the data placement rules that determine how PGs are mapped to OSDs to store object replicas or erasure coded chunks.

The ceph osd crush rule ls command lists the existing rules and the ceph osd crush rule dump rule_name command prints the details of a rule.

The decompiled CRUSH map also contains the rules and might be easier to read:

[ceph: root@node /]# ceph osd getcrushmap -o ./map.bin
[ceph: root@node /]# crushtool -d ./map.bin -o ./map.txt
[ceph: root@node /]# cat ./map.txt
...output omitted...
rule replicated_rule {  1
      id 0  2
      type replicated
      min_size 1   3
      max_size 10  4
      step take default  5
      step chooseleaf firstn 0 type host  6
      step emit 7
}
...output omitted...

1

The name of the rule. Use this name to select the rule when creating a pool with the ceph osd pool create command.

2

The ID of the rule. Some commands use the rule ID instead of the rule name. For example, the ceph osd pool set pool-name crush_ruleset ID command, which sets the rule for an existing pool, uses the rule ID.

3

If a pool makes fewer replicas than this number, then CRUSH does not select this rule.

4

If a pool makes more replicas than this number, then CRUSH does not select this rule.

5

Takes a bucket name, and begins iterating down the tree. In this example, the iterations start at the bucket called default, which is the root of the default CRUSH hierarchy. With a complex hierarchy composed of multiple data centers, you could create a rule for a data center designed to force objects in specific pools to be stored in OSDs in that data center. In that situation, this step could start iterating at the data center bucket.

6

Selects a set of buckets of the given type (host) and chooses a leaf (OSD) from the subtree of each bucket in the set. In this example, the rule selects an OSD from each host bucket in the set, ensuring that the OSDs come from different hosts. The number of buckets in the set is usually the same as the number of replicas in the pool (the pool size):

  • If the number after firstn is 0, choose as many buckets as there are replicas in the pool.

  • If the number is greater than zero, and less than the number of replicas in the pool, choose that many buckets. In that case, the rule needs another step to draw buckets for the remaining replicas. You can use this mechanism to force the location of a subset of the object replicas.

  • If the number is less than zero, subtract its absolute value from the number of replicas and choose that many buckets.

7

Output the results of the rule.

For example, you could create the following rule to select as many OSDs as needed on separate racks, but only from the DC1 data center:

rule myrackruleinDC1 {
      id 2
      type replicated
      min_size 1
      max_size 10
      step take DC1
      step chooseleaf firstn 0 type rack
      step emit
}

Using CRUSH Tunables

You can also modify the CRUSH algorithm's behavior using tunables. Tunables adjust and disable or enable features of the CRUSH algorithm. Ceph defines the tunables at the beginning of the decompiled CRUSH map, and you can get their current values by using the following command:

[ceph: root@node /]# ceph osd crush show-tunables
{
  "choose_local_tries": 0,
  "choose_local_fallback_tries": 0,
  "choose_total_tries": 50,
  "chooseleaf_descend_once": 1,
  "chooseleaf_vary_r": 1,
  "chooseleaf_stable": 1,
  "straw_calc_version": 1,
  "allowed_bucket_algs": 54,
  "profile": "jewel",
  "optimal_tunables": 1,
  "legacy_tunables": 0,
  "minimum_required_version": "jewel",
  "require_feature_tunables": 1,
  "require_feature_tunables2": 1,
  "has_v2_rules": 1,
  "require_feature_tunables3": 1,
  "has_v3_rules": 0,
  "has_v4_buckets": 1,
  "require_feature_tunables5": 1,
  "has_v5_rules": 0
}

Important

Adjusting CRUSH tunables will probably change how CRUSH maps placement groups to OSDs. When that happens, the cluster needs to move objects to different OSDs in the cluster to reflect the recalculated mappings. Cluster performance could degrade during this process.

Rather than modifying individual tunables, you can select a predefined profile with the ceph osd crush tunables profile command. Set the value of profile to optimal to enable the best (optimal) values for the current version of Red Hat Ceph Storage.

Important

Red Hat recommends that all cluster daemons and clients use the same release version.

CRUSH Map Management

The cluster keeps a compiled binary representation of the CRUSH map. You can modify it by:

  • Using the ceph osd crush command.

  • Extracting and decompiling the binary CRUSH map to plain text, editing the text file, recompiling it to binary format, and importing it back into the cluster.

It is usually easier to update the CRUSH map with the ceph osd crush command. However, there are less common scenarios which can only be implemented by using the second method.

Customizing the CRUSH Map Using Ceph Commands

This example creates a new bucket:

[ceph: root@node /]# ceph osd crush add-bucket name type

For example, these commands create three new buckets, one of the datacenter type and two of the rack type:

[ceph: root@node /]# ceph osd crush add-bucket DC1 datacenter
added bucket DC1 type datacenter to crush map
[ceph: root@node /]# ceph osd crush add-bucket rackA1 rack
added bucket rackA1 type rack to crush map
[ceph: root@node /]# ceph osd crush add-bucket rackB1 rack
added bucket rackB1 type rack to crush map

You can then organize the new buckets in a hierarchy with the following command:

[ceph: root@node /]# ceph osd crush move name type=parent

You also use this command to reorganize the tree. For example, the following commands attach the two rack buckets from the previous example to the data center bucket, and attaches the data center bucket to the default root bucket:

[ceph: root@node /]# ceph osd crush move rackA1 datacenter=DC1
moved item id -10 name 'rackA1' to location {datacenter=DC1} in crush map
[ceph: root@node /]# ceph osd crush move rackB1 datacenter=DC1
moved item id -11 name 'rackB1' to location {datacenter=DC1} in crush map
[ceph: root@node /]# ceph osd crush move DC1 root=default
moved item id -9 name 'DC1' to location {root=default} in crush map

Setting the Location of OSDs

After you have created your custom bucket hierarchy, place the OSDs as leaves on this tree. Each OSD has a location, which is a string defining the full path to the OSD from the root of the tree. For example, the location of an OSD attached to the rackA1 bucket is:

root=default datacenter=DC1 rack=rackA1

When Ceph starts, it uses the ceph-crush-location utility to automatically verify that each OSD is in the correct CRUSH location. If the OSD is not in the expected location in the CRUSH map, it is automatically moved. By default, this is root=default host=hostname.

You can replace the ceph-crush-location utility with your own script to change where OSDs are placed in the CRUSH map. To do this, specify the crush_location_hook parameter in the /etc/ceph/ceph.conf configuration file.

...output omitted...
[osd]
crush_location_hook = /path/to/your/script
...output omitted...

Ceph executes the script with these arguments: --cluster cluster-name --id osd-id --type osd. The script must print the location as a single line on its standard output. The upstream Ceph documentation has an example of a custom script that assumes each system has an /etc/rack file containing the name of its rack:

#!/bin/sh
echo "root=default rack=$(cat /etc/rack) host=$(hostname -s)"

You can set the crush_location parameter in the /etc/ceph/ceph.conf configuration file to redefine the location for particular OSDs. For example, to set the location for osd.0 and osd.1, add the crush_location parameter inside their respective sections in that file:

[osd.0]
crush_location = root=default datacenter=DC1 rack=rackA1

[osd.1]
crush_location = root=default datacenter=DC1 rack=rackB1

Adding CRUSH Map Rules

This example creates a rule that Ceph can use for replicated pools:

[ceph: root@node /]# ceph osd crush rule create-replicated name root \
failure-domain-type [class]
  • name is the name of the rule.

  • root is the starting node in the CRUSH map hierarchy.

  • failure-domain-type is the bucket type for replication.

  • class is the class of the devices to use, such as ssd or hdd. This parameter is optional.

The following example creates the new inDC2 rule to store replicas in the DC2 data center, and distributes the replicas across racks:

[ceph: root@node /]# ceph osd crush rule create-replicated inDC2 DC2 rack
[ceph: root@node /]# ceph osd crush rule ls
replicated_rule
erasure-code
inDC2

After you have defined the rule, use it when creating a replicated pool:

[ceph: root@node /]# ceph osd pool create myfirstpool 50 50 inDC2
pool 'myfirstpool' created

For erasure coding, Ceph automatically creates rules for each erasure coded pool you create. The name of the rule is the name of the new pool. Ceph uses the rule parameters you define in the erasure code profile that you specify when you create the pool.

The following example first creates the new myprofile erasure code profile, then creates the myecpool pool based on this profile:

[ceph: root@node /]# ceph osd erasure-code-profile set myprofile k=2 m=1 \
crush-root=DC2 crush-failure-domain=rack crush-device-class=ssd
[ceph: root@node /]# ceph osd pool create myecpool 50 50 erasure myprofile
pool 'myecpool' created
[ceph: root@node /]# ceph osd crush rule ls
replicated_rule
erasure-code
myecpool

Customizing the CRUSH Map by Decompiling the Binary Version

You can decompile and manually edit the CRUSH map with the following commands:

CommandAction
ceph osd getcrushmap -o binfile Export a binary copy of the current map.
crushtool -d binfile -o textfilepath Decompile a CRUSH map binary into a text file.
crushtool -c textfilepath -o binfile Compile a CRUSH map from text.
crushtool -i binfile --test Perform dry runs on a binary CRUSH map and simulate placement group creation.
ceph osd setcrushmap -i binfile Import a binary CRUSH map into the cluster.

Note

The ceph osd getcrushmap and ceph osd setcrushmap commands provide a useful way to back up and restore the CRUSH map for your cluster.

Optimizing Placement Groups

Placement groups (PGs) allow the cluster to store millions of objects in a scalable way by aggregating them into groups. Objects are organized into placement groups based on the object's ID, the ID of the pool, and the number of placement groups in the pool.

During the cluster life cycle, the number of PGs must be adjusted as the cluster layout changes. CRUSH attempts to ensure a uniform distribution of objects among OSDs in the pool, but there are scenarios where the PGs become unbalanced. The placement group autoscaler can be used to optimize PG distribution, and is on by default. You can also manually set the number of PGs per pool, if required.

Objects are typically distributed uniformly, provided that there are one or two orders of magnitude (factors of ten) more placement groups than OSDs in the pool. If there are not enough PGs, then objects might be distributed unevenly. If there is a small number of very large objects stored in the pool, then object distribution might become unbalanced.

Note

PGs should be configured so that there are enough to evenly distribute objects across the cluster. If the number of PGs is set too high, then it increases CPU and memory use significantly. Red Hat recommends approximately 100 to 200 placement groups per OSD to balance these factors.

Calculating the Number of Placement Groups

For a cluster with a single pool, you can use the following formula, with 100 placement groups per OSD:

Total PGs = (OSDs * 100)/Number of replicas

Red Hat recommends the use of the Ceph Placement Groups per Pool Calculator, https://access.redhat.com/labs/cephpgc/, from the Red Hat Customer Portal Labs.

Mapping PGs Manually

Use the ceph osd pg-upmap-items command to manually map PGs to specific OSDs. Because older Ceph clients do not support it, you must configure the ceph osd set-require-min-compat-client setting to enable the pg-upmap command.

[ceph: root@node /]# ceph osd set-require-min-compat-client luminous
set require_min_compat_client to luminous

The following example remaps the PG 3.25 from ODs 2 and 0 to 1 and 0:

[ceph: root@node /]# ceph pg map 3.25
osdmap e384 pg 3.25 (3.25) -> up [2,0] acting [2,0]
[ceph: root@node /]# ceph osd pg-upmap-items 3.25 2 1
set 3.25 pg_upmap_items mapping to [2->1]
[ceph: root@node /]# ceph pg map 3.25
osdmap e387 pg 3.25 (3.25) -> up [1,0] acting [1,0]

Remapping hundreds of PGs this way is not practical. The osdmaptool command is useful here. It takes the actual map for a pool, analyses it, and generates the ceph osd pg-upmap-items commands to run for an optimal distribution:

  1. Export the map to a file. The following command saves the map to the ./om file:

    [ceph: root@node /]# ceph osd getmap -o ./om
    got osdmap epoch 387
  2. Use the --test-map-pgs option of the osdmaptool command to display the actual distribution of PGs. The following command prints the distribution for the pool with the ID of 3:

    [ceph: root@node /]# osdmaptool ./om --test-map-pgs --pool 3
    osdmaptool: osdmap file './om'
    pool 3 pg_num 50
    #osd  count first	primary	c wt	wt
    osd.0    34    19	19	0.0184937	1
    osd.1    39    14	14	0.0184937	1
    osd.2    27    17	17	0.0184937	1
    ...output omitted...

    This output shows that osd.2 has only 27 PGs but osd.1 has 39.

  3. Generate the commands to rebalance the PGs. Use the --upmap option of the osdmaptool command to store the commands in a file:

    [ceph: root@node /]# osdmaptool ./om --upmap ./cmds.txt --pool 3
    osdmaptool: osdmap file './om'
    writing upmap command output to: ./cmds.txt
    checking for upmap cleanups
    upmap, max-count 100, max deviation 0.01
    [ceph: root@node /]# cat ./cmds.txt
    ceph osd pg-upmap-items 3.1 0 2
    ceph osd pg-upmap-items 3.3 1 2
    ceph osd pg-upmap-items 3.6 0 2
    ...output omitted...
  4. Execute the commands:

    [ceph: root@node /]# bash ./cmds.txt
    set 3.1 pg_upmap_items mapping to [0->2]
    set 3.3 pg_upmap_items mapping to [1->2]
    set 3.6 pg_upmap_items mapping to [0->2]
    ...output omitted...

 

References

For more information, refer to Red Hat Ceph Storage 5 Strategies Guide at https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/storage_strategies_guide/

Revision: cl260-5.0-29d2128