Abstract
| Goal | Manage and adjust the CRUSH and OSD maps to optimize data placement to meet the performance and redundancy requirements of cloud applications. |
| Objectives |
|
| Sections |
|
| Lab |
Creating and Customizing Storage Maps |
After completing this section, you should be able to administer and update the cluster CRUSH map used by the Ceph cluster.
Ceph calculates which OSDs should hold which objects by using a placement algorithm called CRUSH (Controlled Replication Under Scalable Hashing). Objects are assigned to placement groups (PGs) and CRUSH determines which OSDs those placement groups should use to store their objects.
The CRUSH algorithm enables Ceph clients to directly communicate with OSDs; this avoids a centralized service bottleneck. Ceph clients and OSDs use the CRUSH algorithm to efficiently compute information about object locations, instead of having to depend on a central lookup table. Ceph clients retrieve the cluster maps and use the CRUSH map to algorithmically determine how to store and retrieve data. This enables massive scalability for the Ceph cluster by avoiding a single point of failure and a performance bottleneck.
Conceptually, a CRUSH map contains two major components:
This lists all available OSDs and organizes them into a treelike structure of buckets.
The CRUSH hierarchy is often used to represent where OSDs are located. By default, there is a root bucket representing the whole hierarchy, which contains a host bucket for each OSD host.
The OSDs are the leaves of the tree, and by default all OSDs on the same OSD host are placed in that host's bucket. You can customize the tree structure to rearrange it, add more levels, and group OSD hosts into buckets representing their location in different server racks or data centers.
CRUSH rules determine how placement groups are assigned OSDs from those buckets. This determines where objects for those placement groups are stored. Different pools might use different CRUSH rules from the CRUSH map.
The CRUSH hierarchy organizes OSDs into a tree of different containers, called buckets. For a large installation, you can create a specific hierarchy to describe your storage infrastructure: data centers with rows of racks, racks, hosts, and OSD devices on those hosts. By creating a CRUSH map rule, you can cause Ceph to place an object's replicas on OSDs on separate servers, on servers in different racks, or even on servers in different data centers.
To summarize, buckets are the containers or branches in the CRUSH hierarchy. Devices are OSDs, and are leaves in the CRUSH hierarchy.
Some of the most important bucket attributes are:
The ID of the bucket. These IDs are negative numbers to distinguish them from storage device IDs.
The name of the bucket.
The type of the bucket.
The default map defines several types that you can retrieve with the ceph osd crush dump command.
Bucket types include root, region, datacenter, room, pod, pdu, row, rack, chassis, and host, but you can also add your own types.
The bucket at the root of the hierarchy is of the root type.
The algorithm that Ceph uses to select items inside the bucket when mapping PG replicas to OSDs.
Several algorithms are available: uniform, list, tree, and straw2.
Each algorithm represents a trade-off between performance and reorganization efficiency. The default algorithm is straw2.
The CRUSH map is the central configuration mechanism for the CRUSH algorithm. You can edit this map to influence data placement and customize the CRUSH algorithm.
Configuring the CRUSH map and creating separate failure domains allows OSDs and cluster nodes to fail without any data loss occurring. The cluster simply operates in a degraded state until the problem is fixed.
Configuring the CRUSH map and creating separate performance domains can reduce performance bottlenecks for clients and applications that use the cluster to store and retrieve data. For example, CRUSH can create one hierarchy for HDDs and another hierarchy for SSDs.
By default, the CRUSH algorithm places replicated objects on OSDs on different hosts. You can customize the CRUSH map so that object replicas are placed across OSDs in different shelves, or on hosts in different rooms, or in different racks with distinct power sources.
The CRUSH map can contain multiple hierarchies that you can select through different CRUSH rules. By using separate CRUSH hierarchies, you can establish separate performance domains. Use case examples for configuring separate performance domains are:
To separate block storage used by VMs from object storage used by applications.
To separate "cold" storage, containing infrequently accessed data, from "hot" storage, containing frequently accessed data.
If you examine an actual CRUSH map definition, it contains:
A list of all available physical storage devices.
A list of all the infrastructure buckets and the IDs of the storage devices or other buckets in each of them. Remember that a bucket is a container, or a branch, in the infrastructure tree. For example, it might represent a location or a piece of physical hardware.
A list of CRUSH rules to map PGs to OSDs.
A list of other CRUSH tunables and their settings.
The cluster installation process deploys a default CRUSH map.
You can use the ceph osd crush dump command to print the CRUSH map in JSON format.
You can also export a binary copy of the map and decompile it into a text file:
[ceph: root@node /]#ceph osd getcrushmap -o ./map.bin[ceph: root@node /]#crushtool -d ./map.bin -o ./map.txt
The CRUSH map contains a list of all the storage devices in the cluster. For each storage device the following information is available:
The ID of the storage device.
The name of the storage device.
The weight of the storage device, normally based on its capacity in terabytes. For example, a 4 TB storage device has a weight of about 4.0. This is the relative amount of data the device can store, which the CRUSH algorithm uses to help ensure uniform object distribution.
You can set the weight of an OSD with the ceph osd crush reweight command.
CRUSH tree bucket weights should equal the sum of their leaf weights.
If you manually edit the CRUSH map weights, then you should execute the following command to ensure that the CRUSH tree bucket weights accurately reflect the sum of the leaf OSDs within the bucket.
[ceph: root@node /]# ceph osd crush reweight-all
reweighted crush hierarchyThe class of the storage device.
Multiple types of storage devices can be used in a storage cluster, such as HDDs, SSDs, or NVMe SSDs.
A storage device's class reflects this information and you can use that to create pools optimized for different application workloads.
OSDs automatically detect and set their device class.
You can explicitly set the device class of an OSD with the ceph osd crush set-device-class command.
Use the ceph osd crush rm-device-class to remove a device class from an OSD.
The ceph osd crush tree command shows the CRUSH map's current CRUSH hierarchy:
[ceph: root@node /]#ceph osd crush treeID CLASS WEIGHT TYPE NAME -1 1.52031 root default -3 0.48828 host serverg0 hdd 0.48828 osd.0-5 0.48828 host serverh1 hdd 0.48828 osd.1-7 0.48828 host serveri2 hdd 0.48828 osd.2-9 0.01849 host serverj3 ssd 0.01849 osd.3-11 0.01849 host serverk4 ssd 0.01849 osd.4-13 0.01849 host serverl5 ssd 0.01849 osd.5
Device classes are implemented by creating a “shadow” CRUSH hierarchy for each device class in use that contains only devices of that class.
CRUSH rules can then distribute data over the shadow hierarchy.
You can view the CRUSH hierarchy with shadow items with the ceph osd crush tree --show-shadow command.
Create a new device class by using the ceph osd crush class create command.
Remove a device class using the ceph osd crush class rm command.
List configured device classes with the ceph osd crush class ls command.
The CRUSH map also contains the data placement rules that determine how PGs are mapped to OSDs to store object replicas or erasure coded chunks.
The decompiled CRUSH map also contains the rules and might be easier to read:
[ceph: root@node /]#ceph osd getcrushmap -o ./map.bin[ceph: root@node /]#crushtool -d ./map.bin -o ./map.txt[ceph: root@node /]#cat ./map.txt...output omitted... rule replicated_rule {id 0
type replicated min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
} ...output omitted...
The name of the rule.
Use this name to select the rule when creating a pool with the | |
The ID of the rule.
Some commands use the rule ID instead of the rule name.
For example, the | |
If a pool makes fewer replicas than this number, then CRUSH does not select this rule. | |
If a pool makes more replicas than this number, then CRUSH does not select this rule. | |
Takes a bucket name, and begins iterating down the tree.
In this example, the iterations start at the bucket called | |
Selects a set of buckets of the given type (
| |
Output the results of the rule. |
For example, you could create the following rule to select as many OSDs as needed on separate racks, but only from the DC1 data center:
rule myrackruleinDC1 {
id 2
type replicated
min_size 1
max_size 10
step take DC1
step chooseleaf firstn 0 type rack
step emit
}You can also modify the CRUSH algorithm's behavior using tunables. Tunables adjust and disable or enable features of the CRUSH algorithm. Ceph defines the tunables at the beginning of the decompiled CRUSH map, and you can get their current values by using the following command:
[ceph: root@node /]#ceph osd crush show-tunables{ "choose_local_tries": 0, "choose_local_fallback_tries": 0, "choose_total_tries": 50, "chooseleaf_descend_once": 1, "chooseleaf_vary_r": 1, "chooseleaf_stable": 1,"straw_calc_version": 1,"allowed_bucket_algs": 54,"profile": "jewel", "optimal_tunables": 1, "legacy_tunables": 0, "minimum_required_version": "jewel", "require_feature_tunables": 1, "require_feature_tunables2": 1, "has_v2_rules": 1, "require_feature_tunables3": 1, "has_v3_rules": 0, "has_v4_buckets": 1, "require_feature_tunables5": 1, "has_v5_rules": 0 }
Adjusting CRUSH tunables will probably change how CRUSH maps placement groups to OSDs. When that happens, the cluster needs to move objects to different OSDs in the cluster to reflect the recalculated mappings. Cluster performance could degrade during this process.
Red Hat recommends that all cluster daemons and clients use the same release version.
The cluster keeps a compiled binary representation of the CRUSH map. You can modify it by:
Using the ceph osd crush command.
Extracting and decompiling the binary CRUSH map to plain text, editing the text file, recompiling it to binary format, and importing it back into the cluster.
It is usually easier to update the CRUSH map with the ceph osd crush command.
However, there are less common scenarios which can only be implemented by using the second method.
This example creates a new bucket:
[ceph: root@node /]# ceph osd crush add-bucket name typeFor example, these commands create three new buckets, one of the datacenter type and two of the rack type:
[ceph: root@node /]#ceph osd crush add-bucket DC1 datacenteradded bucket DC1 type datacenter to crush map [ceph: root@node /]#ceph osd crush add-bucket rackA1 rackadded bucket rackA1 type rack to crush map [ceph: root@node /]#ceph osd crush add-bucket rackB1 rackadded bucket rackB1 type rack to crush map
You can then organize the new buckets in a hierarchy with the following command:
[ceph: root@node /]# ceph osd crush move name type=parentYou also use this command to reorganize the tree. For example, the following commands attach the two rack buckets from the previous example to the data center bucket, and attaches the data center bucket to the default root bucket:
[ceph: root@node /]#ceph osd crush move rackA1 datacenter=DC1moved item id -10 name 'rackA1' to location {datacenter=DC1} in crush map [ceph: root@node /]#ceph osd crush move rackB1 datacenter=DC1moved item id -11 name 'rackB1' to location {datacenter=DC1} in crush map [ceph: root@node /]#ceph osd crush move DC1 root=defaultmoved item id -9 name 'DC1' to location {root=default} in crush map
After you have created your custom bucket hierarchy, place the OSDs as leaves on this tree.
Each OSD has a location, which is a string defining the full path to the OSD from the root of the tree.
For example, the location of an OSD attached to the rackA1 bucket is:
root=default datacenter=DC1 rack=rackA1
When Ceph starts, it uses the ceph-crush-location utility to automatically verify that each OSD is in the correct CRUSH location.
If the OSD is not in the expected location in the CRUSH map, it is automatically moved. By default, this is root=default host=.hostname
You can replace the ceph-crush-location utility with your own script to change where OSDs are placed in the CRUSH map.
To do this, specify the crush_location_hook parameter in the /etc/ceph/ceph.conf configuration file.
...output omitted...
[osd]
crush_location_hook = /path/to/your/script
...output omitted...Ceph executes the script with these arguments: --cluster
cluster-name--id
osd-id--type osd.
The script must print the location as a single line on its standard output.
The upstream Ceph documentation has an example of a custom script that assumes each system has an /etc/rack file containing the name of its rack:
#!/bin/sh echo "root=default rack=$(cat /etc/rack) host=$(hostname -s)"
You can set the crush_location parameter in the /etc/ceph/ceph.conf configuration file to redefine the location for particular OSDs.
For example, to set the location for osd.0 and osd.1, add the crush_location parameter inside their respective sections in that file:
[osd.0] crush_location = root=default datacenter=DC1 rack=rackA1 [osd.1] crush_location = root=default datacenter=DC1 rack=rackB1
This example creates a rule that Ceph can use for replicated pools:
[ceph: root@node /]# ceph osd crush rule create-replicated name root \
failure-domain-type [class]name is the name of the rule.
root is the starting node in the CRUSH map hierarchy.
failure-domain-type is the bucket type for replication.
class is the class of the devices to use, such as ssd or hdd. This parameter is optional.
The following example creates the new inDC2 rule to store replicas in the DC2 data center, and distributes the replicas across racks:
[ceph: root@node /]#ceph osd crush rule create-replicated inDC2 DC2 rack[ceph: root@node /]#ceph osd crush rule lsreplicated_rule erasure-codeinDC2
After you have defined the rule, use it when creating a replicated pool:
[ceph: root@node /]# ceph osd pool create myfirstpool 50 50 inDC2
pool 'myfirstpool' createdFor erasure coding, Ceph automatically creates rules for each erasure coded pool you create. The name of the rule is the name of the new pool. Ceph uses the rule parameters you define in the erasure code profile that you specify when you create the pool.
The following example first creates the new myprofile erasure code profile, then creates the myecpool pool based on this profile:
[ceph: root@node /]#ceph osd erasure-code-profile set myprofile k=2 m=1 \ crush-root=DC2 crush-failure-domain=rack crush-device-class=ssd[ceph: root@node /]#ceph osd pool create myecpool 50 50 erasure myprofilepool 'myecpool' created [ceph: root@node /]#ceph osd crush rule lsreplicated_rule erasure-codemyecpool
You can decompile and manually edit the CRUSH map with the following commands:
| Command | Action |
|---|---|
ceph osd getcrushmap -o
| Export a binary copy of the current map. |
crushtool -d
| Decompile a CRUSH map binary into a text file. |
crushtool -c
| Compile a CRUSH map from text. |
crushtool -i
| Perform dry runs on a binary CRUSH map and simulate placement group creation. |
ceph osd setcrushmap -i
| Import a binary CRUSH map into the cluster. |
The ceph osd getcrushmap and ceph osd setcrushmap commands provide a useful way to back up and restore the CRUSH map for your cluster.
Placement groups (PGs) allow the cluster to store millions of objects in a scalable way by aggregating them into groups. Objects are organized into placement groups based on the object's ID, the ID of the pool, and the number of placement groups in the pool.
During the cluster life cycle, the number of PGs must be adjusted as the cluster layout changes. CRUSH attempts to ensure a uniform distribution of objects among OSDs in the pool, but there are scenarios where the PGs become unbalanced. The placement group autoscaler can be used to optimize PG distribution, and is on by default. You can also manually set the number of PGs per pool, if required.
Objects are typically distributed uniformly, provided that there are one or two orders of magnitude (factors of ten) more placement groups than OSDs in the pool. If there are not enough PGs, then objects might be distributed unevenly. If there is a small number of very large objects stored in the pool, then object distribution might become unbalanced.
PGs should be configured so that there are enough to evenly distribute objects across the cluster. If the number of PGs is set too high, then it increases CPU and memory use significantly. Red Hat recommends approximately 100 to 200 placement groups per OSD to balance these factors.
For a cluster with a single pool, you can use the following formula, with 100 placement groups per OSD:
Total PGs = (OSDs * 100)/Number of replicas
Red Hat recommends the use of the Ceph Placement Groups per Pool Calculator, https://access.redhat.com/labs/cephpgc/, from the Red Hat Customer Portal Labs.
Use the ceph osd pg-upmap-items command to manually map PGs to specific OSDs.
Because older Ceph clients do not support it, you must configure the ceph osd set-require-min-compat-client setting to enable the pg-upmap command.
[ceph: root@node /]# ceph osd set-require-min-compat-client luminous
set require_min_compat_client to luminousThe following example remaps the PG 3.25 from ODs 2 and 0 to 1 and 0:
[ceph: root@node /]#ceph pg map 3.25osdmap e384 pg 3.25 (3.25) -> up [2,0] acting [2,0] [ceph: root@node /]#ceph osd pg-upmap-items 3.25 2 1set 3.25 pg_upmap_items mapping to [2->1] [ceph: root@node /]#ceph pg map 3.25osdmap e387 pg 3.25 (3.25) -> up [1,0] acting [1,0]
Remapping hundreds of PGs this way is not practical.
The osdmaptool command is useful here.
It takes the actual map for a pool, analyses it, and generates the ceph osd pg-upmap-items commands to run for an optimal distribution:
Export the map to a file. The following command saves the map to the ./om file:
[ceph: root@node /]# ceph osd getmap -o ./om
got osdmap epoch 387Use the --test-map-pgs option of the osdmaptool command to display the actual distribution of PGs.
The following command prints the distribution for the pool with the ID of 3:
[ceph: root@node /]#osdmaptool ./om --test-map-pgs --pool 3osdmaptool: osdmap file './om' pool 3 pg_num 50 #osd count first primary c wt wt osd.0 34 19 19 0.0184937 1 osd.13914 14 0.0184937 1 osd.22717 17 0.0184937 1 ...output omitted...
This output shows that osd.2 has only 27 PGs but osd.1 has 39.
Generate the commands to rebalance the PGs. Use the --upmap option of the osdmaptool command to store the commands in a file:
[ceph: root@node /]#osdmaptool ./om --upmap ./cmds.txt --pool 3osdmaptool: osdmap file './om' writing upmap command output to: ./cmds.txt checking for upmap cleanups upmap, max-count 100, max deviation 0.01 [ceph: root@node /]#cat ./cmds.txtceph osd pg-upmap-items 3.1 0 2 ceph osd pg-upmap-items 3.3 1 2 ceph osd pg-upmap-items 3.6 0 2 ...output omitted...
Execute the commands:
[ceph: root@node /]# bash ./cmds.txt
set 3.1 pg_upmap_items mapping to [0->2]
set 3.3 pg_upmap_items mapping to [1->2]
set 3.6 pg_upmap_items mapping to [0->2]
...output omitted...For more information, refer to Red Hat Ceph Storage 5 Strategies Guide at https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/storage_strategies_guide/