After completing this section, you should be able to describe and compare replicated and erasure coded pools, and create and configure each pool type.
Pools are logical partitions for storing objects. Ceph clients write objects to pools.
Ceph clients require the cluster name (ceph by default) and a monitor address to connect to the cluster.
Ceph clients usually obtain this information from a Ceph configuration file, or by being specified as command-line parameters.
The Ceph client creates an input/output context to a specific pool and the Ceph cluster uses the CRUSH algorithm to map these pools to placement groups, which are then mapped to specific OSDs.
The available pool types are replicated and erasure coded.
You decide which pool type to use based on your production use case and the type of workload.
A pool's type cannot be changed after creating the pool.
You must specify certain attributes when you create a pool:
The pool name, which must be unique in the cluster.
The pool type, which determines the protection mechanism the pool uses to ensure data durability.
The replicated type distributes multiple copies of each object across the cluster.
The erasure coded type splits each object into chunks, and distributes them along with additional erasure coded chunks to protect objects using an automatic error correction mechanism.
The number of placement groups (PGs) in the pool, which store their objects in a set of OSDs determined by the CRUSH algorithm.
Optionally, a CRUSH rule set that Ceph uses to identify which placement groups to use to store objects for the pool.
Change the osd_pool_default_pg_num and osd_pool_default_pgp_num configuration settings to set the default number of PGs for a pool.
Ceph protects data within replicated pools by creating multiple copies of each object, called replicas. Ceph uses the CRUSH failure domain to determine the primary OSDs of the acting set to store the data. Then, the primary OSD finds the current replica size for the pool and calculates the secondary OSDs to write the objects to. After the primary OSD receives the acknowledgment of the writes and finishes writing the data, the primary OSD acknowledges a successful write operation to the Ceph client. This protects the data in the object if one or more of the OSDs fail.
Use the following command to create a replicated pool.
[ceph: root@node /]# ceph osd pool create pool-name pg-num pgp-num replicated crush-rule-nameWhere:
pool_name is the name of the new pool.
pg_num is the total configured number of placement groups (PGs) for this pool.
pgp_num is the effective number of placement groups for this pool.
Set this equal to pg_num.
replicated specifies that this is a replicated pool, and is the default if not included in the command.
crush-rule-name is the name of the CRUSH rule set you want to use for this pool.
The osd_pool_default_crush_replicated_ruleset configuration parameter sets the default value.
The number of placement groups in a pool can be adjusted after it is initially configured.
If pg_num and pgp_num are set to the same number, then any future adjustments to pg_num automatically adjusts the value of pgp_num.
The adjustment to pgp_num triggers the movement of PGs across OSDs, if needed, to implement the change.
Define a new number of PGs in a pool by using the following command.
[ceph: root@node /]#ceph osd pool set my_pool pg_num 32set pool6pg_num to 32
When you create a pool with the ceph osd pool create command, you do not specify the number of replicas (size).
The osd_pool_default_size configuration parameter defines the number of replicas, and defaults to a value of 3.
[ceph: root@node /]# ceph config get mon osd_pool_default_size
3Change the size of a pool with the ceph osd pool set command.
Alternatively, update the default setting of the pool-name size number-of-replicasosd_pool_default_size configuration setting.
The osd_pool_default_min_size parameter sets the number of copies of an object that must be available to accept I/O requests.
The default value is 2.
An erasure coded pool uses erasure coding instead of replication to protect object data.
Objects stored in an erasure coded pool are divided into a number of data chunks which are stored in separate OSDs.
The number of coding chunks are calculated based on the data chunks and are stored in different OSDs.
The coding chunks are used to reconstruct the object's data if an OSD fails.
The primary OSD receives the write operation, then encodes the payload into K+M chunks and sends them to the secondary OSDs in erasure coded pools.
Erasure coded pools use this method to protect their objects and, unlike replicated pools, do not rely on storing multiple copies of each object.
To summarize how erasure coded pools work:
Each object's data is divided into k data chunks.
m coding chunks are calculated.
The coding chunk size is the same as the data chunk size.
The object is stored on a total of k + m OSDs.
Erasure coding uses storage capacity more efficiently than replication. Replicated pools maintain n copies of an object, whereas erasure coding maintains only k + m chunks. For example, replicated pools with 3 copies use 3 times the storage space. Erasure coded pools with k=4 and m=2 use only 1.5 times the storage space.
Red Hat supports the following k+m values which result in the corresponding usable-to-raw ratio:
4+2 (1:1.5 ratio)
8+3 (1:1.375 ratio)
8+4 (1:1.5 ratio)
The formula for calculating the erasure code overhead is nOSD * k / (k+m) * OSD Size.
For example, if you have 64 OSDs of 4 TB each (256 TB total), with k=8 and m=4, then the formula is 64 * 8 / (8+4) * 4 = 170.67.
Then divide the raw storage capacity by the overhead to get the ratio.
256 TB/170.67 TB equals a ratio of 1.5.
Erasure coded pools require less storage than replicated pools to obtain a similar level of data protection, which can reduce the cost and size of the storage cluster. However, calculating coding chunks adds CPU processing and memory overhead for erasure coded pools, reducing overall performance.
Use the following command to create an erasure coded pool.
[ceph: root@node /]# ceph osd pool create pool-name pg-num pgp-num \
erasure erasure-code-profile crush-rule-nameWhere:
pool-name is the name of the new pool.
pg-num is the total number of placement groups (PGs) for this pool.
pgp-num is the effective number of placement groups for this pool.
Normally, this should be equal to the total number of placement groups.
erasure specifies that this is an erasure coded pool.
erasure-code-profile is the name of the profile to use.
You can create new profiles with the ceph osd erasure-code-profile set command.
A profile defines the k and m values and the erasure code plug-in to use.
By default, Ceph uses the default profile.
crush-rule-name is the name of the CRUSH rule set to use for this pool.
If not set, Ceph uses the one defined in the erasure code profile.
You can configure placement group autoscaling on a pool.
Autoscaling allows the cluster to calculate the number of placement groups and to choose appropriate pg_num values automatically.
Autoscaling is enabled by default in Red Hat Ceph Storage 5.
Every pool in the cluster has a pg_autoscale_mode option with a value of on, off, or warn.
on: Enables automated adjustments of the PG count for the pool.
off: Disables PG autoscaling for the pool.
warn: Raises a health alert and changes the cluster health status to HEALTH_WARN when the PG count needs adjustment.
This example enables the pg_autoscaler module on the Ceph MGR nodes and sets the autoscaling mode to on for a pool:
[ceph: root@node /]#ceph mgr module enable pg_autoscalermodule 'pg_autoscaler' is already enabled (always-on) [ceph: root@node /]#ceph osd pool setset poolpool-namepg_autoscale_mode on7pg_autoscale_mode to on
Erasure coded pools cannot use the Object Map feature. An object map is an index of objects that tracks where the blocks of an rbd object are allocated. Having an object map for a pool improves the performance of resize, export, flatten, and other operations.
An erasure code profile configures the number of data chunks and coding chunks that your erasure-coded pool uses to store objects, and which erasure coding plug-in and algorithm to use.
Create profiles to define different sets of erasure coding parameters.
Ceph automatically creates the default profile during installation.
This profile is configured to divide objects into two data chunks and one coding chunk.
Use the following command to create a new profile.
[ceph: root@node /]# ceph osd erasure-code-profile set profile-name argumentsThe following arguments are available:
k
The number of data chunks that are split across OSDs. The default value is 2.
m
The number of OSDs that can fail before the data becomes unavailable. The default value is 1.
directory
This optional parameter is the location of the plug-in library. The default value is /usr/lib64/ceph/erasure-code.
plugin
This optional parameter defines the erasure coding algorithm to use.
crush-failure-domain
This optional parameter defines the CRUSH failure domain, which controls chunk placement.
By default, it is set to host, which ensures that an object's chunks are placed on OSDs on different hosts.
If set to osd, then an object's chunks can be placed on OSDs on the same host.
Setting the failure domain to osd is less resilient because all OSDs on a host will fail if the host fails.
Failure domains can be defined and used to ensure chunks are placed on OSDs on hosts in different data center racks or other customization.
crush-device-class
This optional parameter selects only OSDs backed by devices of this class for the pool.
Typical classes might include hdd, ssd, or nvme.
crush-root
This optional parameter sets the root node of the CRUSH rule set.
key=value
Plug-ins might have key-value parameters unique to that plug-in.
technique
Each plug-in provides a different set of techniques that implement different algorithms.
You cannot modify or change the erasure code profile of an existing pool.
Use the ceph osd erasure-code-profile ls command to list existing profiles.
Use the ceph osd erasure-code-profile get command to view the details of an existing profile.
Use the ceph osd erasure-code-profile rm command to remove an existing profile.
You can view and modify existing pools and change pool configuration settings.
Rename a pool by using the ceph osd pool rename command.
This does not affect the data stored in the pool.
If you rename a pool and you have per-pool capabilities for an authenticated user, you must update the user's capabilities with the new pool name.
Delete a pool by using the ceph osd pool delete command.
Deleting a pool removes all data in the pool and is not reversible.
You must set mon_allow_pool_delete to true to enable pool deletion.
Prevent pool deletion for a specific pool by using the ceph osd pool set command.
Set pool_name nodelete truenodelete back to false to allow deletion of the pool.
View and modify pool configuration settings by using the ceph osd pool set and ceph osd pool get commands.
List pools and pool configuration settings by using the ceph osd lspools and ceph osd pool ls detail commands.
List pools usage and performance statistics by using the ceph df and ceph osd pool stats commands.
Enable Ceph applications for a pool by using the ceph osd pool application enable command.
Application types are cephfs for Ceph File System, rbd for Ceph Block Device, and rgw for RADOS Gateway.
Set pool quotas to limit the maximum number of bytes or the maximum number of objects that can be stored in the pool by using the ceph osd pool set-quota command.
When a pool reaches the configured quota, operations are blocked. You can remove a quota by setting its value to 0.
Configure these example setting values to enable protection against pool reconfiguration:
osd_pool_default_flag_nodelete
Sets the default value of the nodelete flag on pools.
Set the value to true to prevent pool deletion.
osd_pool_default_flag_nopgchange
Sets the default value of the nopgchange flag on pools.
Set the value to true to prevent changes to pg_num, and pgp_num.
osd_pool_default_flag_nosizechange
Sets the default value of the nosizechange flag on pools.
Set the value to true to prevent pool size changes.
A namespace is a logical group of objects in a pool. Access to a pool can be limited so that a user can only store or retrieve objects in a particular namespace. One advantage of namespaces is to restrict user access to part of a pool.
Namespaces are currently only supported for applications that directly use librados.
RBD and Ceph Object Gateway clients do not currently support this feature.
Use the rados command to store and retrieve objects from a pool.
Use the -N
name and --namespace=
name options to specify the pool and namespace to use.
The following example stores the /etc/services file as the srv object in the mytestpool pool, under the system namespace.
[ceph: root@node /]#rados -p mytestpool -N system put srv /etc/services[ceph: root@node /]#rados -p mytestpool -N system lssrv
List all the objects in all namespaces in a pool by using the --all option.
To obtain JSON formatted output, add the --format=json-pretty option.
The following example lists the objects in the mytestpool pool.
The mytest object has an empty namespace.
The other objects belong to the system or the flowers namespaces.
[ceph: root@node /]#rados -p mytestpool --all lssystem srv flowers anemone flowers iris system magic flowers rose mytest system networks [ceph: root@node /]#rados -p mytestpool --all ls --format=json-pretty[ { "name": "srv", "namespace": "system" }, { "name": "anemone", "namespace": "flowers" }, { "name": "iris", "namespace": "flowers" }, { "name": "magic", "namespace": "system" }, { "name": "rose", "namespace": "flowers" }, { "name": "mytest", "namespace": "" }, { "name": "networks", "namespace": "system" } ]
For more information, refer to the Pools and Erasure Code Pools chapters in the Red Hat Ceph Storage 5 Storage Strategies Guide at https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/storage_strategies_guide