Bookmark this page

Chapter 4. Creating Object Storage Cluster Components

Abstract

Goal Create and manage the components that comprise the object storage cluster, including OSDs, pools, and the cluster authorization method.
Objectives
  • Describe OSD configuration scenarios and create BlueStore OSDs using ceph-volume.

  • Describe and compare replicated and erasure coded pools, and create and configure each pool type.

  • Describe Cephx and configure user authentication and authorization for Ceph clients.

Sections
  • Creating BlueStore OSDs Using Logical Volumes (and Guided Exercise)

  • Creating and Configuring Pools (and Guided Exercise)

  • Managing Ceph Authentication (and Guided Exercise)

Lab

Creating Object Storage Cluster Components

Creating BlueStore OSDs Using Logical Volumes

Objectives

After completing this section, you should be able to describe OSD configuration scenarios and create BlueStore OSDs using cephadm.

Introducing BlueStore

BlueStore replaced FileStore as the default storage back end for OSDs. FileStore stores objects as files in a file system (Red Hat recommends XFS) on top of a block device. BlueStore stores objects directly on raw block devices and eliminates the file-system layer, which improves read and write operation speeds.

FileStore is deprecated. Continued use of FileStore in RHCS 5 requires a Red Hat support exception. Newly created OSDs, whether by cluster growth or disk replacement, use BlueStore by default.

BlueStore Architecture

Objects that are stored in a Ceph cluster have a cluster-wide unique identifier, binary object data, and object metadata. BlueStore stores the object metadata in the block database. The block database stores metadata as key-value pairs in a RocksDB database, which is a high-performing key-value store.

The block database resides on a small BlueFS partition on the storage device. BlueFS is a minimal file system that is designed to hold the RocksDB files.

BlueStore writes data to block storage devices by utilizing the write-ahead log (WAL). The write-ahead log performs a journaling function and logs all transactions.

Figure 4.1: BlueStore OSD layout

BlueStore Performance

FileStore writes to a journal and then writes from the journal to the block device. BlueStore avoids this double-write performance penalty by writing data directly to the block device and logging transactions to the write-ahead log simultaneously with a separate data stream. BlueStore write operations are approximately twice as fast as FileStore with similar workloads.

When using a mix of different cluster storage devices, customize BlueStore OSDs to improve performance. When you create a BlueStore OSD, the default is to place the data, block database, and write-ahead log all on the same block device. Many of the performance advantages come from the block database and the write-ahead log, so placing those components on separate, faster devices might improve performance.

Note

You might improve performance by moving BlueStore devices if the new device is faster than the primary storage device. For example, if object data is on HDD devices, then improve performance by placing the block database on SSD devices and the write-ahead log on NVMe devices.

Use service specification files to define the location of the BlueStore data, block database, and write-ahead log devices. The following example specifies the BlueStore devices for an OSD service.

service_type: osd
service_id: osd_example
placement:
  host_pattern: '*'
data_devices:
  paths:
    - /dev/vda
db_devices:
  paths:
    - /dev/nvme0
wal_devices:
  paths:
    - /dev/nvme1

The BlueStore storage back end provides the following features:

  • Allows use of separate devices for the data, block database, and write-ahead log (WAL).

  • Supports use of virtually any combination of HDD, SSD, and NVMe devices.

  • Operates over raw devices or partitions, eliminating double writes to storage devices, with increased metadata efficiency.

  • Writes all data and metadata with checksums. All read operations are verified with their corresponding checksums before returning to the client.

The following graphs show the performance of BlueStore versus the deprecated FileStore architecture.

Figure 4.2: FileStore versus BlueStore write throughput
Figure 4.3: FileStore versus BlueStore read throughput

BlueStore runs in user space, manages its own cache and database, and can have a lower memory footprint than FileStore. BlueStore uses RocksDB to store key-value metadata. BlueStore is self-tuning by default, but you can manually tune BlueStore parameters if required.

The BlueStore partition writes data in chunks of the size of the bluestore_min_alloc_size parameter. The default value is 4 KiB. If the data to write is less than the size of the chunk, BlueStore fills the remaining space of the chunk with zeroes. It is a recommended practice to set the parameter to the size of the smallest typical write on the raw partition.

It is recommended to re-create FileStore OSDs as BlueStore to take advantage of the performance improvements and to maintain Red Hat support.

Introducing BlueStore Database Sharding

BlueStore can limit the size of large omap objects stored in RocksDB and distribute them into multiple column families. This process is known as sharding. When using sharding, the cluster groups keys that have similar access and modification frequency to improve performance and to save disk space. Sharding can alleviate the impacts of RocksDB compaction. RocksDB must reach a certain level of used space before compacting the database, which can affect OSD performance. With sharding, these operations are independent from the used space level, allowing a more precise compaction and minimizing the effect on OSD performance.

Note

Red Hat recommends that the configured space for RocksDB is at least 4% of the data device size.

In Red Hat Ceph Storage 5, sharding is enabled by default. Sharding is not enabled in OSDs from clusters that are migrated from earlier versions. OSDs from clusters migrated from previous versions will not have sharding enabled.

Use ceph config get to verify whether sharding is enabled for an OSD and to view the current definition.

[ceph: root@node /]# ceph config get osd.1 bluestore_rocksdb_cf
true
[ceph: root@node /]# ceph config get osd.1 bluestore_rocksdb_cfs
m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P

The default values result in good performance in most Ceph use cases. The optimal sharding definition for your production cluster depends on several factors. Red Hat recommends use of default values unless you are faced with significant performance issues. In a production-upgraded cluster, you might want to weigh the performance benefits against the maintenance effort to enable sharding for RocksDB in a large environment.

You can use the BlueStore administrative tool, ceph-bluestore-tool, to reshard the RocksDB database without reprovisioning OSDs. To reshard an OSD, stop the daemon and pass the new sharding definition with the --sharding option. The --path option refers to the OSD data location, which defaults to /var/lib/ceph/$fsid/osd.$ID/.

[ceph: root@node /]# ceph-bluestore-tool --path <data path> \
  --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" \
  reshard

Provisioning BlueStore OSDs

As a storage administrator, you can use the Ceph Orchestrator service to add or remove OSDs in a cluster. To add an OSD, the device must meet the following conditions:

  • The device must not have partitions.

  • The device must not be mounted.

  • The device must have at least 5 GB of space.

  • The device must not contain a Ceph BlueStore OSD.

Use the ceph orch device ls command to list devices across the hosts in the cluster.

[ceph: root@node /]# ceph orch device ls
Hostname  Path      Type  Serial                Size  Health   Ident  Fault  Available
nodea     /dev/vda  hdd   00000000-0000-0000-a  20G   Unknown  N/A    N/A    Yes
nodea     /dev/vdb  hdd   00000000-0000-0000-b  20G   Unknown  N/A    N/A    Yes
nodeb     /dev/vda  hdd   00000000-0000-0001-a  20G   Unknown  N/A    N/A    Yes
nodeb     /dev/vdb  hdd   00000000-0000-0001-b  20G   Unknown  N/A    N/A    Yes

Nodes with the Yes label in the Available column are candidates for OSD provisioning. To view only in-use storage devices, use the ceph device ls command.

Note

Some devices might not be eligible for OSD provisioning. Use the --wide option to view the details of why the cluster rejects the device.

To prepare a device for provisioning, Use the ceph orch device zap command. This command removes all partitions and purges the data in the device so it can be used for provisioning. Use the --force option to ensure the removal of any partition that a previous OSD might have created.

[ceph: root@node /]# ceph orch device zap node /dev/vda --force

Reviewing BlueStore Provisioning Methods

In RHCS 5, cephadm is the recommended tool to provision and manage OSDs. It uses the ceph-volume utility in the background for OSD operations. The cephadm tool might not see manual operations that use ceph-volume. It is recommended to limit manual ceph-volume OSD use cases to troubleshooting.

There are multiple ways to provision OSDs with cephadm. Consider the appropriate method according to the wanted cluster behavior.

Orchestrator-Managed Provisioning

The Orchestrator service can discover available devices among cluster hosts, add the devices, and create the OSD daemons. The Orchestrator handles the placement for the new OSDs that are balanced between the hosts, as well as handling BlueStore device selection.

Use the ceph orch apply osd --all-available-devices command to provision all available, unused devices.

[ceph: root@node /]# ceph orch apply osd --all-available-devices

This command creates an OSD service called osd.all-available-devices and enables the Orchestrator service to manage all OSD provisioning. The Orchestrator automatically creates OSDs from both new disk devices in the cluster and from existing devices that are prepared with the ceph orch device zap command.

To disable the Orchestrator from automatically provisioning OSDs, set the unmanaged flag to true.

[ceph: root@node /]# ceph orch apply osd --all-available-devices --unmanaged=true
Scheduled osd.all-available-devices update...

Note

You can also update the unmanaged flag with a service specification file.

Specific Target Provisioning

You can create OSD daemons by using a specific device and host. To create a single OSD daemon with a specific host and storage device, use the ceph orch daemon add command.

[ceph: root@node /]# ceph orch daemon add osd node:/dev/vdb
Created osd(s) 12 on host 'node'

To stop an OSD daemon, use the ceph orch daemon stop command with the OSD ID.

[ceph: root@node /]# ceph orch daemon stop osd.12

To remove an OSD daemon, use the ceph orch daemon rm command with the OSD ID.

[ceph: root@node /]# ceph orch daemon rm osd.12
Removed osd.12 from host 'node'

To release an OSD ID, use the ceph osd rm command.

[ceph: root@node /]# ceph osd rm 12
removed osd.12

Service Specification Provisioning

Use service specification files to describe the cluster layout for OSD services. You can customize the service provisioning with filters. With filters, you can configure the OSD service without knowing the specific hardware architecture. This method is useful when automating cluster bootstrap and maintenance windows.

The following is an example service specification YAML file that defines two OSD services, each using different filters for placement and BlueStore device location.

service_type: osd
service_id: osd_size_and_model
placement:
  host_pattern: '*'
data_devices:
  size: '100G:'
db_devices:
  model: My-Disk
wal_devices:
  size: '10G:20G'
unmanaged: true
---
service_type: osd
service_id: osd_host_and_path
placement:
  host_pattern: 'node[6-10]'
data_devices:
  paths:
    - /dev/sdb
db_devices:
  paths:
    - /dev/sdc
wal_devices:
  paths:
    - /dev/sdd
encrypted: true

The osd_size_and_model service specifies that any host can be used for placement and the service will be managed by the storage administrator. The data device must have a device with 100 GB or more, and the write-ahead log must have a device of 10 - 20 GB. The database device must be of the My-Disk model.

The osd_host_and_path service specifies that the target host must be provisioned on nodes between node6 and node10 and the service will be managed by the orchestrator service. The device paths for data, database, and write-ahead log must be /dev/sdb, /dev/sdc, and /dev/sdd. The devices in this service will be encrypted.

Run the ceph orch apply command to apply the service specification.

[ceph: root@node /]# ceph orch apply -i service_spec.yaml

Other OSD Utilities

The ceph-volume command is a modular tool to deploy logical volumes as OSDs. It uses a plug-in type framework. The ceph-volume utility supports the lvm plug-in and raw physical disks. It can also manage devices that are provisioned with the legacy ceph-disk utility.

Use the ceph-volume lvm command to manually create and delete BlueStore OSDs. The following command creates a new BlueStore OSD on block storage device /dev/vdc:

[ceph: root@node /]# ceph-volume lvm create --bluestore --data /dev/vdc

An alternative to the create subcommand is to use the ceph-volume lvm prepare and ceph-volume lvm activate subcommands. With this method, OSDs are gradually introduced into the cluster. You can control when the new OSDs are in the up or in state, so you can ensure that large amounts of data are not unexpectedly rebalanced across OSDs.

The prepare subcommand configures logical volumes for the OSD to use. You can specify a logical volume or a device name. If you specify a device name, then a logical volume is automatically created.

[ceph: root@node /]# ceph-volume lvm prepare --bluestore --data /dev/vdc

The activate subcommand enables a systemd unit for the OSD so that it starts at boot time. You need the OSD fsid (UUID) from the output of the ceph-volume lvm list command to use the activate subcommand. Providing the unique identifier ensures that the correct OSD is activated, because OSD IDs can be reused.

[ceph: root@node /]# ceph-volume lvm activate <osd-fsid>

When the OSD is created, use the systemctl start ceph-osd@$id command to start the OSD so it has the up and in state in the cluster.

The batch subcommand creates multiple OSDs at the same time.

[ceph: root@node /]# ceph-volume lvm batch \
--bluestore /dev/vdc /dev/vdd /dev/nvme0n1

The inventory subcommand provides information about all physical storage devices on a node.

[ceph: root@node /]# ceph-volume inventory

 

References

For more information, refer to the Red Hat Ceph Storage 5 Architecture Guide at https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/architecture_guide/index

For more information, refer to the BlueStore chapter in the Red Hat Ceph Storage 5 Administration Guide at https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/administration_guide/osd-bluestore

For more information, refer to the Advanced service specifications and filters for deploying OSDs chapter in the Red Hat Ceph Storage 5 Operation Guide at https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/operations_guide/index#advanced-service-specifications-and-filters-for-deploying-osds_ops

Revision: cl260-5.0-29d2128