Abstract
| Goal | Create and manage the components that comprise the object storage cluster, including OSDs, pools, and the cluster authorization method. |
| Objectives |
|
| Sections |
|
| Lab |
Creating Object Storage Cluster Components |
After completing this section, you should be able to describe OSD configuration scenarios and create BlueStore OSDs using cephadm.
BlueStore replaced FileStore as the default storage back end for OSDs. FileStore stores objects as files in a file system (Red Hat recommends XFS) on top of a block device. BlueStore stores objects directly on raw block devices and eliminates the file-system layer, which improves read and write operation speeds.
Objects that are stored in a Ceph cluster have a cluster-wide unique identifier, binary object data, and object metadata. BlueStore stores the object metadata in the block database. The block database stores metadata as key-value pairs in a RocksDB database, which is a high-performing key-value store.
The block database resides on a small BlueFS partition on the storage device. BlueFS is a minimal file system that is designed to hold the RocksDB files.
BlueStore writes data to block storage devices by utilizing the write-ahead log (WAL). The write-ahead log performs a journaling function and logs all transactions.
FileStore writes to a journal and then writes from the journal to the block device. BlueStore avoids this double-write performance penalty by writing data directly to the block device and logging transactions to the write-ahead log simultaneously with a separate data stream. BlueStore write operations are approximately twice as fast as FileStore with similar workloads.
When using a mix of different cluster storage devices, customize BlueStore OSDs to improve performance. When you create a BlueStore OSD, the default is to place the data, block database, and write-ahead log all on the same block device. Many of the performance advantages come from the block database and the write-ahead log, so placing those components on separate, faster devices might improve performance.
You might improve performance by moving BlueStore devices if the new device is faster than the primary storage device. For example, if object data is on HDD devices, then improve performance by placing the block database on SSD devices and the write-ahead log on NVMe devices.
Use service specification files to define the location of the BlueStore data, block database, and write-ahead log devices. The following example specifies the BlueStore devices for an OSD service.
service_type: osd
service_id: osd_example
placement:
host_pattern: '*'
data_devices:
paths:
- /dev/vda
db_devices:
paths:
- /dev/nvme0
wal_devices:
paths:
- /dev/nvme1The BlueStore storage back end provides the following features:
Allows use of separate devices for the data, block database, and write-ahead log (WAL).
Supports use of virtually any combination of HDD, SSD, and NVMe devices.
Operates over raw devices or partitions, eliminating double writes to storage devices, with increased metadata efficiency.
Writes all data and metadata with checksums. All read operations are verified with their corresponding checksums before returning to the client.
The following graphs show the performance of BlueStore versus the deprecated FileStore architecture.
BlueStore runs in user space, manages its own cache and database, and can have a lower memory footprint than FileStore. BlueStore uses RocksDB to store key-value metadata. BlueStore is self-tuning by default, but you can manually tune BlueStore parameters if required.
The BlueStore partition writes data in chunks of the size of the bluestore_min_alloc_size parameter.
The default value is 4 KiB.
If the data to write is less than the size of the chunk, BlueStore fills the remaining space of the chunk with zeroes.
It is a recommended practice to set the parameter to the size of the smallest typical write on the raw partition.
It is recommended to re-create FileStore OSDs as BlueStore to take advantage of the performance improvements and to maintain Red Hat support.
BlueStore can limit the size of large omap objects stored in RocksDB and distribute them into multiple column families. This process is known as sharding. When using sharding, the cluster groups keys that have similar access and modification frequency to improve performance and to save disk space. Sharding can alleviate the impacts of RocksDB compaction. RocksDB must reach a certain level of used space before compacting the database, which can affect OSD performance. With sharding, these operations are independent from the used space level, allowing a more precise compaction and minimizing the effect on OSD performance.
Red Hat recommends that the configured space for RocksDB is at least 4% of the data device size.
In Red Hat Ceph Storage 5, sharding is enabled by default. Sharding is not enabled in OSDs from clusters that are migrated from earlier versions. OSDs from clusters migrated from previous versions will not have sharding enabled.
Use ceph config get to verify whether sharding is enabled for an OSD and to view the current definition.
[ceph: root@node /]#ceph config get osd.1 bluestore_rocksdb_cftrue [ceph: root@node /]#ceph config get osd.1 bluestore_rocksdb_cfsm(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P
The default values result in good performance in most Ceph use cases. The optimal sharding definition for your production cluster depends on several factors. Red Hat recommends use of default values unless you are faced with significant performance issues. In a production-upgraded cluster, you might want to weigh the performance benefits against the maintenance effort to enable sharding for RocksDB in a large environment.
You can use the BlueStore administrative tool, ceph-bluestore-tool, to reshard the RocksDB database without reprovisioning OSDs.
To reshard an OSD, stop the daemon and pass the new sharding definition with the --sharding option. The --path option refers to the OSD data location, which defaults to /var/lib/ceph/$fsid/osd.$ID/.
[ceph: root@node /]# ceph-bluestore-tool --path <data path> \
--sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" \
reshardAs a storage administrator, you can use the Ceph Orchestrator service to add or remove OSDs in a cluster. To add an OSD, the device must meet the following conditions:
The device must not have partitions.
The device must not be mounted.
The device must have at least 5 GB of space.
The device must not contain a Ceph BlueStore OSD.
Use the ceph orch device ls command to list devices across the hosts in the cluster.
[ceph: root@node /]# ceph orch device ls
Hostname Path Type Serial Size Health Ident Fault Available
nodea /dev/vda hdd 00000000-0000-0000-a 20G Unknown N/A N/A Yes
nodea /dev/vdb hdd 00000000-0000-0000-b 20G Unknown N/A N/A Yes
nodeb /dev/vda hdd 00000000-0000-0001-a 20G Unknown N/A N/A Yes
nodeb /dev/vdb hdd 00000000-0000-0001-b 20G Unknown N/A N/A YesNodes with the Yes label in the Available column are candidates for OSD provisioning.
To view only in-use storage devices, use the ceph device ls command.
Some devices might not be eligible for OSD provisioning.
Use the --wide option to view the details of why the cluster rejects the device.
To prepare a device for provisioning, Use the ceph orch device zap command.
This command removes all partitions and purges the data in the device so it can be used for provisioning.
Use the --force option to ensure the removal of any partition that a previous OSD might have created.
[ceph: root@node /]# ceph orch device zap node /dev/vda --forceIn RHCS 5, cephadm is the recommended tool to provision and manage OSDs.
It uses the ceph-volume utility in the background for OSD operations.
The cephadm tool might not see manual operations that use ceph-volume.
It is recommended to limit manual ceph-volume OSD use cases to troubleshooting.
There are multiple ways to provision OSDs with cephadm.
Consider the appropriate method according to the wanted cluster behavior.
Orchestrator-Managed Provisioning
The Orchestrator service can discover available devices among cluster hosts, add the devices, and create the OSD daemons. The Orchestrator handles the placement for the new OSDs that are balanced between the hosts, as well as handling BlueStore device selection.
Use the ceph orch apply osd --all-available-devices command to provision all available, unused devices.
[ceph: root@node /]# ceph orch apply osd --all-available-devicesThis command creates an OSD service called osd.all-available-devices and enables the Orchestrator service to manage all OSD provisioning.
The Orchestrator automatically creates OSDs from both new disk devices in the cluster and from existing devices that are prepared with the ceph orch device zap command.
To disable the Orchestrator from automatically provisioning OSDs, set the unmanaged flag to true.
[ceph: root@node /]# ceph orch apply osd --all-available-devices --unmanaged=true
Scheduled osd.all-available-devices update...You can also update the unmanaged flag with a service specification file.
Specific Target Provisioning
You can create OSD daemons by using a specific device and host.
To create a single OSD daemon with a specific host and storage device, use the ceph orch daemon add command.
[ceph: root@node /]# ceph orch daemon add osd node:/dev/vdb
Created osd(s) 12 on host 'node'To stop an OSD daemon, use the ceph orch daemon stop command with the OSD ID.
[ceph: root@node /]# ceph orch daemon stop osd.12To remove an OSD daemon, use the ceph orch daemon rm command with the OSD ID.
[ceph: root@node /]# ceph orch daemon rm osd.12
Removed osd.12 from host 'node'To release an OSD ID, use the ceph osd rm command.
[ceph: root@node /]# ceph osd rm 12
removed osd.12Service Specification Provisioning
Use service specification files to describe the cluster layout for OSD services. You can customize the service provisioning with filters. With filters, you can configure the OSD service without knowing the specific hardware architecture. This method is useful when automating cluster bootstrap and maintenance windows.
The following is an example service specification YAML file that defines two OSD services, each using different filters for placement and BlueStore device location.
service_type: osd service_id:osd_size_and_modelplacement: host_pattern: '*' data_devices: size: '100G:' db_devices: model: My-Disk wal_devices: size: '10G:20G' unmanaged: true --- service_type: osd service_id:osd_host_and_pathplacement: host_pattern: 'node[6-10]' data_devices: paths: - /dev/sdb db_devices: paths: - /dev/sdc wal_devices: paths: - /dev/sdd encrypted: true
The osd_size_and_model service specifies that any host can be used for placement and the service will be managed by the storage administrator.
The data device must have a device with 100 GB or more, and the write-ahead log must have a device of 10 - 20 GB.
The database device must be of the My-Disk model.
The osd_host_and_path service specifies that the target host must be provisioned on nodes between node6 and node10 and the service will be managed by the orchestrator service.
The device paths for data, database, and write-ahead log must be /dev/sdb, /dev/sdc, and /dev/sdd.
The devices in this service will be encrypted.
Run the ceph orch apply command to apply the service specification.
[ceph: root@node /]# ceph orch apply -i service_spec.yamlThe ceph-volume command is a modular tool to deploy logical volumes as OSDs.
It uses a plug-in type framework.
The ceph-volume utility supports the lvm plug-in and raw physical disks.
It can also manage devices that are provisioned with the legacy ceph-disk utility.
Use the ceph-volume lvm command to manually create and delete BlueStore OSDs.
The following command creates a new BlueStore OSD on block storage device /dev/vdc:
[ceph: root@node /]# ceph-volume lvm create --bluestore --data /dev/vdcAn alternative to the create subcommand is to use the ceph-volume lvm prepare and ceph-volume lvm activate subcommands.
With this method, OSDs are gradually introduced into the cluster.
You can control when the new OSDs are in the up or in state, so you can ensure that large amounts of data are not unexpectedly rebalanced across OSDs.
The prepare subcommand configures logical volumes for the OSD to use.
You can specify a logical volume or a device name.
If you specify a device name, then a logical volume is automatically created.
[ceph: root@node /]# ceph-volume lvm prepare --bluestore --data /dev/vdcThe activate subcommand enables a systemd unit for the OSD so that it starts at boot time.
You need the OSD fsid (UUID) from the output of the ceph-volume lvm list command to use the activate subcommand.
Providing the unique identifier ensures that the correct OSD is activated, because OSD IDs can be reused.
[ceph: root@node /]# ceph-volume lvm activate <osd-fsid>The batch subcommand creates multiple OSDs at the same time.
[ceph: root@node /]# ceph-volume lvm batch \
--bluestore /dev/vdc /dev/vdd /dev/nvme0n1The inventory subcommand provides information about all physical storage devices on a node.
[ceph: root@node /]# ceph-volume inventoryFor more information, refer to the Red Hat Ceph Storage 5 Architecture Guide at https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/architecture_guide/index
For more information, refer to the BlueStore chapter in the Red Hat Ceph Storage 5 Administration Guide at https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/administration_guide/osd-bluestore
For more information, refer to the Advanced service specifications and filters for deploying OSDs chapter in the Red Hat Ceph Storage 5 Operation Guide at https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/operations_guide/index#advanced-service-specifications-and-filters-for-deploying-osds_ops