Abstract
| Goal | Identify the key Ceph cluster performance metrics, and use them to tune and troubleshoot Ceph operations for optimal performance. |
| Objectives |
|
| Sections |
|
| Lab |
Tuning and Troubleshooting Red Hat Ceph Storage |
After completing this section, you should be able to choose Red Hat Ceph Storage architecture scenarios and operate Red Hat Ceph Storage-specific performance analysis tools to optimize cluster deployments.
Performance tuning is the process of tailoring system configurations, so that specific, critical applications have the best possible response time or throughput. Performance tuning for a Ceph cluster has three metrics: latency, IOPS (input/output operations per second), and throughput.
It is a common misconception that disk latency and response time are the same thing. Disk latency is a function of the device, but response time is measured as a function of the entire server.
For hard drives using spinning platters, disk latency has two components:
Seek time: The time it takes to position the drive heads on the correct track on the platter, typically 0.2 to 0.8 ms.
Rotational latency: The additional time it takes for the correct starting sector on that track to pass under the drive heads, typically a few milliseconds.
After the drive has positioned the heads, it can start transferring data from the platter. At that point, the sequential data transfer rate is important.
For solid-state drives (SSDs), the equivalent metric is the random access latency of the storage device, which is typically less than a millisecond. For non-volatile memory express drives (NVMes), the random access latency of the storage drive is typically in microseconds.
The number of read and write requests that the system can process per second depends on the storage device capabilities and the application. When an application issues an I/O request, the operating system transfers the request to the device and waits until the request completes. As a reference, hard drives using spinning platters achieve between 50 and 200 IOPS, SSDs are on the order of thousands to hundreds of thousands IOPS, and NVMes achieve some hundreds of thousands IOPS.
Throughput refers to the actual number of bytes per second the system can read or write. The size of the block and the data transfer rate affect the throughput. The higher the disk block size, the more you attenuate the latency factor. The higher the data transfer rate, the faster a disk can transfer data from its surface to a buffer.
As a reference value, hard drives using spinning platters have a throughput around 150 Mb/s, SSDs are around 500 Mbps, and NVMes are in the order of 2,000 Mb/s.
You can measure throughput for networks and the whole system, from a remote client to a server.
The hardware you use determines the performance limits of your system and your Ceph cluster. The objective of tuning performance is to use your hardware as efficiently as possible.
It is a common observation that tuning a specific subsystem can adversely affect the performance of another. For example, you can tune your system for low latency at the expense of high throughput. Therefore, before starting to tune, establish your goals to align with the expected workload of your Ceph cluster:
Workloads on block devices are often IOPS intensive, for example, databases running on virtual machines in OpenStack. Typical deployments require high-performance SAS drives for storage and journals placed on SSDs or NVMe devices.
Workloads on a RADOS Gateway are often throughput intensive. Objects can store significant amounts of data, such as audio and video content.
Workloads that require the ability to store a large quantity of data as inexpensively as possible usually trade performance for price. Selecting less-expensive and slower SATA drives is the solution for this kind of workload.
Depending on your workload, tuning objectives should include:
Reduce latency
Increase IOPS at the device
Increase block size
The following section describes recommended practices for tuning Ceph.
It is important to plan a Ceph cluster deployment correctly. The MONs performance is critical for overall cluster performance. MONs should be on dedicated nodes for large deployments. To ensure a correct quorum, an odd number of MONs is required.
Designed to handle large quantities of data, Ceph can achieve improved performance if the correct hardware is used and the cluster is tuned correctly.
After the cluster installation, begin continuous monitoring of the cluster to troubleshoot failures and schedule maintenance activities. Although Ceph has significant self-healing abilities, many types of failure events require rapid notification and human intervention. Should performance issues occur, begin troubleshooting at the disk, network, and hardware level. Then, continue with diagnosing RADOS block devices and the Ceph RADOS Gateways.
Use SSDs or NVMes to maximize efficiency when writing to BlueStore block database and write-ahead log (WAL). An OSD might have its data, block database, and WAL collocated on the same storage device, or non-collocated by using separate devices for each of these components.
In a typical deployment, OSDs use traditional spinning disks with high latency because they provide satisfactory metrics that meet defined goals at a lower cost per megabyte. By default, BlueStore OSDs place the data, block database, and WAL on the same block device. However, you can maximize the efficiency by using separate low latency SSDs or NVMe devices for the block database and WAL. Multiple block databases and WALs can share the same SSD or NVMe device, reducing the cost of the storage infrastructure.
Consider the impact of the following SSD specifications against the expected workload:
Mean Time Between Failures (MTBF) for the number of supported writes
IOPS capabilities
Data transfer rate
Bus/SSD couple capabilities
When an SSD or NVMe device that hosts journals fails, every OSD using it to host its journal also becomes unavailable. Consider this when deciding how many block databases or WALs to place on the same storage device.
Workloads on a RADOS Gateway are often throughput-intensive. Audio and video materials being stored as objects can be large. However, the bucket index pool typically displays a more I/O-intensive workload pattern. Store the index pools on SSD devices.
The RADOS Gateway maintains one index per bucket. By default, Ceph stores this index in one RADOS object. When a bucket stores more than 100,000 objects, the index performance degrades because the single index object becomes a bottleneck.
Ceph can keep large indexes in multiple RADOS objects, or shards.
Enable this feature by setting the rgw_override_bucket_index_max_shards parameter.
The recommended value is the number of objects expected in a bucket divided by 100,000.
As the index grows, Ceph must regularly reshard the bucket.
Red Hat Ceph Storage provides a bucket index automatic resharding feature.
The rgw_dynamic_resharding parameter, set to true by default, controls this feature.
The metadata pool, which holds the directory structure and other indexes, can become a CephFS bottleneck. To minimize this limitation, use SSD devices for the metadata pool.
Each MDS maintains a cache in memory for different kinds of items, such as inodes.
Ceph limits the size of this cache with the mds_cache_memory_limit parameter.
Its default value, expressed in absolute bytes, is equal to 4 GB.
The total number of PGs in a cluster can impact overall performance due to unnecessary CPU and RAM activity on some OSD nodes. Red Hat recommended validating PG allocation for each pool before putting a cluster into production. Also consider specific testing of the backfill and recovery impact on client I/O requests.
There are two important values:
The overall number of PGs in the cluster
The number of PGs for a specific pool
Use this formula to estimate how many PGs should be available for a single, specific pool:
Total Placement Groups = (OSDs * 100) / Number of replicas
Apply the formula for each pool to get the total number of PGs for the cluster. Red Hat recommends between 100 and 200 PGs per OSD.
Red Hat provides the Ceph Placement Groups (PGs) per Pool Calculator to recommend the number of PGs per pool, at https://access.redhat.com/labs/cephpgc/.
Ceph supports increasing or decreasing the number of PGs in a pool. If you do not specify a value when creating a pool, it is created with a default value of 8 PGs, which is very low.
The pg_autoscale_mode property allows Ceph to make recommendations and automatically adjust the pg_num and pgp_num parameters.
This option is enabled by default when creating a new pool.
The pg_num parameter defines the number of PGs for a specific pool.
The pgp_num parameter defines the number of PGs that the CRUSH algorithm considers for placement.
Red Hat recommends that you make incremental increases in the number of placement groups until you reach the desired number of PGs. Increasing the number of PGs by a significant amount can cause cluster performance degradation, because the expected data relocation and rebalancing is intensive.
Use the ceph osd pool set command to manually increase or decrease the number of PGs by setting the pg_num parameter.
You should only increase the number of PGs in a pool by small increments when doing it manually with the pg_autoscale_mode option disabled.
Setting the total number of placement groups to a number that is a power of 2 provides better distribution of the PGs across the OSDs.
Increasing the pg_num parameter automatically increases the pgp_num parameter, but at a gradual rate to minimize the impact on cluster performance.
Red Hat Ceph Storage can merge two PGs into a larger PG, reducing the total number of PGs. Merging can be useful when the number of PGs in a pool is too large and performance is degraded. Because merging is a complex process, merge only one PG at a time to minimize the impact on cluster performance.
As discussed, the PG autoscale feature allows Ceph to make recommendations and automatically adjust the number of PGs. This feature is enabled by default when creating a pool. For existing pools, configure autoscaling with this command:
[admin@node ~]$ ceph osd pool set pool-name pg_autoscale_mode modeSet the mode parameter to off to disable it, on to enable it and allow Ceph to automatically make adjustments in the number of PGs, or warn to raise health alerts when the number of PGs must be adjusted.
View the information provided by the autoscale module:
[admin@node ~]$ ceph osd pool autoscale-status
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO
pool1 0 3.0 92124M 0.0000
pool2 1323 3.0 92124M 0.0000
pool3 3702 3.0 92124M 0.0000
TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE
1.0 1 on
1.0 32 on
1.0 32 64 warnThe previous table is split in two parts to make it easier to read on this page.
The first two pools have the AUTOSCALE feature set to on, with Ceph automatically adjusting the number of PGs.
The third pool is configured to provide a health alert if the number of PGs needs adjusting.
The PG_NUM parameter is the current number of PGs in each pool or the number of PGs that the pool is working towards.
The NEW PG_NUM parameter is the number of PGs that Ceph recommends to set in the pool.
When designing a Ceph cluster, consider scaling choices to match your future data requirements and to facilitate sufficient throughput with the correct network size and architecture.
You can scale clustered storage in two ways:
Scale out by adding more nodes to a cluster.
Scale up by adding more resources to existing nodes.
Scaling up requires that nodes can accept more CPU and RAM resources to handle an increase in the number of disks and disk size. Scaling out requires adding nodes with similar resources and capacity to match the cluster's existing nodes for balanced operations.
The network interconnecting the nodes in a Ceph cluster is critical to good performance, because all client and cluster I/O operations use it. Red Hat recommends the following practices:
To increase performance and provide better isolation for troubleshooting, use separate networks for OSD traffic and for client traffic.
At a minimum, use 10 GB networks or larger for the storage cluster. 1 GB networks are not suitable for production environments.
Evaluate network sizing based on both cluster and client traffic, and the amount of data stored.
Network monitoring is highly recommended.
Use separate NICs to connect to the networks where possible, or else use separate ports.
SectionFigure 12.1: Separate networks for OSD and client traffic is a representation of such a network architecture.
The Ceph daemons automatically bind to the correct interfaces, such as binding MONs to the public network, and binding OSDs to both public and cluster networks.
Use the primary affinity setting to influence Ceph's selection of a specific OSD as the primary OSD for a placement group.
A higher setting makes an OSD more likely to be selected as a primary OSD.
You can mitigate issues or bottlenecks by configuring the cluster to avoid using slow disks or controllers for a primary OSD.
Use the ceph osd primary-affinity command to modify the primary affinity for an OSD.
Affinity is a real number between 0 and 1.
[admin@node ~]$ ceph osd primary-affinity osd-number affinityWhen Ceph adds or removes an OSD on a cluster, Ceph rebalances the PGs to use the new OSD or recreate replicas stored in the removed OSD. These backfilling and recovery operations can generate a high load of cluster network traffic, which can impact performance.
To avoid degraded cluster performance, adjust the backfilling and recovery operations to create a balance between rebalancing and normal cluster operations. Ceph provides parameters to limit the backfilling and recovery operations' I/O and network activity.
The following list includes some of those parameters:
| Parameter | Definition |
|---|---|
osd_recovery_op_priority
| Priority for recovery operations |
osd_recovery_max_active
| Maximum number of active recovery requests per OSD in parallel |
osd_recovery_threads
| Number of threads for data recovery |
osd_max_backfills
| Maximum number of back fills for an OSD |
osd_backfill_scan_min
| Minimum number of objects per backfill scan |
osd_backfill_scan_max
| Maximum number of objects per backfill scan |
osd_backfill_full_ratio
| Threshold for backfill requests to an OSD |
osd_backfill_retry_interval
| Seconds to wait before retrying backfill requests |
Using realistic metrics for your cluster's expected workload, build the cluster's hardware configuration to provide sufficient performance, but keep the cost as low as possible. Red Hat suggests these hardware configurations for the three performance priorities:
Use two OSDs per NVMe device.
NVMe drives have data, the block database, and WAL collocated on the same storage device.
Assuming a 2 GHz CPU, use 10 cores per NVMe or 2 cores per SSD.
Allocate 16 GB RAM as a baseline, plus 5 GB per OSD.
Use 10 GbE NICs per 2 OSDs.
Use one OSD per HDD.
Place the block database and WAL on SSDs or NVMes.
Use at least 7,200 RPM HDD drives.
Assuming a 2 GHz CPU, use one-half core per HDD.
Allocate 16 GB RAM as a baseline, plus 5 GB per OSD.
Use 10 GbE NICs per 12 OSDs.
Use one OSD per HDD.
HDDs have data, the block database, and WAL collocated on the same storage device.
Use at least 7,200 RPM HDD drives.
Assuming a 2 GHz CPU, use one-half core per HDD.
Allocate 16 GB RAM as a baseline, plus 5 GB per OSD.
Use 10 GbE NICs per 12 OSDs.
Performance tools provide benchmarking metrics to examine clusters for performance issues.
Each Ceph daemon maintains a set of internal counters and gauges. Several tools are available to access these counters:
The Dashboard plug-in exposes a web interface accessible on port 8443.
The → menu provides basic real-time OSD statistics, for example, the number of read bytes, write bytes, read operations, and write operations.
Enable the Dashboard plug-in by using the ceph mgr module enable dashboard command.
If you bootstrap your cluster with the cephadm bootstrap command, then the dashboard is enabled by default.
This plug-in exposes the performance metrics on port 9283, for an external Prometheus server to collect. Prometheus is an open source system monitoring and alerting utility.
The ceph command has options to view metrics and change daemon parameters.
Red Hat Ceph Storage provides tools to stress test and benchmark a Ceph cluster.
RADOS bench is a simple tool for testing the RADOS Object Store. It executes write and read tests on your cluster and provides statistics. The general syntax of the command is:
[admin@node ~]$ rados -p pool-name bench seconds write|seq|rand \
-b objsize -t concurrencyThese are the common parameters for the tool:
The seq and rand tests are sequential and random read benchmarks.
These tests require that a write benchmark is run first with the --no-cleanup option.
By default, RADOS bench removes the objects created for the writing test.
The --no-cleanup option keeps the objects, which can be useful for performing multiple tests on the same objects.
The default object size, objsize, is 4 MB.
The default number of concurrent operations, concurrency, is 16.
With the --no-cleanup option, you must manually remove data that remains in the pool after running the rados bench command.
For example, the following information is provided by the rados bench command, including throughput, IOPS, and latency:
[ceph: root@server /]#rados bench -p testbench 10 write --no-cleanuphints = 1 Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects Object prefix: benchmark_data_server.example.com_265 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 72 56 223.964 224 0.157623 0.241175 ...output omitted... 10 16 715 699 279.551 328 0.120089 0.226616 Total time run: 10.1406 Total writes made: 715 Write size: 4194304 Object size: 4194304Bandwidth (MB/sec): 282.035Stddev Bandwidth: 32.1911 Max bandwidth (MB/sec): 328 Min bandwidth (MB/sec): 224Average IOPS: 70Stddev IOPS: 8.04777 Max IOPS: 82 Min IOPS: 56Average Latency(s): 0.225574Stddev Latency(s): 0.105867 Max latency(s): 0.746961 Min latency(s): 0.0425166
The RBD bench measures I/O throughput and latency on an existing image, which you created for the test.
These are the default values:
If you do not give a suffix to the size arguments, the command assumes bytes as the unit.
The default pool name is rbd.
The default for --io-size is 4096 bytes.
The default for --io-threads is 16.
The default for --io-total is 1 GB.
The default for --io-pattern is seq for sequential.
For example, information is provided by the rbd bench command, including throughput and latency:
[ceph: root@server /]#rbd bench --io-type write testimage --pool=testbenchbench type write io_size 4096 io_threads 16 bytes 1073741824 pattern sequential SEC OPS OPS/SEC BYTES/SEC 1 47360 46952.6 183 MiB/s 2 88864 44461.5 174 MiB/s 3 138016 46025.2 180 MiB/s 4 178128 44546.4 174 MiB/s 5 225264 45064.3 176 MiB/s elapsed: 5 ops: 262144ops/sec: 45274.6bytes/sec: 177 MiB/s
sysctl(8), ceph(8), rados(8), and rbd(8) man pages
For more information, refer to the Ceph performance benchmark chapter in the Red Hat Ceph Administration Guide at https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/administration_guide/index#ceph-performance-benchmarking