Bookmark this page

Chapter 12. Tuning and Troubleshooting Red Hat Ceph Storage

Abstract

Goal Identify the key Ceph cluster performance metrics, and use them to tune and troubleshoot Ceph operations for optimal performance.
Objectives
  • Choose Red Hat Ceph Storage architecture scenarios and operate Red Hat Ceph Storage-specific performance analysis tools to optimize cluster deployments.

  • Protect OSD and cluster hardware resources from over-utilization by controlling scrubbing, deep scrubbing, backfill, and recovery processes to balance CPU, RAM, and I/O requirements.

  • Identify key tuning parameters and troubleshoot performance for Ceph clients, including RADOS Gateway, RADOS Block Devices, and CephFS.

Sections
  • Optimizing Red Hat Ceph Storage Performance (and Guided Exercise)

  • Tuning Object Storage Cluster Performance (and Guided Exercise)

  • Troubleshooting Clusters and Clients (and Guided Exercise)

Lab

Tuning and Troubleshooting Red Hat Ceph Storage

Optimizing Red Hat Ceph Storage Performance

Objectives

After completing this section, you should be able to choose Red Hat Ceph Storage architecture scenarios and operate Red Hat Ceph Storage-specific performance analysis tools to optimize cluster deployments.

Defining Performance Tuning

Performance tuning is the process of tailoring system configurations, so that specific, critical applications have the best possible response time or throughput. Performance tuning for a Ceph cluster has three metrics: latency, IOPS (input/output operations per second), and throughput.

Latency

It is a common misconception that disk latency and response time are the same thing. Disk latency is a function of the device, but response time is measured as a function of the entire server.

For hard drives using spinning platters, disk latency has two components:

  • Seek time: The time it takes to position the drive heads on the correct track on the platter, typically 0.2 to 0.8 ms.

  • Rotational latency: The additional time it takes for the correct starting sector on that track to pass under the drive heads, typically a few milliseconds.

After the drive has positioned the heads, it can start transferring data from the platter. At that point, the sequential data transfer rate is important.

For solid-state drives (SSDs), the equivalent metric is the random access latency of the storage device, which is typically less than a millisecond. For non-volatile memory express drives (NVMes), the random access latency of the storage drive is typically in microseconds.

I/O Operations Per Second (IOPS)

The number of read and write requests that the system can process per second depends on the storage device capabilities and the application. When an application issues an I/O request, the operating system transfers the request to the device and waits until the request completes. As a reference, hard drives using spinning platters achieve between 50 and 200 IOPS, SSDs are on the order of thousands to hundreds of thousands IOPS, and NVMes achieve some hundreds of thousands IOPS.

Throughput

Throughput refers to the actual number of bytes per second the system can read or write. The size of the block and the data transfer rate affect the throughput. The higher the disk block size, the more you attenuate the latency factor. The higher the data transfer rate, the faster a disk can transfer data from its surface to a buffer.

As a reference value, hard drives using spinning platters have a throughput around 150 Mb/s, SSDs are around 500 Mbps, and NVMes are in the order of 2,000 Mb/s.

You can measure throughput for networks and the whole system, from a remote client to a server.

Tuning Objectives

The hardware you use determines the performance limits of your system and your Ceph cluster. The objective of tuning performance is to use your hardware as efficiently as possible.

It is a common observation that tuning a specific subsystem can adversely affect the performance of another. For example, you can tune your system for low latency at the expense of high throughput. Therefore, before starting to tune, establish your goals to align with the expected workload of your Ceph cluster:

IOPS optimized

Workloads on block devices are often IOPS intensive, for example, databases running on virtual machines in OpenStack. Typical deployments require high-performance SAS drives for storage and journals placed on SSDs or NVMe devices.

Throughput optimized

Workloads on a RADOS Gateway are often throughput intensive. Objects can store significant amounts of data, such as audio and video content.

Capacity optimized

Workloads that require the ability to store a large quantity of data as inexpensively as possible usually trade performance for price. Selecting less-expensive and slower SATA drives is the solution for this kind of workload.

Depending on your workload, tuning objectives should include:

  • Reduce latency

  • Increase IOPS at the device

  • Increase block size

Optimizing Ceph Performance

The following section describes recommended practices for tuning Ceph.

Ceph Deployment

It is important to plan a Ceph cluster deployment correctly. The MONs performance is critical for overall cluster performance. MONs should be on dedicated nodes for large deployments. To ensure a correct quorum, an odd number of MONs is required.

Designed to handle large quantities of data, Ceph can achieve improved performance if the correct hardware is used and the cluster is tuned correctly.

After the cluster installation, begin continuous monitoring of the cluster to troubleshoot failures and schedule maintenance activities. Although Ceph has significant self-healing abilities, many types of failure events require rapid notification and human intervention. Should performance issues occur, begin troubleshooting at the disk, network, and hardware level. Then, continue with diagnosing RADOS block devices and the Ceph RADOS Gateways.

Recommendations for OSDs

Use SSDs or NVMes to maximize efficiency when writing to BlueStore block database and write-ahead log (WAL). An OSD might have its data, block database, and WAL collocated on the same storage device, or non-collocated by using separate devices for each of these components.

In a typical deployment, OSDs use traditional spinning disks with high latency because they provide satisfactory metrics that meet defined goals at a lower cost per megabyte. By default, BlueStore OSDs place the data, block database, and WAL on the same block device. However, you can maximize the efficiency by using separate low latency SSDs or NVMe devices for the block database and WAL. Multiple block databases and WALs can share the same SSD or NVMe device, reducing the cost of the storage infrastructure.

Consider the impact of the following SSD specifications against the expected workload:

  • Mean Time Between Failures (MTBF) for the number of supported writes

  • IOPS capabilities

  • Data transfer rate

  • Bus/SSD couple capabilities

Warning

When an SSD or NVMe device that hosts journals fails, every OSD using it to host its journal also becomes unavailable. Consider this when deciding how many block databases or WALs to place on the same storage device.

Recommendations for Ceph RADOS Gateways

Workloads on a RADOS Gateway are often throughput-intensive. Audio and video materials being stored as objects can be large. However, the bucket index pool typically displays a more I/O-intensive workload pattern. Store the index pools on SSD devices.

The RADOS Gateway maintains one index per bucket. By default, Ceph stores this index in one RADOS object. When a bucket stores more than 100,000 objects, the index performance degrades because the single index object becomes a bottleneck.

Ceph can keep large indexes in multiple RADOS objects, or shards. Enable this feature by setting the rgw_override_bucket_index_max_shards parameter. The recommended value is the number of objects expected in a bucket divided by 100,000.

As the index grows, Ceph must regularly reshard the bucket. Red Hat Ceph Storage provides a bucket index automatic resharding feature. The rgw_dynamic_resharding parameter, set to true by default, controls this feature.

Recommendations for CephFS

The metadata pool, which holds the directory structure and other indexes, can become a CephFS bottleneck. To minimize this limitation, use SSD devices for the metadata pool.

Each MDS maintains a cache in memory for different kinds of items, such as inodes. Ceph limits the size of this cache with the mds_cache_memory_limit parameter. Its default value, expressed in absolute bytes, is equal to 4 GB.

Placement Group Algebra

The total number of PGs in a cluster can impact overall performance due to unnecessary CPU and RAM activity on some OSD nodes. Red Hat recommended validating PG allocation for each pool before putting a cluster into production. Also consider specific testing of the backfill and recovery impact on client I/O requests.

There are two important values:

  • The overall number of PGs in the cluster

  • The number of PGs for a specific pool

Use this formula to estimate how many PGs should be available for a single, specific pool:

Total Placement Groups = (OSDs * 100) / Number of replicas

Apply the formula for each pool to get the total number of PGs for the cluster. Red Hat recommends between 100 and 200 PGs per OSD.

Note

Red Hat provides the Ceph Placement Groups (PGs) per Pool Calculator to recommend the number of PGs per pool, at https://access.redhat.com/labs/cephpgc/.

Splitting PGs

Ceph supports increasing or decreasing the number of PGs in a pool. If you do not specify a value when creating a pool, it is created with a default value of 8 PGs, which is very low.

The pg_autoscale_mode property allows Ceph to make recommendations and automatically adjust the pg_num and pgp_num parameters. This option is enabled by default when creating a new pool. The pg_num parameter defines the number of PGs for a specific pool. The pgp_num parameter defines the number of PGs that the CRUSH algorithm considers for placement.

Red Hat recommends that you make incremental increases in the number of placement groups until you reach the desired number of PGs. Increasing the number of PGs by a significant amount can cause cluster performance degradation, because the expected data relocation and rebalancing is intensive.

Use the ceph osd pool set command to manually increase or decrease the number of PGs by setting the pg_num parameter. You should only increase the number of PGs in a pool by small increments when doing it manually with the pg_autoscale_mode option disabled. Setting the total number of placement groups to a number that is a power of 2 provides better distribution of the PGs across the OSDs. Increasing the pg_num parameter automatically increases the pgp_num parameter, but at a gradual rate to minimize the impact on cluster performance.

Merging PGs

Red Hat Ceph Storage can merge two PGs into a larger PG, reducing the total number of PGs. Merging can be useful when the number of PGs in a pool is too large and performance is degraded. Because merging is a complex process, merge only one PG at a time to minimize the impact on cluster performance.

PG Auto-scaling

As discussed, the PG autoscale feature allows Ceph to make recommendations and automatically adjust the number of PGs. This feature is enabled by default when creating a pool. For existing pools, configure autoscaling with this command:

[admin@node ~]$ ceph osd pool set pool-name pg_autoscale_mode mode

Set the mode parameter to off to disable it, on to enable it and allow Ceph to automatically make adjustments in the number of PGs, or warn to raise health alerts when the number of PGs must be adjusted.

View the information provided by the autoscale module:

[admin@node ~]$ ceph osd pool autoscale-status
POOL        SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO
pool1         0                 3.0        92124M  0.0000
pool2      1323                 3.0        92124M  0.0000
pool3      3702                 3.0        92124M  0.0000

TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE
                                1.0       1              on
                                1.0      32              on
                                1.0      32          64  warn

The previous table is split in two parts to make it easier to read on this page. The first two pools have the AUTOSCALE feature set to on, with Ceph automatically adjusting the number of PGs. The third pool is configured to provide a health alert if the number of PGs needs adjusting. The PG_NUM parameter is the current number of PGs in each pool or the number of PGs that the pool is working towards. The NEW PG_NUM parameter is the number of PGs that Ceph recommends to set in the pool.

Designing the Cluster Architecture

When designing a Ceph cluster, consider scaling choices to match your future data requirements and to facilitate sufficient throughput with the correct network size and architecture.

Scalability

You can scale clustered storage in two ways:

  • Scale out by adding more nodes to a cluster.

  • Scale up by adding more resources to existing nodes.

Scaling up requires that nodes can accept more CPU and RAM resources to handle an increase in the number of disks and disk size. Scaling out requires adding nodes with similar resources and capacity to match the cluster's existing nodes for balanced operations.

Networking Best Practices

The network interconnecting the nodes in a Ceph cluster is critical to good performance, because all client and cluster I/O operations use it. Red Hat recommends the following practices:

  • To increase performance and provide better isolation for troubleshooting, use separate networks for OSD traffic and for client traffic.

  • At a minimum, use 10 GB networks or larger for the storage cluster. 1 GB networks are not suitable for production environments.

  • Evaluate network sizing based on both cluster and client traffic, and the amount of data stored.

  • Network monitoring is highly recommended.

  • Use separate NICs to connect to the networks where possible, or else use separate ports.

SectionFigure 12.1: Separate networks for OSD and client traffic is a representation of such a network architecture.

Figure 12.1: Separate networks for OSD and client traffic

The Ceph daemons automatically bind to the correct interfaces, such as binding MONs to the public network, and binding OSDs to both public and cluster networks.

Manually Controlling the Primary OSD for a PG

Use the primary affinity setting to influence Ceph's selection of a specific OSD as the primary OSD for a placement group. A higher setting makes an OSD more likely to be selected as a primary OSD. You can mitigate issues or bottlenecks by configuring the cluster to avoid using slow disks or controllers for a primary OSD. Use the ceph osd primary-affinity command to modify the primary affinity for an OSD. Affinity is a real number between 0 and 1.

[admin@node ~]$ ceph osd primary-affinity osd-number affinity

Recovery and Backfilling for OSDs

When Ceph adds or removes an OSD on a cluster, Ceph rebalances the PGs to use the new OSD or recreate replicas stored in the removed OSD. These backfilling and recovery operations can generate a high load of cluster network traffic, which can impact performance.

To avoid degraded cluster performance, adjust the backfilling and recovery operations to create a balance between rebalancing and normal cluster operations. Ceph provides parameters to limit the backfilling and recovery operations' I/O and network activity.

The following list includes some of those parameters:

ParameterDefinition
osd_recovery_op_priority Priority for recovery operations
osd_recovery_max_active Maximum number of active recovery requests per OSD in parallel
osd_recovery_threads Number of threads for data recovery
osd_max_backfills Maximum number of back fills for an OSD
osd_backfill_scan_min Minimum number of objects per backfill scan
osd_backfill_scan_max Maximum number of objects per backfill scan
osd_backfill_full_ratio Threshold for backfill requests to an OSD
osd_backfill_retry_interval Seconds to wait before retrying backfill requests

Configuring Hardware

Using realistic metrics for your cluster's expected workload, build the cluster's hardware configuration to provide sufficient performance, but keep the cost as low as possible. Red Hat suggests these hardware configurations for the three performance priorities:

IOPS optimized
  • Use two OSDs per NVMe device.

  • NVMe drives have data, the block database, and WAL collocated on the same storage device.

  • Assuming a 2 GHz CPU, use 10 cores per NVMe or 2 cores per SSD.

  • Allocate 16 GB RAM as a baseline, plus 5 GB per OSD.

  • Use 10 GbE NICs per 2 OSDs.

Throughput optimized
  • Use one OSD per HDD.

  • Place the block database and WAL on SSDs or NVMes.

  • Use at least 7,200 RPM HDD drives.

  • Assuming a 2 GHz CPU, use one-half core per HDD.

  • Allocate 16 GB RAM as a baseline, plus 5 GB per OSD.

  • Use 10 GbE NICs per 12 OSDs.

Capacity optimized
  • Use one OSD per HDD.

  • HDDs have data, the block database, and WAL collocated on the same storage device.

  • Use at least 7,200 RPM HDD drives.

  • Assuming a 2 GHz CPU, use one-half core per HDD.

  • Allocate 16 GB RAM as a baseline, plus 5 GB per OSD.

  • Use 10 GbE NICs per 12 OSDs.

Tuning with Ceph Performance Tools

Performance tools provide benchmarking metrics to examine clusters for performance issues.

Performance Counters and Gauges

Each Ceph daemon maintains a set of internal counters and gauges. Several tools are available to access these counters:

The Dashboard plug-in

The Dashboard plug-in exposes a web interface accessible on port 8443. The ClusterOSDs menu provides basic real-time OSD statistics, for example, the number of read bytes, write bytes, read operations, and write operations. Enable the Dashboard plug-in by using the ceph mgr module enable dashboard command. If you bootstrap your cluster with the cephadm bootstrap command, then the dashboard is enabled by default.

The Manager (MGR) Prometheus plug-in

This plug-in exposes the performance metrics on port 9283, for an external Prometheus server to collect. Prometheus is an open source system monitoring and alerting utility.

The ceph command-line tool

The ceph command has options to view metrics and change daemon parameters.

Performance Stress Tools

Red Hat Ceph Storage provides tools to stress test and benchmark a Ceph cluster.

The RADOS bench command

RADOS bench is a simple tool for testing the RADOS Object Store. It executes write and read tests on your cluster and provides statistics. The general syntax of the command is:

[admin@node ~]$ rados -p pool-name bench seconds write|seq|rand \
-b objsize -t concurrency

These are the common parameters for the tool:

  • The seq and rand tests are sequential and random read benchmarks. These tests require that a write benchmark is run first with the --no-cleanup option. By default, RADOS bench removes the objects created for the writing test. The --no-cleanup option keeps the objects, which can be useful for performing multiple tests on the same objects.

  • The default object size, objsize, is 4 MB.

  • The default number of concurrent operations, concurrency, is 16.

    With the --no-cleanup option, you must manually remove data that remains in the pool after running the rados bench command.

    For example, the following information is provided by the rados bench command, including throughput, IOPS, and latency:

    [ceph: root@server /]# rados bench -p testbench 10 write --no-cleanup
    hints = 1
    Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
    Object prefix: benchmark_data_server.example.com_265
      sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
        0       0         0         0         0         0           -           0
        1      16        72        56   223.964       224    0.157623    0.241175
    ...output omitted...
       10      16       715       699   279.551       328    0.120089    0.226616
    Total time run:         10.1406
    Total writes made:      715
    Write size:             4194304
    Object size:            4194304
    Bandwidth (MB/sec):     282.035
    Stddev Bandwidth:       32.1911
    Max bandwidth (MB/sec): 328
    Min bandwidth (MB/sec): 224
    Average IOPS:           70
    Stddev IOPS:            8.04777
    Max IOPS:               82
    Min IOPS:               56
    Average Latency(s):     0.225574
    Stddev Latency(s):      0.105867
    Max latency(s):         0.746961
    Min latency(s):         0.0425166
The RBD bench command

The RBD bench measures I/O throughput and latency on an existing image, which you created for the test.

These are the default values:

  • If you do not give a suffix to the size arguments, the command assumes bytes as the unit.

  • The default pool name is rbd.

  • The default for --io-size is 4096 bytes.

  • The default for --io-threads is 16.

  • The default for --io-total is 1 GB.

  • The default for --io-pattern is seq for sequential.

    For example, information is provided by the rbd bench command, including throughput and latency:

    [ceph: root@server /]# rbd bench --io-type write testimage --pool=testbench
    bench  type write io_size 4096 io_threads 16 bytes 1073741824 pattern sequential
      SEC       OPS   OPS/SEC   BYTES/SEC
        1     47360   46952.6   183 MiB/s
        2     88864   44461.5   174 MiB/s
        3    138016   46025.2   180 MiB/s
        4    178128   44546.4   174 MiB/s
        5    225264   45064.3   176 MiB/s
    elapsed: 5   ops: 262144   ops/sec: 45274.6   bytes/sec: 177 MiB/s

 

References

sysctl(8), ceph(8), rados(8), and rbd(8) man pages

For more information, refer to the Ceph performance benchmark chapter in the Red Hat Ceph Administration Guide at https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/administration_guide/index#ceph-performance-benchmarking

Revision: cl260-5.0-29d2128