CL260 - ch01s03

Bookmark this page

Describing Red Hat Ceph Storage Architecture

Objectives

After completing this section, you should be able to describe the Red Hat Ceph Storage architecture, introduce the Object Storage Cluster, and describe the choices in data access methods.

Introducing the Ceph Cluster Architecture

Red Hat Ceph Storage is a distributed data object store. It is an enterprise-ready, software-defined storage solution that scales to thousands of clients who access exabytes of data and beyond. Ceph is designed to provide excellent performance, reliability, and scalability.

Ceph has a modular and distributed architecture that contains the following elements:

An object storage back end that is known as RADOS (Reliable Autonomic Distributed Object Store)
Various access methods to interact with RADOS

RADOS is a self-healing and self-managing software-based object store.

Ceph Storage Back-end Components

The Red Hat Ceph Storage cluster has the following daemons:

Monitors (MONs) maintain maps of the cluster state. They help the other daemons to coordinate with each other.
Object Storage Devices (OSDs) store data and handle data replication, recovery, and rebalancing.
Managers (MGRs) track runtime metrics and expose cluster information through a browser-based dashboard and REST API.
Metadata Servers (MDSes) store metadata that CephFS uses (but not object storage or block storage) so that clients can run POSIX commands efficiently.

These daemons can scale to meet the requirements of a deployed storage cluster.

Ceph Monitors

Ceph Monitors (MONs) are the daemons that maintain the cluster map. The cluster map is a collection of five maps that contain information about the state of the cluster and its configuration. Ceph must handle each cluster event, update the appropriate map, and replicate the updated map to each MON daemon.

To apply updates, the MONs must establish a consensus on the state of the cluster. A majority of the configured monitors must be available and agree on the map update. Configure your Ceph clusters with an odd number of monitors to ensure that the monitors can establish a quorum when they vote on the state of the cluster. More than half of the configured monitors must be functional for the Ceph storage cluster to be operational and accessible.

Ceph Object Storage Devices

Ceph Object Storage Devices (OSDs) are the building blocks of a Ceph storage cluster. OSDs connect a storage device (such as a hard disk or other block device) to the Ceph storage cluster. An individual storage server can run multiple OSD daemons and provide multiple OSDs to the cluster. Red Hat Ceph Storage 5 supports a feature called BlueStore to store data within RADOS. BlueStore uses the local storage devices in raw mode and is designed for high performance.

One design goal for OSD operation is to bring computing power as close as possible to the physical data so that the cluster can perform at peak efficiency. Both Ceph clients and OSD daemons use the Controlled Replication Under Scalable Hashing (CRUSH) algorithm to efficiently compute information about object location, instead of depending on a central lookup table.

CRUSH Map

CRUSH assigns every object to a Placement Group (PG), which is a single hash bucket. PGs are an abstraction layer between the objects (application layer) and the OSDs (physical layer). CRUSH uses a pseudo-random placement algorithm to distribute the objects across the PGs and uses rules to determine the mapping of the PGs to the OSDs. In the event of failure, Ceph remaps the PGs to different physical devices (OSDs) and synchronizes their content to match the configured data protection rules.

Primary and Secondary OSDs

One OSD is the primary OSD for the object's placement group, and Ceph clients always contact the primary OSD in the acting set when it reads or writes data. Other OSDs are secondary OSDs and play an important role in ensuring the resilience of data in the event of failures in the cluster.

Primary OSD functions:

Serve all I/O requests
Replicate and protect the data
Check data coherence
Rebalance the data
Recover the data
Secondary OSD functions:
Act always under control of the primary OSD
Can become the primary OSD

Warning

A host that runs OSDs must not mount Ceph RBD images or CephFS file systems by using the kernel-based client. Mounted resources can become unresponsive due to memory deadlocks or blocked I/O that is pending on stale sessions.

Ceph Managers

Ceph Managers (MGRs) provide for the collection of cluster statistics.

If no MGRs are available in a cluster, client I/O operations are not negatively affected, but attempts to query cluster statistics fail. To avoid this scenario, Red Hat recommends that you deploy at least two Ceph MGRs for each cluster, each running in a separate failure domain.

The MGR daemon centralizes access to all data that is collected from the cluster and provides a simple web dashboard to storage administrators. The MGR daemon can also export status information to an external monitoring server, such as Zabbix.

Metadata Server

The Ceph Metadata Server (MDS) manages Ceph File System (CephFS) metadata. It provides POSIX-compliant, shared file-system metadata management, including ownership, time stamps, and mode. The MDS uses RADOS instead of local storage to store its metadata. It has no access to file contents.

The MDS enables CephFS to interact with the Ceph Object Store, mapping an inode to an object and the location where Ceph stored the data within a tree. Clients who access a CephFS file system first send a request to an MDS, which provides the needed information to get file content from the correct OSDs.

Cluster Map

Ceph clients and OSDs require knowledge of the cluster topology. Five maps represent the cluster topology, which is collectively referred to as the cluster map. The Ceph Monitor daemon maintains the cluster map. A cluster of Ceph MONs ensures high availability if a monitor daemon fails.

The Monitor Map contains the cluster's fsid ; the position, name, address, and port of each monitor; and map time stamps. The fsid is a unique, auto-generated identifier (UUID) that identifies the Ceph cluster. View the Monitor Map with the ceph mon dump command.
The OSD Map contains the cluster's fsid , a list of pools, replica sizes, placement group numbers, a list of OSDs and their status, and map time stamps. View the OSD Map with the ceph osd dump command.
The Placement Group (PG) Map contains the PG version; the full ratios; details on each placement group such as the PG ID, the Up Set, the Acting Set, the state of the PG, data usage statistics for each pool; and map time stamps. View the PG Map statistics with the ceph pg dump command.
The CRUSH Map contains a list of storage devices, the failure domain hierarchy (such as device, host, rack, row, room), and rules for traversing the hierarchy when storing data. View the CRUSH map with the ceph osd crush dump command.
The Metadata Server (MDS) Map contains the pool for storing metadata, a list of metadata servers, metadata servers status, and map time stamps. View the MDS Map with the ceph fs dump command.

Ceph Access Methods

Ceph provides the following methods for accessing a Ceph cluster:

Ceph Native API (librados)
Ceph Block Device (RBD, librbd), also known as a RADOS Block Device (RBD) image
Ceph Object Gateway
Ceph File System (CephFS, libcephfs)

The following diagram depicts the four data access methods of a Ceph cluster, the libraries that support the access methods, and the underlying Ceph components that manage and store the data.

Figure 1.1: Ceph components

Ceph Native API (`librados`)

The foundational library that implements the other Ceph interfaces, such as Ceph Block Device and Ceph Object Gateway, is librados. The librados library is a native C library that allows applications to work directly with RADOS to access objects stored by the Ceph cluster. Similar libraries are available for C++, Java, Python, Ruby, Erlang, and PHP.

To maximize performance, write your applications to work directly with librados. This method gives the best results to improve storage performance in a Ceph environment. For easier Ceph storage access, instead use the higher-level access methods that are provided, such as the RADOS Block Devices, Ceph Object Gateway (RADOSGW), and CephFS.

RADOS Block Device

A Ceph Block Device (RADOS Block Device or RBD) provides block storage within a Ceph cluster through RBD images. Ceph scatters the individual objects that compose RBD images across different OSDs in the cluster. Because the objects that make up the RBD are on different OSDs, access to the block device is automatically parallelized.

RBD provides the following features:

Storage for virtual disks in the Ceph cluster
Mount support in the Linux kernel
Boot support in QEMU, KVM, and OpenStack Cinder

Ceph Object Gateway (RADOS Gateway)

Ceph Object Gateway (RADOS Gateway, RADOSGW, or RGW) is an object storage interface that is built with librados. It uses this library to communicate with the Ceph cluster and writes to OSD processes directly. It provides applications with a gateway with a RESTful API, and supports two interfaces: Amazon S3 and OpenStack Swift.

Note

You need to update the endpoint only to port existing applications that use the Amazon S3 API or the OpenStack Swift API.

The Ceph Object Gateway offers scalability support by not limiting the number of deployed gateways and by providing support for standard HTTP load balancers. The Ceph Object Gateway includes the following use cases:

Image storage (for example, SmugMug, Tumblr)
Backup services
File storage and sharing (for example, Dropbox)

Ceph File System (CephFS)

Ceph File System (CephFS) is a parallel file system that provides a scalable, single-hierarchy shared disk. Red Hat Ceph Storage provides production environment support for CephFS, including support for snapshots.

The Ceph Metadata Server (MDS) manages the metadata that is associated with files stored in CephFS, including file access, change, and modification time stamps.

Ceph Client Components

Cloud-aware applications need a simple object storage interface with asynchronous communication capability. The Red Hat Ceph Storage Cluster provides such an interface. Clients have direct, parallel access to objects and access throughout the cluster, including:

Pool Operations
Snapshots
Read/Write Objects
- Create or Remove
- Entire Object or Byte Range
- Append or Truncate
Create/Set/Get/Remove XATTRs
Create/Set/Get/Remove Key/Value Pairs
Compound operations and dual-ack semantics

The object map tracks the presence of backing RADOS objects when a client writes to an RBD image. When a write occurs, it is translated to an offset within a backing RADOS object. When the object map feature is enabled, the presence of RADOS objects is tracked to signify that the objects exist. The object map is kept in-memory on the librbd client to avoid querying the OSDs for objects that do not exist.

The object map is beneficial for certain operations, such as:

Resize
Export
Copy
Flatten
Delete
Read

Storage devices have throughput limitations, which impact performance and scalability. Storage systems often support striping, which is storing sequential pieces of information across multiple storage devices, to increase throughput and performance. Ceph clients can use data striping to increase performance when writing data to the cluster.

Data Distribution and Organization in Ceph

This section describes the mechanisms that Ceph uses to distribute and organize data across the various storage devices in a cluster.

Partitioning Storage with Pools

Ceph OSDs protect and constantly check the integrity of the data that is stored in the cluster. Pools are logical partitions of the Ceph storage cluster that are used to store objects under a common name tag. Ceph assigns each pool a specific number of hash buckets, called Placement Groups (PGs), to group objects for storage.

Each pool has the following adjustable properties:

Immutable ID
Name
Number of PGs to distribute the objects across the OSDs
CRUSH rule to determine the mapping of the PGs for this pool
Protection type (replicated or erasure coding)
Parameters that are associated with the protection type
Various flags to influence the cluster behavior

Configure the number of placement groups that are assigned to each pool independently to fit the type of data and the required access for the pool.

The CRUSH algorithm determines the OSDs that host the data for a pool. Each pool has a single CRUSH rule that is assigned as its placement strategy. The CRUSH rule determines which OSDs store the data for all the pools that are assigned that rule.

Placement Groups

A Placement Group (PG) aggregates a series of objects into a hash bucket, or group. Ceph maps each PG to a set of OSDs. An object belongs to a single PG, and all objects that belong to the same PG return the same hash result.

The CRUSH algorithm maps an object to its PG based on the hashing of the object's name. The placement strategy is also called the CRUSH placement rule. The placement rule identifies the failure domain to choose within the CRUSH topology to receive each replica or erasure code chunk.

When a client writes an object to a pool, it uses the pool's CRUSH placement rule to determine the object's placement group. The client then uses its copy of the cluster map, the placement group, and the CRUSH placement rule to calculate which OSDs to write a copy of the object to (or its erasure-coded chunks).

The layer of indirection that the placement group provides is important when new OSDs become available to the Ceph cluster. When OSDs are added to or removed from a cluster, placement groups are automatically rebalanced between operational OSDs.

Mapping an Object to Its Associated OSDs

A Ceph client gets the latest copy of the cluster map from a MON. The cluster map provides information to the client about all the MONs, OSDs, and MDSs in the cluster. It does not provide the client with the location of the objects; the client must use CRUSH to compute the locations of objects that it needs to access.

To calculate the Placement Group ID (PG ID) for an object, the Ceph client needs the object ID and the name of the object's storage pool. The client calculates the PG ID, which is the hash of the object ID modulo the number of PGs. It then looks up the numeric ID for the pool, based on the pool's name, and prepends the pool ID to the PG ID.

The CRUSH algorithm is then used to determine which OSDs are responsible for a placement group (the Acting Set). The OSDs in the Acting Set that are up are in the Up Set. The first OSD in the Up Set is the current primary OSD for the object's placement group, and all other OSDs in the Up Set are secondary OSDs.

The Ceph client can then directly work with the primary OSD to access the object.

Data Protection

Like Ceph clients, OSD daemons use the CRUSH algorithm, but the OSD daemon uses it to compute where to store the object replicas and for rebalancing storage. In a typical write scenario, a Ceph client uses the CRUSH algorithm to compute where to store the original object. The client maps the object to a pool and placement group and then uses the CRUSH map to identify the primary OSD for the mapped placement group. When creating pools, set them as either replicated or erasure coded pools. Red Hat Ceph Storage 5 supports erasure coded pools for Ceph RBD and CephFS.

Figure 1.2: Ceph pool data protection methods

For resilience, configure pools with the number of OSDs that can fail without losing data. For a replicated pool, which is the default pool type, the number determines the number of copies of an object to create and distribute across different devices. Replicated pools provide better performance than erasure coded pools in almost all cases at the cost of a lower usable-to-raw storage ratio.

Erasure coding provides a more cost-efficient way to store data but with lower performance. For an erasure coded pool, the configuration values determine the number of coding chunks and parity blocks to create.

A primary advantage of erasure coding is its ability to offer extreme resilience and durability. You can configure the number of coding chunks (parities) to use.

The following figure illustrates how data objects are stored in a Ceph cluster. Ceph maps one or more objects in a pool to a single PG, represented by the colored boxes. Each of the PGs in this figure is replicated and stored on separate OSDs within the Ceph cluster.

Figure 1.3: Objects in a Ceph pool stored in placement groups

References

For more information, refer to the Red Hat Ceph Storage 5 Architecture Guide at https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/architecture_guide

For more information, refer to the Red Hat Ceph Storage 5 Hardware Guide at https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/hardware_guide

For more information, refer to the Red Hat Ceph Storage 5 Storage Strategies Guide at https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/storage_strategies_guide

Discuss Cloud Storage with Red Hat Ceph Storage

Go to community

Red Hat Ceph Storage for OpenStack (CL260)

Haley_Ruccio

3 sie 2023

Build, expand and maintain cloud-scale, clustered storage for your applications with Red Hat Ceph StorageCloud Storage with Red Hat Ceph Storage (CL260) is designed for storage administrators and cloud operators who deploy Red Hat Ceph Storage in a production data center environment or as a component of a Red Hat OpenStack Platform or OpenShift Container Platform infrastructure. Learn how to deploy, manage, and scale a Ceph storage cluster to provide hybrid storage resources, including Amazon S3 and OpenStack Swift-compatible object storage, Ceph-native and iSCSI-based block storage, and shared file storage. This course is based on Red Hat Ceph Storage 5.0.Course summaryDeploy and manage a Red Hat Ceph Storage cluster on commodity servers.Perform common management operations using the web-based management interface.Create, expand, and control access to storage pools provided by the Ceph cluster.Access Red Hat Ceph Storage from clients using object, block, and file-based methods.Analyze and tune Red Hat Ceph Storage performance.Integrate Red Hat OpenStack Platform image, object, block, and file storage with a Red Hat Ceph Storage cluster.Integrate OpenShift Container Platform with a Red Hat Ceph Storage cluster.Target AudienceThis course is intended for storage administrators and cloud operators who want to learn how to deploy and manage Red Hat Ceph Storage on servers in an enterprise data center or within a Red Hat OpenStack Platform or OpenShift Container Platform environment.Developers writing applications that use cloud-based storage will learn the distinctions of various storage types and client access methods.Recommended trainingTake our free assessment to gauge whether this offering is the best fit for your skills.Red Hat Certified System Administrator (RHCSA) certification, or equivalent experience.For candidates that have not earned an RHCSA or equivalent, confirmation of the correct skill set knowledge can be obtained by taking the online skills assessment.Some experience with storage administration is recommended but not required.Technology considerationsThis course does not have any special technical requirements.This course is not intended for BYOD.Internet access is recommended.

Welcome to the Red Hat Ceph Storage for OpenStack (CL260) group in the Red Hat Learning Community!

cschunke

31 lip 2023

We are excited to launch a space dedicated to the Red Hat Training course CL260! To gain the most value from this group - click the "Join Group" button in the upper right hand corner of the group home page.We encourage group members to collaborate in this group to discuss topics, ask questions, share best practices and tips, provide course feedback, and share their accomplishments as it relates to CL260.Read more about Red Hat Ceph Storage for OpenStack here.

381

Revision: cl260-5.0-29d2128

Cloud Storage with Red Hat Ceph Storage