After completing this section, you should be able to describe the Red Hat Ceph Storage architecture, introduce the Object Storage Cluster, and describe the choices in data access methods.
Red Hat Ceph Storage is a distributed data object store. It is an enterprise-ready, software-defined storage solution that scales to thousands of clients who access exabytes of data and beyond. Ceph is designed to provide excellent performance, reliability, and scalability.
Ceph has a modular and distributed architecture that contains the following elements:
An object storage back end that is known as RADOS (Reliable Autonomic Distributed Object Store)
Various access methods to interact with RADOS
The Red Hat Ceph Storage cluster has the following daemons:
Monitors (MONs) maintain maps of the cluster state. They help the other daemons to coordinate with each other.
Object Storage Devices (OSDs) store data and handle data replication, recovery, and rebalancing.
Managers (MGRs) track runtime metrics and expose cluster information through a browser-based dashboard and REST API.
Metadata Servers (MDSes) store metadata that CephFS uses (but not object storage or block storage) so that clients can run POSIX commands efficiently.
These daemons can scale to meet the requirements of a deployed storage cluster.
Ceph Monitors (MONs) are the daemons that maintain the cluster map. The cluster map is a collection of five maps that contain information about the state of the cluster and its configuration. Ceph must handle each cluster event, update the appropriate map, and replicate the updated map to each MON daemon.
To apply updates, the MONs must establish a consensus on the state of the cluster. A majority of the configured monitors must be available and agree on the map update. Configure your Ceph clusters with an odd number of monitors to ensure that the monitors can establish a quorum when they vote on the state of the cluster. More than half of the configured monitors must be functional for the Ceph storage cluster to be operational and accessible.
Ceph Object Storage Devices (OSDs) are the building blocks of a Ceph storage cluster. OSDs connect a storage device (such as a hard disk or other block device) to the Ceph storage cluster. An individual storage server can run multiple OSD daemons and provide multiple OSDs to the cluster. Red Hat Ceph Storage 5 supports a feature called BlueStore to store data within RADOS. BlueStore uses the local storage devices in raw mode and is designed for high performance.
CRUSH assigns every object to a Placement Group (PG), which is a single hash bucket. PGs are an abstraction layer between the objects (application layer) and the OSDs (physical layer). CRUSH uses a pseudo-random placement algorithm to distribute the objects across the PGs and uses rules to determine the mapping of the PGs to the OSDs. In the event of failure, Ceph remaps the PGs to different physical devices (OSDs) and synchronizes their content to match the configured data protection rules.
One OSD is the primary OSD for the object's placement group, and Ceph clients always contact the primary OSD in the acting set when it reads or writes data. Other OSDs are secondary OSDs and play an important role in ensuring the resilience of data in the event of failures in the cluster.
Primary OSD functions:
Serve all I/O requests
Replicate and protect the data
Check data coherence
Rebalance the data
Recover the data
Secondary OSD functions:
Act always under control of the primary OSD
Can become the primary OSD
A host that runs OSDs must not mount Ceph RBD images or CephFS file systems by using the kernel-based client. Mounted resources can become unresponsive due to memory deadlocks or blocked I/O that is pending on stale sessions.
Ceph Managers (MGRs) provide for the collection of cluster statistics.
If no MGRs are available in a cluster, client I/O operations are not negatively affected, but attempts to query cluster statistics fail. To avoid this scenario, Red Hat recommends that you deploy at least two Ceph MGRs for each cluster, each running in a separate failure domain.
The MGR daemon centralizes access to all data that is collected from the cluster and provides a simple web dashboard to storage administrators. The MGR daemon can also export status information to an external monitoring server, such as Zabbix.
The Ceph Metadata Server (MDS) manages Ceph File System (CephFS) metadata. It provides POSIX-compliant, shared file-system metadata management, including ownership, time stamps, and mode. The MDS uses RADOS instead of local storage to store its metadata. It has no access to file contents.
The MDS enables CephFS to interact with the Ceph Object Store, mapping an inode to an object and the location where Ceph stored the data within a tree. Clients who access a CephFS file system first send a request to an MDS, which provides the needed information to get file content from the correct OSDs.
Ceph clients and OSDs require knowledge of the cluster topology. Five maps represent the cluster topology, which is collectively referred to as the cluster map. The Ceph Monitor daemon maintains the cluster map. A cluster of Ceph MONs ensures high availability if a monitor daemon fails.
The Monitor Map contains the cluster's fsid ; the position, name, address, and port of each monitor; and map time stamps.
The fsid is a unique, auto-generated identifier (UUID) that identifies the Ceph cluster.
View the Monitor Map with the ceph mon dump command.
The OSD Map contains the cluster's fsid , a list of pools, replica sizes, placement group numbers, a list of OSDs and their status, and map time stamps.
View the OSD Map with the ceph osd dump command.
The Placement Group (PG) Map contains the PG version; the full ratios; details on each placement group such as the PG ID, the Up Set, the Acting Set, the state of the PG, data usage statistics for each pool; and map time stamps.
View the PG Map statistics with the ceph pg dump command.
The CRUSH Map contains a list of storage devices, the failure domain hierarchy (such as device, host, rack, row, room), and rules for traversing the hierarchy when storing data.
View the CRUSH map with the ceph osd crush dump command.
The Metadata Server (MDS) Map contains the pool for storing metadata, a list of metadata servers, metadata servers status, and map time stamps.
View the MDS Map with the ceph fs dump command.
Ceph provides the following methods for accessing a Ceph cluster:
Ceph Native API (librados)
Ceph Block Device (RBD, librbd), also known as a RADOS Block Device (RBD) image
Ceph Object Gateway
Ceph File System (CephFS, libcephfs)
The following diagram depicts the four data access methods of a Ceph cluster, the libraries that support the access methods, and the underlying Ceph components that manage and store the data.
The foundational library that implements the other Ceph interfaces, such as Ceph Block Device and Ceph Object Gateway, is librados.
The librados library is a native C library that allows applications to work directly with RADOS to access objects stored by the Ceph cluster.
Similar libraries are available for C++, Java, Python, Ruby, Erlang, and PHP.
To maximize performance, write your applications to work directly with librados.
This method gives the best results to improve storage performance in a Ceph environment.
For easier Ceph storage access, instead use the higher-level access methods that are provided, such as the RADOS Block Devices, Ceph Object Gateway (RADOSGW), and CephFS.
A Ceph Block Device (RADOS Block Device or RBD) provides block storage within a Ceph cluster through RBD images. Ceph scatters the individual objects that compose RBD images across different OSDs in the cluster. Because the objects that make up the RBD are on different OSDs, access to the block device is automatically parallelized.
RBD provides the following features:
Storage for virtual disks in the Ceph cluster
Mount support in the Linux kernel
Boot support in QEMU, KVM, and OpenStack Cinder
Ceph Object Gateway (RADOS Gateway, RADOSGW, or RGW) is an object storage interface that is built with librados.
It uses this library to communicate with the Ceph cluster and writes to OSD processes directly.
It provides applications with a gateway with a RESTful API, and supports two interfaces: Amazon S3 and OpenStack Swift.
You need to update the endpoint only to port existing applications that use the Amazon S3 API or the OpenStack Swift API.
The Ceph Object Gateway offers scalability support by not limiting the number of deployed gateways and by providing support for standard HTTP load balancers. The Ceph Object Gateway includes the following use cases:
Image storage (for example, SmugMug, Tumblr)
Backup services
File storage and sharing (for example, Dropbox)
Ceph File System (CephFS) is a parallel file system that provides a scalable, single-hierarchy shared disk. Red Hat Ceph Storage provides production environment support for CephFS, including support for snapshots.
The Ceph Metadata Server (MDS) manages the metadata that is associated with files stored in CephFS, including file access, change, and modification time stamps.
Cloud-aware applications need a simple object storage interface with asynchronous communication capability. The Red Hat Ceph Storage Cluster provides such an interface. Clients have direct, parallel access to objects and access throughout the cluster, including:
Pool Operations
Snapshots
Read/Write Objects
Create or Remove
Entire Object or Byte Range
Append or Truncate
Create/Set/Get/Remove XATTRs
Create/Set/Get/Remove Key/Value Pairs
Compound operations and dual-ack semantics
The object map tracks the presence of backing RADOS objects when a client writes to an RBD image.
When a write occurs, it is translated to an offset within a backing RADOS object.
When the object map feature is enabled, the presence of RADOS objects is tracked to signify that the objects exist.
The object map is kept in-memory on the librbd client to avoid querying the OSDs for objects that do not exist.
The object map is beneficial for certain operations, such as:
Resize
Export
Copy
Flatten
Delete
Read
Storage devices have throughput limitations, which impact performance and scalability.
Storage systems often support striping, which is storing sequential pieces of information across multiple storage devices, to increase throughput and performance.
Ceph clients can use data striping to increase performance when writing data to the cluster.
This section describes the mechanisms that Ceph uses to distribute and organize data across the various storage devices in a cluster.
Ceph OSDs protect and constantly check the integrity of the data that is stored in the cluster. Pools are logical partitions of the Ceph storage cluster that are used to store objects under a common name tag. Ceph assigns each pool a specific number of hash buckets, called Placement Groups (PGs), to group objects for storage.
Each pool has the following adjustable properties:
Immutable ID
Name
Number of PGs to distribute the objects across the OSDs
CRUSH rule to determine the mapping of the PGs for this pool
Protection type (replicated or erasure coding)
Parameters that are associated with the protection type
Various flags to influence the cluster behavior
A Placement Group (PG) aggregates a series of objects into a hash bucket, or group. Ceph maps each PG to a set of OSDs. An object belongs to a single PG, and all objects that belong to the same PG return the same hash result.
The CRUSH algorithm maps an object to its PG based on the hashing of the object's name. The placement strategy is also called the CRUSH placement rule. The placement rule identifies the failure domain to choose within the CRUSH topology to receive each replica or erasure code chunk.
When a client writes an object to a pool, it uses the pool's CRUSH placement rule to determine the object's placement group. The client then uses its copy of the cluster map, the placement group, and the CRUSH placement rule to calculate which OSDs to write a copy of the object to (or its erasure-coded chunks).
The layer of indirection that the placement group provides is important when new OSDs become available to the Ceph cluster. When OSDs are added to or removed from a cluster, placement groups are automatically rebalanced between operational OSDs.
A Ceph client gets the latest copy of the cluster map from a MON. The cluster map provides information to the client about all the MONs, OSDs, and MDSs in the cluster. It does not provide the client with the location of the objects; the client must use CRUSH to compute the locations of objects that it needs to access.
To calculate the Placement Group ID (PG ID) for an object, the Ceph client needs the object ID and the name of the object's storage pool. The client calculates the PG ID, which is the hash of the object ID modulo the number of PGs. It then looks up the numeric ID for the pool, based on the pool's name, and prepends the pool ID to the PG ID.
The CRUSH algorithm is then used to determine which OSDs are responsible for a placement group (the Acting Set). The OSDs in the Acting Set that are up are in the Up Set. The first OSD in the Up Set is the current primary OSD for the object's placement group, and all other OSDs in the Up Set are secondary OSDs.
The Ceph client can then directly work with the primary OSD to access the object.
Like Ceph clients, OSD daemons use the CRUSH algorithm, but the OSD daemon uses it to compute where to store the object replicas and for rebalancing storage. In a typical write scenario, a Ceph client uses the CRUSH algorithm to compute where to store the original object. The client maps the object to a pool and placement group and then uses the CRUSH map to identify the primary OSD for the mapped placement group. When creating pools, set them as either replicated or erasure coded pools. Red Hat Ceph Storage 5 supports erasure coded pools for Ceph RBD and CephFS.
Erasure coding provides a more cost-efficient way to store data but with lower performance. For an erasure coded pool, the configuration values determine the number of coding chunks and parity blocks to create.
The following figure illustrates how data objects are stored in a Ceph cluster. Ceph maps one or more objects in a pool to a single PG, represented by the colored boxes. Each of the PGs in this figure is replicated and stored on separate OSDs within the Ceph cluster.
For more information, refer to the Red Hat Ceph Storage 5 Architecture Guide at https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/architecture_guide
For more information, refer to the Red Hat Ceph Storage 5 Hardware Guide at https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/hardware_guide
For more information, refer to the Red Hat Ceph Storage 5 Storage Strategies Guide at https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/storage_strategies_guide