Bookmark this page

Describing the Kafka Ecosystem and Architecture

Objectives

After completing this section, you should be able to describe the architecture of Kafka and AMQ Streams.

Describing the Apache Kafka Architecture

Kafka is referred to as a distributed commit log system. The concept of commit log is often used in a database context to refer to a set of transactions. By reading from this commit log and applying the transactions, the system can be recovered to a consistent state in case of any failure.

The basic idea of a commit log in Kafka is the same. Your application should be able to recover its current state by reading the data sent to Kafka. This is closely related to event sourcing and event-driven architectures in general. For example, let's say you perform the following operations on a bank account in a banking application in sequence:

  1. Deposit 200 USD

  2. Withdraw 50 USD

  3. Withdraw 5 USD

You can see every operation as an event. If you read the events in order and apply the operations, then you can find the current balance of the account, which is 145 USD. This is possible because events (messages or records in Kafka terminology) are durable, they are kept for the time that you have configured. Therefore, in event of a failure or messages out of order, you can refer to the log to ensure data integrity.

Describing the Kafka Cluster

Kafka is composed of several servers that allow horizontal scaling. A Kafka server is called a broker. When several brokers work together, this is called a cluster. A cluster is a logical concept to identify several brokers working together.

Among the brokers in a cluster, Apache ZooKeeper selects a controller. The controller broker is responsible for managing partitions and replicas as well as performing other administrative tasks. Because brokers are independent instances, they have to communicate over a network to share information about failures or updates. Apache ZooKeeper is used by Kafka to perform operations such as selecting a controller broker, notifying the controller broker that another broker is joining or leaving the cluster, or handling topic creation/removal.

Note

In future versions of Kafka, Apache ZooKeeper will be removed. Because ZooKeeper operates outside of the Kafka cluster, it implies management and resource issues.

Figure 2.1: Brokers communicating with ZooKeeper

Describing Messages and Topics

A Kafka cluster forms a distributed messaging system. A message (also called record) is an array of bytes sent over a network. In an event-driven architecture, an event can be represented as a Kafka message. For example, the event UserCreated, which contains information about the creation of a user, would be converted into an array of bytes to send the data as a Kafka message.

Kafka categorizes messages into topics. A topic is a category that messages are sent to. Messages within the same topic are usually related. For example, in a banking application, a topic named transactions would contain messages regarding transaction operations (sending money or withdrawing money).

Topics are append-only, which means that you cannot remove specific messages. However, you can set a retention policy. A retention policy sets the criteria for messages to be removed from the topic. You can inform Kafka to remove messages after a certain period of time (for example, 5 days) or whenever the topic reaches a particular size (for example, 3 GB).

In Kafka, a producer is an application that writes data to topics, and a consumer is an application that reads data from topics.

Every topic is divided into several partitions. A partition is an ordered collection of messages within a topic. When you publish a message to a topic, it is stored in one of its partitions. Message order is guaranteed within the partition, but not within the topic because a consumer can read a message from any of the topic's partitions.

Figure 2.2: The structure of a Kafka topic

Describing How Kafka Achieves Scalability and Availability

To achieve scalability, partitions of a topic are split across the different brokers of the cluster. This enables a topic to be scaled horizontally, which means that the workload is distributed across different brokers, instead of providing more resources to a single broker to cope with all the workload. Consumers of a topic can read messages from different machines, avoiding all requests being sent to the same broker.

At the same time, every partition is replicated to provide data redundancy in case of failure. These replicas are spread across the Kafka brokers in the cluster. Among these replicas located in different brokers, Kafka selects a leader by applying an election algorithm.

Producers and consumers only write and read data from the partition leader, and then the data is propagated to the replicas. If the broker that contains the partition leader goes down, Kafka selects another leader among the remaining replicas. This way, data is always available.

For example, in a cluster with 3 brokers, consider that you create a topic called Topic A with 3 partitions and a replication factor of 3. This means Topic A configures one partition leader and 2 replicas per partition for a total of 3 replicas per partition. The following diagram illustrates a possible layout for the cluster.

Figure 2.3: Distribution of partitions and replicas across a Kafka cluster

Every partition leader is on a different broker, which ensures that consumers read from different machines. At the same time, every broker holds a replica for every partition. In the case that, for example, Broker 0 goes down, Partition 0 will need a new leader, which will be one of the replicas in Broker 1 or Broker 2.

Describing the Apache Kafka Ecosystem

Aside from the Kafka platform itself, there are other applications that are designed to use with Kafka. These applications include clients, Red Hat OpenShift Container Platform (RHOCP) operators, or applications that use the Kafka APIs.

Clients

A Kafka client allows you to interact (produce or consume messages) with a Kafka cluster by connecting to a broker over a network.

Programming language clients

There are clients available for a variety of programming languages (Java, Python, Go…​).

You can find a list of available clients here.

Built-in shell client

The installation of a Kafka broker also provides some built-in shell scripts (for example kafka-console-producer.sh or kafka-console-consumer.sh).

You can find these scripts in the bin directory of the installed broker. If you are using a cloud environment and you have deployed a broker as a container, then these scripts are in the bin directory of your container.

Kafka Streams

Kafka Streams is a client library that allows you to retrieve data from Kafka in real-time. The main difference with other standard client libraries is that Kafka Streams encapsulates reading from a topic in a stream. You can perform aggregations on different streams and other data manipulation.

Kafka Streams is covered in later sections of this course.

Distribution

There are several ways in which you can deploy Kafka. You can install a broker in several bare-metal computers and then make them interact together to form a cluster (along with installing ZooKeeper and configuring everything).

To make this process easier, many people choose a cloud environment like RHOCP. You can also deploy Kafka in RHOCP from scratch (deploying the brokers and configuring everything yourself) however, you can simplify the process by using AMQ Streams on RHOCP. AMQ Streams on RHOCP is based on the Strimzi upstream project.

Describing Strimzi

Strimzi provides a way to run Kafka easily in Kubernetes or RHOCP. Using several RHOCP operators behind the scenes, it sets up a Kafka cluster and a ZooKeeper cluster and configures both of them.

Strimzi uses RHOCP Custom Resources to perform operations such as creating the Kafka cluster or creating a new topic.

Describing the Strimzi Architecture

Strimzi uses several RHOCP operators to manage Kafka when it is running in an RHOCP cluster.

  • Cluster operator

    Manages the deployment of the Kafka cluster itself. Provides a Kafka custom resource that allows you to configure several aspects of the cluster.

  • Topic operator

    Manages the creation, update, and removal of Kafka topics. Provides a KafkaTopic custom resource that allows you to set other topic configurations such as the number of partitions or replicas.

  • User operator

    Manages the creation, update, and removal of Kafka users. Provides a KafkaUser custom resource.

  • Entity operator

    This operator is used to handle both the Topic and User operators.

Important

This course uses a Kafka cluster deployed and managed in RHOCP by Strimzi.

Integration with Other Systems

The Kafka platform offers a built-in solution that fetches data from other platforms and integrates it with Kafka.

Kafka Connect allows you to integrate Kafka with other systems such as databases or other messaging platforms. It acts as a bridge between Kafka and other platforms.

Kafka Connect is used in later sections of this course.

 

References

Apache Kafka Official Documentation

Strimzi Official Documentation

Neha Narkhede, Gwen Shapira, Todd Palino. (2017) Kafka: The Definitive Guide. O'Reilly. ISBN 978-1-491-99065-0.

Revision: ad482-1.8-cc2ae1c