Developing Event-driven Applications with Apache Kafka and Red Hat AMQ Streams
After completing this section, you should be able to describe the architecture of Kafka and AMQ Streams.
Kafka is referred to as a distributed commit log system. The concept of commit log is often used in a database context to refer to a set of transactions. By reading from this commit log and applying the transactions, the system can be recovered to a consistent state in case of any failure.
The basic idea of a commit log in Kafka is the same. Your application should be able to recover its current state by reading the data sent to Kafka. This is closely related to event sourcing and event-driven architectures in general. For example, let's say you perform the following operations on a bank account in a banking application in sequence:
Deposit 200 USD
Withdraw 50 USD
Withdraw 5 USD
You can see every operation as an event. If you read the events in order and apply the operations, then you can find the current balance of the account, which is 145 USD. This is possible because events (messages or records in Kafka terminology) are durable, they are kept for the time that you have configured. Therefore, in event of a failure or messages out of order, you can refer to the log to ensure data integrity.
Kafka is composed of several servers that allow horizontal scaling. A Kafka server is called a broker. When several brokers work together, this is called a cluster. A cluster is a logical concept to identify several brokers working together.
Among the brokers in a cluster, Apache ZooKeeper selects a controller. The controller broker is responsible for managing partitions and replicas as well as performing other administrative tasks. Because brokers are independent instances, they have to communicate over a network to share information about failures or updates. Apache ZooKeeper is used by Kafka to perform operations such as selecting a controller broker, notifying the controller broker that another broker is joining or leaving the cluster, or handling topic creation/removal.
Note
In future versions of Kafka, Apache ZooKeeper will be removed. Because ZooKeeper operates outside of the Kafka cluster, it implies management and resource issues.
A Kafka cluster forms a distributed messaging system.
A message (also called record) is an array of bytes sent over a network.
In an event-driven architecture, an event can be represented as a Kafka message.
For example, the event UserCreated, which contains information about the creation of a user, would be converted into an array of bytes to send the data as a Kafka message.
Kafka categorizes messages into topics.
A topic is a category that messages are sent to.
Messages within the same topic are usually related.
For example, in a banking application, a topic named transactions would contain messages regarding transaction operations (sending money or withdrawing money).
Topics are append-only, which means that you cannot remove specific messages. However, you can set a retention policy. A retention policy sets the criteria for messages to be removed from the topic. You can inform Kafka to remove messages after a certain period of time (for example, 5 days) or whenever the topic reaches a particular size (for example, 3 GB).
In Kafka, a producer is an application that writes data to topics, and a consumer is an application that reads data from topics.
Every topic is divided into several partitions. A partition is an ordered collection of messages within a topic. When you publish a message to a topic, it is stored in one of its partitions. Message order is guaranteed within the partition, but not within the topic because a consumer can read a message from any of the topic's partitions.
To achieve scalability, partitions of a topic are split across the different brokers of the cluster. This enables a topic to be scaled horizontally, which means that the workload is distributed across different brokers, instead of providing more resources to a single broker to cope with all the workload. Consumers of a topic can read messages from different machines, avoiding all requests being sent to the same broker.
At the same time, every partition is replicated to provide data redundancy in case of failure.
These replicas are spread across the Kafka brokers in the cluster.
Among these replicas located in different brokers, Kafka selects a leader by applying an election algorithm.
Producers and consumers only write and read data from the partition leader, and then the data is propagated to the replicas. If the broker that contains the partition leader goes down, Kafka selects another leader among the remaining replicas. This way, data is always available.
For example, in a cluster with 3 brokers, consider that you create a topic called Topic A with 3 partitions and a replication factor of 3.
This means Topic A configures one partition leader and 2 replicas per partition for a total of 3 replicas per partition.
The following diagram illustrates a possible layout for the cluster.
Every partition leader is on a different broker, which ensures that consumers read from different machines.
At the same time, every broker holds a replica for every partition.
In the case that, for example, Broker 0 goes down, Partition 0 will need a new leader, which will be one of the replicas in Broker 1 or Broker 2.
Aside from the Kafka platform itself, there are other applications that are designed to use with Kafka. These applications include clients, Red Hat OpenShift Container Platform (RHOCP) operators, or applications that use the Kafka APIs.
A Kafka client allows you to interact (produce or consume messages) with a Kafka cluster by connecting to a broker over a network.
- Programming language clients
There are clients available for a variety of programming languages (Java, Python, Go…).
You can find a list of available clients here.
- Built-in shell client
The installation of a Kafka broker also provides some built-in shell scripts (for example
kafka-console-producer.shorkafka-console-consumer.sh).You can find these scripts in the
bindirectory of the installed broker. If you are using a cloud environment and you have deployed a broker as a container, then these scripts are in thebindirectory of your container.- Kafka Streams
Kafka Streams is a client library that allows you to retrieve data from Kafka in real-time. The main difference with other standard client libraries is that Kafka Streams encapsulates reading from a topic in a stream. You can perform aggregations on different streams and other data manipulation.
Kafka Streams is covered in later sections of this course.
There are several ways in which you can deploy Kafka. You can install a broker in several bare-metal computers and then make them interact together to form a cluster (along with installing ZooKeeper and configuring everything).
To make this process easier, many people choose a cloud environment like RHOCP. You can also deploy Kafka in RHOCP from scratch (deploying the brokers and configuring everything yourself) however, you can simplify the process by using AMQ Streams on RHOCP. AMQ Streams on RHOCP is based on the Strimzi upstream project.
- Describing Strimzi
Strimzi provides a way to run Kafka easily in Kubernetes or RHOCP. Using several RHOCP operators behind the scenes, it sets up a Kafka cluster and a ZooKeeper cluster and configures both of them.
Strimzi uses RHOCP Custom Resources to perform operations such as creating the Kafka cluster or creating a new topic.
- Describing the Strimzi Architecture
Strimzi uses several RHOCP operators to manage Kafka when it is running in an RHOCP cluster.
Cluster operator
Manages the deployment of the Kafka cluster itself. Provides a
Kafkacustom resource that allows you to configure several aspects of the cluster.Topic operator
Manages the creation, update, and removal of Kafka topics. Provides a
KafkaTopiccustom resource that allows you to set other topic configurations such as the number of partitions or replicas.User operator
Manages the creation, update, and removal of Kafka users. Provides a
KafkaUsercustom resource.Entity operator
This operator is used to handle both the Topic and User operators.
Important
This course uses a Kafka cluster deployed and managed in RHOCP by Strimzi.
The Kafka platform offers a built-in solution that fetches data from other platforms and integrates it with Kafka.
Kafka Connect allows you to integrate Kafka with other systems such as databases or other messaging platforms. It acts as a bridge between Kafka and other platforms.
Kafka Connect is used in later sections of this course.
References
Apache Kafka Official Documentation
Strimzi Official Documentation
Neha Narkhede, Gwen Shapira, Todd Palino. (2017) Kafka: The Definitive Guide. O'Reilly. ISBN 978-1-491-99065-0.