Introduction
Last updated
Last updated
Developed by LinkedIn and open sourced in 2010.
Apache Kafka is a publish/subscribe messaging system designed to solve this problem. It is often described as a “distributed commit log” or, more recently, as a “distributing streaming platform.”
A filesystem or database commit log is designed to provide a durable record of all transactions so that they can be replayed to build the state of a system consistently.
The unit of data within Kafka is called a message. In the DB world, it's similar to a Row.
The message is simply an array of bytes and does not have a specific format or meaning.
The topic is a category or feed name to which messages are published, like a TV Channel / Radio station.
Topics are additionally broken down into several partitions.
Messages are written to it in an append-only fashion.
Note that as a topic typically has multiple partitions, message time-ordering is not guaranteed across the entire topic, just within a single partition.
Partitions provide redundancy and scalability. The partition can be hosted on a different server.
A message can have an optional bit of metadata, a key. As with the message, the key is also a byte array with no specific meaning to Kafka. Keys are used when messages are to be written to partitions in a more controlled manner.
Remember, MongoDB data has _id field. Similarly, keys can be used to relay messages.
TV Shows you have Season/Episode
Offset values in Kafka are long integers uniquely identifying each message within a partition. They start at 0 for the first message in a partition and increment by 1 for each subsequent message. For example, if a partition has five messages, the offsets would be 0, 1, 2, 3, and 4.
A batch is a collection of messages on the same topic and partition.
Producers create new messages. In general, a message will be produced on a specific topic.
In general, a message will be produced to a specific topic. By default, the producer does not care what partition a particular message is written to and will evenly balance messages over all partitions of a topic.
In some cases, the producer will direct messages to specific partitions. This is typically done using the message key and a partitioner to generate a hash of the key and map it to a particular partition.
Consumers work as part of a consumer group, one or more consumers working together to consume a topic. The group ensures that each partition is only consumed by one member.
A single Kafka server is called a broker. The broker receives messages from producers, assigns offsets to them, and commits the messages to storage on disk.
Kafka brokers are designed to operate as part of a cluster.
One broker will also function as the cluster controller within a cluster of brokers.
A partition may be assigned to multiple brokers, which will result in Replication.
Automatic offset commits provide "at least once" delivery semantics because if the consumer fails after processing a message but before the next auto-commit, it may re-process some messages after restarting.
This behavior ensures no message loss but can lead to duplicate processing.
The message is sent a maximum of only one time.