Big Data & Tools with NoSQL
  • Big Data & Tools
  • ReadMe
  • Big Data Overview
    • Overview
    • Job Opportunities
    • What is Data?
    • How does it help?
    • Types of Data
    • The Big 4 V's
      • Variety
      • Volume
      • Velocity
      • Veracity
      • Other V's
    • Trending Technologies
    • Big Data Concerns
    • Big Data Challenges
    • Data Integration
    • Scaling
      • CAP Theorem
      • Optimistic concurrency
      • Eventual consistency
      • Concurrent vs. Parallel Programming
    • Big Data Tools
    • No SQL Databases
    • What does Big Data learning means?
  • Linux & Tools
    • Overview
    • Linux Commands - 01
    • Linux Commands - 02
    • AWK
    • CSVKIT
    • CSVSQL
    • CSVGREP
  • Data Format
    • Storage Formats
    • CSV/TSV/Parquet
    • Parquet Example
    • JSON
    • HTTP & REST API
      • Terms to Know
        • Statefulness
        • Statelessness
        • Monolithic Architecture
        • Microservices
        • Idempotency
    • REST API
    • Python
      • Setup
      • Decorator
      • Unit Testing
      • Flask Demo
      • Flask Demo - 01
      • Flask Demo - 02
      • Flask Demo - 03
      • Flask Demo - 04
      • Flask Demo - 06
    • API Testing
    • Flask Demo Testing
    • API Performance
    • API in Big Data World
  • NoSQL
    • Types of NoSQL Databases
    • Redis
      • Overview
      • Terms to know
      • Redis - (RDBMS) MySql
      • Redis Cache Demo
      • Use Cases
      • Data Structures
        • Strings
        • List
        • Set
        • Hash
        • Geospatial Index
        • Pub/Sub
        • Redis - Python
      • Redis JSON
      • Redis Search
      • Persistence
      • Databases
      • Timeseries
    • Neo4J
      • Introduction
      • Neo4J Terms
      • Software
      • Neo4J Components
      • Hello World
      • Examples
        • MySQL: Neo4J
        • Sample Transactions
        • Sample
        • Create Nodes
        • Update Nodes
        • Relation
        • Putting it all together
        • Commonly used Functions
        • Data Profiling
        • Queries
        • Python Scripts
      • More reading
    • MongoDB
      • Sample JSON
      • Introduction
      • Software
      • MongoDB Best Practices
      • MongoDB Commands
      • Insert Document
      • Querying MongoDB
      • Update & Remove
      • Import
      • Logical Operators
      • Data Types
      • Operators
      • Aggregation Pipeline
      • Further Reading
      • Fun Task
        • Sample
    • InfluxDB
      • Data Format
      • Scripts
  • Python
    • Python Classes
    • Serialization-Deserialization
  • Tools
    • JQ
    • DUCK DB
    • CICD Intro
    • CICD Tools
      • CI YAML
      • CD Yaml
    • Containers
      • VMs or Containers
      • What container does
      • Podman
      • Podman Examples
  • Cloud Everywhere
    • Overview
    • Types of Cloud Services
    • Challenges of Cloud Computing
    • High Availability
    • Azure Cloud
      • Services
      • Storages
      • Demo
    • Terraform
  • Data Engineering
    • Batch vs Streaming
    • Kafka
      • Introduction
      • Kafka Use Cases
      • Kafka Software
      • Python Scripts
      • Different types of Streaming
    • Quality & Governance
    • Medallion Architecture
    • Data Engineering Model
    • Data Mesh
  • Industry Trends
    • Roadmap - Data Engineer
    • Good Reads
      • IP & SUBNET
Powered by GitBook
On this page
  • Messages
  • Topic
  • Partitions
  • Keys
  • Offset
  • Batches
  • Producers
  • Consumers
  • Broker
  • At least Once Delivery
  • At most Delivery
  1. Data Engineering
  2. Kafka

Introduction

PreviousKafkaNextKafka Use Cases

Last updated 1 year ago

Developed by LinkedIn and open sourced in 2010.

Apache Kafka is a publish/subscribe messaging system designed to solve this problem. It is often described as a “distributed commit log” or, more recently, as a “distributing streaming platform.”

A filesystem or database commit log is designed to provide a durable record of all transactions so that they can be replayed to build the state of a system consistently.

Messages

The unit of data within Kafka is called a message. In the DB world, it's similar to a Row.

The message is simply an array of bytes and does not have a specific format or meaning.

Topic

The topic is a category or feed name to which messages are published, like a TV Channel / Radio station.

Partitions

Topics are additionally broken down into several partitions.

Messages are written to it in an append-only fashion.

Note that as a topic typically has multiple partitions, message time-ordering is not guaranteed across the entire topic, just within a single partition.

Partitions provide redundancy and scalability. The partition can be hosted on a different server.

Keys

A message can have an optional bit of metadata, a key. As with the message, the key is also a byte array with no specific meaning to Kafka. Keys are used when messages are to be written to partitions in a more controlled manner.

Remember, MongoDB data has _id field. Similarly, keys can be used to relay messages.

TV Shows you have Season/Episode

Offset

Offset values in Kafka are long integers uniquely identifying each message within a partition. They start at 0 for the first message in a partition and increment by 1 for each subsequent message. For example, if a partition has five messages, the offsets would be 0, 1, 2, 3, and 4.

Batches

A batch is a collection of messages on the same topic and partition.

Producers

Producers create new messages. In general, a message will be produced on a specific topic.

In general, a message will be produced to a specific topic. By default, the producer does not care what partition a particular message is written to and will evenly balance messages over all partitions of a topic.

In some cases, the producer will direct messages to specific partitions. This is typically done using the message key and a partitioner to generate a hash of the key and map it to a particular partition.

Consumers

Consumers work as part of a consumer group, one or more consumers working together to consume a topic. The group ensures that each partition is only consumed by one member.

Broker

A single Kafka server is called a broker. The broker receives messages from producers, assigns offsets to them, and commits the messages to storage on disk.

Kafka brokers are designed to operate as part of a cluster.

One broker will also function as the cluster controller within a cluster of brokers.

A partition may be assigned to multiple brokers, which will result in Replication.

At least Once Delivery

  • Automatic offset commits provide "at least once" delivery semantics because if the consumer fails after processing a message but before the next auto-commit, it may re-process some messages after restarting.

  • This behavior ensures no message loss but can lead to duplicate processing.

At most Delivery

  • The message is sent a maximum of only one time.

Src: Oreilly Kafka Book
Src: Oreilly Kafka Book
Src: Oreilly Kafka Book