Big Data & Tools with NoSQL
  • Big Data & Tools
  • ReadMe
  • Big Data Overview
    • Overview
    • Job Opportunities
    • What is Data?
    • How does it help?
    • Types of Data
    • The Big 4 V's
      • Variety
      • Volume
      • Velocity
      • Veracity
      • Other V's
    • Trending Technologies
    • Big Data Concerns
    • Big Data Challenges
    • Data Integration
    • Scaling
      • CAP Theorem
      • Optimistic concurrency
      • Eventual consistency
      • Concurrent vs. Parallel Programming
    • Big Data Tools
    • No SQL Databases
    • What does Big Data learning means?
  • Linux & Tools
    • Overview
    • Linux Commands - 01
    • Linux Commands - 02
    • AWK
    • CSVKIT
    • CSVSQL
    • CSVGREP
  • Data Format
    • Storage Formats
    • CSV/TSV/Parquet
    • Parquet Example
    • JSON
    • HTTP & REST API
      • Terms to Know
        • Statefulness
        • Statelessness
        • Monolithic Architecture
        • Microservices
        • Idempotency
    • REST API
    • Python
      • Setup
      • Decorator
      • Unit Testing
      • Flask Demo
      • Flask Demo - 01
      • Flask Demo - 02
      • Flask Demo - 03
      • Flask Demo - 04
      • Flask Demo - 06
    • API Testing
    • Flask Demo Testing
    • API Performance
    • API in Big Data World
  • NoSQL
    • Types of NoSQL Databases
    • Redis
      • Overview
      • Terms to know
      • Redis - (RDBMS) MySql
      • Redis Cache Demo
      • Use Cases
      • Data Structures
        • Strings
        • List
        • Set
        • Hash
        • Geospatial Index
        • Pub/Sub
        • Redis - Python
      • Redis JSON
      • Redis Search
      • Persistence
      • Databases
      • Timeseries
    • Neo4J
      • Introduction
      • Neo4J Terms
      • Software
      • Neo4J Components
      • Hello World
      • Examples
        • MySQL: Neo4J
        • Sample Transactions
        • Sample
        • Create Nodes
        • Update Nodes
        • Relation
        • Putting it all together
        • Commonly used Functions
        • Data Profiling
        • Queries
        • Python Scripts
      • More reading
    • MongoDB
      • Sample JSON
      • Introduction
      • Software
      • MongoDB Best Practices
      • MongoDB Commands
      • Insert Document
      • Querying MongoDB
      • Update & Remove
      • Import
      • Logical Operators
      • Data Types
      • Operators
      • Aggregation Pipeline
      • Further Reading
      • Fun Task
        • Sample
    • InfluxDB
      • Data Format
      • Scripts
  • Python
    • Python Classes
    • Serialization-Deserialization
  • Tools
    • JQ
    • DUCK DB
    • CICD Intro
    • CICD Tools
      • CI YAML
      • CD Yaml
    • Containers
      • VMs or Containers
      • What container does
      • Podman
      • Podman Examples
  • Cloud Everywhere
    • Overview
    • Types of Cloud Services
    • Challenges of Cloud Computing
    • High Availability
    • Azure Cloud
      • Services
      • Storages
      • Demo
    • Terraform
  • Data Engineering
    • Batch vs Streaming
    • Kafka
      • Introduction
      • Kafka Use Cases
      • Kafka Software
      • Python Scripts
      • Different types of Streaming
    • Quality & Governance
    • Medallion Architecture
    • Data Engineering Model
    • Data Mesh
  • Industry Trends
    • Roadmap - Data Engineer
    • Good Reads
      • IP & SUBNET
Powered by GitBook
On this page
  • Stateless Streaming:
  • Stateful Streaming:
  • Structured Streaming
  1. Data Engineering
  2. Kafka

Different types of Streaming

  • Streaming (Structured)

  • Kafka (Open Source)

  • MSK (AWS Managed Service Kafka)

  • Kinesis (AWS)

  • Autoloader (Databricks)

  • DMS (AWS Data Migration Service)

  • EventHubs (Azure)

  • MLFlow

  • Terraform

Stateless Streaming:

An example of a stateless streaming application is a real-time weather data processing system. The system receives a continuous stream of weather data, such as temperature, humidity, and wind speed, from various weather sensors. As each data point arrives, the system performs calculations, such as averaging the temperature over a certain period and then sends the processed data to a downstream system for further analysis or storage. The system does not maintain any state between data points; it only processes them as they arrive and does not consider previous data points in its calculations. The system can work independently, each data point is processed independently, and the result does not depend on previous data points.

Stateful Streaming:

An example of a stateful streaming application is a real-time fraud detection system for a financial institution. The system receives continuous transactions from various sources, such as ATM withdrawals, online purchases, and wire transfers. As each transaction arrives, the system compares it to the current state of the customer's account and previous transactions. Suppose the system detects any unusual or suspicious activity, such as a large withdrawal from an account with a low balance or a purchase from a location the customer has never visited. In that case, it flags the transaction for further review. The system also updates the customer's account state after each transaction, allowing it to detect any additional suspicious activity.

Structured Streaming

Structured Streaming is a high-level API for processing real-time and batch data streams in Apache Spark. It allows developers to express their stream processing logic in the same way as batch processing, using the same DataFrame and SQL API. Using familiar tools and frameworks makes building and maintaining real-time data pipelines easy.

Structured Streaming provides a unified batch and streaming API, allowing developers to write their code once and run it in both batch and streaming modes. It also offers built-in support for event-time processing, watermarking, and state management, making it easy to handle out-of-order and late-arriving data. Additionally, it uses a "micro-batch" processing model, which allows for low-latency processing of small batches of data rather than processing each data point individually.

In summary, Structured Streaming is a high-level API for stream processing in Spark that allows developers to express their stream processing logic using DataFrame and SQL API, handle out-of-order and late-arriving data, and process small batches of data for low-latency processing.

PreviousPython ScriptsNextQuality & Governance

Last updated 1 year ago