Different types of Streaming
Streaming (Structured)
Kafka (Open Source)
MSK (AWS Managed Service Kafka)
Kinesis (AWS)
Autoloader (Databricks)
DMS (AWS Data Migration Service)
EventHubs (Azure)
MLFlow
Terraform
Stateless Streaming:
An example of a stateless streaming application is a real-time weather data processing system. The system receives a continuous stream of weather data, such as temperature, humidity, and wind speed, from various weather sensors. As each data point arrives, the system performs calculations, such as averaging the temperature over a certain period and then sends the processed data to a downstream system for further analysis or storage. The system does not maintain any state between data points; it only processes them as they arrive and does not consider previous data points in its calculations. The system can work independently, each data point is processed independently, and the result does not depend on previous data points.
Stateful Streaming:
An example of a stateful streaming application is a real-time fraud detection system for a financial institution. The system receives continuous transactions from various sources, such as ATM withdrawals, online purchases, and wire transfers. As each transaction arrives, the system compares it to the current state of the customer's account and previous transactions. Suppose the system detects any unusual or suspicious activity, such as a large withdrawal from an account with a low balance or a purchase from a location the customer has never visited. In that case, it flags the transaction for further review. The system also updates the customer's account state after each transaction, allowing it to detect any additional suspicious activity.
Structured Streaming
Structured Streaming is a high-level API for processing real-time and batch data streams in Apache Spark. It allows developers to express their stream processing logic in the same way as batch processing, using the same DataFrame and SQL API. Using familiar tools and frameworks makes building and maintaining real-time data pipelines easy.
Structured Streaming provides a unified batch and streaming API, allowing developers to write their code once and run it in both batch and streaming modes. It also offers built-in support for event-time processing, watermarking, and state management, making it easy to handle out-of-order and late-arriving data. Additionally, it uses a "micro-batch" processing model, which allows for low-latency processing of small batches of data rather than processing each data point individually.
In summary, Structured Streaming is a high-level API for stream processing in Spark that allows developers to express their stream processing logic using DataFrame and SQL API, handle out-of-order and late-arriving data, and process small batches of data for low-latency processing.
Last updated