Databricks Community

Salmiakki · ‎06-27-2023

Structured Streaming and Delta Live Tables.

Structured Streaming: Structured Streaming is a stream processing engine built on Apache Spark that provides high-level, declarative APIs for processing and analyzing continuous data streams. It allows developers to treat streaming data as a series of structured data frames or tables, enabling seamless integration with batch processing and traditional SQL queries. Structured Streaming provides fault-tolerant and exactly-once processing semantics, ensuring data reliability and consistency. It supports various data sources and sinks, including files, Kafka, and more. With Structured Streaming, developers can write continuous queries that update results as new data arrives, enabling real-time analytics and insights.

Delta Live Tables: Delta Live Tables is an extension to Delta Lake, which is an open-source storage layer built on top of Apache Spark for reliable and scalable data lakes. Delta Live Tables provides a high-level API for building real-time, collaborative, and event-driven applications on top of Delta Lake. It allows developers to create tables that represent streaming data and automatically tracks changes to those tables. Delta Live Tables supports both streaming data and batch data, enabling the ability to mix real-time and batch processing. It provides capabilities like transactional writes, schema evolution, time travel, and data versioning for managing and processing data in a reliable and scalable manner.

I think Delta Live tables is unique and good use case, when people wanted to start from the scratch for a project by having a lot of dependant pipelines, then it's a really important feature to consider.

The actual use cases are if there are IoT data coming to the organization and there are constantly changing datasets, when you wanted to combine multiple other datasets, this leaves a situation to dependany pipelines where you wanted to maintain the previous workloads and then start the downstream systems. Also because it is more evolving due to IoT - constant feeding of the data, we need to have streaming done in a batch fashion, ie., Delta Live Tables comes to the picture.