Databricks Community

140015 · ‎10-19-2022

Hi,

I would like to know what you think about using the Delta Live Tables when the source for this pipeline is not incremental. What I mean by that is suppose that the data provider creates for me a new folder with files each time it has update to their data (e.g. /data/folder_1, /data/folder_2, /data/folder_3). So I need to process the entire new folder and drop old data from the previous folder each time a new update arrives.

I know that DLTs were designed for incremental data and autoloader as well. So I ended up running full refresh each time I ran the pipeline. Now I tried to not use autoload with readStream() and instead use simple pyspark read() for data ingestion into pipeline, but "Setting up tables" stage in pipeline now takes a really long time.

What do you think?

AmarK · ‎10-19-2022

Hello,

You can define a live or streaming live view or table:

A live table or view always reflects the results of the query that defines it, including when the query defining the table or view is updated, or an input data source is updated. Like a traditional materialized view, a live table or view may be entirely computed when possible to optimize computation resources and time.

A streaming live table or view processes data that has been added only since the last pipeline update. Streaming tables and views are stateful; if the defining query changes, new data will be processed based on the new query and existing data is not recomputed.

Autoloader can be used for batch processing as well by using the readStream with a Trigger.Once option. https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/production#cost-considerati....

So in your case you could setup autoloader readStream to watch new files that land in a file path with wildcards like

input_stream = (

spark

.readStream.format("cloudFiles")

.option("cloudFiles.format", "csv")

.schema("name STRING, group STRING")

.load(base_dir + "/data/*/datasource_A_*.csv")

)

View solution in original post

AmarK · ‎10-19-2022

Hello,

You can define a live or streaming live view or table:

A live table or view always reflects the results of the query that defines it, including when the query defining the table or view is updated, or an input data source is updated. Like a traditional materialized view, a live table or view may be entirely computed when possible to optimize computation resources and time.

A streaming live table or view processes data that has been added only since the last pipeline update. Streaming tables and views are stateful; if the defining query changes, new data will be processed based on the new query and existing data is not recomputed.

Autoloader can be used for batch processing as well by using the readStream with a Trigger.Once option. https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/production#cost-considerati....

So in your case you could setup autoloader readStream to watch new files that land in a file path with wildcards like

input_stream = (

spark

.readStream.format("cloudFiles")

.option("cloudFiles.format", "csv")

.schema("name STRING, group STRING")

.load(base_dir + "/data/*/datasource_A_*.csv")

)

140015 · ‎10-19-2022

Thank you for your answer!

Do you know how to save that created stream into the delta table right away?

I need to save this stream into a temporary delta table and then make some transformations on it?

Joe_Suarez · ‎08-17-2023

When dealing with B2B data building, the process of updating and managing your data can present unique challenges. Since your data updates involve new folders with files and you need to process the entire new folder, the concept of incremental processing might not directly apply to your situation. Running a full refresh every time could indeed be necessary, as you're effectively processing new chunks of data each time. This approach, while not as efficient as incremental updates often used in B2B data building, might be the most straightforward given your data source's behavior.

In the realm of B2B data building, tools like Delta Live Tables and Delta Lake can still be highly relevant in this scenario, especially if you're working with large volumes of data or require features like ACID transactions and schema evolution. These technologies offer a robust framework for managing complex data workflows. However, since your data updates aren't truly incremental, some of the optimization benefits of DLTs might not be fully utilized in your specific context. Nonetheless, leveraging DLTs can still provide valuable benefits like improved data organization, versioning, and metadata management, which are critical aspects of maintaining high-quality B2B data.

Databricks Community

Using DLT pipeline with non-incremental data

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon