cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Using DLT pipeline with non-incremental data

140015
New Contributor III

Hi,

I would like to know what you think about using the Delta Live Tables when the source for this pipeline is not incremental. What I mean by that is suppose that the data provider creates for me a new folder with files each time it has update to their data (e.g. /data/folder_1, /data/folder_2, /data/folder_3). So I need to process the entire new folder and drop old data from the previous folder each time a new update arrives.

I know that DLTs were designed for incremental data and autoloader as well. So I ended up running full refresh each time I ran the pipeline. Now I tried to not use autoload with readStream() and instead use simple pyspark read() for data ingestion into pipeline, but "Setting up tables" stage in pipeline now takes a really long time.

What do you think?

1 ACCEPTED SOLUTION

Accepted Solutions

AmarK
New Contributor III
New Contributor III

Hello,

You can define a live or streaming live view or table:

A live table or view always reflects the results of the query that defines it, including when the query defining the table or view is updated, or an input data source is updated. Like a traditional materialized view, a live table or view may be entirely computed when possible to optimize computation resources and time.

A streaming live table or view processes data that has been added only since the last pipeline update. Streaming tables and views are stateful; if the defining query changes, new data will be processed based on the new query and existing data is not recomputed.

Autoloader can be used for batch processing as well by using the readStream with a Trigger.Once option. https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/production#cost-considerati....

So in your case you could setup autoloader readStream to watch new files that land in a file path with wildcards like

input_stream = (

 spark

  .readStream.format("cloudFiles")

  .option("cloudFiles.format", "csv")

  .schema("name STRING, group STRING")

  .load(base_dir + "/data/*/datasource_A_*.csv")

)

View solution in original post

3 REPLIES 3

AmarK
New Contributor III
New Contributor III

Hello,

You can define a live or streaming live view or table:

A live table or view always reflects the results of the query that defines it, including when the query defining the table or view is updated, or an input data source is updated. Like a traditional materialized view, a live table or view may be entirely computed when possible to optimize computation resources and time.

A streaming live table or view processes data that has been added only since the last pipeline update. Streaming tables and views are stateful; if the defining query changes, new data will be processed based on the new query and existing data is not recomputed.

Autoloader can be used for batch processing as well by using the readStream with a Trigger.Once option. https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/production#cost-considerati....

So in your case you could setup autoloader readStream to watch new files that land in a file path with wildcards like

input_stream = (

 spark

  .readStream.format("cloudFiles")

  .option("cloudFiles.format", "csv")

  .schema("name STRING, group STRING")

  .load(base_dir + "/data/*/datasource_A_*.csv")

)

140015
New Contributor III

Thank you for your answer!

Do you know how to save that created stream into the delta table right away?

I need to save this stream into a temporary delta table and then make some transformations on it?

Joe_Suarez
New Contributor III

When dealing with B2B data building, the process of updating and managing your data can present unique challenges. Since your data updates involve new folders with files and you need to process the entire new folder, the concept of incremental processing might not directly apply to your situation. Running a full refresh every time could indeed be necessary, as you're effectively processing new chunks of data each time. This approach, while not as efficient as incremental updates often used in B2B data building, might be the most straightforward given your data source's behavior.

In the realm of B2B data building, tools like Delta Live Tables and Delta Lake can still be highly relevant in this scenario, especially if you're working with large volumes of data or require features like ACID transactions and schema evolution. These technologies offer a robust framework for managing complex data workflows. However, since your data updates aren't truly incremental, some of the optimization benefits of DLTs might not be fully utilized in your specific context. Nonetheless, leveraging DLTs can still provide valuable benefits like improved data organization, versioning, and metadata management, which are critical aspects of maintaining high-quality B2B data.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.