topic Re: DLT refresh time for combination of streaming and non streaming tables? in Get Started Discussions

DLT refresh time for combination of streaming and non streaming tables?

surajitDE — Mon, 17 Mar 2025 06:29:26 GMT

@dlt.table

def joined_table():

dim_df = spark.read.table("dim_table") # Reloads every batch

fact_df = spark.readStream.table("fact_stream")

return fact_df.join(dim_df, "id", "left")

Re: DLT refresh time for combination of streaming and non streaming tables?

surajitDE — Mon, 17 Mar 2025 06:44:18 GMT

the question is default DLT pipeline refresh time is 5seconds but if I use combination of streaming and non streaming data then will it still be 5 seconds?

Re: DLT refresh time for combination of streaming and non streaming tables?

Advika — Tue, 18 Mar 2025 11:15:16 GMT

Hello @surajitDE!

When using both streaming and batch data, the pipeline may not always refresh every 5 seconds. While the streaming table (fact_stream) updates every 5 seconds, the batch table (dim_table) fully reloads each time, adding overhead from repeatedly loading the batch data.

The actual refresh time depends on the size of dim_table, larger tables take longer to reload, which can delay updates.

Re: DLT refresh time for combination of streaming and non streaming tables?

surajitDE — Fri, 21 Mar 2025 07:06:00 GMT

In a Delta Live Tables (DLT) continuous pipeline, does it make a difference if df_dim_prev (loaded in cell 1) is only read once at the start?

For example, if df_dim_prev is initialized as:

# Cell 1: Read dim_table once

df_dim_prev = spark.read.table("dim_table")

Then used in a streaming join inside a DLT table:

# Cell 2: DLT table with a streaming source

@Dlt.table def joined_table():

dim_df = df_dim_prev

# Using the preloaded dimension table

fact_df = spark.readStream.table("fact_stream")

return fact_df.join(dim_df, "id", "left")

Would this mean that dim_df remains static until the entire pipeline is refreshed, rather than updating dynamically as dim_table changes?

is there a better way to handle this if we want dim_table to update periodically in a continuous pipeline?

Re: DLT refresh time for combination of streaming and non streaming tables?

brycejune — Sat, 22 Mar 2025 11:03:15 GMT

Hi,

Current approach reloads dim_df in every batch, which can be inefficient. To optimize, consider broadcasting dim_df if it's small or using a mapGroupsWithState function for stateful joins. Also, ensure that fact_df has sufficient watermarking to handle late data efficiently. Let me know if you need further optimization suggestions!

Regards,
Bryce June