cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Partition cols for a temporary table in Lakefow SDP

IM_01
Contributor

Hi,

I was going through the documentation on quarantining records. Initially I thought that partitioning is not supported for temporary tables however I came cross the following

@DP.table(
  temporary=True,
  partition_cols=["is_quarantined"],
)
@dp.expect_all(rules)
def trips_data_quarantine():
  return (
    spark.readStream.table("raw_trips_data").withColumn("is_quarantined", expr(quarantine_rules))
  )

I'm unable to understand how partitioning is been applied for a temporary table. Could anyone please explain this🙂

1 ACCEPTED SOLUTION

Accepted Solutions

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @IM_01,

Good question. It's the terminology (temporary) that is probably causing the confusion. 

In Declarative Pipelines, @dp.table(temporary=True, ...) creates a real Delta table on disk, just like a normal table. The only difference is visibility. It is not published to Unity Catalog or the Hive Meta Store. It is accessible only inside that pipeline. It also persists across pipeline runs for the lifetime of the pipeline, not just for a single run/session. Because it’s still a proper Delta table under the hood, all the physical table features apply... including partition_cols. The engine creates an internal table in the pipeline’s internal schema with the same partitioning as for a non‑temporary table. This is also explained in a technical blog here.

In your example, trips_data_quarantine is a physically partitioned Delta table on is_quarantined (so reads from it can prune on that column). This is private to the pipeline (not queryable from outside, not visible in UC) and is cleaned up when the pipeline is deleted.

Partitioning is fully honoured. Temporary here means pipeline‑private, not ephemeral in-memory temp view.

Does this answer your question?

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post

9 REPLIES 9

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @IM_01,

Good question. It's the terminology (temporary) that is probably causing the confusion. 

In Declarative Pipelines, @dp.table(temporary=True, ...) creates a real Delta table on disk, just like a normal table. The only difference is visibility. It is not published to Unity Catalog or the Hive Meta Store. It is accessible only inside that pipeline. It also persists across pipeline runs for the lifetime of the pipeline, not just for a single run/session. Because it’s still a proper Delta table under the hood, all the physical table features apply... including partition_cols. The engine creates an internal table in the pipeline’s internal schema with the same partitioning as for a non‑temporary table. This is also explained in a technical blog here.

In your example, trips_data_quarantine is a physically partitioned Delta table on is_quarantined (so reads from it can prune on that column). This is private to the pipeline (not queryable from outside, not visible in UC) and is cleaned up when the pipeline is deleted.

Partitioning is fully honoured. Temporary here means pipeline‑private, not ephemeral in-memory temp view.

Does this answer your question?

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

IM_01
Contributor

Hi  

Thanks for the response  & for sharing the article.

I was going through the article, had a doubt - view is a logical construct so it gives results when we read it and as per my knowledge a append-only source can be used as a source for streaming table just confused how we are able to use a view as a streaming source. could you please explain this

@Ashwin_DSA  could you please help with above 🙂

sorry forgot to mention you above & unable to edit it 🙂

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @IM_01,

You’re correct on both points. A view is purely logical, and a streaming table still needs an append-only source. What makes this work is that streaming is applied to the underlying Delta tables, not to the view object itself.

When you do STREAM(my_view) or spark.readStream.table("my_view"), Databricks expands the view definition (CREATE VIEW my_view AS SELECT ... FROM base_table ...) and builds a streaming plan directly over the base Delta table(s) that the view references. In other words, it’s equivalent to copying the view’s SELECT into your streaming query. The fact that it’s called a view is just naming. The streaming engine still talks to the underlying Delta tables.

The constraint "source must be append‑only" is checked on the Delta table behind the view, not on the view itself. Structured Streaming only supports streaming from Delta tables if upstream writes are append‑only.  UPDATE/DELETE/MERGE will break the stream unless you use options like skipChangeCommits.

In DLT/SDP, streaming tables can only read from streaming/append‑only sources (Auto Loader, streaming tables, Delta tables used in append‑only mode). If the view’s query wraps an append‑only Delta table with allowed operators (projection, filters, certain joins, etc.), then the resulting streaming plan is still over an append‑only source.

So, you're not really streaming from a view in the physical sense. You’re streaming from the append‑only Delta table(s) behind that view, with the view’s SQL simply applied on top of the streaming plan.

Hope that answers your question.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

Hi  @Ashwin_DSA 

Thanks for the response 🙂
But in case of materialized view as a source for streaming table it should not work right but I see the pipeline is running without error. could you please help with this as the refresh rate can be full/incrmentral as per row_id tracking right.
Sorry for so many questions 🙂

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @IM_01,

You’re right to be suspicious here... a materialized view is not a safe or supported source for a streaming table, even if your pipeline currently runs without errors.

Streaming tables are built on Structured Streaming and assume the source is append‑only. New versions of the source table may only add rows. Existing rows are never changed. They only process new data since the last checkpoint and do not recompute history.

Materialized views are maintained by Enzyme with batch semantics. On each refresh (incremental or full) they are free to update or delete existing rows in their backing Delta table to keep the result correct. A full refresh can even truncate and recompute everything.

So if you do:

CREATE OR REFRESH STREAMING TABLE st AS
SELECT * FROM STREAM(my_mv);
you get a mismatch...
 
From the streaming engine’s perspective, my_mv looks like a Delta table whose history can be rewritten by MV refresh. As soon as the MV performs non‑append changes (corrections, late data, full refresh), you either hit a "non‑append source" style failure, or silently lose corrections if you’ve enabled options that skip non‑append commits.

The fact that your pipeline runs today usually just means the MV hasn’t yet done anything but pure appends. It’s not a pattern you should rely on.

Hope this answers your question.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

 
Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

IM_01
Contributor

@Ashwin_DSA  So if we use view as a streaming source it can only reference a streaming table that are part of sdp pipeline or delta tables created outside the pipeline but not materialized views

And even a temporary view can be used as streaming source right.

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @IM_01,

You can stream from views (including SDP temporary views) as long as those views ultimately sit on append‑only streaming/Delta tables. Using a materialized view as the base is not supported. MV refresh can rewrite history (updates/deletes), breaking the append‑only contract that streaming tables rely on, even if your pipeline currently runs without errors.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

IM_01
Contributor

Thanks Ashwin