cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Partition cols for a temporary table in Lakefow SDP

IM_01
Contributor III

Hi,

I was going through the documentation on quarantining records. Initially I thought that partitioning is not supported for temporary tables however I came cross the following

@DP.table(
  temporary=True,
  partition_cols=["is_quarantined"],
)
@dp.expect_all(rules)
def trips_data_quarantine():
  return (
    spark.readStream.table("raw_trips_data").withColumn("is_quarantined", expr(quarantine_rules))
  )

I'm unable to understand how partitioning is been applied for a temporary table. Could anyone please explain this๐Ÿ™‚

1 ACCEPTED SOLUTION

Accepted Solutions

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @IM_01,

Good question. It's the terminology (temporary) that is probably causing the confusion. 

In Declarative Pipelines, @dp.table(temporary=True, ...) creates a real Delta table on disk, just like a normal table. The only difference is visibility. It is not published to Unity Catalog or the Hive Meta Store. It is accessible only inside that pipeline. It also persists across pipeline runs for the lifetime of the pipeline, not just for a single run/session. Because itโ€™s still a proper Delta table under the hood, all the physical table features apply... including partition_cols. The engine creates an internal table in the pipelineโ€™s internal schema with the same partitioning as for a nonโ€‘temporary table. This is also explained in a technical blog here.

In your example, trips_data_quarantine is a physically partitioned Delta table on is_quarantined (so reads from it can prune on that column). This is private to the pipeline (not queryable from outside, not visible in UC) and is cleaned up when the pipeline is deleted.

Partitioning is fully honoured. Temporary here means pipelineโ€‘private, not ephemeral in-memory temp view.

Does this answer your question?

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post

9 REPLIES 9

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @IM_01,

Good question. It's the terminology (temporary) that is probably causing the confusion. 

In Declarative Pipelines, @dp.table(temporary=True, ...) creates a real Delta table on disk, just like a normal table. The only difference is visibility. It is not published to Unity Catalog or the Hive Meta Store. It is accessible only inside that pipeline. It also persists across pipeline runs for the lifetime of the pipeline, not just for a single run/session. Because itโ€™s still a proper Delta table under the hood, all the physical table features apply... including partition_cols. The engine creates an internal table in the pipelineโ€™s internal schema with the same partitioning as for a nonโ€‘temporary table. This is also explained in a technical blog here.

In your example, trips_data_quarantine is a physically partitioned Delta table on is_quarantined (so reads from it can prune on that column). This is private to the pipeline (not queryable from outside, not visible in UC) and is cleaned up when the pipeline is deleted.

Partitioning is fully honoured. Temporary here means pipelineโ€‘private, not ephemeral in-memory temp view.

Does this answer your question?

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

IM_01
Contributor III

Hi  

Thanks for the response  & for sharing the article.

I was going through the article, had a doubt - view is a logical construct so it gives results when we read it and as per my knowledge a append-only source can be used as a source for streaming table just confused how we are able to use a view as a streaming source. could you please explain this

IM_01
Contributor III

@Ashwin_DSA  could you please help with above ๐Ÿ™‚

sorry forgot to mention you above & unable to edit it ๐Ÿ™‚

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @IM_01,

Youโ€™re correct on both points. A view is purely logical, and a streaming table still needs an append-only source. What makes this work is that streaming is applied to the underlying Delta tables, not to the view object itself.

When you do STREAM(my_view) or spark.readStream.table("my_view"), Databricks expands the view definition (CREATE VIEW my_view AS SELECT ... FROM base_table ...) and builds a streaming plan directly over the base Delta table(s) that the view references. In other words, itโ€™s equivalent to copying the viewโ€™s SELECT into your streaming query. The fact that itโ€™s called a view is just naming. The streaming engine still talks to the underlying Delta tables.

The constraint "source must be appendโ€‘only" is checked on the Delta table behind the view, not on the view itself. Structured Streaming only supports streaming from Delta tables if upstream writes are appendโ€‘only.  UPDATE/DELETE/MERGE will break the stream unless you use options like skipChangeCommits.

In DLT/SDP, streaming tables can only read from streaming/appendโ€‘only sources (Auto Loader, streaming tables, Delta tables used in appendโ€‘only mode). If the viewโ€™s query wraps an appendโ€‘only Delta table with allowed operators (projection, filters, certain joins, etc.), then the resulting streaming plan is still over an appendโ€‘only source.

So, you're not really streaming from a view in the physical sense. Youโ€™re streaming from the appendโ€‘only Delta table(s) behind that view, with the viewโ€™s SQL simply applied on top of the streaming plan.

Hope that answers your question.

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

Hi  @Ashwin_DSA 

Thanks for the response ๐Ÿ™‚
But in case of materialized view as a source for streaming table it should not work right but I see the pipeline is running without error. could you please help with this as the refresh rate can be full/incrmentral as per row_id tracking right.
Sorry for so many questions ๐Ÿ™‚

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @IM_01,

Youโ€™re right to be suspicious here... a materialized view is not a safe or supported source for a streaming table, even if your pipeline currently runs without errors.

Streaming tables are built on Structured Streaming and assume the source is appendโ€‘only. New versions of the source table may only add rows. Existing rows are never changed. They only process new data since the last checkpoint and do not recompute history.

Materialized views are maintained by Enzyme with batch semantics. On each refresh (incremental or full) they are free to update or delete existing rows in their backing Delta table to keep the result correct. A full refresh can even truncate and recompute everything.

So if you do:

CREATE OR REFRESH STREAMING TABLE st AS
SELECT * FROM STREAM(my_mv);
you get a mismatch...
 
From the streaming engineโ€™s perspective, my_mv looks like a Delta table whose history can be rewritten by MV refresh. As soon as the MV performs nonโ€‘append changes (corrections, late data, full refresh), you either hit a "nonโ€‘append source" style failure, or silently lose corrections if youโ€™ve enabled options that skip nonโ€‘append commits.

The fact that your pipeline runs today usually just means the MV hasnโ€™t yet done anything but pure appends. Itโ€™s not a pattern you should rely on.

Hope this answers your question.

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

 
Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

IM_01
Contributor III

@Ashwin_DSA  So if we use view as a streaming source it can only reference a streaming table that are part of sdp pipeline or delta tables created outside the pipeline but not materialized views

And even a temporary view can be used as streaming source right.

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @IM_01,

You can stream from views (including SDP temporary views) as long as those views ultimately sit on appendโ€‘only streaming/Delta tables. Using a materialized view as the base is not supported. MV refresh can rewrite history (updates/deletes), breaking the appendโ€‘only contract that streaming tables rely on, even if your pipeline currently runs without errors.

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

IM_01
Contributor III

Thanks Ashwin