cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta Sharing with Materialized View - recepient data not refreshing when using Open Protocol

ittzzmalind
New Contributor III

Scenario: Delta Sharing with Materialized View

Provider Side Setup :

->A Delta Share was created.

->A materialized view was added to the share.

->Recipients Created

-> 1). Open Delta Sharing recipient

       Accessed using Python (import delta_sharing)

->2). Databricks-to-Databricks recipient

      Accessed directly from another Databricks workspace

Initial Behavior 

Both recipients executed a SELECT query on the shared materialized view.

Result: Both returned the same (correct) data.

After Data Update

Source data for the materialized view changed.

The materialized view was refreshed, and new records were added.

Observed Behavior at Recipient Side

Recipient Type                            -- Result After Refresh

Open Delta Sharing (Python)     -- Returned old data only

Databricks-to-Databricks            -- Returned updated data (including new records)

Now created another recipient at provider for Open Delta Sharing (Python),  with this recipient the code run at python and results returned correctly with new added records.

tried removing materialized view from share and add it back but not working ? why behaving like this is this correct behavior ? any method to solve this??

1 ACCEPTED SOLUTION

Accepted Solutions

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @ittzzmalind,

This is expected behaviour and is mainly due to how Delta Sharing handles materialized views for open (non-Databricks) recipients versus Databricks-to-Databricks recipients.

For Databricks-to-Databricks recipients, the shared materialized view is read almost directly from its backing table. After you run REFRESH MATERIALIZED VIEW, those recipients see the new data right away.

However, for open recipients using the Python delta_sharing client, Databricks uses provider-side materialization. This means the first query for that MV builds a hidden, cached table on the provider side, and subsequent queries for that same recipient reuse that cached result for a configurable time-to-live (TTL, default 8 hours).. see snapshot from documentation below. During that TTL, the open recipient can still see stale data even after refreshing the MV.

delta share TTL.png

 That’s why your original open recipient continued to see the old data, and a new open recipient (new token) immediately saw the updated data because it triggered a fresh materialization.

To mitigate this, you can reduce the TTL of data materialisation in the Delta Sharing settings at the metastore level, so cached results expire sooner (at the cost of more frequent recompute and higher provider cost). Check this link for the steps.

Where possible, use Databricks-to-Databricks sharing for consumers that need near-real-time MV freshness, or share a Delta table instead of an MV for open recipients and let them do the aggregation on their side.

Hope this helps.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post

1 REPLY 1

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @ittzzmalind,

This is expected behaviour and is mainly due to how Delta Sharing handles materialized views for open (non-Databricks) recipients versus Databricks-to-Databricks recipients.

For Databricks-to-Databricks recipients, the shared materialized view is read almost directly from its backing table. After you run REFRESH MATERIALIZED VIEW, those recipients see the new data right away.

However, for open recipients using the Python delta_sharing client, Databricks uses provider-side materialization. This means the first query for that MV builds a hidden, cached table on the provider side, and subsequent queries for that same recipient reuse that cached result for a configurable time-to-live (TTL, default 8 hours).. see snapshot from documentation below. During that TTL, the open recipient can still see stale data even after refreshing the MV.

delta share TTL.png

 That’s why your original open recipient continued to see the old data, and a new open recipient (new token) immediately saw the updated data because it triggered a fresh materialization.

To mitigate this, you can reduce the TTL of data materialisation in the Delta Sharing settings at the metastore level, so cached results expire sooner (at the cost of more frequent recompute and higher provider cost). Check this link for the steps.

Where possible, use Databricks-to-Databricks sharing for consumers that need near-real-time MV freshness, or share a Delta table instead of an MV for open recipients and let them do the aggregation on their side.

Hope this helps.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***