cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

DELTA LIVE TABLES - MATERIALIZED VIEW DOES NOT INCREMENT NOTHING !

ismaelhenzel
Contributor
I'm very disappointed with this framework. The documentation is inadequate, and it has many limitations. I want to run materialized views with incremental updates, but DLT insists on performing a full recompute. Why is it doing this? Here is the log from a random test execution: { "planning_information":{ "technique_information":[ { "incrementalization_issues":[ { "issue_type":"INCREMENTAL_PLAN_REJECTED_BY_COST_MODEL", "prevent_incrementalization":true, "cost_model_rejection_subtype":"CHANGESET_SIZE_THRESHOLD_EXCEEDED" } ] }, { "maintenance_type":"MAINTENANCE_TYPE_COMPLETE_RECOMPUTE", "is_chosen":true, "is_applicable":true, "cost":2163.0 }, { "maintenance_type":"MAINTENANCE_TYPE_ROW_BASED", "is_chosen":false, "is_applicable":true, "cost":616.0 } ], "source_table_information":[ { "table_name":"`mul_dev_tests`.`dlt_managed`.`teste`", "table_id":"810f74f2-fc09-45b6-93f5-2a544ac93002", "full_size":2950.0, "change_size":710.0, "is_size_after_pruning":true, "is_row_id_enabled":true, "is_cdf_enabled":true, "is_deletion_vector_enabled":true, "is_change_from_legacy_cdf":false } ], "target_table_information":{ "table_name":"`mul_dev_tests`.`default`.`teste_novo_99`", "table_id":"8bc37e86-6cf7-4e92-a69a-85f5da7e1099", "full_size":1320.0, "is_row_id_enabled":true, "is_cdf_enabled":true, "is_deletion_vector_enabled":true } } } It states that the cost of running the incremental update is too high, but the incremental process is FOUR TIMES faster than a full recompute. Please note that I'm using a small dataset for this example, but with large tables, the issue becomes significant. Furthermore, this error is not documented anywhere. Yes, I'm using a serverless setup, which is indeed fast, but it is also a complete black box.
1 ACCEPTED SOLUTION

Accepted Solutions

XP
Databricks Employee
Databricks Employee

I'll try to clear up some misunderstanding surrounding the incremental load feature of materialized views:

There isn't a feature to force materialized views to update incrementally. Instead, there is an optimizer called Enzyme that can selectively incrementally load materialized views when the optimizer determines that an incremental update is a more optimal strategy than a full update. Enzyme chooses an incremental strategy when a number of factors are true.  

@ismaelhenzel, in your case Enzyme determined full load was more optimal indicated by:

"issue_type": "INCREMENTAL_PLAN_REJECTED_BY_COST_MODEL"
and the reason is indicated by:
"cost_model_rejection_subtype": "CHANGESET_SIZE_THRESHOLD_EXCEEDED"

This rejection subtype indicates that the change size is greater than the threshold allowed for incremental loading. The current default threshold is pretty conservative but will be relaxed overtime. This threshold is potentially tunable, depending on your query pattern. 

@lucassvrielink You might be getting tripped up by seeing the cost outputs from the log and doing a simple inequality. The calculation that Enzyme uses to determine which strategy to execute is much more involved than that. It's important to remember that the goal here isn't to incrementally load a materialized view, it's to make the loading of materialized views faster. The optimizer assumes you wouldn't want to force an incremental load if it was slower to do so. 

The goal of DLT more broadly is to simplify ETL by abstracting and automating away some of the inherent complexity. While this might feel like a black box to some, the goal isn't to obscure things. If you want more control over the process, it is possible to implement a similar pattern using other tools in the Databricks platform.

View solution in original post

3 REPLIES 3

lucassvrielink
New Contributor III

I'm dealing with the same problem.

Doesn't make any sense make a feature that should make our dlt jobs faster unusable in every context. Is there a explanation for this? In my concept, even if incremental process would be 1.1x faster it should option for that.

XP
Databricks Employee
Databricks Employee

I'll try to clear up some misunderstanding surrounding the incremental load feature of materialized views:

There isn't a feature to force materialized views to update incrementally. Instead, there is an optimizer called Enzyme that can selectively incrementally load materialized views when the optimizer determines that an incremental update is a more optimal strategy than a full update. Enzyme chooses an incremental strategy when a number of factors are true.  

@ismaelhenzel, in your case Enzyme determined full load was more optimal indicated by:

"issue_type": "INCREMENTAL_PLAN_REJECTED_BY_COST_MODEL"
and the reason is indicated by:
"cost_model_rejection_subtype": "CHANGESET_SIZE_THRESHOLD_EXCEEDED"

This rejection subtype indicates that the change size is greater than the threshold allowed for incremental loading. The current default threshold is pretty conservative but will be relaxed overtime. This threshold is potentially tunable, depending on your query pattern. 

@lucassvrielink You might be getting tripped up by seeing the cost outputs from the log and doing a simple inequality. The calculation that Enzyme uses to determine which strategy to execute is much more involved than that. It's important to remember that the goal here isn't to incrementally load a materialized view, it's to make the loading of materialized views faster. The optimizer assumes you wouldn't want to force an incremental load if it was slower to do so. 

The goal of DLT more broadly is to simplify ETL by abstracting and automating away some of the inherent complexity. While this might feel like a black box to some, the goal isn't to obscure things. If you want more control over the process, it is possible to implement a similar pattern using other tools in the Databricks platform.

Hello, I appreciate your response.

I understand your point. I have tested Delta Live Tables (DLT) with incrementally materialized views of various query sizes, and I'm not impressed with the results. I imagine that the engine will continue to improve, but the lack of information about the procedures used by DLT and its internal mechanisms is not what I expected at this time. Databricks has always been a company that values transparency and open-source technologies. Don't get me wrong—I love the platform, and I'm just sharing my perspective.

I created a simple library that applies changes incrementally using Spark Structured Streaming, and it is working well.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group