Databricks Community

ismaelhenzel · ‎12-09-2024

I'm very disappointed with this framework. The documentation is inadequate, and it has many limitations. I want to run materialized views with incremental updates, but DLT insists on performing a full recompute. Why is it doing this? Here is the log from a random test execution: { "planning_information":{ "technique_information":[ { "incrementalization_issues":[ { "issue_type":"INCREMENTAL_PLAN_REJECTED_BY_COST_MODEL", "prevent_incrementalization":true, "cost_model_rejection_subtype":"CHANGESET_SIZE_THRESHOLD_EXCEEDED" } ] }, { "maintenance_type":"MAINTENANCE_TYPE_COMPLETE_RECOMPUTE", "is_chosen":true, "is_applicable":true, "cost":2163.0 }, { "maintenance_type":"MAINTENANCE_TYPE_ROW_BASED", "is_chosen":false, "is_applicable":true, "cost":616.0 } ], "source_table_information":[ { "table_name":"`mul_dev_tests`.`dlt_managed`.`teste`", "table_id":"810f74f2-fc09-45b6-93f5-2a544ac93002", "full_size":2950.0, "change_size":710.0, "is_size_after_pruning":true, "is_row_id_enabled":true, "is_cdf_enabled":true, "is_deletion_vector_enabled":true, "is_change_from_legacy_cdf":false } ], "target_table_information":{ "table_name":"`mul_dev_tests`.`default`.`teste_novo_99`", "table_id":"8bc37e86-6cf7-4e92-a69a-85f5da7e1099", "full_size":1320.0, "is_row_id_enabled":true, "is_cdf_enabled":true, "is_deletion_vector_enabled":true } } } It states that the cost of running the incremental update is too high, but the incremental process is FOUR TIMES faster than a full recompute. Please note that I'm using a small dataset for this example, but with large tables, the issue becomes significant. Furthermore, this error is not documented anywhere. Yes, I'm using a serverless setup, which is indeed fast, but it is also a complete black box.

XP · ‎01-11-2025

I'll try to clear up some misunderstanding surrounding the incremental load feature of materialized views:

There isn't a feature to force materialized views to update incrementally. Instead, there is an optimizer called Enzyme that can selectively incrementally load materialized views when the optimizer determines that an incremental update is a more optimal strategy than a full update. Enzyme chooses an incremental strategy when a number of factors are true.

@ismaelhenzel, in your case Enzyme determined full load was more optimal indicated by:

"issue_type": "INCREMENTAL_PLAN_REJECTED_BY_COST_MODEL"

and the reason is indicated by:

"cost_model_rejection_subtype": "CHANGESET_SIZE_THRESHOLD_EXCEEDED"

This rejection subtype indicates that the change size is greater than the threshold allowed for incremental loading. The current default threshold is pretty conservative but will be relaxed overtime. This threshold is potentially tunable, depending on your query pattern.

@lucassvrielink You might be getting tripped up by seeing the cost outputs from the log and doing a simple inequality. The calculation that Enzyme uses to determine which strategy to execute is much more involved than that. It's important to remember that the goal here isn't to incrementally load a materialized view, it's to make the loading of materialized views faster. The optimizer assumes you wouldn't want to force an incremental load if it was slower to do so.

The goal of DLT more broadly is to simplify ETL by abstracting and automating away some of the inherent complexity. While this might feel like a black box to some, the goal isn't to obscure things. If you want more control over the process, it is possible to implement a similar pattern using other tools in the Databricks platform.

View solution in original post

lucassvrielink · ‎12-09-2024

I'm dealing with the same problem.

Doesn't make any sense make a feature that should make our dlt jobs faster unusable in every context. Is there a explanation for this? In my concept, even if incremental process would be 1.1x faster it should option for that.

XP · ‎01-11-2025

I'll try to clear up some misunderstanding surrounding the incremental load feature of materialized views:

There isn't a feature to force materialized views to update incrementally. Instead, there is an optimizer called Enzyme that can selectively incrementally load materialized views when the optimizer determines that an incremental update is a more optimal strategy than a full update. Enzyme chooses an incremental strategy when a number of factors are true.

@ismaelhenzel, in your case Enzyme determined full load was more optimal indicated by:

"issue_type": "INCREMENTAL_PLAN_REJECTED_BY_COST_MODEL"

and the reason is indicated by:

"cost_model_rejection_subtype": "CHANGESET_SIZE_THRESHOLD_EXCEEDED"

This rejection subtype indicates that the change size is greater than the threshold allowed for incremental loading. The current default threshold is pretty conservative but will be relaxed overtime. This threshold is potentially tunable, depending on your query pattern.

@lucassvrielink You might be getting tripped up by seeing the cost outputs from the log and doing a simple inequality. The calculation that Enzyme uses to determine which strategy to execute is much more involved than that. It's important to remember that the goal here isn't to incrementally load a materialized view, it's to make the loading of materialized views faster. The optimizer assumes you wouldn't want to force an incremental load if it was slower to do so.

The goal of DLT more broadly is to simplify ETL by abstracting and automating away some of the inherent complexity. While this might feel like a black box to some, the goal isn't to obscure things. If you want more control over the process, it is possible to implement a similar pattern using other tools in the Databricks platform.

ismaelhenzel · ‎01-13-2025

Hello, I appreciate your response.

I understand your point. I have tested Delta Live Tables (DLT) with incrementally materialized views of various query sizes, and I'm not impressed with the results. I imagine that the engine will continue to improve, but the lack of information about the procedures used by DLT and its internal mechanisms is not what I expected at this time. Databricks has always been a company that values transparency and open-source technologies. Don't get me wrong—I love the platform, and I'm just sharing my perspective.

I created a simple library that applies changes incrementally using Spark Structured Streaming, and it is working well.