a week ago
Hello everyone,
I would like to clarify some doubts regarding how Databricks Pipelines (DLT) behave when using serverless pipelines with incremental updates.
In general, incremental processing is enabled and works as expected. However, I have observed some behaviors that I do not fully understand:
In several pipelines, the system selects the incremental execution plan even when the estimated cost of the incremental run appears to be higher than a complete recompute. In those cases, the incremental run indeed ends up taking longer and costing more, which we have been able to verify after execution.
In other pipelines, I have noticed that there seems to be a cost limit or threshold (which I assume is related to the estimated cost of a complete recompute). When the incremental plan exceeds that limit, Databricks chooses to prioritize a complete recompute instead.
This suggests that such a mechanism exists, but it is not consistently triggered across pipelines.
I would like to know more about:
I have tried to find official documentation explaining these internals, but I have found very limited details. Any pointers to relevant documentation, blog posts, or technical explanations would be greatly appreciated.
Finally, a somewhat related question:
In the documentation I did find, it is mentioned that non-deterministic operations (e.g. those based on the current date/time) should invalidate incremental updates. However, in recent tests I have performed filters based on the current date (e.g. distance to today) and incremental updates were still applied (although some execution plans were discarded).
Under what conditions such operations still allow incremental execution?
{
"comment": "This execution chose complete_recompute using a threshold"
"technique_information": [
{
"maintenance_type": "MAINTENANCE_TYPE_NO_OP",
"incrementalization_issues": [
{
"issue_type": "DATA_HAS_CHANGED",
"prevent_incrementalization": true
}
]
},
{
"incrementalization_issues": [
{
"issue_type": "INCREMENTAL_PLAN_REJECTED_BY_COST_MODEL",
"prevent_incrementalization": true,
"cost_model_rejection_subtype": "CHANGESET_SIZE_THRESHOLD_EXCEEDED"
}
]
},
{
"maintenance_type": "MAINTENANCE_TYPE_COMPLETE_RECOMPUTE",
"is_chosen": true,
"is_applicable": true,
"cost": 24966691933335
},
{
"maintenance_type": "MAINTENANCE_TYPE_ROW_BASED",
"is_chosen": false,
"is_applicable": true,
"cost": 51838151965590
}
],
"source_table_information": [
{
"table_name": "`inc_customer_event_with_customer_info`",
"catalog_table_type": "MATERIALIZED_VIEW",
"full_size": 77759940,
"num_rows": 2057197,
"num_files": 3,
"change_size": 77759940,
"num_changed_rows": 1166,
"num_rows_in_changed_files": 1166,
"num_changed_files": 3,
"change_file_read_size": 77759940,
"is_size_after_pruning": true,
"is_row_id_enabled": true,
"is_cdf_enabled": true,
"is_deletion_vector_enabled": true,
"is_change_from_legacy_cdf": true
},
{
"table_name": "`platform_event`",
"catalog_table_type": "STREAMING_TABLE",
"partition_columns": [
"TenantId"
],
"full_size": 108057,
"num_rows": 1219,
"num_files": 14,
"change_size": 0,
"num_changed_rows": 0,
"num_rows_in_changed_files": 0,
"num_changed_files": 0,
"change_file_read_size": 0,
"is_size_after_pruning": true,
"is_row_id_enabled": true,
"is_cdf_enabled": true,
"is_deletion_vector_enabled": true,
"is_change_from_legacy_cdf": false
},
{
"table_name": "`platform_event_group`",
"catalog_table_type": "STREAMING_TABLE",
"partition_columns": [
"TenantId"
],
"full_size": 84011,
"num_rows": 250,
"num_files": 13,
"change_size": 0,
"num_changed_rows": 0,
"num_rows_in_changed_files": 0,
"num_changed_files": 0,
"change_file_read_size": 0,
"is_size_after_pruning": true,
"is_row_id_enabled": true,
"is_cdf_enabled": true,
"is_deletion_vector_enabled": true,
"is_change_from_legacy_cdf": false
}
],
"target_table_information": {
"table_name": "`inc_event_dashboard_model`",
"full_size": 4860340,
"is_row_id_enabled": true,
"is_cdf_enabled": true,
"is_deletion_vector_enabled": true
},
"planning_wall_time_ms": 4864,
"fingerprint_info": {
"primary_fingerprint_version": 1
}
}{
"comment": "This execution chose incremental_refresh even with more cost than complete_recompute",
"technique_information": [
{
"maintenance_type": "MAINTENANCE_TYPE_NO_OP",
"incrementalization_issues": [
{
"issue_type": "DATA_HAS_CHANGED",
"prevent_incrementalization": true
}
]
},
{
"maintenance_type": "MAINTENANCE_TYPE_COMPLETE_RECOMPUTE",
"is_chosen": false,
"is_applicable": true,
"cost": 925057881923
},
{
"maintenance_type": "MAINTENANCE_TYPE_ROW_BASED",
"is_chosen": true,
"is_applicable": true,
"cost": 11994639573063
}
],
"source_table_information": [
{
"table_name": "`order`",
"catalog_table_type": "STREAMING_TABLE",
"partition_columns": [
"TenantId"
],
"full_size": 14840711065,
"num_rows": 42379500,
"num_files": 228,
"change_size": 0,
"num_changed_rows": 0,
"num_rows_in_changed_files": 0,
"num_changed_files": 0,
"change_file_read_size": 0,
"is_size_after_pruning": true,
"is_row_id_enabled": true,
"is_cdf_enabled": true,
"is_deletion_vector_enabled": true,
"is_change_from_legacy_cdf": false
},
{
"table_name": "`dim_unaccounted_products`",
"catalog_table_type": "MANAGED",
"full_size": 3681,
"num_rows": 19,
"num_files": 2,
"change_size": 4379,
"num_changed_rows": 2,
"num_rows_in_changed_files": 2,
"num_changed_files": 2,
"change_file_read_size": 4379,
"is_size_after_pruning": true,
"is_row_id_enabled": true,
"is_cdf_enabled": true,
"is_deletion_vector_enabled": true,
"is_change_from_legacy_cdf": false
}
],
"target_table_information": {
"table_name": "`inc_operation_analytics`",
"full_size": 14554925095,
"is_row_id_enabled": true,
"is_cdf_enabled": true,
"is_deletion_vector_enabled": true
},
"planning_wall_time_ms": 12122,
"fingerprint_info": {
"primary_fingerprint_version": 1
}
}
Wednesday
CHANGESET_SIZE_THRESHOLD_EXCEEDED in your logs, try running updates more frequently to reduce the volume of changes per update.WHERE clauses. Support for operators is constantly updated in the doc: Azure l AWS l GCP. Wednesday
CHANGESET_SIZE_THRESHOLD_EXCEEDED in your logs, try running updates more frequently to reduce the volume of changes per update.WHERE clauses. Support for operators is constantly updated in the doc: Azure l AWS l GCP. 7 hours ago
Hi @Alf01 , thanks for accepting the solution!
To keep you updated, the REFRESH POLICY feature, that I mentioned in my post, is out now! It allows manual control of the refresh strategy (AUTO, INCREMENTAL, INCREMENTAL STRICT, FULL), just as you stated in your question.
Check out the docs: Azure | AWS | GCP
Best regards,