Materialized views in Databricks, especially those maintained via DLT (Delta Live Tables) pipelines, often have different execution patterns and optimization strategies compared to running the same SQL in a standard serverless notebook. This can lead to unexpected differences in query profiles, such as the pronounced "Columnar To Row" step you observed in your materialized view but not in the direct CREATE TABLE query in the notebook.
Key Differences in Execution
DLT Materialized Views
-
DLT pipelines apply additional checks, lineage tracking, and reliability guarantees, which can add certain steps to the physical execution plan.
-
"Columnar To Row" conversion is common in DLT executions when an operation requires row-wise processing for tasks like streaming ingestion, data consistency checks, or certain aggregations. This conversion can become a bottleneck, especially if the rest of the pipeline is highly columnar optimized but the output needs row-based serialization.
-
DLT pipelines may also add extra shuffling, checkpointing, and monitoring logic for manageability and observability.
Serverless Notebooks
-
Serverless notebooks often leverage Spark's latest query engine and optimizations directly without additional pipeline overhead.
-
Notebooks typically execute queries in a more ad hoc way, allowing Spark to optimize the entire plan for speed, possibly avoiding row-based conversions if not strictly required by the output format.
-
The physical plan for SQL queries in notebooks may be simpler and more streamlined for quick result turnaround.
Why "Columnar To Row" May Slow Down DLT
-
If the final stage of a DLT pipeline requires row output (for writing results, audits, or tracking changes), Spark will convert data from its internal columnar format (fast for analytics) to row format (required for serialization or external consumption), which is notably slower for large datasets.
-
This step often appears if your materialized view involves UDFs, complex aggregations, or outputting to certain external targets.
Troubleshooting and Recommendations
-
Review your DLT pipeline settings and query logic to see if you can avoid row-based operations in the final steps. Try to keep as much processing in columnar format.
-
Check if recent changes added operations (like UDFs, row-wise transformations) that force this conversion.
-
Consider using OPTIMIZE or appropriate partitioning to minimize the work required in the conversion step.
-
Compare execution plans side by side by exporting them from both DLT and the notebook, focusing on the steps before and after "Columnar To Row" for clues.
Conclusion
The difference in execution time is likely due to the extra work and guarantees enforced by the DLT pipeline and the required conversion to rows for certain output types. Serverless notebooks may avoid this step by producing purely columnar results, leading to the speedup. Optimizing your DLT pipeline for more columnar-heavy steps and avoiding forced row conversions where possible should help.
If you want specific optimization guidance, sharing the detailed execution plans for the materialized view and the notebook query would be helpful. This will allow a more targeted recommendation for tuning either the pipeline or the query.