cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Slow performance in REFRESH MATERIALIZED VIEW over CTAS

MarcoRezende
New Contributor III

Hello guys, i have some materialized views created in my databricks workspace and after 1 change in one of them, it became 3x slower (9 minutes to 30 minutes). After some debugging i found that the bottleneck process in the execution plan is one called "Columnar To Row" (108h in agg task time). For some reason, if i run the exact same query in a CREATE TABLE in serveless notebook, the execution plan do not have this "Columnar To Row" and it runs in 9 minutes. I know that MATERIALIZED VIEW runs in DLT serveless pipelines, it has some different optimizations than run in serveless notebooks? It is a really strange behavior.

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

Materialized views in Databricks, especially those maintained via DLT (Delta Live Tables) pipelines, often have different execution patterns and optimization strategies compared to running the same SQL in a standard serverless notebook. This can lead to unexpected differences in query profiles, such as the pronounced "Columnar To Row" step you observed in your materialized view but not in the direct CREATE TABLE query in the notebook.

Key Differences in Execution

DLT Materialized Views

  • DLT pipelines apply additional checks, lineage tracking, and reliability guarantees, which can add certain steps to the physical execution plan.

  • "Columnar To Row" conversion is common in DLT executions when an operation requires row-wise processing for tasks like streaming ingestion, data consistency checks, or certain aggregations. This conversion can become a bottleneck, especially if the rest of the pipeline is highly columnar optimized but the output needs row-based serialization.

  • DLT pipelines may also add extra shuffling, checkpointing, and monitoring logic for manageability and observability.

Serverless Notebooks

  • Serverless notebooks often leverage Spark's latest query engine and optimizations directly without additional pipeline overhead.

  • Notebooks typically execute queries in a more ad hoc way, allowing Spark to optimize the entire plan for speed, possibly avoiding row-based conversions if not strictly required by the output format.

  • The physical plan for SQL queries in notebooks may be simpler and more streamlined for quick result turnaround.

Why "Columnar To Row" May Slow Down DLT

  • If the final stage of a DLT pipeline requires row output (for writing results, audits, or tracking changes), Spark will convert data from its internal columnar format (fast for analytics) to row format (required for serialization or external consumption), which is notably slower for large datasets.

  • This step often appears if your materialized view involves UDFs, complex aggregations, or outputting to certain external targets.

Troubleshooting and Recommendations

  • Review your DLT pipeline settings and query logic to see if you can avoid row-based operations in the final steps. Try to keep as much processing in columnar format.

  • Check if recent changes added operations (like UDFs, row-wise transformations) that force this conversion.

  • Consider using OPTIMIZE or appropriate partitioning to minimize the work required in the conversion step.

  • Compare execution plans side by side by exporting them from both DLT and the notebook, focusing on the steps before and after "Columnar To Row" for clues.

Conclusion

The difference in execution time is likely due to the extra work and guarantees enforced by the DLT pipeline and the required conversion to rows for certain output types. Serverless notebooks may avoid this step by producing purely columnar results, leading to the speedup. Optimizing your DLT pipeline for more columnar-heavy steps and avoiding forced row conversions where possible should help.

If you want specific optimization guidance, sharing the detailed execution plans for the materialized view and the notebook query would be helpful. This will allow a more targeted recommendation for tuning either the pipeline or the query.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now