Databricks Community

dave_d · ‎10-10-2023

I am running a relatively simple SQL query that writes back to a table on a Databricks serverless SQL warehouse, and I'm trying to understand why there is a "Columnar To Row" node in the query profile that is consuming the vast majority of the time spent executing the query. The rest of the query plan makes sense to me, but I can't figure out why it would need to convert to row format if the query is just writing back to another table - shouldn't it stay entirely in columnar format? Maybe I'm misunderstanding the purpose of this node...

CREATE
OR REPLACE TABLE id_mappings AS
select
l.id_current as id_old,
r.id_current as id_new
from
tuples as l
join tuples as r on (l.`HashedEmail` = r.`HashedEmail`)
where
l.id_current > r.id_current
and l.`HashedEmail` is not null
and r.`HashedEmail` is not null
and l.`HashedEmail` != ''
and r.`HashedEmail` != ''

dave_d · ‎10-11-2023

Interesting, thanks for the response! Why is it here that the Columnar to Row node happens after all of the query logic, and right before the query results stage? I would assume that if it's needed for a particular operation that only supports row-based processing, then it would happen earlier in the plan.

Also, do you know if there a list somewhere in the Databricks or Spark SQL docs that specifies which types of Spark SQL expressions/operations support columnar execution? Would love to know if I can rewrite this query somehow to not require that transition, given how expensive it appears to be here.

Annapurna_Hiriy · ‎11-05-2023

@dave_d We do not have a document with list of operations that would bring up ColumnarToRow node. This node provides a common executor to translate an RDD of ColumnarBatch into an RDD of InternalRow. This is inserted whenever such a transition is determined to be needed. You can check under what scenarios it might get invoked. Here are few documents:

https://books.japila.pl/spark-sql-internals/physical-operators/ColumnarToRowExec/

A columnar format is optimized for efficient selection of specific columns, while a row format is best suited for efficiently selecting particular rows. When you frequently need to retrieve specific rows, such as by country or day, using a row-based format with an index on the filtering column can be a beneficial approach. You can also refer to the documentation provided by Spark, where they discuss the importance of defining an index: link to Spark documentation.

Databricks Community

What is the "Columnar To Row" node in this simple Databricks SQL query profile?

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon