cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

What is the "Columnar To Row" node in this simple Databricks SQL query profile?

dave_d
New Contributor II

I am running a relatively simple SQL query that writes back to a table on a Databricks serverless SQL warehouse, and I'm trying to understand why there is a "Columnar To Row" node in the query profile that is consuming the vast majority of the time spent executing the query. The rest of the query plan makes sense to me, but I can't figure out why it would need to convert to row format if the query is just writing back to another table - shouldn't it stay entirely in columnar format? Maybe I'm misunderstanding the purpose of this node...

dave_d_0-1696974904324.png

CREATE
OR REPLACE TABLE id_mappings AS
select
l.id_current as id_old,
r.id_current as id_new
from
tuples as l
join tuples as r on (l.`HashedEmail` = r.`HashedEmail`)
where
l.id_current > r.id_current
and l.`HashedEmail` is not null
and r.`HashedEmail` is not null
and l.`HashedEmail` != ''
and r.`HashedEmail` != ''

 

3 REPLIES 3

Kaniz
Community Manager
Community Manager

Hi @dave_d, The "Columnar To Row" node in your query execution plan is a part of the Apache Spark™ SQL execution engine. 

 

Spark SQL uses a columnar format for in-memory computations to optimize for data locality in computations, which can significantly improve the performance of your queries. However, not all operations can be performed in a columnar format. Some operations, such as certain types of joins or aggregations, require data to be in a row format. When such an operation is encountered in the execution plan, Spark SQL must convert the data from columnar to row format. This is what the "Columnar To Row" node in your execution plan is doing. 

 

In your case, the join operation in your query might be causing the conversion from columnar to row format. Even though you're writing the result back to another table, the join operation might require the data to be in a row format for processing. 

 

 

 

dave_d
New Contributor II

Interesting, thanks for the response! Why is it here that the Columnar to Row node happens after all of the query logic, and right before the query results stage? I would assume that if it's needed for a particular operation that only supports row-based processing, then it would happen earlier in the plan.

Also, do you know if there a list somewhere in the Databricks or Spark SQL docs that specifies which types of Spark SQL expressions/operations support columnar execution? Would love to know if I can rewrite this query somehow to not require that transition, given how expensive it appears to be here.

Annapurna_Hiriy
New Contributor III
New Contributor III

 @dave_d We do not have a document with list of operations that would bring up ColumnarToRow node. This node provides a common executor to translate an RDD of ColumnarBatch into an RDD of InternalRow. This is inserted whenever such a transition is determined to be needed. You can check under what scenarios it might get invoked. Here are few documents:

https://books.japila.pl/spark-sql-internals/physical-operators/ColumnarToRowExec/

A columnar format is optimized for efficient selection of specific columns, while a row format is best suited for efficiently selecting particular rows. When you frequently need to retrieve specific rows, such as by country or day, using a row-based format with an index on the filtering column can be a beneficial approach. You can also refer to the documentation provided by Spark, where they discuss the importance of defining an index: link to Spark documentation.