cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

What is the "Columnar To Row" node in this simple Databricks SQL query profile?

dave_d
New Contributor II

I am running a relatively simple SQL query that writes back to a table on a Databricks serverless SQL warehouse, and I'm trying to understand why there is a "Columnar To Row" node in the query profile that is consuming the vast majority of the time spent executing the query. The rest of the query plan makes sense to me, but I can't figure out why it would need to convert to row format if the query is just writing back to another table - shouldn't it stay entirely in columnar format? Maybe I'm misunderstanding the purpose of this node...

dave_d_0-1696974904324.png

CREATE
OR REPLACE TABLE id_mappings AS
select
l.id_current as id_old,
r.id_current as id_new
from
tuples as l
join tuples as r on (l.`HashedEmail` = r.`HashedEmail`)
where
l.id_current > r.id_current
and l.`HashedEmail` is not null
and r.`HashedEmail` is not null
and l.`HashedEmail` != ''
and r.`HashedEmail` != ''

 

2 REPLIES 2

dave_d
New Contributor II

Interesting, thanks for the response! Why is it here that the Columnar to Row node happens after all of the query logic, and right before the query results stage? I would assume that if it's needed for a particular operation that only supports row-based processing, then it would happen earlier in the plan.

Also, do you know if there a list somewhere in the Databricks or Spark SQL docs that specifies which types of Spark SQL expressions/operations support columnar execution? Would love to know if I can rewrite this query somehow to not require that transition, given how expensive it appears to be here.

Annapurna_Hiriy
Contributor

 @dave_d We do not have a document with list of operations that would bring up ColumnarToRow node. This node provides a common executor to translate an RDD of ColumnarBatch into an RDD of InternalRow. This is inserted whenever such a transition is determined to be needed. You can check under what scenarios it might get invoked. Here are few documents:

https://books.japila.pl/spark-sql-internals/physical-operators/ColumnarToRowExec/

A columnar format is optimized for efficient selection of specific columns, while a row format is best suited for efficiently selecting particular rows. When you frequently need to retrieve specific rows, such as by country or day, using a row-based format with an index on the filtering column can be a beneficial approach. You can also refer to the documentation provided by Spark, where they discuss the importance of defining an index: link to Spark documentation.

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group