Databricks Community

noorbasha534 · ‎08-22-2025

hello all,

have you come across a freely available parser that does good job for parsing explain plans for SQL queries in databricks...

WiliamRosa · ‎08-22-2025

Hi @noorbasha534,

See if this helps: you can run an EXPLAIN directly in the SQL editor. Try something like:

EXPLAIN FORMATTED
SELECT c.id, SUM(s.val) v
FROM sales s JOIN customers c ON s.cid = c.id
WHERE s.dt >= '2025-08-01'
GROUP BY c.id;

You can also view details via the “See performance” icon that appears after the query runs.

Hope that helps!

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

noorbasha534 · ‎08-22-2025

@WiliamRosa thanks. What I meant is a parser that gives me filter columns, join columns from the explain plan🙂

WiliamRosa · ‎08-22-2025

See if this helps:

Import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.catalyst.plans.logical._
import org.apache.spark.sql.catalyst.expressions._

def extractFiltersAndJoins(df: DataFrame): (Set[String], Set[(String,String)]) = {
val plan = df.queryExecution.optimizedPlan
var filterCols = Set.empty[String]
var joinCols = Set.empty[(String,String)]

plan.foreach {
case Filter(condition, _) =>
condition.references.foreach(ref => filterCols += ref.sql)
case Join(left, right, _, condOpt, _) =>
condOpt.foreach { cond =>
cond.references.toSeq.combinations(2).foreach {
case Seq(a,b) => joinCols += (a.sql -> b.sql)
case _ =>
}
}
case _ =>
}
(filterCols, joinCols)
}

// use it
val df = spark.sql("""
SELECT c.id, SUM(s.val) v
FROM sales s JOIN customers c ON s.cid = c.id
WHERE s.dt >= DATE '2025-08-01'
GROUP BY c.id
""")
val (filters, joins) = extractFiltersAndJoins(df)
println("Filter columns: " + filters)
println("Join columns: " + joins)

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

noorbasha534 · ‎08-22-2025

@WiliamRosa ah thanks much.

Just a final question - it seems the code executes the SQL query & then captures the columns from the plan? If so, my need is to go through around 1000 sqls each day & get the columns used in filters/joins/group by of these. So, executing them is not an option🙂

WiliamRosa · ‎08-22-2025

got it. If execution is not an option, you can still extract columns without running the queries by parsing their AST.

Use Spark’s internal SQL parser (no execution)

You can parse to an unresolved logical plan and walk the tree for Filter, Join, and Aggregate nodes. This avoids query execution and catalog reads on data (though the analyzer may still look up objects if you resolve). In Scala:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.catalyst.plans.logical._
import org.apache.spark.sql.catalyst.expressions._

val spark: SparkSession = SparkSession.getActiveSession.get
val parser = spark.sessionState.sqlParser // parse only, no execution

def colsFromExpr(e: Expression): Set[String] =
e.references.map(_.sql).toSet

case class Parsed(colsFilter: Set[String], colsJoin: Set[(String,String)], colsGroupBy: Set[String])

def scan(sql: String): Parsed = {
val plan = parser.parsePlan(sql) // Unresolved plan (no execution)
var filterCols = Set.empty[String]
var joinCols = Set.empty[(String,String)]
var groupCols = Set.empty[String]

plan.foreach {
case f: Filter =>
filterCols ++= colsFromExpr(f.condition)

case j: Join =>
j.condition.foreach { cond =>
val refs = cond.references.toSeq.map(_.sql).distinct
// pairwise for simple equi-joins
for {
i <- refs.indices
j <- (i+1) until refs.length
} joinCols += (refs(i) -> refs(j))
}

case a: Aggregate =>
groupCols ++= a.groupingExpressions.flatMap(e => colsFromExpr(e))

case _ =>
}
Parsed(filterCols, joinCols, groupCols)
}

// Example
val sql =
"""SELECT c.id, SUM(s.val)
FROM sales s JOIN customers c ON s.cid = c.id
WHERE s.dt >= DATE '2025-08-01'
GROUP BY c.id"""

val out = scan(sql)
println(s"Filter: ${out.colsFilter}")
println(s"Joins: ${out.colsJoin}")
println(s"Group By: ${out.colsGroupBy}")

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

Databricks Community

EXPLAIN PLAN parser

Join Us as a Local Community Builder!

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐