cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

EXPLAIN PLAN parser

noorbasha534
Valued Contributor II

hello all,

have you come across a freely available parser that does good job for parsing explain plans for SQL queries in databricks...

5 REPLIES 5

WiliamRosa
New Contributor III

Hi @noorbasha534,

See if this helps: you can run an EXPLAIN directly in the SQL editor. Try something like:

EXPLAIN FORMATTED
SELECT c.id, SUM(s.val) v
FROM sales s JOIN customers c ON s.cid = c.id
WHERE s.dt >= '2025-08-01'
GROUP BY c.id;

WiliamRosa_0-1755906421026.png

You can also view details via the โ€œSee performanceโ€ icon that appears after the query runs.

WiliamRosa_1-1755906693198.png

Hope that helps!

 

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

noorbasha534
Valued Contributor II

@WiliamRosa thanks. What I meant is a parser that gives me filter columns, join columns from the explain plan๐Ÿ™‚

WiliamRosa
New Contributor III

See if this helps:


Import org.apache.spark.sql.DataFrame

import org.apache.spark.sql.catalyst.plans.logical._
import org.apache.spark.sql.catalyst.expressions._

def extractFiltersAndJoins(df: DataFrame): (Set[String], Set[(String,String)]) = {
val plan = df.queryExecution.optimizedPlan
var filterCols = Set.empty[String]
var joinCols = Set.empty[(String,String)]

plan.foreach {
case Filter(condition, _) =>
condition.references.foreach(ref => filterCols += ref.sql)
case Join(left, right, _, condOpt, _) =>
condOpt.foreach { cond =>
cond.references.toSeq.combinations(2).foreach {
case Seq(a,b) => joinCols += (a.sql -> b.sql)
case _ =>
}
}
case _ =>
}
(filterCols, joinCols)
}

// use it
val df = spark.sql("""
SELECT c.id, SUM(s.val) v
FROM sales s JOIN customers c ON s.cid = c.id
WHERE s.dt >= DATE '2025-08-01'
GROUP BY c.id
""")
val (filters, joins) = extractFiltersAndJoins(df)
println("Filter columns: " + filters)
println("Join columns: " + joins)

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

noorbasha534
Valued Contributor II

@WiliamRosa ah thanks much.

Just a final question - it seems the code executes the SQL query & then captures the columns from the plan? If so, my need is to go through around 1000 sqls each day & get the columns used in filters/joins/group by of these. So, executing them is not an option๐Ÿ™‚

WiliamRosa
New Contributor III

got it. If execution is not an option, you can still extract columns without running the queries by parsing their AST.

Use Sparkโ€™s internal SQL parser (no execution)

You can parse to an unresolved logical plan and walk the tree for Filter, Join, and Aggregate nodes. This avoids query execution and catalog reads on data (though the analyzer may still look up objects if you resolve). In Scala:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.catalyst.plans.logical._
import org.apache.spark.sql.catalyst.expressions._

val spark: SparkSession = SparkSession.getActiveSession.get
val parser = spark.sessionState.sqlParser // parse only, no execution

def colsFromExpr(e: Expression): Set[String] =
e.references.map(_.sql).toSet

case class Parsed(colsFilter: Set[String], colsJoin: Set[(String,String)], colsGroupBy: Set[String])

def scan(sql: String): Parsed = {
val plan = parser.parsePlan(sql) // Unresolved plan (no execution)
var filterCols = Set.empty[String]
var joinCols = Set.empty[(String,String)]
var groupCols = Set.empty[String]

plan.foreach {
case f: Filter =>
filterCols ++= colsFromExpr(f.condition)

case j: Join =>
j.condition.foreach { cond =>
val refs = cond.references.toSeq.map(_.sql).distinct
// pairwise for simple equi-joins
for {
i <- refs.indices
j <- (i+1) until refs.length
} joinCols += (refs(i) -> refs(j))
}

case a: Aggregate =>
groupCols ++= a.groupingExpressions.flatMap(e => colsFromExpr(e))

case _ =>
}
Parsed(filterCols, joinCols, groupCols)
}

// Example
val sql =
"""SELECT c.id, SUM(s.val)
FROM sales s JOIN customers c ON s.cid = c.id
WHERE s.dt >= DATE '2025-08-01'
GROUP BY c.id"""

val out = scan(sql)
println(s"Filter: ${out.colsFilter}")
println(s"Joins: ${out.colsJoin}")
println(s"Group By: ${out.colsGroupBy}")
Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now