โ08-22-2025 03:34 PM
hello all,
have you come across a freely available parser that does good job for parsing explain plans for SQL queries in databricks...
โ08-22-2025 04:53 PM
Hi @noorbasha534,
See if this helps: you can run an EXPLAIN directly in the SQL editor. Try something like:
EXPLAIN FORMATTED
SELECT c.id, SUM(s.val) v
FROM sales s JOIN customers c ON s.cid = c.id
WHERE s.dt >= '2025-08-01'
GROUP BY c.id;
You can also view details via the โSee performanceโ icon that appears after the query runs.
Hope that helps!
โ08-22-2025 06:20 PM
@WiliamRosa thanks. What I meant is a parser that gives me filter columns, join columns from the explain plan๐
โ08-22-2025 06:31 PM - edited โ08-22-2025 06:33 PM
See if this helps:
Import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.catalyst.plans.logical._
import org.apache.spark.sql.catalyst.expressions._
def extractFiltersAndJoins(df: DataFrame): (Set[String], Set[(String,String)]) = {
val plan = df.queryExecution.optimizedPlan
var filterCols = Set.empty[String]
var joinCols = Set.empty[(String,String)]
plan.foreach {
case Filter(condition, _) =>
condition.references.foreach(ref => filterCols += ref.sql)
case Join(left, right, _, condOpt, _) =>
condOpt.foreach { cond =>
cond.references.toSeq.combinations(2).foreach {
case Seq(a,b) => joinCols += (a.sql -> b.sql)
case _ =>
}
}
case _ =>
}
(filterCols, joinCols)
}
// use it
val df = spark.sql("""
SELECT c.id, SUM(s.val) v
FROM sales s JOIN customers c ON s.cid = c.id
WHERE s.dt >= DATE '2025-08-01'
GROUP BY c.id
""")
val (filters, joins) = extractFiltersAndJoins(df)
println("Filter columns: " + filters)
println("Join columns: " + joins)
โ08-22-2025 06:39 PM
@WiliamRosa ah thanks much.
Just a final question - it seems the code executes the SQL query & then captures the columns from the plan? If so, my need is to go through around 1000 sqls each day & get the columns used in filters/joins/group by of these. So, executing them is not an option๐
โ08-22-2025 06:46 PM - edited โ08-22-2025 06:50 PM
got it. If execution is not an option, you can still extract columns without running the queries by parsing their AST.
Use Sparkโs internal SQL parser (no execution)
You can parse to an unresolved logical plan and walk the tree for Filter, Join, and Aggregate nodes. This avoids query execution and catalog reads on data (though the analyzer may still look up objects if you resolve). In Scala:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.catalyst.plans.logical._
import org.apache.spark.sql.catalyst.expressions._
val spark: SparkSession = SparkSession.getActiveSession.get
val parser = spark.sessionState.sqlParser // parse only, no execution
def colsFromExpr(e: Expression): Set[String] =
e.references.map(_.sql).toSet
case class Parsed(colsFilter: Set[String], colsJoin: Set[(String,String)], colsGroupBy: Set[String])
def scan(sql: String): Parsed = {
val plan = parser.parsePlan(sql) // Unresolved plan (no execution)
var filterCols = Set.empty[String]
var joinCols = Set.empty[(String,String)]
var groupCols = Set.empty[String]
plan.foreach {
case f: Filter =>
filterCols ++= colsFromExpr(f.condition)
case j: Join =>
j.condition.foreach { cond =>
val refs = cond.references.toSeq.map(_.sql).distinct
// pairwise for simple equi-joins
for {
i <- refs.indices
j <- (i+1) until refs.length
} joinCols += (refs(i) -> refs(j))
}
case a: Aggregate =>
groupCols ++= a.groupingExpressions.flatMap(e => colsFromExpr(e))
case _ =>
}
Parsed(filterCols, joinCols, groupCols)
}
// Example
val sql =
"""SELECT c.id, SUM(s.val)
FROM sales s JOIN customers c ON s.cid = c.id
WHERE s.dt >= DATE '2025-08-01'
GROUP BY c.id"""
val out = scan(sql)
println(s"Filter: ${out.colsFilter}")
println(s"Joins: ${out.colsJoin}")
println(s"Group By: ${out.colsGroupBy}")
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now