Databricks Community

nengen · ‎10-23-2024

I have a pretty large SQL query that has the following stats from the query profiler:

Tasks total time: 1.93s

Executing: 27s

Based on the information in the query profiler this can be due to tasks waiting for available nodes.

How should I approach this to figure out where this is happening?

Stefan-Koch · ‎10-23-2024

Hi nengen

You may have more infos to share, so we can help you?

nengen · ‎10-25-2024

I have a pretty complex and large SQL query which does a lot of joins on CTEs. Due to the nature of the data this has to be done using cross joins so I suspect that this might be the reason it is slow. I was hoping to be able to pinpoint where the tasks are waiting for available nodes or where the query is taking so much time (wall clock duration). I tried using the query profiler but this seems to show the execution time of the tasks and not the whole process.

Panda · ‎10-25-2024

@nengen Try using EXPLAIN EXTENDED: This provides a detailed breakdown of the logical and physical plan of a query in Spark SQL.

Based on the EXPLAIN EXTENDED output, here are a few things to consider:

Broadcast Exchange: If the join causes data skew, consider switching to a sort-merge join.
FileScan: If the scan is slow, consider partitioning or caching the data to improve performance.
Filter Pushdown: Ensure the most restrictive filters are applied early to reduce the amount of data processed.

Please review for more details