โ01-08-2026 07:23 AM
Dear community,
We are running nightly businessvaults. Last year it stopped to finish on classic wh and after testing until completion when switching to serverless wh it stayed that way but costs have increased a lot. I have been testing numerous spark_conf options with the problematic query on classic wh to check if after optimization I could set it back to classic. It hangs at a 100% with no rows written indicating a commit issue. I tried materializing into view and write the table with a posthook. This will move the issue to the posthook.
Longer running but similar structured queries in the same schema do finish on classic.
Example image above has been indicating 100% after a few minutes but running for 17min and will run indefinitely until canceled.
What are my options to combat this hanging state on classic wh?
Below are a few spark_conf options I touched/used/changed while testing:
"spark.sql.adaptive.enabled": "true",
"spark.sql.adaptive.coalescePartitions.enabled": "true",
"spark.sql.adaptive.coalescePartitions.minPartitionSize": "134217728",
"spark.sql.adaptive.skewJoin.enabled": "true",
"spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes": 134217728,
"spark.sql.adaptive.skewJoin.skewedPartitionFactor": "5",
"spark.sql.shuffle.partitions": "64",
"spark.sql.autoBroadcastJoinThreshold": "-1",
"spark.sql.broadcastTimeout": "1200",
"spark.sql.network.timeout": "800s",
"spark.sql.execution.arrow.maxRecordsPerBatch": "50000",
"spark.sql.files.maxPartitionBytes": 134217728,
"spark.sql.adaptive.advisoryPartitionSizeInBytes": "134217728",
"spark.sql.adaptive.localShuffleReader.enabled": "true",
"spark.sql.join.preferSortMergeJoin": "true",
"spark.sql.optimizer.dynamicPartitionPruning.enabled": "true",
"spark.databricks.delta.optimizeWrite.enabled": "true",
"spark.databricks.delta.autoCompact.enabled": "true",
"spark.sql.execution.arrow.pyspark.enabled": "true",
"spark.sql.inMemoryColumnarStorage.compressed": "true",
"spark.sql.inMemoryColumnarStorage.batchSize": "10000",
"spark.sql.cbo.enabled": "false",
Thanks for reading!
Ties
โ01-08-2026 08:13 AM - edited โ01-08-2026 08:14 AM
If you have a Support contract, this would be a good one to create a ticket for.
That being said, is there a reason you need to be on classic? Its an engine that really should be considered a good starter engine but you should be using Pro or Serverless for anything where you consider performance to be a measuring stick. Have you tried the same on Pro WH?
Also how are you setting these configs. SQL warehouses don't respect all configs so if you are setting this in dbt, it's possible they are being ignored or just causing some unintended effects.
Also, if I'm reading this correctly you're going from 7M to 3.7B rows after an inner join? Do you have N matches for each row? It's possible that the "engine" improvements in Serverless are handling this explode much better than the classic engine.
โ01-08-2026 08:13 AM - edited โ01-08-2026 08:14 AM
If you have a Support contract, this would be a good one to create a ticket for.
That being said, is there a reason you need to be on classic? Its an engine that really should be considered a good starter engine but you should be using Pro or Serverless for anything where you consider performance to be a measuring stick. Have you tried the same on Pro WH?
Also how are you setting these configs. SQL warehouses don't respect all configs so if you are setting this in dbt, it's possible they are being ignored or just causing some unintended effects.
Also, if I'm reading this correctly you're going from 7M to 3.7B rows after an inner join? Do you have N matches for each row? It's possible that the "engine" improvements in Serverless are handling this explode much better than the classic engine.
โ01-09-2026 02:16 AM
@MoJaMa thanks for your response,
The reason I want to be on classic is the costs. Serverless runs have tripled our costs sadly. I have tried the query in case on pro and it will finish in two minutes. Cost indication running only on pro do not seem to fit the budget. Therefor the choice of serverless for now.
spark_conf dictionary you can set from within the model inside the config list and noticed those settings are being picked up. You can also set these in dbt_project.yml if you want to apply to a whole schema or just one model.
This explode you are seeing is not intentionally. Longer running but similar structured queries in the same schema have this as well. Those finish within a few minutes. Seems like databricks optimizing/restructuring the query results in this explode for some reason. There must be some databricks logic to it I guess. It is clear that pro and serverless are better at handling this.
โ01-08-2026 08:19 AM
Hey @Ties ,
When a query reaches 100% completion but fails to commit, it usually points to a metadata bottleneck, file system contention, or a massive data skew.
While the job is hanging at 100%, go to the Spark UI -> Executors tab -> Driver -> Thread Dump. Look for threads stuck in FileCommitProtocol or DeltaLog. If you see many threads waiting on S3/ADLS/GCS listing, you have a file-count problem.
Also checking the Driver log, will give an idea on what action is stuck, When Spark says 100% but won't finish, itโs often in the Driver. But I would still ask you to go to the Spark UI and see if there are any tasks running. A screenshot of that can help check on this further.
Thanks!
โ01-09-2026 03:09 AM
@NandiniN thanks for you response,
I tried influencing the way skewjoin is being applied and the way files are written but no results.
This query is part of a businessvault run and is called from dbt. It is not a job run native to databricks and sadly I cannot simulate this run from a job with a dbt command as a classic wh is unavailable to use.
โ02-02-2026 02:57 AM
I saw that in this topic a reply was selected as a solution. Sadly this is not the case and we are still in limbo with this issue.
I tried setting up a meeting through support for databricks on azure but the third party rep microsoft provided did not show up.
Also set the businessvault runs to run on Pro warehouse for a week to compare. Costs are slightly higher than serverless while performance is the same.