Databricks Community

SaiCharan · ‎01-24-2024

Hello, I'm writing to bring to your attention an issue that we have encountered while working with Data bricks and seek your assistance in resolving it.

Context of the Error :
When a sql query(1700 lines) is ran, corresponding data bricks job is failing for No space left on device and upgrading the cluster configuration to higher versions and added required spark configurations still which errored out for IllegalStateException: Have already allocated a maximum of 8192 pages.

Initially runs were completing with configuration as mentioned : Driver: m5.4xlarge · Workers: m5.4xlarge · 20-50 workers and spark configurations as spark.driver.maxResultSize 8g andspark.sql.autoBroadcastJoinThreshold -1 and now this is failing with Caused by: java.io.IOException: No space left on device.
ERROR Uncaught throwable from user code: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 149 (count at DuplicatesFinder.scala:23) has failed the maximum allowable number of times: 4. Most recent failure reason:
org.apache.spark.shuffle.FetchFailedException.

Based on this error, checked the stages and dag and observed a failed stage for collecting samples for range partitioning in count at DuplicatesFinder.scala consumes most of driver memory to shuffle read & shuffle write.

Have tried upgrading to higher cluster resources with relevant spark configurations scaling each time as :
spark.sql.shuffle.partitions ( auto, 100, 300, 400 till 4000)
spark.driver.maxResultSize (16g)
spark.default.parallelism 4000
spark.driver.memory 16g
spark.executor.memory 6g

enable-cache: false
view-partitions: 8000

and now it errors out IllegalStateException: Have already allocated a maximum of 8192 pages.

Kindly let us know what points can considered w.r.t this issue.
Any suggestions to resolve this issue will be much helpful.

Thanks for your time and assistance. Appreciate it.

Best regards,

Sai Charan