Hubert-Dudek
Databricks MVP

It is hard to analyze without Spark UI and more detailed information, but anyway few tips:

  • look for data skews some partitions can be very big some small because of incorrect partitioning. You can use Spark UI to do that but also debug your code a bit (get getNumPartitions())
  • increase shuffle size spark.sql.shuffle.partitions default is 200 try bigger, I would go to 1000 at least even
  • increase size of driver to be 2 times bigger than executor (but to get optimal size please analyze load - in databricks on cluster tab look to Metrics there is Ganglia or even better integarte datadog with cluster)
  • make sure that everything run in distributed way, specially udf, you need to use vectorized pandas udfs so they will run on executors

My blog: https://databrickster.medium.com/