Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-17-2022 03:42 AM
It is hard to analyze without Spark UI and more detailed information, but anyway few tips:
- look for data skews some partitions can be very big some small because of incorrect partitioning. You can use Spark UI to do that but also debug your code a bit (get getNumPartitions())
- increase shuffle size spark.sql.shuffle.partitions default is 200 try bigger, I would go to 1000 at least even
- increase size of driver to be 2 times bigger than executor (but to get optimal size please analyze load - in databricks on cluster tab look to Metrics there is Ganglia or even better integarte datadog with cluster)
- make sure that everything run in distributed way, specially udf, you need to use vectorized pandas udfs so they will run on executors
My blog: https://databrickster.medium.com/