How to make the write operation faster for writing a spark dataframe to a delta table
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-26-2025 05:22 PM
So, I am doing 4 spatial join operation on the files with the following sizes:
- Base_road_file which is 1gigabyte
- Telematics file which is 1.2 gigs
- state boundary file , BH road file, client_geofence file and kpmg_geofence_file which are not too large
My databricks cluster details are as follows:
13.3 LTS runtme, Standard_DS5_v2 56gb mem 16 cores for driver and worker nodes
The issue is that the joins happen within seconds but writing to a delta table is timing out my entire run>Moreover, even if I increase the time out the whole operation keeps running for hours which is not good for my client.
So, could anyone please suggest what to do. I have even tried repartition but have added optimizeWrite to my spark session settings as well but nothing seems to help. So, could anyone please suggest a way to make my write operation faster.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-28-2025 08:12 AM
Hi @Sjoshi ,
I think the following information would be helpful to understand more about the problem you're experiencing:
- The schema of the tables involved in the join.
- The join condition used on each join.
- How the inputs are stored before the job reads them for the join.
- Any spark configuration options you're setting apart from the default settings.
- The query plans you can see, either from the SparkUI, or from running an explain.
Additionally, I've found this past talk from a Spark Summit very helpful for inspecting and improving the performance of my own workloads. https://www.youtube.com/watch?v=daXEp4HmS-E
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-28-2025 08:34 AM
We recommend using spatial frameworks to speed up things like spatial joins, point-in-polygon, etc, like databricks mosaic or apache sedona. Without these frameworks, many of these operations result in unoptimized and explosive crossjoins.

