topic Re: How to make the write operation faster for writing a spark dataframe to a delta table in Data Engineering

How to make the write operation faster for writing a spark dataframe to a delta table

Sjoshi — Thu, 27 Feb 2025 01:22:58 GMT

So, I am doing 4 spatial join operation on the files with the following sizes:

Base_road_file which is 1gigabyte
Telematics file which is 1.2 gigs
state boundary file , BH road file, client_geofence file and kpmg_geofence_file which are not too large

My databricks cluster details are as follows:

13.3 LTS runtme, Standard_DS5_v2 56gb mem 16 cores for driver and worker nodes

The issue is that the joins happen within seconds but writing to a delta table is timing out my entire run>Moreover, even if I increase the time out the whole operation keeps running for hours which is not good for my client.

So, could anyone please suggest what to do. I have even tried repartition but have added optimizeWrite to my spark session settings as well but nothing seems to help. So, could anyone please suggest a way to make my write operation faster.

Re: How to make the write operation faster for writing a spark dataframe to a delta table

Nik_Vanderhoof — Fri, 28 Feb 2025 16:12:43 GMT

Hi @Sjoshi ,

I think the following information would be helpful to understand more about the problem you're experiencing:
- The schema of the tables involved in the join.
- The join condition used on each join.
- How the inputs are stored before the job reads them for the join.
- Any spark configuration options you're setting apart from the default settings.
- The query plans you can see, either from the SparkUI, or from running an explain.

Additionally, I've found this past talk from a Spark Summit very helpful for inspecting and improving the performance of my own workloads. https://www.youtube.com/watch?v=daXEp4HmS-E

Re: How to make the write operation faster for writing a spark dataframe to a delta table

cgrant — Fri, 28 Feb 2025 16:34:58 GMT

We recommend using spatial frameworks to speed up things like spatial joins, point-in-polygon, etc, like databricks mosaic or apache sedona. Without these frameworks, many of these operations result in unoptimized and explosive crossjoins.