cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to make the write operation faster for writing a spark dataframe to a delta table

Sjoshi
New Contributor

So, I am doing 4 spatial join operation on the files with the following sizes:

  1. Base_road_file which is 1gigabyte
  2. Telematics file which is 1.2 gigs
  3. state boundary file , BH road file, client_geofence file and kpmg_geofence_file which are not too large 

My databricks cluster details are as follows:

13.3 LTS runtme, Standard_DS5_v2 56gb mem 16 cores for driver and worker nodes

The issue is that the joins happen within seconds but writing to a delta table is timing out my entire run>Moreover, even if I increase the time out the whole operation keeps running for hours which is not good for my client. 

So, could anyone please suggest what to do. I have even tried repartition but have added optimizeWrite to my spark session settings as well but nothing seems to help. So, could anyone please suggest a way to make my write operation faster.

2 REPLIES 2

Nik_Vanderhoof
Contributor

Hi @Sjoshi ,

I think the following information would be helpful to understand more about the problem you're experiencing:
- The schema of the tables involved in the join.
- The join condition used on each join.
- How the inputs are stored before the job reads them for the join.
- Any spark configuration options you're setting apart from the default settings.
- The query plans you can see, either from the SparkUI, or from running an explain.

Additionally, I've found this past talk from a Spark Summit very helpful for inspecting and improving the performance of my own workloads. https://www.youtube.com/watch?v=daXEp4HmS-E


cgrant
Databricks Employee
Databricks Employee

We recommend using spatial frameworks to speed up things like spatial joins, point-in-polygon, etc, like databricks mosaic or apache sedona. Without these frameworks, many of these operations result in unoptimized and explosive crossjoins.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now