Databricks Community

Sjoshi · ‎02-26-2025

So, I am doing 4 spatial join operation on the files with the following sizes:

Base_road_file which is 1gigabyte
Telematics file which is 1.2 gigs
state boundary file , BH road file, client_geofence file and kpmg_geofence_file which are not too large

My databricks cluster details are as follows:

13.3 LTS runtme, Standard_DS5_v2 56gb mem 16 cores for driver and worker nodes

The issue is that the joins happen within seconds but writing to a delta table is timing out my entire run>Moreover, even if I increase the time out the whole operation keeps running for hours which is not good for my client.

So, could anyone please suggest what to do. I have even tried repartition but have added optimizeWrite to my spark session settings as well but nothing seems to help. So, could anyone please suggest a way to make my write operation faster.

Nik_Vanderhoof · ‎02-28-2025

Hi @Sjoshi ,

I think the following information would be helpful to understand more about the problem you're experiencing:
- The schema of the tables involved in the join.
- The join condition used on each join.
- How the inputs are stored before the job reads them for the join.
- Any spark configuration options you're setting apart from the default settings.
- The query plans you can see, either from the SparkUI, or from running an explain.

Additionally, I've found this past talk from a Spark Summit very helpful for inspecting and improving the performance of my own workloads. https://www.youtube.com/watch?v=daXEp4HmS-E

cgrant · ‎02-28-2025

We recommend using spatial frameworks to speed up things like spatial joins, point-in-polygon, etc, like databricks mosaic or apache sedona. Without these frameworks, many of these operations result in unoptimized and explosive crossjoins.

Databricks Community

How to make the write operation faster for writing a spark dataframe to a delta table

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!