cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Exponentially slower joins using Pyspark

datatello
New Contributor II

I'm new to Pyspark, but I've stumbled across an odd issue when I perform joins, where the action seems to take exponentially longer every time I add a new join to a function I'm writing.

I'm trying to join a dataset of ~3 million records to one of ~17 million ten times (each time with slightly different join criteria). Each join on it's own takes 15-50 seconds to commit, however when I add the joins together in one function, the action takes exponentially longer (e.g join 2 runs in a minute, but by join 5 the function takes about 11 minutes to run and by join 7/8 the notebook will run for hours and then give a generic cluster error).

I've tried repartitioning and cacheing the data before joins, but if anything this seems to slow down the joins even further.

I can't work out what I've done wrong, and from QAing every line of the notebook, nothing obvious is jumping out.

3 REPLIES 3

-werners-
Esteemed Contributor III

Probably some bug in your function.

What I suggest is to first execute all the joins manually and run an explain to get the query plan.

Than compare that query plan to the one created by your function.

Especially if you do a loop in your function, it will probably be the culprit.

jose_gonzalez
Databricks Employee
Databricks Employee

Hi @Lee Bevers​,

Which DBR version are you using? could you share some code snippets? can you share the physical query plans? DAGs?

Vidula
Honored Contributor

Hi @Lee Bevers​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group