cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

slow running query

joakon
New Contributor III

Hi All,

I would you to get some ideas on how to improve performance on a data frame with around 10M rows.

adls- gen2

df1 =source1 , format , parquet ( 10 m)

df2 =source2 , format , parquet ( 10 m)

df = join df1 and df2 type =inner join

df.count() is taking for ever.

trying to join the above sources and aggregate them and write back to adls .

1 ACCEPTED SOLUTION

Accepted Solutions

LandanG
Databricks Employee
Databricks Employee

@raghu maremandaโ€‹ It's hard to provide an answer without having more info. Can you add the actual code used in the join, as well as the total data size, & cluster configuration (note types & number of nodes)

View solution in original post

5 REPLIES 5

LandanG
Databricks Employee
Databricks Employee

@raghu maremandaโ€‹ It's hard to provide an answer without having more info. Can you add the actual code used in the join, as well as the total data size, & cluster configuration (note types & number of nodes)

labtech
Valued Contributor II

Which size of parquet file, if is too small you can try with pandas then compare to pyspark

sher
Valued Contributor II

you can use ShuffleHashJoin to improve

Aviral-Bhardwaj
Esteemed Contributor III

yeah this is easy you can do some performance tuning in your cluster and it will work, you can use auto broadcast join configuration or other where you can set up your performance tuning

AviralBhardwaj

Aviral-Bhardwaj
Esteemed Contributor III

hey @raghu maremandaโ€‹ did you get any answer if yes ,please update here, by that other people can also get the solution

AviralBhardwaj

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group