Databricks Community

joakon · ‎12-16-2022

Hi All,

I would you to get some ideas on how to improve performance on a data frame with around 10M rows.

adls- gen2

df1 =source1 , format , parquet ( 10 m)

df2 =source2 , format , parquet ( 10 m)

df = join df1 and df2 type =inner join

df.count() is taking for ever.

trying to join the above sources and aggregate them and write back to adls .

LandanG · ‎12-16-2022

@raghu maremanda It's hard to provide an answer without having more info. Can you add the actual code used in the join, as well as the total data size, & cluster configuration (note types & number of nodes)

View solution in original post

LandanG · ‎12-16-2022

@raghu maremanda It's hard to provide an answer without having more info. Can you add the actual code used in the join, as well as the total data size, & cluster configuration (note types & number of nodes)

labtech · ‎12-16-2022

Which size of parquet file, if is too small you can try with pandas then compare to pyspark

sher · ‎12-16-2022

you can use ShuffleHashJoin to improve

Aviral-Bhardwaj · ‎12-17-2022

yeah this is easy you can do some performance tuning in your cluster and it will work, you can use auto broadcast join configuration or other where you can set up your performance tuning

AviralBhardwaj