โ12-16-2022 01:10 PM
Hi All,
I would you to get some ideas on how to improve performance on a data frame with around 10M rows.
adls- gen2
df1 =source1 , format , parquet ( 10 m)
df2 =source2 , format , parquet ( 10 m)
df = join df1 and df2 type =inner join
df.count() is taking for ever.
trying to join the above sources and aggregate them and write back to adls .
โ12-16-2022 01:40 PM
@raghu maremandaโ It's hard to provide an answer without having more info. Can you add the actual code used in the join, as well as the total data size, & cluster configuration (note types & number of nodes)
โ12-16-2022 01:40 PM
@raghu maremandaโ It's hard to provide an answer without having more info. Can you add the actual code used in the join, as well as the total data size, & cluster configuration (note types & number of nodes)
โ12-16-2022 11:01 PM
Which size of parquet file, if is too small you can try with pandas then compare to pyspark
โ12-16-2022 11:41 PM
you can use ShuffleHashJoin to improve
โ12-17-2022 10:25 PM
yeah this is easy you can do some performance tuning in your cluster and it will work, you can use auto broadcast join configuration or other where you can set up your performance tuning
โ12-23-2022 08:37 PM
hey @raghu maremandaโ did you get any answer if yes ,please update here, by that other people can also get the solution
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group