cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Spark dataframe performing poorly

qwerty3
Contributor

I have huge datasets, transformation, display, print, show are working well on this data when read in a pandas dataframe. But the same dataframe when converted to a spark dataframe, is taking minutes to display even a single row and hours to write the data in a delta table.

21 REPLIES 21

qwerty3
Contributor
df = sparl.sql("SELECT df1.*, df2.*
    FROM df1
    JOIN df2
    ON ST_Intersects(df1.geometry, df2.geometry)")
 
df.write.format("delta").mode("append").saveAsTable(table_name)
 
df1 has 1500 rows, df2 has 90 lakh rows.

gchandra
Databricks Employee
Databricks Employee

df1 has 1500 rows and df2 has 9 million rows, try broadcast join on df1

df= spark.sql("""SELECT /*+ BROADCAST(df1) */
df1.*, df2.* FROM df1 JOIN df2  ON ST_Intersects(df1.geometry, df2.geometry) """)

 



~

qwerty3
Contributor

But the issue is coming up while writing the df, sure broadcast might optimize the join, but how will it enhance the data write operation?

gchandra
Databricks Employee
Databricks Employee

The spark concept is Lazy Transformation. This means all Transformations will be processed when an Action is invoked. In your case, the SQL JOIN is the transformation, and Write is the Action. 

So try the Broadcast join and check the result.

 



~

qwerty3
Contributor

Didn't work, still taking 10 minutes to write the df, which is a very long time, considering I 5000 such chunks to process

gchandra
Databricks Employee
Databricks Employee

I understand you want it sooner. Did it at least write the data in 10 minutes compared to not writing before?

There are more knobs you can tweak like 

spark.sql.shuffle.partitions=auto

Do you have any index columns in your spatial data that can be used for joining?

Also please check whether your data is partitioned correctly.

Finally, what is your rationale for stating that this should be completed in less than 10 minutes? Do you have anything to compare, or its just your feel.

 



~

Reason why I feel 10 minutes is way too long is because, I have written 90 lakhs data table in few minutes, but after this spatial data, even when row count is way less than 90 lakh, say 35 lakh, it is still taking 10 minutes, which shouldn't be the case, considering the final dataset I am writing does not even have geometry column, it has some regular string and numeric columns, 10-12 in number.

Additionally, this duration is not acceptable because I have 3000 such datasets to process, as I already partitioned by data because of volume, still no performance improvement.

Rest will try the other things you have mentioned.

Thanks for your input.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group