09-23-2024 08:08 AM
I have huge datasets, transformation, display, print, show are working well on this data when read in a pandas dataframe. But the same dataframe when converted to a spark dataframe, is taking minutes to display even a single row and hours to write the data in a delta table.
09-26-2024 09:02 AM
09-26-2024 09:12 AM
df1 has 1500 rows and df2 has 9 million rows, try broadcast join on df1
df= spark.sql("""SELECT /*+ BROADCAST(df1) */
df1.*, df2.* FROM df1 JOIN df2 ON ST_Intersects(df1.geometry, df2.geometry) """)
09-26-2024 09:31 AM
But the issue is coming up while writing the df, sure broadcast might optimize the join, but how will it enhance the data write operation?
09-26-2024 09:46 AM
The spark concept is Lazy Transformation. This means all Transformations will be processed when an Action is invoked. In your case, the SQL JOIN is the transformation, and Write is the Action.
So try the Broadcast join and check the result.
09-26-2024 11:50 AM
Didn't work, still taking 10 minutes to write the df, which is a very long time, considering I 5000 such chunks to process
09-26-2024 12:06 PM
I understand you want it sooner. Did it at least write the data in 10 minutes compared to not writing before?
There are more knobs you can tweak like
spark.sql.shuffle.partitions=auto
Do you have any index columns in your spatial data that can be used for joining?
Also please check whether your data is partitioned correctly.
Finally, what is your rationale for stating that this should be completed in less than 10 minutes? Do you have anything to compare, or its just your feel.
09-27-2024 08:50 AM
Reason why I feel 10 minutes is way too long is because, I have written 90 lakhs data table in few minutes, but after this spatial data, even when row count is way less than 90 lakh, say 35 lakh, it is still taking 10 minutes, which shouldn't be the case, considering the final dataset I am writing does not even have geometry column, it has some regular string and numeric columns, 10-12 in number.
Additionally, this duration is not acceptable because I have 3000 such datasets to process, as I already partitioned by data because of volume, still no performance improvement.
Rest will try the other things you have mentioned.
Thanks for your input.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group