Databricks Community

qwerty3 · ‎09-23-2024

I have huge datasets, transformation, display, print, show are working well on this data when read in a pandas dataframe. But the same dataframe when converted to a spark dataframe, is taking minutes to display even a single row and hours to write the data in a delta table.

qwerty3 · ‎09-26-2024

df = sparl.sql("SELECT df1.*, df2.*

FROM df1

JOIN df2

ON ST_Intersects(df1.geometry, df2.geometry)")

df.write.format("delta").mode("append").saveAsTable(table_name)

df1 has 1500 rows, df2 has 90 lakh rows.

gchandra · ‎09-26-2024

df1 has 1500 rows and df2 has 9 million rows, try broadcast join on df1

df= spark.sql("""SELECT /*+ BROADCAST(df1) */
df1.*, df2.* FROM df1 JOIN df2 ON ST_Intersects(df1.geometry, df2.geometry) """)

~

qwerty3 · ‎09-26-2024

But the issue is coming up while writing the df, sure broadcast might optimize the join, but how will it enhance the data write operation?

gchandra · ‎09-26-2024

The spark concept is Lazy Transformation. This means all Transformations will be processed when an Action is invoked. In your case, the SQL JOIN is the transformation, and Write is the Action.

So try the Broadcast join and check the result.

~

qwerty3 · ‎09-26-2024

Didn't work, still taking 10 minutes to write the df, which is a very long time, considering I 5000 such chunks to process

gchandra · ‎09-26-2024

I understand you want it sooner. Did it at least write the data in 10 minutes compared to not writing before?

There are more knobs you can tweak like

spark.sql.shuffle.partitions=auto

Do you have any index columns in your spatial data that can be used for joining?

Also please check whether your data is partitioned correctly.

Finally, what is your rationale for stating that this should be completed in less than 10 minutes? Do you have anything to compare, or its just your feel.

~

qwerty3 · ‎09-27-2024

Reason why I feel 10 minutes is way too long is because, I have written 90 lakhs data table in few minutes, but after this spatial data, even when row count is way less than 90 lakh, say 35 lakh, it is still taking 10 minutes, which shouldn't be the case, considering the final dataset I am writing does not even have geometry column, it has some regular string and numeric columns, 10-12 in number.

Additionally, this duration is not acceptable because I have 3000 such datasets to process, as I already partitioned by data because of volume, still no performance improvement.

Rest will try the other things you have mentioned.

Thanks for your input.